Html Parsing – Extracting and Removing Html Tag of Specific Class From String in PHP

There might be a time when your scraping a page and would like to extract a specific div block from the returned html. I find that using regular expression searches does not always work on large blocks of text. I have made a PHP function that takes a identifier value, start and end delimiter tags to extract the exact tag block you want.

In this example you have the following snippet of some html. Lets say you want to extract just the div block of class name ‘content’

<html>
<body>
    <div class="wrapper">
        <div class="header"></div>
        <div class="content">some content</div>
        <div class="footer"></div>
    </div>
</body>
</html>

You can extract the div block this way using the extractTag() function which will return ‘ <div class=”content”>some content</div>’. The $str parameter is the html text being searched.


$tag = extractTag($str,'class="content"','<div','/div>')

You can also extract any block of text without any id parameter. Just pass the delimiters. These functions will work on any blocks of text with identifiable delimiters.


$tag = extractTag($str,'','<head','/head>')


 function extractTag($str,$id,$start_tag,$end_tag)
 {
      //str - string to search
      //id - text to search for
      //start_tag - start delimiter
     //end_tag - end delimiter

 	 if($id)
	 {
		 $pos_srch = strpos($str,$id);
		 //extract string up to id value
		 $beg = substr($str,0,$pos_srch);

		 //get position of start delimiter
		 $pos_start_tag = strrpos($beg,$start_tag);
	 }
	 else
	 	$pos_start_tag = strpos($str,$start_tag); //if no id value get first tag found

	 //get position of end delimiter
	 $pos_end_tag = strpos($str,$end_tag,$pos_start_tag);
	 //length of end deilimter
	 $end_tag_len = strlen($end_tag);
	 //length of string to extract
	 $len = ($pos_end_tag+$end_tag_len)-$pos_start_tag;
	 //Extract the tag
	 $tag = substr($str,$pos_start_tag,$len);

	 return $tag;
 }


Removing tags from string

The following will remove all instances of a block of text within a string. This could be useful if you wanted to remove multiple instances of a html tag of certain class from your html string.


//Will remove all hr tags of page-break class from $html string
$parseStr = removeTag($html,'page-break','<hr','/>');

function removeTag($str,$id,$start_tag,$end_tag)
 {
    //str - string to search
    //id - text to search for
    //start_tag - start delimiter to remove
   //end_tag - end delimiter to remove

 //find position of tag identifier. loops until all instance of text removed
 while(($pos_srch = strpos($str,$id))!==false)
 {
	 //get text before identifier
	 $beg = substr($str,0,$pos_srch);
	 //get position of start tag
	 $pos_start_tag = strrpos($beg,$start_tag);
	 //echo 'start: '.$pos_start_tag.'<br>';
	 //extract text up to but not including start tag
	 $beg = substr($beg,0,$pos_start_tag);
	 //echo "beg: ".$beg."<br>";

	 //get text from identifier and on
	 $end = substr($str,$pos_srch);

	 //get length of end tag
	 $end_tag_len = strlen($end_tag);
	 //find position of end tag
	 $pos_end_tag = strpos($end,$end_tag);
	 //extract after end tag and on
	 $end = substr($end,$pos_end_tag+$end_tag_len);

	 $str = $beg.$end;
 }

 //return processed string
 return $str;
 }

Leave a comment

2 Comments.

  1. Thanks for the script

    pretty usefull

  2. This does not parse out tag with child elements of the same tag. You could not for example parse out the entire ‘wrapper’ div tag in the above example. Will work on correcting this behavior.

Leave a Reply


[ Ctrl + Enter ]