Is regex the right tool to find HTML strings?

Question

Is regex the right tool to find HTML strings?

I have a PHP script that pops some content from the server, but the problem is that there is a line in which the content changes every day, so I can’t just pull out a specific line. However, the contents are contained in a div that has a unique identifier. Is it possible (and this is the best way) for a regular expression to search for this unique identifier, and then pass the string that it refers to my script?

Example:

HTML file:

<html><head><title>Example</title></head> <body> <div id="Alpha"> Blah blah blah </div> <div id="Beta"> Blah Blah Blah </div> </body> </html>

So let's say I'm looking for a line with an opening div tag with id alpha . The code should return 3 , because on the third line there is a div with the identifier alpha .

+4

html php regex

waiwai933 Nov 19 '09 at 3:34

source share

7 answers

According to Jeff Atwood, you should never parse HTML with regex .

+3

Asaph Nov 19 '09 at 3:35

source share

at the risk of giving more votes to Jeff, who has already crossed mountains of madness ... see here

The argument grows back and forth, but ... it's a simple one-time or little-used script that you write, then be sure to use a regular expression, if it is more complex and should be reliable with a little future tweaking, then I'd suggest using an HTML parser . HTML is a disgusting often irregular beast to tame. Use the right tool for the job ... perhaps in your case it is a regular expression, or maybe its full-sized parser.

+3

beggs Nov 19 '09 at 3:44

source share

Generally NO . But if you are sure that the div will always be one line or there is no other div in it , you can use it without any problems. Something like /<div id=\"mydivid\">(.*?)</div>/ or something similar.

Otherwise, a DOMDocument would be a more sensible way.

EDIT See an example from HTML. My answer will be YES . RegEx is a very good tool for this.

I assume that you have HTML as continuous text, not lines (which will be slightly different). I also assume that you want the line number to contain the line more.

Here is the crude PHP code to extract it. (just to give some idea)

 $HTML = "<html><head><title>Example</title></head> <body> <div id=\"Alpha\"> Blah blah blah </div> <div id=\"Beta\"> Blah Blah Blah </div> </body> </html>"; $ID = "Alpha"; function GetLineOfDIV($HTML, $ID) { $RegEx_Alpha = '/\n(<div id="'.$ID.'">.*?<\/div>)\n/m'; $Index = preg_match($RegEx_Alpha, $HTML, $Match, PREG_OFFSET_CAPTURE); $Match = $Match[1]; // Only the one in '(...)' if ($Match == "") return -1; //$MatchStr = $Match[0]; Since you do not want it, so we comment it out. $MatchOffset = $Match[1]; $StartLines = preg_split("/\n/", $HTML, -1, PREG_SPLIT_OFFSET_CAPTURE); foreach($StartLines as $I => $StartLine) { $LineOffset = $StartLine[1]; if ($MatchOffset <= $LineOffset) return $I + 1; } return count($StartLines); } echo GetLineOfDIV($HTML, $ID);

Hope I will give you some idea.

+3

Nawaman Nov 19 '09 at 3:52

source share

Instead of RegEx, use a parser created specifically for processing (messy) HTML. This will make your application less fragile if the HTML changes a bit, and you don’t have to manually configure RegEx every time you want to pull out a new piece of data.

See this page: Mature HTML Parsers for PHP

+1

philfreo Nov 19 '09 at 3:36

source share

The fact that a unique identifier is involved sounds promising, but since it will be a DIV, and not necessarily a single line of HTML, it will be difficult to build a regular expression, but the usual objection to parsing HTML with regular expressions.

Not recommended.

+1

pavium Nov 19 '09 at 3:45

source share

@OP, since your requirement is so simple, you can just use string methods

 $f = fopen("file","r"); if($f){ $s=""; while( !feof($f) ){ $i+=1; $line = fgets($f,4096); if (stripos($line,'<div id="Alpha">')!==FALSE){ print "line number: $i\n"; } } fclose($f); }

0

ghostdog74 Nov 19 '09 at 6:28

source share

Iain fraser · Accepted Answer · 2009-11-19T05:40:48+0000

Since the line number is important to you, not the actual contents of the div, I would be inclined not to use the regex at all. I would suggest that explode() string into an array and loop through this array looking for your marker. For instance:

 <?php $myContent = "[your string of html here]"; $myArray = explode("\n", $myContent); $arraylen = count($myArray); // So you don't waste time counting the array at every loop $lineNo = 0; for($i = 0; $i < $arraylen; $i++) { $pos = strpos($myArray[$i], 'id="Alpha"'); if($pos !== false) { $lineNo = $i+1; break; } } ?>

Disclaimer: I do not have a php installation available for testing, so some debugging may be required.

Hope this helps, as I think it’s just a waste of time for you to implement a parsing mechanism just to make something so simple, especially if it’s one-time.

Edit: If the content at this stage is impotent, you can use it in combination with other answers that provide an adequate regular expression for the job.

Edit # 2: Oh hey ... here are my two cents:

"/<div.*?id=\"Alpha\".*?>.*?(<div.*//div>)*.*?//div>/m"

(<div.*//div>) tells the regex engine that it can find nested div tags and just include them if it finds them, rather than just stopping at the first </div> . However, this solves the problem only if there is only one breeding level. If there is more, then the regular expression is not a pity for you: (.

/m also causes the regex engine to ignore line strings, so you don't need to pollute your expressions with [\S\s] everywhere.

Again, sorry, I do not have an environment to check this at the moment, so you may need debugging.

Greetings Ian

Is regex the right tool to find HTML strings?

More articles: