Regex pull all attributes from all meta tags

I am trying to pull meta tags from an html page to compare two pages (live and dev) to see if they are the same SEO after redesigning / refactoring the site. I need to compare names, meta tags (description, opengraph, etc.), H1's, our analytics (Omniture) and our ad tags (doubleclick) - all the same.

My problem is getting meta tags http://php.net/manual/en/function.get-meta-tags.php only works if they have the attribute name =, the same with the solution "mariano at cricava dot com ".

I don’t want to limit it to the presence of certain attributes, I could assume that all of our meta tags have either name = or property = or http-equiv = and modify the regular expression accordingly, but they cannot be completely finite, since this is a massive web the site and any random shit can be in tags (therefore, this tool should check this stuff!) and would like to leave it as dynamic as possible.

I have

$page = @file_get_contents('http://.../'); preg_match_all('#<meta(?:\s+?([^\=]+)\=\"(.+?)\")+?\s*?/?>#sui', $page, $matches, PREG_SET_ORDER) 

but the subpatterns override each other, so this only pulls out the last pair attribute-name = attribute-value

 Array ( [0] => Array ( [0] => <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" /> [1] => content [2] => text/html; charset=UTF-8 ) [1] => Array ( [0] => <meta name="description" content="some description" /> [1] => content [2] => some description ) [2] => Array ( [0] => <meta property="og:type" content="website" /> [1] => content [2] => website ) ... 

I need all the attributes for all the meta tags. I could do this in two steps by pulling the contents of <meta ([^>]*)> and then running the second regular expression for the results, but does this seem unnecessary with the power of the regular expression?

+1
source share
3 answers

But back to the original question, forget about this HTML now, is there no way to return repeating subpatterns to preg_match_all than just returning the last match?

Impossible with preg_* / PCRE (or any other regex flavor that I know of, but in Perl you can use hack (?{ push @list, $^N }) ).

+1
source
  preg_match_all("<meta\\s*(?:(?:\\b(\\w|-)+\\b\\s*(?:=\\s*(?:[\"\"[^\"\"]*\"\"|'[^']*'| [^\"\"'<> ]|[''[^'']*''|\"[^\"]*\"|[^''\"<> ]]]+)\\s*)?)*)/?\\s*>", $content, $meta); 

try with this

0
source

I do it like this. Pull the meta tags first with the following regex

 string regex = "<meta\\s(?:\"[^\"]*\"['\"]*|'[^']*'['\"]*|[^'\">])+>"; 

I found a regex here -

Open RegEx tags except XHTML tags contained offline

Then pull out the attributes with another regular expression that would be pretty simple to write.

0
source

All Articles