I am trying to pull meta tags from an html page to compare two pages (live and dev) to see if they are the same SEO after redesigning / refactoring the site. I need to compare names, meta tags (description, opengraph, etc.), H1's, our analytics (Omniture) and our ad tags (doubleclick) - all the same.
My problem is getting meta tags http://php.net/manual/en/function.get-meta-tags.php only works if they have the attribute name =, the same with the solution "mariano at cricava dot com ".
I donβt want to limit it to the presence of certain attributes, I could assume that all of our meta tags have either name = or property = or http-equiv = and modify the regular expression accordingly, but they cannot be completely finite, since this is a massive web the site and any random shit can be in tags (therefore, this tool should check this stuff!) and would like to leave it as dynamic as possible.
I have
$page = @file_get_contents('http://.../'); preg_match_all('#<meta(?:\s+?([^\=]+)\=\"(.+?)\")+?\s*?/?>#sui', $page, $matches, PREG_SET_ORDER)
but the subpatterns override each other, so this only pulls out the last pair attribute-name = attribute-value
Array ( [0] => Array ( [0] => <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" /> [1] => content [2] => text/html; charset=UTF-8 ) [1] => Array ( [0] => <meta name="description" content="some description" /> [1] => content [2] => some description ) [2] => Array ( [0] => <meta property="og:type" content="website" /> [1] => content [2] => website ) ...
I need all the attributes for all the meta tags. I could do this in two steps by pulling the contents of <meta ([^>]*)> and then running the second regular expression for the results, but does this seem unnecessary with the power of the regular expression?
source share