Regex to find all urls and headers

I would like to extract all the urls and headings from a paragraph of text.

Les <a href="http://test.com/blop" class="c_link-blue">résultats du sondage</a> sur les remakes et suites souhaités sont <a href="http://test.com" class="c_link-blue">dans le blog</a>. 

I can get all href thanks to the following regex, but I don’t know how to get an extra name between the <a></a> tags?

 preg_match_all('/<a.*href="?([^" ]*)" /iU', $v['message'], $urls); 

It would be best to get such an associative array

 [0] => Array ( [title] => XXX [link] => http://test.com/blop ) [1] => Array ( [title] => XXX [link] => http://test.com ) 

thanks for the help

+4
source share
5 answers

If you still insist on using regular expressions to solve this problem, you can parse it with this regular expression:

 <a.*?href="(.*?)".*?>(.*?)</a> 

Please note that it does not use the U modifier as you did.

Update: To accept single Qout as well as double quotes, you can use the following pattern:

 <a.*?href=(?:"(.*?)"|'(.*?)').*?>(.*?)</a> 
+3
source

As mentioned in the comments, do not use a regular expression, but a DOM parser.
For instance.

 <?php $doc = new DOMDocument; $doc->loadhtml( getExampleData() ); $xpath = new DOMXPath($doc); foreach( $xpath->query('/html/body/p[@id="abc"]//a') as $node ) { echo $node->getAttribute('href'), ' - ' , $node->textContent, "\n"; } function getExampleData() { return '<html><head><title>...</title></head><body> <p> not <a href="wrong">this one</a> but .... </p> <p id="abc"> Les <a href="http://test.com/blop" class="c_link-blue">résultats du sondage</a> sur les remakes et suites souhaités sont <a href="http://test.com" class="c_link-blue">dans le blog</a>. </p> </body></html>'; } 

see http://docs.php.net/DOMDocument and http://docs.php.net/DOMXPath

+3
source

You should not use RegEx for this. You must use the XML / DOM parser. I did it fast using DOMDocument .

 $links = array(); $dom = new DOMDocument; @$dom->loadHTML('Les <a href="http://test.com/blop" class="c_link-blue">résultats du sondage</a> sur les remakes et suites souhaités sont <a href="http://test.com" class="c_link-blue">dans le blog</a>.'); $xPath = new DOMXPath($dom); $a = $xPath->query('//a'); for($i=0; $i<$a->length; $i++){ $e = $a->item($i); $links[] = array( 'title' => $e->nodeValue, 'link' => $e->getAttribute('href') ); } print_r($links); 

DEMO: http://codepad.org/2LEn2CAJ

+2
source
 preg_match_all("/<a[^>]*href=\"([^\"]*)[^>]*>([^<]*)</a>/", $v['message'], $urls, PREG_SET_ORDER) 

should work to give you what you want. This is not an associated array, but it must be a nested array in the desired format.

+1
source

For people offering to use the DOM, it would be nice to use the DOM. But of course, you will not use the FULL DOM parser to parse a couple of URLs / titles!

Just use a regex like this:

 /<a.*href="([^" ]*)".*>(.*)<\/a>/iU 
0
source

All Articles