Regex to find all urls and headers

Question

Regex to find all urls and headers

I would like to extract all the urls and headings from a paragraph of text.

Les <a href="http://test.com/blop" class="c_link-blue">résultats du sondage</a> sur les remakes et suites souhaités sont <a href="http://test.com" class="c_link-blue">dans le blog</a>.

I can get all href thanks to the following regex, but I don’t know how to get an extra name between the <a></a> tags?

 preg_match_all('/<a.*href="?([^" ]*)" /iU', $v['message'], $urls);

It would be best to get such an associative array

 [0] => Array ( [title] => XXX [link] => http://test.com/blop ) [1] => Array ( [title] => XXX [link] => http://test.com )

thanks for the help

+4

url php regex

Simon taisne Oct 24 '11 at 16:12

source share

5 answers

As mentioned in the comments, do not use a regular expression, but a DOM parser.
For instance.

 <?php $doc = new DOMDocument; $doc->loadhtml( getExampleData() ); $xpath = new DOMXPath($doc); foreach( $xpath->query('/html/body/p[@id="abc"]//a') as $node ) { echo $node->getAttribute('href'), ' - ' , $node->textContent, "\n"; } function getExampleData() { return '<html><head><title>...</title></head><body> <p> not <a href="wrong">this one</a> but .... </p> <p id="abc"> Les <a href="http://test.com/blop" class="c_link-blue">résultats du sondage</a> sur les remakes et suites souhaités sont <a href="http://test.com" class="c_link-blue">dans le blog</a>. </p> </body></html>'; }

see http://docs.php.net/DOMDocument and http://docs.php.net/DOMXPath

+3

Volkerk Oct 24 '11 at 16:22

source share

You should not use RegEx for this. You must use the XML / DOM parser. I did it fast using DOMDocument .

 $links = array(); $dom = new DOMDocument; @$dom->loadHTML('Les <a href="http://test.com/blop" class="c_link-blue">résultats du sondage</a> sur les remakes et suites souhaités sont <a href="http://test.com" class="c_link-blue">dans le blog</a>.'); $xPath = new DOMXPath($dom); $a = $xPath->query('//a'); for($i=0; $i<$a->length; $i++){ $e = $a->item($i); $links[] = array( 'title' => $e->nodeValue, 'link' => $e->getAttribute('href') ); } print_r($links);

DEMO: http://codepad.org/2LEn2CAJ

+2

Rocket hazmat Oct 24 '11 at 16:21

source share

 preg_match_all("/<a[^>]*href=\"([^\"]*)[^>]*>([^<]*)</a>/", $v['message'], $urls, PREG_SET_ORDER)

should work to give you what you want. This is not an associated array, but it must be a nested array in the desired format.

+1

Glyphgryph Oct 24 '11 at 16:23

source share

For people offering to use the DOM, it would be nice to use the DOM. But of course, you will not use the FULL DOM parser to parse a couple of URLs / titles!

Just use a regex like this:

 /<a.*href="([^" ]*)".*>(.*)<\/a>/iU

0

Yousf Oct 24 '11 at 16:24

source share

Marcus · Accepted Answer · 2011-10-24T16:20:08+0000

If you still insist on using regular expressions to solve this problem, you can parse it with this regular expression:

 <a.*?href="(.*?)".*?>(.*?)</a>

Please note that it does not use the U modifier as you did.

Update: To accept single Qout as well as double quotes, you can use the following pattern:

 <a.*?href=(?:"(.*?)"|'(.*?)').*?>(.*?)</a>

Regex to find all urls and headers

More articles: