Crawling the site and getting only links that start with http: //

I am using the following code to extract links from a tag <a>, but would like to make some changes.

  • I would only like to return links that start with "http: //"
  • I would like to include image links and script links containing "http: //"

It would be even better if he can return links for all tags while it starts with "http: //"

Here is the current code:

<?php

$html = file_get_contents('http://mattressandmore.com/in-the-community/');

$dom = new DOMDocument();
@$dom->loadHTML($html);

// grab all the links on the page
$xpath = new DOMXPath($dom);
$hrefs = $xpath->evaluate("/html/body//a");

for ($i = 0; $i < $hrefs->length; $i++) {
       $href = $hrefs->item($i);
       $url = $href->getAttribute('href');
       echo $url.'<br />';
}
?>
+4
source share
1 answer

You need to apply the function starts-withto the hrefelement attribute a:). Check out some links and you will get a view, here is the code:

...
$hrefs = $xpath->evaluate("/html/body//a[starts-with(@href, \"http:\")]");
...

Full code:

<?php

$html = file_get_contents('http://mattressandmore.com/in-the-community/');

$dom = new DOMDocument();
@$dom->loadHTML($html);

// grab all the links on the page
$xpath = new DOMXPath($dom);
$hrefs = $xpath->evaluate("/html/body//a[starts-with(@href, \"http:\")]");

for ($i = 0; $i < $hrefs->length; $i++) {
       $href = $hrefs->item($i);
       $url = $href->getAttribute('href');
       echo $url.'<br />';
}
?>

img src, "http://" script href.

...
$hrefs = $xpath->evaluate("/html/body//img[starts-with(@src, \"http:\")]");
...
+2

All Articles