How to implement a web scraper in PHP?

What PHP built-in functions are useful for web clips? What are some good resources (web or print) for speeding up web clips with PHP?

+58
php screen-scraping
Aug 25 '08 at 21:28
source share
15 answers

There is a book called Webbots, Spiders and Screen Scrapers: A Guide for Developing Internet Agents with PHP / CURL on this subject - see the review here

PHP-Architect reviewed it in a well-written article in the December 2007 issue of Matthew Turland

+30
Aug 25 '08 at 23:21
source share
β€” -

The scraper usually covers 3 stages:

  • first you will receive or send a request to the specified URL
  • then you get html, which is returned as an answer
  • finally, you figured out what html text you want to scrape.

To complete steps 1 and 2 below is a simple php class that uses Curl to retrieve web pages using GET or POST. After you return the HTML, you simply use regular expressions to complete step 3, playing out the text you want to clear.

For regular expressions, my favorite tutorial site is: Regular Expression Tutorial

My favorite RegExs program is Regex Buddy . I would advise you to try a demo of this product, even if you have no intention of buying it. It is an invaluable tool and even generates code for your regular expressions that you do in your chosen language (including php).

Using:



$curl = new Curl(); $html = $curl->get(" http://www.google.com ");

// now, do your regex work against $html

PHP class:

 <?php class Curl { public $cookieJar = ""; public function __construct($cookieJarFile = 'cookies.txt') { $this->cookieJar = $cookieJarFile; } function setup() { $header = array(); $header[0] = "Accept: text/xml,application/xml,application/xhtml+xml,"; $header[0] .= "text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5"; $header[] = "Cache-Control: max-age=0"; $header[] = "Connection: keep-alive"; $header[] = "Keep-Alive: 300"; $header[] = "Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7"; $header[] = "Accept-Language: en-us,en;q=0.5"; $header[] = "Pragma: "; // browsers keep this blank. curl_setopt($this->curl, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows; U; Windows NT 5.2; en-US; rv:1.8.1.7) Gecko/20070914 Firefox/2.0.0.7'); curl_setopt($this->curl, CURLOPT_HTTPHEADER, $header); curl_setopt($this->curl,CURLOPT_COOKIEJAR, $this->cookieJar); curl_setopt($this->curl,CURLOPT_COOKIEFILE, $this->cookieJar); curl_setopt($this->curl,CURLOPT_AUTOREFERER, true); curl_setopt($this->curl,CURLOPT_FOLLOWLOCATION, true); curl_setopt($this->curl,CURLOPT_RETURNTRANSFER, true); } function get($url) { $this->curl = curl_init($url); $this->setup(); return $this->request(); } function getAll($reg,$str) { preg_match_all($reg,$str,$matches); return $matches[1]; } function postForm($url, $fields, $referer='') { $this->curl = curl_init($url); $this->setup(); curl_setopt($this->curl, CURLOPT_URL, $url); curl_setopt($this->curl, CURLOPT_POST, 1); curl_setopt($this->curl, CURLOPT_REFERER, $referer); curl_setopt($this->curl, CURLOPT_POSTFIELDS, $fields); return $this->request(); } function getInfo($info) { $info = ($info == 'lasturl') ? curl_getinfo($this->curl, CURLINFO_EFFECTIVE_URL) : curl_getinfo($this->curl, $info); return $info; } function request() { return curl_exec($this->curl); } } ?>
<?php class Curl { public $cookieJar = ""; public function __construct($cookieJarFile = 'cookies.txt') { $this->cookieJar = $cookieJarFile; } function setup() { $header = array(); $header[0] = "Accept: text/xml,application/xml,application/xhtml+xml,"; $header[0] .= "text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5"; $header[] = "Cache-Control: max-age=0"; $header[] = "Connection: keep-alive"; $header[] = "Keep-Alive: 300"; $header[] = "Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7"; $header[] = "Accept-Language: en-us,en;q=0.5"; $header[] = "Pragma: "; // browsers keep this blank. curl_setopt($this->curl, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows; U; Windows NT 5.2; en-US; rv:1.8.1.7) Gecko/20070914 Firefox/2.0.0.7'); curl_setopt($this->curl, CURLOPT_HTTPHEADER, $header); curl_setopt($this->curl,CURLOPT_COOKIEJAR, $this->cookieJar); curl_setopt($this->curl,CURLOPT_COOKIEFILE, $this->cookieJar); curl_setopt($this->curl,CURLOPT_AUTOREFERER, true); curl_setopt($this->curl,CURLOPT_FOLLOWLOCATION, true); curl_setopt($this->curl,CURLOPT_RETURNTRANSFER, true); } function get($url) { $this->curl = curl_init($url); $this->setup(); return $this->request(); } function getAll($reg,$str) { preg_match_all($reg,$str,$matches); return $matches[1]; } function postForm($url, $fields, $referer='') { $this->curl = curl_init($url); $this->setup(); curl_setopt($this->curl, CURLOPT_URL, $url); curl_setopt($this->curl, CURLOPT_POST, 1); curl_setopt($this->curl, CURLOPT_REFERER, $referer); curl_setopt($this->curl, CURLOPT_POSTFIELDS, $fields); return $this->request(); } function getInfo($info) { $info = ($info == 'lasturl') ? curl_getinfo($this->curl, CURLINFO_EFFECTIVE_URL) : curl_getinfo($this->curl, $info); return $info; } function request() { return curl_exec($this->curl); } } ?> 
+47
Sep 19 '08 at 16:40
source share

I would like to recommend this class that I recently encountered. Simple HTML DOM Parser

+37
Apr 21 '09 at 7:43
source share

I recommend Goutte, a simple PHP web scraper .

Usage example: -

Create an instance of the Goutte client (which extends Symfony\Component\BrowserKit\Client ):

 use Goutte\Client; $client = new Client(); 

Execute queries using the request() method:

 $crawler = $client->request('GET', 'http://www.symfony-project.org/'); 

The request method returns a Crawler object ( Symfony\Component\DomCrawler\Crawler ).

Click on the links:

 $link = $crawler->selectLink('Plugins')->link(); $crawler = $client->click($link); 

Submit form:

 $form = $crawler->selectButton('sign in')->form(); $crawler = $client->submit($form, array('signin[username]' => 'fabien', 'signin[password]' => 'xxxxxx')); 

Data Extraction:

 $nodes = $crawler->filter('.error_list'); if ($nodes->count()) { die(sprintf("Authentification error: %s\n", $nodes->text())); } printf("Nb tasks: %d\n", $crawler->filter('#nb_tasks')->text()); 
+14
May 26 '12 at 4:08
source share

ScraperWiki is a pretty interesting project. Helps you create scrapers on the Internet in Python, Ruby or PHP - I was able to get a simple try in a few minutes.

+11
Sep 24 '10 at 4:50
source share

Here's the OK tutorial (link removed, see below) on a web scraper using cURL and file_get_contents . Present also the following few parts.

(direct hyperlink removed due to malware warnings)

http://www.oooff.com/php-scripts/basic-php-scraped-data-parsing/basic-php-data-parsing.php

+5
Aug 25 '08 at 21:34
source share

I am really looking to clear BibleGateway.com, as they do not provide a poem access API for the web application that I am looking to create.

It looks like you can try "hotlink" rather than scratching, i.e. update in real time based on the content of your site?

This lesson is not bad:

http://www.merchantos.com/makebeta/php/scraping-links-with-php/

You can also watch Prowser.

+3
Dec 23 '09 at 7:40
source share

If you need something that is easy to maintain rather than fast to execute, it can help you use a browser with browsing capabilities, such as SimpleTest .

+2
Sep 19 '08 at 21:49
source share

here's another one: simple php scraper without regex .

+1
Jun 19 '10 at 13:41
source share

A scraper can be quite complex, depending on what you want to do. Read this series of tutorials on the basics of writing a scraper in PHP and see if you can handle it.

You can use similar methods to automate the registration of forms, login, even fake clicking on ads! The main limitations of using CURL is that it does not support the use of javascript, so if you are trying to clean a site that uses AJAX for pagination, for example, it can get a little complicated ... but again there are ways around this!

+1
Jan 22 '15 at 17:41
source share

file_get_contents() can take a remote URL and provide you with a source. Then you can use regular expressions (with Perl compatible functions) to get what you need.

Out of curiosity, what are you trying to clean?

0
Aug 25 '08 at 21:31
source share

I would use libcurl or Perl LWP (libwww for perl). Is there any libwww for php?

0
Aug 25 '08 at 21:39
source share

A class scraper from my frame:

 <?php /* Example: $site = $this->load->cls('scraper', 'http://www.anysite.com'); $excss = $site->getExternalCSS(); $incss = $site->getInternalCSS(); $ids = $site->getIds(); $classes = $site->getClasses(); $spans = $site->getSpans(); print '<pre>'; print_r($excss); print_r($incss); print_r($ids); print_r($classes); print_r($spans); */ class scraper { private $url = ''; public function __construct($url) { $this->url = file_get_contents("$url"); } public function getInternalCSS() { $tmp = preg_match_all('/(style=")(.*?)(")/is', $this->url, $patterns); $result = array(); array_push($result, $patterns[2]); array_push($result, count($patterns[2])); return $result; } public function getExternalCSS() { $tmp = preg_match_all('/(href=")(\w.*\.css)"/i', $this->url, $patterns); $result = array(); array_push($result, $patterns[2]); array_push($result, count($patterns[2])); return $result; } public function getIds() { $tmp = preg_match_all('/(id="(\w*)")/is', $this->url, $patterns); $result = array(); array_push($result, $patterns[2]); array_push($result, count($patterns[2])); return $result; } public function getClasses() { $tmp = preg_match_all('/(class="(\w*)")/is', $this->url, $patterns); $result = array(); array_push($result, $patterns[2]); array_push($result, count($patterns[2])); return $result; } public function getSpans(){ $tmp = preg_match_all('/(<span>)(.*)(<\/span>)/', $this->url, $patterns); $result = array(); array_push($result, $patterns[2]); array_push($result, count($patterns[2])); return $result; } } ?> 
0
Dec 26 '09 at 6:19 06:19
source share

A nice php scrambling ebook here:

https://leanpub.com/web-scraping

0
Jul 11 '13 at 4:39 on
source share

The curl library allows you to load web pages. You should look for regular expressions to perform scraper.

-2
Aug 25 '08 at 21:30
source share



All Articles