How to download only html (and skip media files)

Question

How to download only html (and skip media files)

I am optimizing my simple web crawler (currently using PHP / curl_multi).

The goal is to crawl the entire website while being smart and to skip non-html content. I tried to use nobody and send only HEAD requests, but this does not work on every site (some servers do not support HEAD), as a result, exec pauses for a long time (sometimes much longer than loading the page itself).

Is there any other way to get the page type without loading all the content or make CURL refuse to load if the file is not html?

(Writing my own http client is not an option, because I intend to use the CURL functions as cookies and ssl later).

+4

http php curl web-crawler

Smok Aug 24 '12 at 0:34

source share

4 answers

I have not tried, but see CURLOPT_PROGRESSFUNCTION . I bet you can gradually read the answer to find the content-type header and perhaps the curl_close () handle if you're not interested in what is loading.

 CURLOPT_PROGRESSFUNCTION The name of a callback function where the callback function takes three parameters. The first is the cURL resource, the second is a file-descriptor resource, and the third is length. Return the string containing the data.

http://www.php.net/manual/en/function.curl-setopt.php

+1

goat Aug 24 '12 at 0:56

source share

Have you watched fsockopen ?

You can open the socket on a remote page and read only what is needed. Once you have defined the Content-Type header, you can close the connection.

 <?php $type = 'Unknown'; $fp = fsockopen("www.example.com", 80, $errno, $errstr, 30); if (!$fp) { echo "$errstr ($errno)<br />\n"; } else { $out = "GET / HTTP/1.1\r\n"; $out .= "Host: www.example.com\r\n"; $out .= "Connection: Close\r\n\r\n"; fwrite($fp, $out); $in = ''; while (!feof($fp)) { $in .= fgets($fp, 128); if ( preg_match( '/Content-Type: (.+)\n/i', $in, &$matches ) ) { $type = $matches[1]; break; } } fclose($fp); } echo $type; ?>

0

Dimitry Aug 24 '12 at 0:48

source share

This worked for me:

 <?php $handle = curl_init('http://www.google.com'); curl_setopt($handle, CURLOPT_RETURNTRANSFER, true); curl_setopt($handle, CURLOPT_HEADER, true); $result = curl_exec($handle); $type = curl_getinfo($handle, CURLINFO_CONTENT_TYPE); if(strpos($type, 'text/html') !== false) { echo 'The URL is an HTML page.'; } ?>

0

uınbɐɥs Aug 24 '12 at 1:54

source share

Smok · Accepted Answer · 2012-08-25T15:14:18+0000

The right way to do this is to use

curl_setopt($ch, CURLOPT_HEADERFUNCTION, 'curlHeaderCallback');

The callback will take two parameters - first CURL Handle, the second - the header. It will be called every time a new header arrives.

 $acceptable=array('application/xhtml+xml', 'application/xml', 'text/plain', 'text/xml', 'text/html'); function curlHeaderCallback($resURL, $strHeader) { global $acceptable; if (stripos($strHeader,'content-type')===0) { $type=strtolower(trim(array_shift(explode(';',array_pop(explode(':',$strHeader)))))); if (!in_array($type,$acceptable)) return 0; } return strlen($strHeader);

}

How to download only html (and skip media files)

More articles: