The fastest way to ping thousands of websites using PHP

I am currently pinging URLs using CURL + PHP. But in my script, the request is sent, then it waits for a response, then another request ... If each answer takes ~ 3 s, it takes more than 8 hours to ping 10k links!

Is there a way to send multiple requests at the same time, for example, some kind of multithreading?

Thanks.

+4
source share
9 answers

You can either develop your php process using pcntl_fork or take a look at the built-in multithreading. https://web.archive.org/web/20091014034235/http://www.ibuildings.co.uk/blog/archives/811-Multithreading-in-PHP-with-CURL.html

+3
source

Use the curl_multi_* functions available in curl. See http://www.php.net/manual/en/ref.curl.php

You should group the URLs in smaller sets: adding all 10k links at the same time is unlikely to work. So create a loop around the following code and use a subset of the URLS (e.g. 100) in the $urls variable.

 $all = array(); $handle = curl_multi_init(); foreach ($urls as $url) { $all[$url] = curl_init(); // Set curl options for $all[$url] curl_multi_add_handle($handle, $all[$url]); } $running = 0; do { curl_multi_exec($handle, $running;); } while ($running > 0); foreach ($all as $url => $curl) { $content = curl_multi_getcontent($curl); // do something with $content curl_multi_remove_handle($handle, $curl); } curl_multi_close($handle); 
+3
source

First of all, I would like to note that this is not the main task that you can do for any hosting provider. I guess you will be banned.

So, I assume that you can compile the software (VPS?) And run lengthy processes in the background (using php cli ). I would use redis (I liked predis as a PHP client library) in push messages on the list. ( PS: I would prefer to write this in node.js / python (the explanation below works for PHP), because I think this task can be encoded in these languages ​​pretty quickly. I'm going to try writing it and the post code on github later . )

Redis:

Redis is an extended keystore. It is similar to memcached, but the data set is not mutable, and the values ​​can be strings, just like memcached, but also lists, sets, and ordered sets. All these data types can manipulate atomic operations to push / pop elements, add / remove elements, make a server-side connection, intersection, difference between sets, and so on. Redis supports a variety of sorting options.

Then start several workflows that take (block if they are not available) messages from the list.

Blpop:

Here Redis is really interesting. BLPOP and BRPOP are blocking equivalents of LPOP and RPOP. If the queue for any of the keys that they specify has an element in it, that element will slip out and return. If this is not the case, the Redis client will be blocked until the key is available (or the timeout expires), specify 0 for an unlimited timeout).

Curl is not exactly ping (ICMP Echo), but I think some servers can block these requests (security). First, I would try ping (using part of the nmap fragment), and fail to curl if ping failed because pinging is faster than using curl.

Libcurl:

Free client-side URL transfer, FTP, FTPS, Gopher (protocol), HTTP, HTTPS, SCP, SFTP, TFTP, TELNET, DICT, FILE, LDAP, LDAPS, IMAP, POP3, SMTP and RTSP support (last four - only in versions newer than 7.20.0 or February 9, 2010)

Ping:

Ping is a computer network administration utility used to test the reachability of a host on the Internet (IP) and measure the round-trip time for messages sent from the host to the target computer. The name comes from active sonar terminology. Ping works by sending Internet Control Message Protocol (ICMP) pings to the target host and waiting for an ICMP Response.

But then you should execute the HEAD request and get only the headers to check if the host is up. Otherwise, you will also download the contents of the URL (time / cost bandwidth required).

HEAD:

The HEAD method is identical to the GET method except that the server SHOULD NOT return the message body in the response. the meta information contained in the HTTP headers in response to the HEAD request MUST be identical to the information sent in response to the GET request. This method can be used to obtain meta-information about an object implied by a request without transmitting the essential body itself. This method is often used to test hypertext links for validity, accessibility, and recent modification.

Then each workflow should use curl_multi. I think this link can provide a good implementation of this (minus that it does not make a request on the head). to have some kind of concurrency in every process.

+3
source

PHP does not have true multi-threaded capabilities.

However, you can always make your CURL requests asynchronously.

This will allow you to run batches of pings instead of one at a time.

Link: How to make an asynchronous GET request in PHP?

Edit: just keep in mind that you need to make your PHP wait until all answers return before completion.

  • Christian
+1
source

curl has a "multi request" object, which is essentially a way to execute requests with a stream. Check out the example on this page: http://www.php.net/manual/en/function.curl-multi-exec.php

+1
source

I would use system() and execute the ping script as a new process. Or several processes.

You can make a centralized queue with all the addresses for ping, and then hit some ping scripts in the task.

Just check:

If a program starts with this function in order to continue running in the background, the program output should be redirected to a file or other output stream. the failure for this PHP will hang until the program terminates.

0
source

You can use the PHP function exec () to execute unix commands such as wget to execute this.

 exec('wget -O - http://example.com/url/to_ping /dev/null 2>&1 &'); 

This is by no means an ideal solution, but it does the job and sends the output to / dev / null and runs it in the background, you can go to the next "ping" without waiting for an answer.

Note. On some servers, the exec () function is disabled for security purposes.

0
source

To cope with such tasks, try I / O multiplexing strategies. In short, the idea is that you create a bunch of sockets, load them into your OS (say, using epoll in linux / kqueue on FreeBSD) and sleep mode until the event occurs on some sockets. The kernel of the OS can process hundreds or even thousands of sockets in parallel in one process.

You can not only handle TCP sockets, but also handle timer / file descriptors in a similar way in parallel.

Back to PHP, look at something like https://github.com/reactphp/event-loop , which provides a good API and hides a lot of low-level details.

0
source

Running multiple php processes.

Process 1: pings 1-1000 sites

Process 2: pings sites 1001-2001

...

-1
source

All Articles