Loading pages in parallel using PHP

Question

Loading pages in parallel using PHP

I need to give up a website where I need to get multiple URLs and then process them one by one. The current process is a bit like this.

I retrieve the base URL and get all the secondary URLs from this page, then for each secondary URL I retrieve this URL, look at the found page, upload some photos (which take quite a lot of time) and store this data in database, then select the following URL and repeat the process.

In this process, I think that I spend time getting a secondary URL at the beginning of each iteration. Therefore, I am trying to get the following URLs in parallel when processing the first iteration.

The solution, in my opinion, is that a PHP script is called from the main process, say, a loader that loads the entire URL (using curl_multi or wget ) and stores them in some database.

My questions

How to call such a downloder asynchronously , I do not want my main script to wait for the downloder to complete.
Any place to store downloaded data, such as shared memory. Of course, except for the database.
Is there a chance that the data will be damaged during storage and retrieval, how to avoid this?
Also, please help me find out if anyone has a better plan.

+7

performance php parallel-processing web-scraping

Uday savant Feb 10 '12 at 19:07

source share

5 answers

You can use curl_multi: http://www.somacon.com/p537.php

You may also consider running this client side and using Javascript.

Another solution is to write a hunter / collector to whom you send an array of URLs, then do the parallel work and return a JSON array after it completes.

Put another way: if you had 100 URLs, you can send this array (maybe like JSON) to mysite.tld / huntergatherer - it does whatever it wants in any language you need and just returns JSON .

+2

Grok Feb 10 '12 at 19:11

source share

In addition to the solution for curling several, the other only has a batch of relay workers . If you go this route, I found supervisord good way to start loading a diam worker.

+2

Wrikken Feb 10 '12 at 21:10

source share

Things you should watch in addition to CURL multi:

Non-blocking threads (example: PHP-MIO )
ZeroMQ for spawning many workers running asynchronous requests

While node.js, ruby EventMachine, or similar tools are great for doing this stuff, what I mentioned makes it pretty simple in PHP too.

+1

igorw Feb 12 '12 at 10:55

source share

Try to execute PHP scripts, python-pycurl. Lighter, faster PHP curl.

0

danielpopa Feb 10 '12 at 19:35

source share

Slava · Accepted Answer · 2012-02-10T21:06:40+0000

When I hear that someone is using curl_multi_exec, it usually turns out that he just loads it, say, 100 URLs, then wait until everything is ready, and then process them all, and then start with the following 100 URLs ., I did it too, but then I found out that it is possible to remove / add curl_multi descriptors while something else is going on, and it really saves a lot of time, especially if you reuse already open connections. I wrote a small library to handle a request queue with callbacks; Of course, I do not post the full version here (the “small” is still quite a bit of code), but here is a simplified version to give you a general idea:

 public function launch() { $channels = $freeChannels = array_fill(0, $this->maxConnections, NULL); $activeJobs = array(); $running = 0; do { // pick jobs for free channels: while ( !(empty($freeChannels) || empty($this->jobQueue)) ) { // take free channel, (re)init curl handle and let // queued object set options $chId = key($freeChannels); if (empty($channels[$chId])) { $channels[$chId] = curl_init(); } $job = array_pop($this->jobQueue); $job->init($channels[$chId]); curl_multi_add_handle($this->master, $channels[$chId]); $activeJobs[$chId] = $job; unset($freeChannels[$chId]); } $pending = count($activeJobs); // launch them: if ($pending > 0) { while(($mrc = curl_multi_exec($this->master, $running)) == CURLM_CALL_MULTI_PERFORM); // poke it while it wants curl_multi_select($this->master); // wait for some activity, don't eat CPU while ($running < $pending && ($info = curl_multi_info_read($this->master))) { // some connection(s) finished, locate that job and run response handler: $pending--; $chId = array_search($info['handle'], $channels); $content = curl_multi_getcontent($channels[$chId]); curl_multi_remove_handle($this->master, $channels[$chId]); $freeChannels[$chId] = NULL; // free up this channel if ( !array_key_exists($chId, $activeJobs) ) { // impossible, but... continue; } $activeJobs[$chId]->onComplete($content); unset($activeJobs[$chId]); } } } while ( ($running > 0 && $mrc == CURLM_OK) || !empty($this->jobQueue) ); }

In my version, $ jobs are actually a separate class, not instances of controllers or models. They simply process the cURL parameters, parse the response, and call the given onComplete callback. With this structure, new requests will begin as soon as something from the pool ends.

Of course, this really does not save you, if not just extraction takes time, but also processing ... And this is not real parallel processing. But I still hope this helps. :)

PS helped. :) As soon as the 8-hour work is now completed 3-4 times, using a pool of 50 connections. I can’t describe this feeling. :) I really didn’t expect it to work as planned, because with PHP it rarely works exactly as intended ... It was like “okay, hope it ends, at least for an hour ... Wha ... Wait .. Already ?! 8-O "

Loading pages in parallel using PHP

More articles: