When I hear that someone is using curl_multi_exec, it usually turns out that he just loads it, say, 100 URLs, then wait until everything is ready, and then process them all, and then start with the following 100 URLs ., I did it too, but then I found out that it is possible to remove / add curl_multi descriptors while something else is going on, and it really saves a lot of time, especially if you reuse already open connections. I wrote a small library to handle a request queue with callbacks; Of course, I do not post the full version here (the “small” is still quite a bit of code), but here is a simplified version to give you a general idea:
public function launch() { $channels = $freeChannels = array_fill(0, $this->maxConnections, NULL); $activeJobs = array(); $running = 0; do { // pick jobs for free channels: while ( !(empty($freeChannels) || empty($this->jobQueue)) ) { // take free channel, (re)init curl handle and let // queued object set options $chId = key($freeChannels); if (empty($channels[$chId])) { $channels[$chId] = curl_init(); } $job = array_pop($this->jobQueue); $job->init($channels[$chId]); curl_multi_add_handle($this->master, $channels[$chId]); $activeJobs[$chId] = $job; unset($freeChannels[$chId]); } $pending = count($activeJobs); // launch them: if ($pending > 0) { while(($mrc = curl_multi_exec($this->master, $running)) == CURLM_CALL_MULTI_PERFORM); // poke it while it wants curl_multi_select($this->master); // wait for some activity, don't eat CPU while ($running < $pending && ($info = curl_multi_info_read($this->master))) { // some connection(s) finished, locate that job and run response handler: $pending--; $chId = array_search($info['handle'], $channels); $content = curl_multi_getcontent($channels[$chId]); curl_multi_remove_handle($this->master, $channels[$chId]); $freeChannels[$chId] = NULL; // free up this channel if ( !array_key_exists($chId, $activeJobs) ) { // impossible, but... continue; } $activeJobs[$chId]->onComplete($content); unset($activeJobs[$chId]); } } } while ( ($running > 0 && $mrc == CURLM_OK) || !empty($this->jobQueue) ); }
In my version, $ jobs are actually a separate class, not instances of controllers or models. They simply process the cURL parameters, parse the response, and call the given onComplete callback. With this structure, new requests will begin as soon as something from the pool ends.
Of course, this really does not save you, if not just extraction takes time, but also processing ... And this is not real parallel processing. But I still hope this helps. :)
PS helped. :) As soon as the 8-hour work is now completed 3-4 times, using a pool of 50 connections. I can’t describe this feeling. :) I really didn’t expect it to work as planned, because with PHP it rarely works exactly as intended ... It was like “okay, hope it ends, at least for an hour ... Wha ... Wait .. Already ?! 8-O "