Take a look at the very capable Typhoeus and Hydra . These two methods make it easy to handle multiple URLs.
Example β Times should quickly start and run. In the on_complete block on_complete enter the code to record your statuses in the database. You could use the stream to create and serve requests in a queue at a healthy level or a queue with a given number, so that all of them are executed until completion, and then a loop for another group, it's up to you.
Paul Dix, the author of the original, spoke about his design goals on his blog.
This is an example of the code I wrote to download archived mail lists so that I can perform local requests. I intentionally deleted the URL so as not to expose the site to DOS attacks if people start running the code:
#!/usr/bin/env ruby require 'nokogiri' require 'addressable/uri' require 'typhoeus' BASE_URL = '' url = Addressable::URI.parse(BASE_URL) resp = Typhoeus::Request.get(url.to_s) doc = Nokogiri::HTML(resp.body) hydra = Typhoeus::Hydra.new(:max_concurrency => 10) doc.css('a').map{ |n| n['href'] }.select{ |href| href[/\.gz$/] }.each do |gzip| gzip_url = url.join(gzip) request = Typhoeus::Request.new(gzip_url.to_s) request.on_complete do |resp| gzip_filename = resp.request.url.split('/').last puts "writing #{gzip_filename}" File.open("gz/#{gzip_filename}", 'w') do |fo| fo.write resp.body end end puts "queuing #{ gzip }" hydra.queue(request) end hydra.run
Running code on my multi-year MacBook Pro pulled out 76 files for a total of 11 MB in just 20 seconds, wirelessly to DSL. If you only make HEAD requests, your throughput will be better. You want to associate yourself with the concurrency parameter, because there is a point at which more simultaneous sessions only slow down and use resources uselessly.
I give him 8 out of 10; He got a big hit, and I can dance with him.
EDIT:
When checking delete URLs, you can use a HEAD or GET request with If-Modified-Since . They can give you answers that you can use to determine the freshness of your URLs.
the tin man
source share