Best way to check URLs simultaneously (for status, i.e. 200, 301, 404) for multiple URLs in the database

Question

Best way to check URLs simultaneously (for status, i.e. 200, 301, 404) for multiple URLs in the database

Here is what I am trying to do. Say I have 100,000 URLs stored in a database, and I want to check each of them for the http status and keep that status. I want to be able to do this simultaneously in a relatively small amount of time.

I was wondering what the best way to do this. I was thinking about using some kind of queue with workers / consumers or some specific model, but in fact I have not enough experience to know what will work better in this scenario.

Ideas?

0

ruby database concurrency http-status

MAP Jan 28 '11 at 20:53

source share

3 answers

I haven’t done anything multithreaded in Ruby, only in Java, but it looks pretty simple: http://www.tutorialspoint.com/ruby/ruby_multithreading.htm

From what you described, you do not need a queue and workers (well, I'm sure you can do it too, but I doubt that you will get much benefit). Just split your URLs between multiple threads and let each thread execute each fragment and update the database with the results. For example, create 100 threads and give each thread a range of 1000 database rows to process.

You can even create 100 separate processes and give them strings as arguments if you would rather process processes than threads.

To get the status of the URL, I think you are executing an HTTP HEAD request, which I assume is http://apidock.com/ruby/Net/HTTP/request_head in ruby.

+1

ykaganovich Jan 28 '11 at 21:54

source share

work_queue gem is the easiest way to execute tasks asynchronously and simultaneously in your application.

 wq = WorkQueue.new 10 urls.each do |url| wq.enqueue_b do response = Net::HTTP.get_response(uri) puts response.code end end wq.join

0

Miguel fonseca Jun 19 '15 at 19:22

source share

the tin man · Accepted Answer · 2011-01-28T22:49:00+0000

Take a look at the very capable Typhoeus and Hydra . These two methods make it easy to handle multiple URLs.

Example “ Times should quickly start and run. In the on_complete block on_complete enter the code to record your statuses in the database. You could use the stream to create and serve requests in a queue at a healthy level or a queue with a given number, so that all of them are executed until completion, and then a loop for another group, it's up to you.

Paul Dix, the author of the original, spoke about his design goals on his blog.

This is an example of the code I wrote to download archived mail lists so that I can perform local requests. I intentionally deleted the URL so as not to expose the site to DOS attacks if people start running the code:

 #!/usr/bin/env ruby require 'nokogiri' require 'addressable/uri' require 'typhoeus' BASE_URL = '' url = Addressable::URI.parse(BASE_URL) resp = Typhoeus::Request.get(url.to_s) doc = Nokogiri::HTML(resp.body) hydra = Typhoeus::Hydra.new(:max_concurrency => 10) doc.css('a').map{ |n| n['href'] }.select{ |href| href[/\.gz$/] }.each do |gzip| gzip_url = url.join(gzip) request = Typhoeus::Request.new(gzip_url.to_s) request.on_complete do |resp| gzip_filename = resp.request.url.split('/').last puts "writing #{gzip_filename}" File.open("gz/#{gzip_filename}", 'w') do |fo| fo.write resp.body end end puts "queuing #{ gzip }" hydra.queue(request) end hydra.run

Running code on my multi-year MacBook Pro pulled out 76 files for a total of 11 MB in just 20 seconds, wirelessly to DSL. If you only make HEAD requests, your throughput will be better. You want to associate yourself with the concurrency parameter, because there is a point at which more simultaneous sessions only slow down and use resources uselessly.

I give him 8 out of 10; He got a big hit, and I can dance with him.

EDIT:

When checking delete URLs, you can use a HEAD or GET request with If-Modified-Since . They can give you answers that you can use to determine the freshness of your URLs.

Best way to check URLs simultaneously (for status, i.e. 200, 301, 404) for multiple URLs in the database

More articles: