Why is curl in Ruby slower than command line curl?

I am trying to load more than 1 million pages (URLs ending with a sequence identifier). I have implemented a multipurpose download manager with a customizable number of download threads and one processing thread. The uploading file uploads the files:

curl = Curl::Easy.new batch_urls.each { |url_info| curl.url = url_info[:url] curl.perform file = File.new(url_info[:file], "wb") file << curl.body_str file.close # ... some other stuff } 

I tried to load a sample of 8000 pages. Using the above code, I get 1000 in 2 minutes. When I write all the URLs to a file and do in the shell:

 cat list | xargs curl 

I am a general of all 8,000 pages in two minutes.

The thing is, I need it to be in the ruby ​​code, because there is a different control and processing code.

I tried:

  • Curl :: Multi is somehow faster, but skips 50-90% of files (does not load them and does not give any reason / code)
  • multiple threads with Curl :: Easy - at the same speed as single-threaded

Why is reusing Curl :: Easy slower than subsequent hangs on the command line and how can I do it faster? Or what am I doing wrong?

I'd rather fix the download manager code than make the download for this case different.

Before that, I called the wget command line, which I provided with a file with a list of URLs. Howerver, not all errors were processed, it was also impossible to specify the output file for each URL separately when using the list of URLs.

Now it seems to me that the best way would be to use multiple threads with a system call to the 'curl' command. But why, when can I use Curl directly in Ruby?

The code for the download manager is here if it can help: Download Manager (I played with timeouts, without setting it to different values, this did not help)

Any hints appreciated.

+7
ruby curl download curb
source share
6 answers

This may be a suitable task for Typhoeus.

Something like this (untested):

 require 'typhoeus' def write_file(filename, data) file = File.new(filename, "wb") file.write(data) file.close # ... some other stuff end hydra = Typhoeus::Hydra.new(:max_concurrency => 20) batch_urls.each do |url_info| req = Typhoeus::Request.new(url_info[:url]) req.on_complete do |response| write_file(url_info[:file], response.body) end hydra.queue req end hydra.run 

Think about it, you may get a memory problem due to the huge number of files. One way to prevent this is to never store data in a variable, but instead pass it directly to a file. You can use em-http-request for this.

 EventMachine.run { http = EventMachine::HttpRequest.new('http://www.website.com/').get http.stream { |chunk| print chunk } # ... } 
+5
source share

So, if you do not set the on_body handler, and curb will buffer the load. If you upload files, you should use the on_body handler. If you want to upload multiple files using Ruby Curl, try the Curl :: Multi.download interface.

 require 'rubygems' require 'curb' urls_to_download = [ 'http://www.google.com/', 'http://www.yahoo.com/', 'http://www.cnn.com/', 'http://www.espn.com/' ] path_to_files = [ 'google.com.html', 'yahoo.com.html', 'cnn.com.html', 'espn.com.html' ] Curl::Multi.download(urls_to_download, {:follow_location => true}, {}, path_to_files) {|c,p|} 

If you want to just upload one file.

 Curl::Easy.download('http://www.yahoo.com/') 

Here is a good resource: http://gist.github.com/405779

+3
source share

Tests were conducted that compared the restrictions with other methods, such as HTTPClient. The winner in almost all categories was HTTPClient. In addition, there were some documented scripts in which curb did not work in multi-thread scripts.

Like you, I had my own experience. I ran curl system commands in 20+ parallel threads and it was 10 X fasters than running curb in 20+ parallel threads. No matter what I tried, it always has been.

Since then, I switched to HTTPClient, and the difference is huge. Now it works as fast as 20 parallel swirl system commands, and also uses less CPU.

+1
source share

First let me say that I know almost nothing about Ruby.

I know that Ruby is an interpreted language; it is not surprising that this is slower than the highly optimized code that was compiled for a particular platform. Each file operation is likely to be checked that curl not. "Some other things" will slow down work even more.

Have you tried profiling your code to find out how much time is wasted?

0
source share

Stiivi

is there any chance that Net :: HTTP will be enough to simply load HTML pages?

0
source share

You did not specify the Ruby version, but the threads in 1.8.x are user-space threads not planned by the OS, so the entire Ruby interpreter uses only one processor / core. In addition, there is a global interceptor lock, and possibly other locks that interfere with concurrency. Since you are trying to maximize network bandwidth, you are probably under-utilizing processors.

Create as many processes as the machine has memory and limit the dependency on threads.

0
source share

All Articles