I am trying to load more than 1 million pages (URLs ending with a sequence identifier). I have implemented a multipurpose download manager with a customizable number of download threads and one processing thread. The uploading file uploads the files:
curl = Curl::Easy.new batch_urls.each { |url_info| curl.url = url_info[:url] curl.perform file = File.new(url_info[:file], "wb") file << curl.body_str file.close
I tried to load a sample of 8000 pages. Using the above code, I get 1000 in 2 minutes. When I write all the URLs to a file and do in the shell:
cat list | xargs curl
I am a general of all 8,000 pages in two minutes.
The thing is, I need it to be in the ruby code, because there is a different control and processing code.
I tried:
- Curl :: Multi is somehow faster, but skips 50-90% of files (does not load them and does not give any reason / code)
- multiple threads with Curl :: Easy - at the same speed as single-threaded
Why is reusing Curl :: Easy slower than subsequent hangs on the command line and how can I do it faster? Or what am I doing wrong?
I'd rather fix the download manager code than make the download for this case different.
Before that, I called the wget command line, which I provided with a file with a list of URLs. Howerver, not all errors were processed, it was also impossible to specify the output file for each URL separately when using the list of URLs.
Now it seems to me that the best way would be to use multiple threads with a system call to the 'curl' command. But why, when can I use Curl directly in Ruby?
The code for the download manager is here if it can help: Download Manager (I played with timeouts, without setting it to different values, this did not help)
Any hints appreciated.
Stiivi
source share