I have a problem when I need to download, unzip, and then process a very large CSV file line by line. I think itโs useful to give you an idea of โโhow big the file is:
- big_file.zip ~ 700mb
- big_file.csv ~ 23gb
Here are some things I would like to do:
- No need to download the whole file before unpacking
- No need to unpack the whole file before parsing csv lines
- Do not use too much memory / disk doing all this
I do not know if this is possible or not. Here is what I thought:
require 'open-uri' require 'rubyzip' require 'csv' open('http://foo.bar/big_file.zip') do |zipped| Zip::InputStream.open(zipped) do |unzipped| sleep 10 until entry = unzipped.get_next_entry && entry.name == 'big_file.csv' CSV.foreach(unzipped) do |row|
Here are the issues I know about:
open-uri reads the whole response and saves it in Tempfile , which is not suitable for a file of this size. I will probably need to use Net::HTTP directly, but I'm not sure how to do this and still get IO .- I donโt know how fast it will load, or if
Zip::InputStream works as I showed it to work. Can it unzip part of the file if it isnโt all? - Does
CSV.foreach with rubyzip InputStream ? Is it enough to behave like a File so that it can parse lines? Will he worry if he wants to read, but the buffer is empty?
I do not know if this is suitable for this. Perhaps some EventMachine solution would be better (although I had never used EventMachine before, but if it works better for something like that, Iโm all for it).
source share