Stream and unzip the large ruby โ€‹โ€‹csv file

I have a problem when I need to download, unzip, and then process a very large CSV file line by line. I think itโ€™s useful to give you an idea of โ€‹โ€‹how big the file is:

  • big_file.zip ~ 700mb
  • big_file.csv ~ 23gb

Here are some things I would like to do:

  • No need to download the whole file before unpacking
  • No need to unpack the whole file before parsing csv lines
  • Do not use too much memory / disk doing all this

I do not know if this is possible or not. Here is what I thought:

require 'open-uri' require 'rubyzip' require 'csv' open('http://foo.bar/big_file.zip') do |zipped| Zip::InputStream.open(zipped) do |unzipped| sleep 10 until entry = unzipped.get_next_entry && entry.name == 'big_file.csv' CSV.foreach(unzipped) do |row| # process the row, maybe write out to STDOUT or some file end end end 

Here are the issues I know about:

  • open-uri reads the whole response and saves it in Tempfile , which is not suitable for a file of this size. I will probably need to use Net::HTTP directly, but I'm not sure how to do this and still get IO .
  • I donโ€™t know how fast it will load, or if Zip::InputStream works as I showed it to work. Can it unzip part of the file if it isnโ€™t all?
  • Does CSV.foreach with rubyzip InputStream ? Is it enough to behave like a File so that it can parse lines? Will he worry if he wants to read, but the buffer is empty?

I do not know if this is suitable for this. Perhaps some EventMachine solution would be better (although I had never used EventMachine before, but if it works better for something like that, Iโ€™m all for it).

+6
source share
1 answer

It has been some time since I posted this question, and if someone else comes across it, I thought it might be worth sharing what I found.

The solution I came across was to download the file to disk, and then use the open3 Ruby library and Linux unzip package to stream the uncompressed CSV file from the zip.

 require 'open3' IO.popen('unzip -p /path/to/big_file.zip big_file.csv', 'rb') do |io| line = io.gets # do stuff to process the CSV line end 

The -p switch when unpacking sends the extracted file to standard output. IO.popen then use the pipes to make an IO object in ruby. It works very well. You could use it with CSV too, if you wanted this extra processing, it was too slow for me.

 require 'open3' require 'csv' IO.popen('unzip -p /path/to/big_file.zip big_file.csv', 'rb') do |io| CSV.foreach(io) do |row| # process the row end end 
+6
source

All Articles