How can I handle large files in Ruby?

Question

How can I handle large files in Ruby?

I am new to programming, so be careful. I am trying to extract IBSNs from a .dat database database file. I wrote code that works, but it only looks at half the 180 MB file. How to configure it to search the entire file? Or how can I write a program to split the dat file into manageable pieces?

edit: Here is my code:

export = File.new("resultsfinal.txt","w+") File.open("bibrec2.dat").each do |line| line.scan(/[a]{1}[1234567890xX]{10}\W/) do |x| export.puts x end line.scan(/[a]{1}[1234567890xX]{13}/) do |x| export.puts x end end

+6

ruby file-io

Nick Jul 07 '09 at 4:30

source share

6 answers

Yoann le touche · Answer 1 · 2009-07-07T10:33:14+0000

You should try to catch an exception to check if the problem is really on the read block or not.

Just so you know that I already made a script with the same syntax to search for a real large file of ~ 8 GB in size without any problems.

 export = File.new("resultsfinal.txt","w+") File.open("bibrec2.dat").each do |line| begin line.scan(/[a]{1}[1234567890xX]{10}\W/) do |x| export.puts x end line.scan(/[a]{1}[1234567890xX]{13}/) do |x| export.puts x end rescue puts "Problem while adding the result" end end

pguardiario · Answer 2 · 2011-12-14T02:48:09+0000

The main thing is to clear and combine the regular expression to improve performance. Also, you should always use file block syntax to make sure fd is closed properly. File # each does not load the entire file into memory, it does one line at a time:

 File.open("resultsfinal.txt","w+") do |output| File.open("bibrec2.dat").each do |line| output.puts line.scan(/a[\dxX]{10}(?:[\dxX]{3}|\W)/) end end

Stevenr12 · Answer 3 · 2011-12-13T23:25:59+0000

 file = File.new("bibrec2.dat", "r") while (line = file.gets) line.scan(/[a]{1}[1234567890xX]{10}\W/) do |x| export.puts x end line.scan(/[a]{1}[1234567890xX]{13}/) do |x| export.puts x end end file.close

Mike woodhouse · Answer 4 · 2009-07-07T08:22:36+0000

As for the performance issue, I don’t see anything special about the file size: 180 MB should not create any problems. What happens with memory usage when running the script?

I am not sure, however, that your regular expressions do what you want. This is for example:

 /[a]{1}[1234567890xX]{10}\W/

(I think):

one a. Are you sure you want to match "a"? "a" would be enough, not "[a] {1}" in this case.
exactly 10 of (digit or "x" or "X")
one non-word character, i.e. not az, AZ, 0-9 or underscore

There are several sample ISBN sockets here and here , although they seem to correspond more closely to the format we see on the back cover of the book, and I assume that your input file has stripped some of these formats.

iGbanam · Answer 5 · 2011-12-15T12:05:48+0000

You can study File#truncate and IO#seek and use a binary search algorithm. #truncate can be destructive, so you should duplicate the file (I know this is a hassle).

 middle = File.new("my_huge_file.dat").size / 2 tmpfile = File.new("my_huge_file.dat", "r+").truncate(middle) # run search algoritm on 'tmpfile' File.open("my_huge_file.dat") do |huge_file| huge_file.seek(middle + 1) # run search algorithm from here end

The code is very untested, fragile and incomplete. But I hope this gives you a build platform.

drudru · Answer 6 · 2009-07-07T04:37:20+0000

If you program in a modern operating system and there is enough memory on the computer (say 512 megabytes), Ruby should not have problems reading the entire file into memory.

Things usually work out when you get about 2 gigabytes of working set on a typical 32-bit OS.

How can I handle large files in Ruby?

More articles: