How can I handle large files in Ruby?

I am new to programming, so be careful. I am trying to extract IBSNs from a .dat database database file. I wrote code that works, but it only looks at half the 180 MB file. How to configure it to search the entire file? Or how can I write a program to split the dat file into manageable pieces?

edit: Here is my code:

export = File.new("resultsfinal.txt","w+") File.open("bibrec2.dat").each do |line| line.scan(/[a]{1}[1234567890xX]{10}\W/) do |x| export.puts x end line.scan(/[a]{1}[1234567890xX]{13}/) do |x| export.puts x end end 
+6
ruby file-io
source share
6 answers

You should try to catch an exception to check if the problem is really on the read block or not.

Just so you know that I already made a script with the same syntax to search for a real large file of ~ 8 GB in size without any problems.

 export = File.new("resultsfinal.txt","w+") File.open("bibrec2.dat").each do |line| begin line.scan(/[a]{1}[1234567890xX]{10}\W/) do |x| export.puts x end line.scan(/[a]{1}[1234567890xX]{13}/) do |x| export.puts x end rescue puts "Problem while adding the result" end end 
+4
source share

The main thing is to clear and combine the regular expression to improve performance. Also, you should always use file block syntax to make sure fd is closed properly. File # each does not load the entire file into memory, it does one line at a time:

 File.open("resultsfinal.txt","w+") do |output| File.open("bibrec2.dat").each do |line| output.puts line.scan(/a[\dxX]{10}(?:[\dxX]{3}|\W)/) end end 
+3
source share
 file = File.new("bibrec2.dat", "r") while (line = file.gets) line.scan(/[a]{1}[1234567890xX]{10}\W/) do |x| export.puts x end line.scan(/[a]{1}[1234567890xX]{13}/) do |x| export.puts x end end file.close 
+2
source share

As for the performance issue, I don’t see anything special about the file size: 180 MB should not create any problems. What happens with memory usage when running the script?

I am not sure, however, that your regular expressions do what you want. This is for example:

 /[a]{1}[1234567890xX]{10}\W/ 

(I think):

  • one a. Are you sure you want to match "a"? "a" would be enough, not "[a] {1}" in this case.
  • exactly 10 of (digit or "x" or "X")
  • one non-word character, i.e. not az, AZ, 0-9 or underscore

There are several sample ISBN sockets here and here , although they seem to correspond more closely to the format we see on the back cover of the book, and I assume that your input file has stripped some of these formats.

+1
source share

You can study File#truncate and IO#seek and use a binary search algorithm. #truncate can be destructive, so you should duplicate the file (I know this is a hassle).

 middle = File.new("my_huge_file.dat").size / 2 tmpfile = File.new("my_huge_file.dat", "r+").truncate(middle) # run search algoritm on 'tmpfile' File.open("my_huge_file.dat") do |huge_file| huge_file.seek(middle + 1) # run search algorithm from here end 

The code is very untested, fragile and incomplete. But I hope this gives you a build platform.

+1
source share

If you program in a modern operating system and there is enough memory on the computer (say 512 megabytes), Ruby should not have problems reading the entire file into memory.

Things usually work out when you get about 2 gigabytes of working set on a typical 32-bit OS.

-2
source share

All Articles