How to efficiently parse large text files in Ruby

I am writing an import script that processes a file with potentially hundreds of thousands of lines (a log file). Using a very simple approach (see below), there was enough time and memory that I felt that my MBP would choose at any time, so I killed the process.

#... File.open(file, 'r') do |f| f.each_line do |line| # do stuff here to line end end 

In this file, in particular, there are 642,868 lines:

 $ wc -l nginx.log /code/src/myimport 642868 ../nginx.log 

Does anyone know a more efficient way (memory / processor) to process each line in this file?

UPDATE

The code inside f.each_line top just matches the regular expression to the line. If the match fails, I add the string to the @skipped array. If it passes, I formatted the matches in the hash (using the "fields" of the match) and adds it to the @results array.

 # regex built in `def initialize` (not on each line iteration) @regex = /(\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}) - (.{0})- \[([^\]]+?)\] "(GET|POST|PUT|DELETE) ([^\s]+?) (HTTP\/1\.1)" (\d+) (\d+) "-" "(.*)"/ #... loop lines match = line.match(@regex) if match.nil? @skipped << line else @results << convert_to_hash(match) end 

I am fully open to make this an inefficient process. I could make the code inside convert_to_hash use a precomputed lambda rather than compute the calculations every time. I assume that I just assumed that it was the iteration of the line itself, and not the code for each line.

+6
ruby text-processing
source share
3 answers

I just checked a test file of 600,000 lines and it repeated this file in less than half a second. I assume that slowness is not in file loops, but in line parsing. Can you also insert your syntax code?

+5
source share

This blogpost contains several approaches to parsing large log files. Perhaps this is inspiration. Also look at the gem tail file

+4
source share

If you use bash (or similar), you can optimize as follows:

In input.rb:

  while x = gets # Parse end 

then in bash:

  cat nginx.log | ruby -n input.rb 

The -n flag tells ruby ​​to assume 'while gets(); ... end' loop around your script assume 'while gets(); ... end' loop around your script , which might make it do something special to optimize.

You might also want to see a pre-written solution to the problem, as it will be faster.

+1
source share

All Articles