I am writing an import script that processes a file with potentially hundreds of thousands of lines (a log file). Using a very simple approach (see below), there was enough time and memory that I felt that my MBP would choose at any time, so I killed the process.
#... File.open(file, 'r') do |f| f.each_line do |line| # do stuff here to line end end
In this file, in particular, there are 642,868 lines:
$ wc -l nginx.log /code/src/myimport 642868 ../nginx.log
Does anyone know a more efficient way (memory / processor) to process each line in this file?
UPDATE
The code inside f.each_line top just matches the regular expression to the line. If the match fails, I add the string to the @skipped array. If it passes, I formatted the matches in the hash (using the "fields" of the match) and adds it to the @results array.
# regex built in `def initialize` (not on each line iteration) @regex = /(\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}) - (.{0})- \[([^\]]+?)\] "(GET|POST|PUT|DELETE) ([^\s]+?) (HTTP\/1\.1)" (\d+) (\d+) "-" "(.*)"/ #... loop lines match = line.match(@regex) if match.nil? @skipped << line else @results << convert_to_hash(match) end
I am fully open to make this an inefficient process. I could make the code inside convert_to_hash use a precomputed lambda rather than compute the calculations every time. I assume that I just assumed that it was the iteration of the line itself, and not the code for each line.
ruby text-processing
localshred
source share