Regexp search on a very large file

Question

Regexp search on a very large file

I need to scan a 300 MB text file using regular expression.

Reading the entire file and transferring it to a variable eats more than 700 MB of RAM, and then with the error "cannot allocate memory."
The match can be in two or three lines, so I can not use the linear step in the loop.

Is there any lazy method for completely scanning files with a regular expression without reading it into a separate variable?

UPD

Done. Now you can use this function to read in pieces. Change it for your purposes.

def prepare_session_hash(fname, regex_string, start=0) @session_login_hash = {} File.open(fname, 'rb') { |f| fsize = f.size bsize = fsize / 8 if start > 0 f.seek(start) end overlap = 200 while true if (f.tell() >= overlap) and (f.tell() < fsize) f.seek(f.tell() - overlap) end buffer = f.read(bsize) if buffer buffer.scan(s) { |match| @session_login_hash[match[0]] = match[1] } else return @session_login_hash end end } end

+4

ruby regex

Alexander.Iljushkin Dec 17 '12 at 7:50

source share

1 answer

Dhruvpathak · Accepted Answer · 2012-12-17T08:05:59+0000

Move the file in chunks instead of line by line, where chunks are created by occurrences of a frequently occurring character or pattern, such as "X".
An "X" is such that it never exists in your regular expression, that is, an "X" is where your regular expression will never match a string.
Match your regex in the current snippet, extract matches and move on to the next snippet.

Example:

 This is string with multline numbers -2000 2223434 34356666 444564646 . These numbers can occur at 34345 567567 places, and on 67 87878 pages . The problem is to find a good way to extract these more than 100 0 regexes without memory hogging.

In this text, suppose that the desired pattern is a numeric string, for example, /d+/s match multiline digits, then instead of processing and loading the whole file, you can choose a template for creating a piece, say, FULL STOP in this case . and only read and process up to this template, and then move on to the next fragment.

CHUNK # 1:

 This is string with multline numbers -2000 2223434 34356666 444564646 .

CHUNK # 2:

 These numbers can occur at 34345 567567 places, and on 67 87878 pages

etc.

EDIT: Adding @Ranty's suggestion from the comments:

Or just read a few lines, say 20. When you find a match inside, clear the rest of the match and add another 20 lines. There is no need for the frequent "X".

Regexp search on a very large file

More articles: