Regexp search on a very large file

I need to scan a 300 MB text file using regular expression.

  • Reading the entire file and transferring it to a variable eats more than 700 MB of RAM, and then with the error "cannot allocate memory."
  • The match can be in two or three lines, so I can not use the linear step in the loop.

Is there any lazy method for completely scanning files with a regular expression without reading it into a separate variable?

UPD

Done. Now you can use this function to read in pieces. Change it for your purposes.

def prepare_session_hash(fname, regex_string, start=0) @session_login_hash = {} File.open(fname, 'rb') { |f| fsize = f.size bsize = fsize / 8 if start > 0 f.seek(start) end overlap = 200 while true if (f.tell() >= overlap) and (f.tell() < fsize) f.seek(f.tell() - overlap) end buffer = f.read(bsize) if buffer buffer.scan(s) { |match| @session_login_hash[match[0]] = match[1] } else return @session_login_hash end end } end 
+4
source share
1 answer
  • Move the file in chunks instead of line by line, where chunks are created by occurrences of a frequently occurring character or pattern, such as "X".
  • An "X" is such that it never exists in your regular expression, that is, an "X" is where your regular expression will never match a string.
  • Match your regex in the current snippet, extract matches and move on to the next snippet.

Example:

 This is string with multline numbers -2000 2223434 34356666 444564646 . These numbers can occur at 34345 567567 places, and on 67 87878 pages . The problem is to find a good way to extract these more than 100 0 regexes without memory hogging. 

In this text, suppose that the desired pattern is a numeric string, for example, /d+/s match multiline digits, then instead of processing and loading the whole file, you can choose a template for creating a piece, say, FULL STOP in this case . and only read and process up to this template, and then move on to the next fragment.

CHUNK # 1:

 This is string with multline numbers -2000 2223434 34356666 444564646 . 

CHUNK # 2:

 These numbers can occur at 34345 567567 places, and on 67 87878 pages 

etc.

EDIT: Adding @Ranty's suggestion from the comments:

Or just read a few lines, say 20. When you find a match inside, clear the rest of the match and add another 20 lines. There is no need for the frequent "X".

+5
source

All Articles