Trim a large log file

I am doing performance tests for several java applications. When testing applications, they produce very large log files (this can be 7-10 GB). I need to trim these log files between specific dates and times. I am currently using a python script that parses log timestamps in a python datetime object and prints only matched lines. But this solution is very slow. A 5 GB log is analyzed for about 25 minutes. Obviously, the entries in the log file are sequential, and I do not need to read all the files line by line. I thought about reading the file from the very beginning and from the end, until the condition is agreed and the files between the number of lines are printed. But I don’t know how I can read the file from the back without loading it into memory.

Please can you offer me any solution for this case.

here is the python script part:

lfmt = '%Y-%m-%d %H:%M:%S' file = open(filename, 'rU') normal_line = '' for line in file: if line[0] == '[': ltimestamp = datetime.strptime(line[1:20], lfmt) if ltimestamp >= str and ltimestamp <= end: normal_line = 'True' else: normal_line = '' if normal_line: print line, 
+7
source share
4 answers

Since the data is sequential, if the beginning and end of the region of interest are near the beginning of the file, then reading from the end of the file (to find a suitable endpoint) is still a bad decision!

I wrote code that will quickly find the start and end points as needed, this approach is called binary search and is similar to the classic children's game "above or below"!

The script reads a test line in the middle between lower_bounds and upper_bounds (originally SOF and EOF) and checks for matching criteria. If the desired line is earlier, then it guesses again, reading the line halfway between lower_bound and the previous reading (if it is higher, then it is divided between its assumption and the upper boundary). Thus, you continue to iterate between the upper and lower boundaries - this gives the maximum possible "average" solution.

This should be a real quick fix (enter base 2 from the number of lines!). For example, in the worst case scenario (outputting a line of 999 out of 1000 lines), using a binary search, you need only 9 lines. (from a billion lines only 30 will be required ...)

Assumptions for the code below:

  • Each line starts with time information.
  • The time is unique. If not, when a match is found, you will need to check backwards or forwards to include or exclude all records with a suitable time (if necessary).
  • Animation is a recursive function, so the number of lines in your file is limited to 2 ** 1000 (fortunately, this allows a fairly large file ...)

Further:

  • This can be adapted for reading in arbitrary blocks, rather than line by line, if necessary. As suggested by J. F. Sebastian.
  • In my original answer, I suggested this approach, but using linecache.getline , while possible, it is not acceptable for large files when reading the entire file in memory (thus file.seek() superior) thanks to TerryE and JF Sebastian to indicate this.

import datetime

 def match(line): lfmt = '%Y-%m-%d %H:%M:%S' if line[0] == '[': return datetime.datetime.strptime(line[1:20], lfmt) def retrieve_test_line(position): file.seek(position,0) file.readline() # avoids reading partial line, which will mess up match attempt new_position = file.tell() # gets start of line position return file.readline(), new_position def check_lower_bound(position): file.seek(position,0) new_position = file.tell() # gets start of line position return file.readline(), new_position def find_line(target, lower_bound, upper_bound): trial = int((lower_bound + upper_bound) /2) inspection_text, position = retrieve_test_line(trial) if position == upper_bound: text, position = check_lower_bound(lower_bound) if match(text) == target: return position return # no match for target within range matched_position = match(inspection_text) if matched_position == target: return position elif matched_position < target: return find_line(target, position, upper_bound) elif matched_position > target: return find_line(target, lower_bound, position) else: return # no match for target within range lfmt = '%Y-%m-%d %H:%M:%S' # start_target = # first line you are trying to find: start_target = datetime.datetime.strptime("2012-02-01 13:10:00", lfmt) # end_target = # last line you are trying to find: end_target = datetime.datetime.strptime("2012-02-01 13:39:00", lfmt) file = open("log_file.txt","r") lower_bound = 0 file.seek(0,2) # find upper bound upper_bound = file.tell() sequence_start = find_line(start_target, lower_bound, upper_bound) if sequence_start or sequence_start == 0: #allow for starting at zero - corner case sequence_end = find_line(end_target, sequence_start, upper_bound) if not sequence_end: print "start_target match: ", sequence_start print "end match is not present in the current file" else: print "start match is not present in the current file" if (sequence_start or sequence_start == 0) and sequence_end: print "start_target match: ", sequence_start print "end_target match: ", sequence_end print print start_target, 'target' file.seek(sequence_start,0) print file.readline() print end_target, 'target' file.seek(sequence_end,0) print file.readline() 
+5
source

5 GB log analyzed about 25 minutes

This is ~ 3 MB / s. Even sequential O(n) scanning in Python can significantly improve (~ 500 MB / s for wc-l.py ) , i.e. Performance should be limited only by I / O.

To perform a binary search on a file, you can adapt FileSearcher , which uses fixed records to use strings instead, using an approach similar to the tail -n implementation in Python ( O(n) to scan '\n' ).

To avoid O(n) (if the date range selects only a small part of the log), you can use an approximate search that uses large fixed chunks and allows you to skip some records due to the fact that they lie on the border of the fragment, for example, use an unmodified FileSearcher with record_size=1MB and a custom Query class:

 class Query(object): def __init__(self, query): self.query = query # eg, '2012-01-01' def __lt__(self, chunk): # assume line starts with a date; find the start of line i = chunk.find('\n') # assert '\n' in chunk and len(chunk) > (len(self.query) + i) # eg, '2012-01-01' < '2012-03-01' return self.query < chunk[i+1:i+1+len(self.query)] 

Given that a date range can span multiple fragments, you can change FileSearcher.__getitem__ to return (filepos, chunk) , and do a double search ( bisect_left() , bisect_right() ) to find approximate filepos_mindate , filepos_maxdate . After that, you can perform a linear search (for example, using the tail -n approach) around given file positions to find the exact first and last log entries.

+2
source

From 7 to 10 GB - a large amount of data. If I had to analyze such data, I would either make an application log in the database, or upload the log files to the database. Then there are many analyzes that you can efficiently perform in the database. If you are using a standard logging tool such as Log4J, logging in the database should be fairly simple. Just suggesting an alternative solution.

You can read more about registering a database in this post:

Good database logging application for Java?

+1
source

If you have access to the Windows environment, you can use MS LogParser to read files and collect any information that you may need. It uses SQL syntax, which makes using this tool a joy. It also supports a large number of input types.

As an added bonus, it also supports the iCheckPoint switch, which creates a checkpoint file when working with sequential log files. For more information, see the Help "Password Analysis" in the section "Advanced Functions β†’" Paragraph Analysis ""

See also:

0
source

All Articles