Since the data is sequential, if the beginning and end of the region of interest are near the beginning of the file, then reading from the end of the file (to find a suitable endpoint) is still a bad decision!
I wrote code that will quickly find the start and end points as needed, this approach is called binary search and is similar to the classic children's game "above or below"!
The script reads a test line in the middle between lower_bounds and upper_bounds (originally SOF and EOF) and checks for matching criteria. If the desired line is earlier, then it guesses again, reading the line halfway between lower_bound and the previous reading (if it is higher, then it is divided between its assumption and the upper boundary). Thus, you continue to iterate between the upper and lower boundaries - this gives the maximum possible "average" solution.
This should be a real quick fix (enter base 2 from the number of lines!). For example, in the worst case scenario (outputting a line of 999 out of 1000 lines), using a binary search, you need only 9 lines. (from a billion lines only 30 will be required ...)
Assumptions for the code below:
- Each line starts with time information.
- The time is unique. If not, when a match is found, you will need to check backwards or forwards to include or exclude all records with a suitable time (if necessary).
- Animation is a recursive function, so the number of lines in your file is limited to 2 ** 1000 (fortunately, this allows a fairly large file ...)
Further:
- This can be adapted for reading in arbitrary blocks, rather than line by line, if necessary. As suggested by J. F. Sebastian.
- In my original answer, I suggested this approach, but using linecache.getline , while possible, it is not acceptable for large files when reading the entire file in memory (thus
file.seek() superior) thanks to TerryE and JF Sebastian to indicate this.
import datetime
def match(line): lfmt = '%Y-%m-%d %H:%M:%S' if line[0] == '[': return datetime.datetime.strptime(line[1:20], lfmt) def retrieve_test_line(position): file.seek(position,0) file.readline() # avoids reading partial line, which will mess up match attempt new_position = file.tell() # gets start of line position return file.readline(), new_position def check_lower_bound(position): file.seek(position,0) new_position = file.tell() # gets start of line position return file.readline(), new_position def find_line(target, lower_bound, upper_bound): trial = int((lower_bound + upper_bound) /2) inspection_text, position = retrieve_test_line(trial) if position == upper_bound: text, position = check_lower_bound(lower_bound) if match(text) == target: return position return # no match for target within range matched_position = match(inspection_text) if matched_position == target: return position elif matched_position < target: return find_line(target, position, upper_bound) elif matched_position > target: return find_line(target, lower_bound, position) else: return # no match for target within range lfmt = '%Y-%m-%d %H:%M:%S' # start_target = # first line you are trying to find: start_target = datetime.datetime.strptime("2012-02-01 13:10:00", lfmt) # end_target = # last line you are trying to find: end_target = datetime.datetime.strptime("2012-02-01 13:39:00", lfmt) file = open("log_file.txt","r") lower_bound = 0 file.seek(0,2) # find upper bound upper_bound = file.tell() sequence_start = find_line(start_target, lower_bound, upper_bound) if sequence_start or sequence_start == 0: #allow for starting at zero - corner case sequence_end = find_line(end_target, sequence_start, upper_bound) if not sequence_end: print "start_target match: ", sequence_start print "end match is not present in the current file" else: print "start match is not present in the current file" if (sequence_start or sequence_start == 0) and sequence_end: print "start_target match: ", sequence_start print "end_target match: ", sequence_end print print start_target, 'target' file.seek(sequence_start,0) print file.readline() print end_target, 'target' file.seek(sequence_end,0) print file.readline()