I iterate over lines in a large number of loaded text files and do regular expression on each line. Usually a match takes less than a second. However, at times the match takes several minutes, sometimes the match does not end at all, and the code just hangs (I waited an hour a couple of times, and then gave up). So I need to enter some kind of timeout and tell the regex match code to somehow stop after 10 seconds or so. I can live with the fact that I will lose the data that the regular expression should return.
I tried the following (of course, these are already two different thread-based solutions shown in one code example):
def timeout_handler(): print 'timeout_handler called' if __name__ == '__main__': timer_thread = Timer(8.0, timeout_handler) parse_thread = Thread(target=parse_data_files, args=(my_args)) timer_thread.start() parse_thread.start() parse_thread.join(12.0) print 'do we ever get here ?'
but I do not get the line timeout_handler called or do we ever get here ? on output, the code just got stuck in parse_data_files .
Even worse, I can't even stop the program using CTRL-C , instead I need to find the python process number and kill this process. Some research has shown that the Python guys are aware of running regular expression code: http://bugs.python.org/issue846388
I really succeeded using the signals:
signal(SIGALRM, timeout_handler) alarm(8) data_sets = parse_data_files(config(), data_provider) alarm(0)
this returns me the timeout_handler called line in the output - and I can still stop my script using CTRL-C . If now I change timeout_handler as follows:
class TimeoutException(Exception): pass def timeout_handler(signum, frame): raise TimeoutException()
and enclose the actual call in re.match(...) in the try ... except TimeoutException , the regular expression match is actually broken. Unfortunately, this only works in my simple single-threaded sandbox script, which I use to try out the material. There are several errors in this solution:
- the signal is triggered only once, if there is more than one problem line, I am stuck on the second.
- the timer starts counting right there, and not when parsing begins.
- due to the GIL, I have to do all the signal tuning in the main stream, and the signals are only received in the main stream; this is due to the fact that several files are intended for simultaneous analysis in separate streams - also there is only one exception to the global timeout, and I do not see how to find out in which stream I should respond to it.
- I read several times when streams and signals do not mix very well.
I also considered running a regex in a separate process, but before I get into this, I thought it would be better to check here if someone ran into this problem earlier and could give me some tips on how to solve it.
Update
the regular expression looks like this (well, in any case, the problem arises with other regular expressions, this is the easiest):
'^(\d{5}), .+?, (\d{8}), (\d{4}), .+?, .+?,' + 37 * ' (.*?),' + ' (.*?)$'
sample data:
95756, "KURN ", 20110311, 2130, -34.00, 151.21, 260, 06.0, -9999.0, -9999.0, -9999.0, -9999.0, -9999.0, -9999, -9999, 07.0, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -
As said, the regex is usually performed fine - I can parse several hundred files with several hundred lines in less than a minute. This is when the files are complete - the code seems to freeze with files with incomplete lines, for example,
`95142, "YMGD ", 20110311, 1700, -12.06, 134.23, 310, 05.0, 25.8, 23.7, 1004.7, 20.6, 0.0, -9999, -9999, 07.0, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999
I also get cases where the regex seems to return immediately and report a mismatch.
Update 2
I just quickly read the catastrophic article , but as far as I can tell so far, this is not the reason - I am not the nest of any repetition operators.
I am on Mac OSX, so I cannot use RegexBuddy to parse my regex. I tried RegExhibit (which obviously uses the Perl RegEx engine internally) - and that also escapes.