Python: how to abort regex

I iterate over lines in a large number of loaded text files and do regular expression on each line. Usually a match takes less than a second. However, at times the match takes several minutes, sometimes the match does not end at all, and the code just hangs (I waited an hour a couple of times, and then gave up). So I need to enter some kind of timeout and tell the regex match code to somehow stop after 10 seconds or so. I can live with the fact that I will lose the data that the regular expression should return.

I tried the following (of course, these are already two different thread-based solutions shown in one code example):

def timeout_handler(): print 'timeout_handler called' if __name__ == '__main__': timer_thread = Timer(8.0, timeout_handler) parse_thread = Thread(target=parse_data_files, args=(my_args)) timer_thread.start() parse_thread.start() parse_thread.join(12.0) print 'do we ever get here ?' 

but I do not get the line timeout_handler called or do we ever get here ? on output, the code just got stuck in parse_data_files .

Even worse, I can't even stop the program using CTRL-C , instead I need to find the python process number and kill this process. Some research has shown that the Python guys are aware of running regular expression code: http://bugs.python.org/issue846388

I really succeeded using the signals:

 signal(SIGALRM, timeout_handler) alarm(8) data_sets = parse_data_files(config(), data_provider) alarm(0) 

this returns me the timeout_handler called line in the output - and I can still stop my script using CTRL-C . If now I change timeout_handler as follows:

 class TimeoutException(Exception): pass def timeout_handler(signum, frame): raise TimeoutException() 

and enclose the actual call in re.match(...) in the try ... except TimeoutException , the regular expression match is actually broken. Unfortunately, this only works in my simple single-threaded sandbox script, which I use to try out the material. There are several errors in this solution:

  • the signal is triggered only once, if there is more than one problem line, I am stuck on the second.
  • the timer starts counting right there, and not when parsing begins.
  • due to the GIL, I have to do all the signal tuning in the main stream, and the signals are only received in the main stream; this is due to the fact that several files are intended for simultaneous analysis in separate streams - also there is only one exception to the global timeout, and I do not see how to find out in which stream I should respond to it.
  • I read several times when streams and signals do not mix very well.

I also considered running a regex in a separate process, but before I get into this, I thought it would be better to check here if someone ran into this problem earlier and could give me some tips on how to solve it.

Update

the regular expression looks like this (well, in any case, the problem arises with other regular expressions, this is the easiest):

'^(\d{5}), .+?, (\d{8}), (\d{4}), .+?, .+?,' + 37 * ' (.*?),' + ' (.*?)$'

sample data:

95756, "KURN ", 20110311, 2130, -34.00, 151.21, 260, 06.0, -9999.0, -9999.0, -9999.0, -9999.0, -9999.0, -9999, -9999, 07.0, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -

As said, the regex is usually performed fine - I can parse several hundred files with several hundred lines in less than a minute. This is when the files are complete - the code seems to freeze with files with incomplete lines, for example,

`95142, "YMGD ", 20110311, 1700, -12.06, 134.23, 310, 05.0, 25.8, 23.7, 1004.7, 20.6, 0.0, -9999, -9999, 07.0, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999

I also get cases where the regex seems to return immediately and report a mismatch.

Update 2

I just quickly read the catastrophic article , but as far as I can tell so far, this is not the reason - I am not the nest of any repetition operators.

I am on Mac OSX, so I cannot use RegexBuddy to parse my regex. I tried RegExhibit (which obviously uses the Perl RegEx engine internally) - and that also escapes.

+7
source share
4 answers

You are facing a catastrophic retreat; not because of the quantifiers enclosed, but because your quantified characters can also match separators, and since there are a lot of them, in certain cases you will get exponential time.

Besides being more like a job for a CSV parser, try the following:

 r'^(\d{5}), [^,]+, (\d{8}), (\d{4}), [^,]+, [^,]+,' + 37 * r' ([^,]+),' + r' ([^,]+)$' 

By explicitly disallowing a comma to match between delimiters, you will significantly increase the regex value.

If commas can be present inside quoted strings, for example, just replace [^,]+ (in the places where you expect this) with

 (?:"[^"]*"|[^,]+) 

To illustrate:

Using your regex in the first example, RegexBuddy reports a successful match after 793 steps in the regex engine. For the second example (an incomplete line), it reports a match failure after the 1,000,000 steps of the regex engine (this is where the RegexBuddy refuses, Python will continue foaming).

Using my regex, a successful match occurs in 173 steps, a failure in 174.

+8
source

Instead of trying to solve the problem of freezing regular expressions with timeouts, it might be advisable to consider a completely different approach. If your data is really comma separated values, you should get much better performance with the csv module or just using line.split(",") .

+2
source

You cannot do this with threads. Continue your idea of ​​making the match in a separate process.

+1
source

Threading in Python is a strange beast. The Global Interpreter Lock is essentially one large lock around the interpreter, which means that only one thread at a time gets execution in the interpreter.

Thread scheduling is delegated to the OS. Python essentially signals the OS that another thread may take a lock after a certain number of "instructions". Thus, if Python is busy due to the usual expression, it will never receive a signal to signal to the OS that it might try to take a lock on another thread. Hence the reason for using signals; they are the only way to interrupt.

I am with Nosco, continuing and using separate processes. Or try rewriting the regular expression so that it does not run away. See backtracking issues . This may or may not be the cause of poor regular expression performance, and changing your regular expression may not be possible. But if this is the reason, and it can be changed, you will get rid of the headache, avoiding several processes.

+1
source

All Articles