I wrote a Python utility to scan log files for known error patterns.
I tried to speed up the search by providing the regular expression with additional information about the pattern. For example, not only that I am looking for lines with gold, I require that such a line start with an underscore, therefore: ^_.*goldinstead gold.
Since 99% of lines do not start with underscores, I expected a big performance gain because the regex engine could interrupt line reading after just one character. I was surprised to learn something else.
The following program illustrates the problem:
import re
from time import time
def main():
line = r'I do not start with an underscore 123456789012345678901234567890'
p1 = re.compile(r"^_")
p2 = re.compile(r"abcdefghijklmnopqrstuvwxyz")
patterns = (p1, p2)
for p in patterns:
start = time()
for i in xrange(1000*1000):
match = re.search(p, line)
end = time()
print 'Elapsed: ' + str(end-start)
main()
I tried looking sre_compile.pyfor an explanation, but its code was too hairy for me.
, ? , ?
, x8, , , (22 Vs 6 ).
: - ?