A slower search when an initial character is specified is illogical

I wrote a Python utility to scan log files for known error patterns.

I tried to speed up the search by providing the regular expression with additional information about the pattern. For example, not only that I am looking for lines with gold, I require that such a line start with an underscore, therefore: ^_.*goldinstead gold.

Since 99% of lines do not start with underscores, I expected a big performance gain because the regex engine could interrupt line reading after just one character. I was surprised to learn something else.

The following program illustrates the problem:

import re
from time import time
def main():
    line = r'I do not start with an underscore 123456789012345678901234567890'
    p1 = re.compile(r"^_") # requires  underscore as a first char
    p2 = re.compile(r"abcdefghijklmnopqrstuvwxyz")
    patterns = (p1, p2)

    for p in patterns:
        start = time()
        for i in xrange(1000*1000):
            match = re.search(p, line)
        end = time() 
        print 'Elapsed: ' + str(end-start) 
main()

I tried looking sre_compile.pyfor an explanation, but its code was too hairy for me.

, ? , ?

, x8, , , (22 Vs 6 ).

: - ?

+5
4

: , match not search. , re.match( pattern, line), pattern.match(line).

import re
from time import time
def main():
    line = r'I do not start with an underscore 123456789012345678901234567890'
    p1 = re.compile(r"_") # requires  underscore as a first char
    p2 = re.compile(r"abcdefghijklmnopqrstuvwxyz")
    patterns = (p1, p2)

    for p in patterns:
        start = time()
        for i in xrange(1000*1000):
            match = p.match(line)
        end = time() 
        print 'Elapsed: ' + str(end-start) 
main()

, - .

+2

if line[0] == "_" and "gold" in line:
   print "Yup, it starts with an underscore"
else:
   print "Nope it doesn't"

,

+2

! . , regexp , , . , re.MULTILINE

re.match re.search , ,

def main():
    line = r'I do not start with an underscore 123456789012345678901234567890'
    p1 = re.compile(r"_.*") # requires  underscore as a first char
    p2 = re.compile(r"abcdefghijklmnopqrstuvwxyz")
    patterns = (p1, p2)

    start = time()
    for i in xrange(1000*1000):
        match = re.match(p1, line)
    end = time() 
    print 'Elapsed: ' + str(end-start) 
    start = time()
    for i in xrange(1000*1000):
        match = re.search(p2, line)
    end = time() 
    print 'Elapsed: ' + str(end-start) 

.

, , :

for p in patterns:
    start = time()
    for i in xrange(1000*1000):
        match = p.search(line)
    end = time() 
    print 'Elapsed: ' + str(end-start)

...

+2

, . , . , search match, ( , ).

: , .

+1

All Articles