Python regex string string

In Python regular expressions

re.compile("x"*50000) 

gives me OverflowError: regular expression code size limit exceeded

but the next one fails, but it gets to the 100% processor and takes 1 minute on my PC.

 >>> re.compile(".*?.*?.*?.*?.*?.*?.*?.*?.*?.*?"*50000) <_sre.SRE_Pattern object at 0x03FB0020> 

This is normal?

Do you assume that ".*?.*?.*?.*?.*?.*?.*?.*?.*?.*?"*50000 shorter than "x"*50000 ?

Tested on Python 2.6, Win32

UPDATE 1 :

It seems that ".*?.*?.*?.*?.*?.*?.*?.*?.*?.*?"*50000 can be reduced to .*?

So how about this?

 re.compile(".*?x"*50000) 

It compiles, and if it can also be reduced to ".*?x" , it should only match the lines "abcx" or "x" , but it does not match.

So, am I missing something?

UPDATE 2 :

My point does not know the maximum line limit of the regex source, I like to know some reasons / concepts of "x"*50000 caught by the overflow handler, but not by ".*?x"*50000 .

That doesn't make sense to me, that's why.

Is this something missing when checking for overflow or is it just fine or is something really crowded?

Any advice / opinions would be appreciated.

+4
source share
2 answers

The difference is that ".*?.*?.*?.*?.*?.*?.*?.*?.*?.*?"*50000 can be reduced to ".*?" , and "x"*50000 should generate 50,000 nodes in the FSM (or a similar structure used by the regular expression engine).

EDIT: Alright, I was wrong. This is not so smart. The reason "x"*50000 fails, but ".*?x"*50000 does not mean that there is a limit on the size of one “code element”. "x"*50000 will generate one long element, and ".*?x"*50000 will generate many small elements. If you could somehow break the string literal without changing the regex value, this will work, but I can't figure out how to do it.

+6
source

you want to combine 50,000 "x" s, right ??? if so, an alternative without regular expression

 if "x"*50000 in mystring: print "found" 

if you want to match 50,000 "x" with a regex, you can use a range

 >>> pat=re.compile("x{50000}") >>> pat.search(s) <_sre.SRE_Match object at 0xb8057a30> 

on my system it will take 65535 max

 >>> pat=re.compile("x{65536}") Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/usr/lib/python2.6/re.py", line 188, in compile return _compile(pattern, flags) File "/usr/lib/python2.6/re.py", line 241, in _compile p = sre_compile.compile(pattern, flags) File "/usr/lib/python2.6/sre_compile.py", line 529, in compile groupindex, indexgroup RuntimeError: invalid SRE code >>> pat=re.compile("x{65535}") >>> 

I don't know if there are tweaks in Python that we can use to increase this limit.

+1
source

All Articles