I believe the regex <bla><blabla>87765.*?</blabla><bla> can produce catastrophic backtracking. Instead, use: <bla><blabla>87765[^<]*</blabla><bla> Using atomic grouping (I'm not sure Python supports this), the above regex becomes <bla><blabla>(?>(.*?<))/blabla><bla>
Everything that is between (?> ...) is considered as one single token using the regular expression mechanism, as soon as the regular expression module leaves the group. Since the entire group is a single marker, no backtracking can take place once the regex engine has found a match for the group. If backtracking is required, the engine should return to the regular expression marker in front of the group (carriage in our example). If there is no token in front of the group, the regular expression should repeat the entire regular expression in the next position on the line. Please note that I needed to include "<" in the group to ensure atomicity. Close enough.
source share