Emulate \G at the beginning of a regex with re.RegexObject.match
You can emulate the \G effect at the beginning of the regular expression with the re module by tracking and providing the starting position of re.RegexObject.match , which causes the match to begin at the specified position in pos .
def tokenize(w): index = 0 m = matcher.match(w, index) o = []
Caveat
The caveat to this method is that it does not work very well with a regular expression that matches an empty string in the main match, since Python has no way to force the repeated expression to repeat the match without allowing a zero-length match.
As an example, re.findall(r'(.??)', 'abc') returns an array of 4 empty strings ['', '', '', ''] , while in PCRE you can find 7 matches ['', 'a', '', 'b', '', 'c' ''] , where the 2nd, 4th and 6th matches begin with the same indices as the 1st, 3rd 5th and 5th matches, respectively. Additional matches in PCRE are detected by retrying with the same indexes with the flag, which prevents the empty string from matching.
I know the question is about Perl, not PCRE, but the global mapping should be the same. Otherwise, the source code could not work.
Rewriting ([^a-zA-Z0-9]*)([a-zA-Z0-9]*?) (.+?) As done in the question avoids this problem, although you can use re.S flag.
Other regular expression comments
Since the case-insensitive flag in Python affects the entire pattern, you need to change the registration of asymmetric submatrices. I would rewrite (?i:st) as [sS][tT] to keep the original value, but go with (?:st|ST) if that is part of your requirement.
Since Python supports free access mode with the re.X flag , you can write your regular expression similar to what you did in Perl code:
matcher = re.compile(r''' (.+?) (?: