What is the Python method for creating \ G syntax loop?

The following is the perl function, which I wrote many years ago. This is a smart tokenizer that recognizes some cases of things sticking together, which may not be the case. For example, given the input on the left, it divides the line as shown on the right:

abc123 -> abc|123 abcABC -> abc|ABC ABC123 -> ABC|123 123abc -> 123|abc 123ABC -> 123|ABC AbcDef -> Abc|Def (eg CamelCase) ABCDef -> ABC|Def 1stabc -> 1st|abc (recognize valid ordinals) 1ndabc -> 1|ndabc (but not invalid ordinals) 11thabc -> 11th|abc (recognize that 11th - 13th are different than 1st - 3rd) 11stabc -> 11|stabc 

Now I'm experimenting with machine learning, and I would like to do some experiments that use this tokenizer. But first, I will need to port it from Perl to Python. The key to this code is a loop that uses the \ G anchor, what I hear does not exist in python. I tried googling for how this is done in Python, but I'm not sure what to look for, so it's hard for me to find the answer.

How would you write this function in Python?

 sub Tokenize # Breaks a string into tokens using special rules, # where a token is any sequence of characters, be they a sequence of letters, # a sequence of numbers, or a sequence of non-alpha-numeric characters # the list of tokens found are returned to the caller { my $value = shift; my @list = (); my $word; while ( $value ne '' && $value =~ m/ \G # start where previous left off ([^a-zA-Z0-9]*) # capture non-alpha-numeric characters, if any ([a-zA-Z0-9]*?) # capture everything up to a token boundary (?: # identify the token boundary (?=[^a-zA-Z0-9]) # next character is not a word character | (?=[AZ][az]) # Next two characters are upper lower | (?<=[az])(?=[AZ]) # lower followed by upper | (?<=[a-zA-Z])(?=[0-9]) # letter followed by digit # ordinal boundaries | (?<=^1(?i:st)) # first | (?<=[^1][1](?i:st)) # first but not 11th | (?<=^2(?i:nd)) # second | (?<=[^1]2(?i:nd)) # second but not 12th | (?<=^3(?i:rd)) # third | (?<=[^1]3(?i:rd)) # third but not 13th | (?<=1[123](?i:th)) # 11th - 13th | (?<=[04-9](?i:th)) # other ordinals # non-ordinal digit-letter boundaries | (?<=^1)(?=[a-zA-Z])(?!(?i)st) # digit-letter but not first | (?<=[^1]1)(?=[a-zA-Z])(?!(?i)st) # digit-letter but not 11th | (?<=^2)(?=[a-zA-Z])(?!(?i)nd) # digit-letter but not first | (?<=[^1]2)(?=[a-zA-Z])(?!(?i)nd) # digit-letter but not 12th | (?<=^3)(?=[a-zA-Z])(?!(?i)rd) # digit-letter but not first | (?<=[^1]3)(?=[a-zA-Z])(?!(?i)rd) # digit-letter but not 13th | (?<=1[123])(?=[a-zA-Z])(?!(?i)th) # digit-letter but not 11th - 13th | (?<=[04-9])(?=[a-zA-Z])(?!(?i)th) # digit-letter but not ordinal | (?=$) # end of string ) /xg ) { push @list, $1 if $1 ne ''; push @list, $2 if $2 ne ''; } return @list; } 

I tried using re.split () with the option above. However, split () refuses to split into a zero-width match (an ability that should be possible if you really know what it is doing).

I came up with a solution to this particular problem, but not the general problem of "how to use G-based parsing." I have an example of code that executes regular expressions in loops that are bound using \ G and then in the body it uses a different correspondence fixed in \ G to see in which order to continue parsing. So I'm still looking for an answer.

So here is my last working code to translate the above into Python:

 import re IsA = lambda s: '[' + s + ']' IsNotA = lambda s: '[^' + s + ']' Upper = IsA( 'AZ' ) Lower = IsA( 'az' ) Letter = IsA( 'a-zA-Z' ) Digit = IsA( '0-9' ) AlphaNumeric = IsA( 'a-zA-Z0-9' ) NotAlphaNumeric = IsNotA( 'a-zA-Z0-9' ) EndOfString = '$' OR = '|' ZeroOrMore = lambda s: s + '*' ZeroOrMoreNonGreedy = lambda s: s + '*?' OneOrMore = lambda s: s + '+' OneOrMoreNonGreedy = lambda s: s + '+?' StartsWith = lambda s: '^' + s Capture = lambda s: '(' + s + ')' PreceededBy = lambda s: '(?<=' + s + ')' FollowedBy = lambda s: '(?=' + s + ')' NotFollowedBy = lambda s: '(?!' + s + ')' StopWhen = lambda s: s CaseInsensitive = lambda s: '(?i:' + s + ')' ST = '(?:st|ST)' ND = '(?:nd|ND)' RD = '(?:rd|RD)' TH = '(?:th|TH)' def OneOf( *args ): return '(?:' + '|'.join( args ) + ')' pattern = '(.+?)' + \ OneOf( # ABC | !!! - break at whitespace or non-alpha-numeric boundary PreceededBy( AlphaNumeric ) + FollowedBy( NotAlphaNumeric ), PreceededBy( NotAlphaNumeric ) + FollowedBy( AlphaNumeric ), # ABC | Abc - break at what looks like the start of a word or sentence FollowedBy( Upper + Lower ), # abc | ABC - break when a lower-case letter is followed by an upper case PreceededBy( Lower ) + FollowedBy( Upper ), # abc | 123 - break between words and digits PreceededBy( Letter ) + FollowedBy( Digit ), # 1st | oak - recognize when the string starts with an ordinal PreceededBy( StartsWith( '1' + ST ) ), PreceededBy( StartsWith( '2' + ND ) ), PreceededBy( StartsWith( '3' + RD ) ), # 1st | abc - contains an ordinal PreceededBy( IsNotA( '1' ) + '1' + ST ), PreceededBy( IsNotA( '1' ) + '2' + ND ), PreceededBy( IsNotA( '1' ) + '3' + RD ), PreceededBy( '1' + IsA( '123' ) + TH ), PreceededBy( IsA( '04-9' ) + TH ), # 1 | abcde - recognize when it starts with or contains a non-ordinal digit/letter boundary PreceededBy( StartsWith( '1' ) ) + FollowedBy( Letter ) + NotFollowedBy( ST ), PreceededBy( StartsWith( '2' ) ) + FollowedBy( Letter ) + NotFollowedBy( ND ), PreceededBy( StartsWith( '3' ) ) + FollowedBy( Letter ) + NotFollowedBy( RD ), PreceededBy( IsNotA( '1' ) + '1' ) + FollowedBy( Letter ) + NotFollowedBy( ST ), PreceededBy( IsNotA( '1' ) + '2' ) + FollowedBy( Letter ) + NotFollowedBy( ND ), PreceededBy( IsNotA( '1' ) + '3' ) + FollowedBy( Letter ) + NotFollowedBy( RD ), PreceededBy( '1' + IsA( '123' ) ) + FollowedBy( Letter ) + NotFollowedBy( TH ), PreceededBy( IsA( '04-9' ) ) + FollowedBy( Letter ) + NotFollowedBy( TH ), # abcde | $ - end of the string FollowedBy( EndOfString ) ) matcher = re.compile( pattern ) def tokenize( s ): return matcher.findall( s ) 
+8
python regex
source share
1 answer

Emulate \G at the beginning of a regex with re.RegexObject.match

You can emulate the \G effect at the beginning of the regular expression with the re module by tracking and providing the starting position of re.RegexObject.match , which causes the match to begin at the specified position in pos .

 def tokenize(w): index = 0 m = matcher.match(w, index) o = [] # Although index != m.end() check zero-length match, it more of # a guard against accidental infinite loop. # Don't expect a regex which can match empty string to work. # See Caveat section. while m and index != m.end(): o.append(m.group(1)) index = m.end() m = matcher.match(w, index) return o 

Caveat

The caveat to this method is that it does not work very well with a regular expression that matches an empty string in the main match, since Python has no way to force the repeated expression to repeat the match without allowing a zero-length match.

As an example, re.findall(r'(.??)', 'abc') returns an array of 4 empty strings ['', '', '', ''] , while in PCRE you can find 7 matches ['', 'a', '', 'b', '', 'c' ''] , where the 2nd, 4th and 6th matches begin with the same indices as the 1st, 3rd 5th and 5th matches, respectively. Additional matches in PCRE are detected by retrying with the same indexes with the flag, which prevents the empty string from matching.

I know the question is about Perl, not PCRE, but the global mapping should be the same. Otherwise, the source code could not work.

Rewriting ([^a-zA-Z0-9]*)([a-zA-Z0-9]*?) (.+?) As done in the question avoids this problem, although you can use re.S flag.

Other regular expression comments

Since the case-insensitive flag in Python affects the entire pattern, you need to change the registration of asymmetric submatrices. I would rewrite (?i:st) as [sS][tT] to keep the original value, but go with (?:st|ST) if that is part of your requirement.

Since Python supports free access mode with the re.X flag , you can write your regular expression similar to what you did in Perl code:

 matcher = re.compile(r''' (.+?) (?: # identify the token boundary (?=[^a-zA-Z0-9]) # next character is not a word character | (?=[AZ][az]) # Next two characters are upper lower | (?<=[az])(?=[AZ]) # lower followed by upper | (?<=[a-zA-Z])(?=[0-9]) # letter followed by digit # ordinal boundaries | (?<=^1[sS][tT]) # first | (?<=[^1][1][sS][tT]) # first but not 11th | (?<=^2[nN][dD]) # second | (?<=[^1]2[nN][dD]) # second but not 12th | (?<=^3[rR][dD]) # third | (?<=[^1]3[rR][dD]) # third but not 13th | (?<=1[123][tT][hH]) # 11th - 13th | (?<=[04-9][tT][hH]) # other ordinals # non-ordinal digit-letter boundaries | (?<=^1)(?=[a-zA-Z])(?![sS][tT]) # digit-letter but not first | (?<=[^1]1)(?=[a-zA-Z])(?![sS][tT]) # digit-letter but not 11th | (?<=^2)(?=[a-zA-Z])(?![nN][dD]) # digit-letter but not first | (?<=[^1]2)(?=[a-zA-Z])(?![nN][dD]) # digit-letter but not 12th | (?<=^3)(?=[a-zA-Z])(?![rR][dD]) # digit-letter but not first | (?<=[^1]3)(?=[a-zA-Z])(?![rR][dD]) # digit-letter but not 13th | (?<=1[123])(?=[a-zA-Z])(?![tT][hH]) # digit-letter but not 11th - 13th | (?<=[04-9])(?=[a-zA-Z])(?![tT][hH]) # digit-letter but not ordinal | (?=$) # end of string ) ''', re.X) 
+2
source share

All Articles