Python splits string by pattern

I have strings like "aaaaabbbbbbbbbbbbbbccccccccccc" . The number of characters may vary, and sometimes there may be dashes inside the string, for example, "aaaaa-bbbbbbbbbbbbbbccccccccccc" .

Is there any smart way to split it into "aaaaa" , "bbbbbbbbbbbbbb" , "ccccccccccc" and get the indices if they were split or just got the indices without going through each row? If the dash is between patterns, it can end either in the left or the right, since it is always processed the same way.

Any idea?

+6
source share
3 answers

MatchObject regex MatchObject include match indexes. It remains to combine duplicate characters:

 import re repeat = re.compile(r'(?P<start>[az])(?P=start)+-?') 

will match only if the repeated character of the letter ( a - z ) is repeated at least once:

 >>> for match in repeat.finditer("aaaaabbbbbbbbbbbbbbccccccccccc"): ... print match.group(), match.start(), match.end() ... aaaaa 0 5 bbbbbbbbbbbbbb 5 19 ccccccccccc 19 30 

The .start() and .end() methods of the .start() result give you exact positions in the input string.

In matches hyphens are included, but non-repeating characters:

 >>> for match in repeat.finditer("a-bb-cccccccc"): ... print match.group(), match.start(), match.end() ... bb- 2 5 cccccccc 5 13 

If you want the a-part to be a match, just replace + with the factor * :

 repeat = re.compile(r'(?P<start>[az])(?P=start)*-?') 
+11
source

How about using itertools.groupby ?

 >>> s = 'aaaaabbbbbbbbbbbbbbccccccccccc' >>> from itertools import groupby >>> [''.join(v) for k,v in groupby(s)] ['aaaaa', 'bbbbbbbbbbbbbb', 'ccccccccccc'] 

This will put - as its own substrings, which can be easily filtered.

 >>> s = 'aaaaa-bbbbbbbbbbbbbb-ccccccccccc' >>> [''.join(v) for k,v in groupby(s) if k != '-'] ['aaaaa', 'bbbbbbbbbbbbbb', 'ccccccccccc'] 
+3
source
 str="aaaaabbbbbbbbbbbbbbccccccccccc" p = [0] for i, c in enumerate(zip(str, str[1:])): if c[0] != c[1]: p.append(i + 1) print p # [0, 5, 19] 
0
source

All Articles