Python splits string by pattern

Question

Python splits string by pattern

I have strings like "aaaaabbbbbbbbbbbbbbccccccccccc" . The number of characters may vary, and sometimes there may be dashes inside the string, for example, "aaaaa-bbbbbbbbbbbbbbccccccccccc" .

Is there any smart way to split it into "aaaaa" , "bbbbbbbbbbbbbb" , "ccccccccccc" and get the indices if they were split or just got the indices without going through each row? If the dash is between patterns, it can end either in the left or the right, since it is always processed the same way.

Any idea?

+6

python string split regex

Trollbrot Apr 18 '13 at 15:19

source share

3 answers

How about using itertools.groupby ?

 >>> s = 'aaaaabbbbbbbbbbbbbbccccccccccc' >>> from itertools import groupby >>> [''.join(v) for k,v in groupby(s)] ['aaaaa', 'bbbbbbbbbbbbbb', 'ccccccccccc']

This will put - as its own substrings, which can be easily filtered.

 >>> s = 'aaaaa-bbbbbbbbbbbbbb-ccccccccccc' >>> [''.join(v) for k,v in groupby(s) if k != '-'] ['aaaaa', 'bbbbbbbbbbbbbb', 'ccccccccccc']

+3

mgilson Apr 18 '13 at 15:25

source share

 str="aaaaabbbbbbbbbbbbbbccccccccccc" p = [0] for i, c in enumerate(zip(str, str[1:])): if c[0] != c[1]: p.append(i + 1) print p # [0, 5, 19]

0

perreal Apr 18 '13 at 15:35

source share

Martijn pieters · Accepted Answer · 2013-04-18T15:25:37+0000

MatchObject regex MatchObject include match indexes. It remains to combine duplicate characters:

 import re repeat = re.compile(r'(?P<start>[az])(?P=start)+-?')

will match only if the repeated character of the letter ( a - z ) is repeated at least once:

 >>> for match in repeat.finditer("aaaaabbbbbbbbbbbbbbccccccccccc"): ... print match.group(), match.start(), match.end() ... aaaaa 0 5 bbbbbbbbbbbbbb 5 19 ccccccccccc 19 30

The .start() and .end() methods of the .start() result give you exact positions in the input string.

In matches hyphens are included, but non-repeating characters:

 >>> for match in repeat.finditer("a-bb-cccccccc"): ... print match.group(), match.start(), match.end() ... bb- 2 5 cccccccc 5 13

If you want the a-part to be a match, just replace + with the factor * :

 repeat = re.compile(r'(?P<start>[az])(?P=start)*-?')

Python splits string by pattern

More articles: