Split a string into the first case from a set of delimiters using Python and regex

First of all, the question is tagged with Python and regex tags, but it is not tied to them - the answer may be high.

At the moment, I am separating a multiple-delimited string with the following pattern. Actually there are more differentiating patterns, and they are more complex, but let this simplify and limit them to two characters - # and * :

parts = re.split('#|*', string)

What is this approach? aaa#bbb*ccc#ddd string is divided into 4 substrings aaa , bbb , ccc , ddd . But separation is required either by a separator that occurs first in the string, or by a separator that most commonly occurs in the string. aaa#bbb*ccc#ddd should be divided into aaa , bbb*ccc , ddd and aaa*bbb#ccc*ddd should be divided into aaa , bbb#ccc , ddd .

I know a simple way to achieve this is to find which separator occurs first or is the most frequent in the string, and then split with that single separator. But the method needs to be effective, and I wonder if this can be achieved with a single regex expression. The main question is to split with the first occurrence of many separators - for the most common case, the delimiter will almost certainly need to calculate the number of errors in advance.

Update:

The question does not require separation by the first occurrence or the most frequent separator at the same time - any of these methods individually will be enough. I understand that splitting with the most common delimiter is not possible with a regular expression without first defining the delimiter, but I think that there is a possibility that splitting in the first case is possible with a regular expression and look without preliminary preparation.

+5
source share
2 answers

it is necessary to separate either the separator that occurs first in the string or the separator that most often occurs in the string.

So, you can first find all the delimiters and save them in the appropriate container with their frequency, then find the most common and the first, and then split the string into them.

Now, to find separators, you need to separate them from plain text based on a specific function, for example, if they are not word characters, and we can use a dictionary to save them to keep track of similar separators (in this case collections.Counter() will complete the task).

Demo:

 >>> s = "aaa#bbb*ccc#ddd*rkfh^ndjfh*dfehb* erjg-rh@fkej *rjh" >>> delimiters = re.findall(r'\W', s) >>> first = delimiters[0] '#' >>> Counter(delimiters) Counter({'*': 5, '#': 2, '@': 1, '-': 1, '^': 1}) >>> >>> frequent = Counter(delimiters).most_common(1)[0][0] '*' >>> re.split(r'\{}|\{}'.format(first, frequent), s) ['aaa', 'bbb', 'ccc', 'ddd', 'rkfh^ndjfh', 'dfehb', ' erjg-rh@fkej ', 'rjh'] 

Note that if you are dealing with delimiters that contain more than one character, you can use re.escape() to avoid special regular expression characters (e.g. * ).

+3
source

I found the string.count () method to be very fast since it is implemented in C. Anything that avoids loops will usually be faster, even if you repeat the line several times. This is probably the fastest solution:

 >>> s = 'aaa*bbb#ccc*ddd' >>> a, b = s.count('*'), s.count('#') >>> if a == b: a, b = -s.find('*'), -s.find('#') ... >>> s.split('*' if a > b else '#') ['aaa', 'bbb#ccc', 'ddd'] 
0
source

All Articles