Python regular mapping - extraneous matches

I want to split a string using - , += , == , = , + and spaces as delimiters. I want to keep the separator if it is not white.

I tried to achieve this with the following code:

 def tokenize(s): import re pattern = re.compile("(\-|\+\=|\=\=|\=|\+)|\s+") return pattern.split(s) print(tokenize("hello-+==== =+ there")) 

I expected the output to be

 ['hello', '-', '+=', '==', '=', '=', '+', 'there'] 

However i got

 ['hello', '-', '', '+=', '', '==', '', '=', '', None, '', '=', '', '+', '', None, 'there'] 

This is almost what I wanted, except that there are quite a few extraneous None and blank lines.

Why is this so, and how can I change it to get what I want?

+7
source share
5 answers

re.split by default returns an array of string bits that is between matches: (As @Laurence Gonsalves points out, this is its main use.)

 ['hello', '', '', '', '', '', '', '', 'there'] 

Note the blank lines between - and += , += and == , etc.

As the docs explain, because you use a capture group (i.e. because you use (\-|\+\=|\=\=|\=|\+) instead of (?:\-|\+\=|\=\=|\=|\+) , bits that match the capture group alternate:

 ['hello', '-', '', '+=', '', '==', '', '=', '', None, '', '=', '', '+', '', None, 'there'] 

None matches where half \s+ your pattern was matched; in these cases, the capture group did not capture anything.

From looking at the docs for re.split, I don’t see a simple way to discard blank lines between matches, although a simple list comprehension (or filter , if you like) can easily drop None and blank lines:

 def tokenize(s): import re pattern = re.compile("(\-|\+\=|\=\=|\=|\+)|\s+") return [ x for x in pattern.split(s) if x ] 

Last note . For what you have described so far, this will work fine, but depending on the direction of your project, you may switch to the appropriate parsing library. The Python wiki has a good overview of some of the options.

+3
source

Why does he behave this way?

According to the documentation for re.split:

If the template uses brackets for parentheses, then the text of all groups in the template is also returned as part of the resulting list.

This is literally correct: if parentheses are used for parentheses, then the text of all groups is returned, regardless of whether they match or not; those that do not match anything return None .

As always with split , two consecutive delimiters are considered separating blank lines, so you arbitrarily enter blank lines.

how can i change it to get what i want?

The simplest solution is to filter output:

 filter(None, pattern.split(s)) 
+2
source

Perhaps re.findall would be more suitable for you?

 >>> re.findall(r'-|\+=|==|=|\+|[^-+=\s]+', "hello-+==== =+ there") ['hello', '-', '+=', '==', '=', '=', '+', 'there'] 
+2
source

This template is more consistent with what you want:

 \s*(\-|\+\=|\=\=|\=|\+)\s* 

You still get an empty line between each section, though, as you would expect.

+1
source

Try the following:

 def tokenize(s): import re pattern = re.compile("(\-|\+\=|\=\=|\=|\+)|\s+") x = pattern.split(s) result = [] for item in x: if item != '' and item != None: result.append(item) return result print(tokenize("hello-+==== =+ there")) 
0
source

All Articles