re.split by default returns an array of string bits that is between matches: (As @Laurence Gonsalves points out, this is its main use.)
['hello', '', '', '', '', '', '', '', 'there']
Note the blank lines between - and += , += and == , etc.
As the docs explain, because you use a capture group (i.e. because you use (\-|\+\=|\=\=|\=|\+) instead of (?:\-|\+\=|\=\=|\=|\+) , bits that match the capture group alternate:
['hello', '-', '', '+=', '', '==', '', '=', '', None, '', '=', '', '+', '', None, 'there']
None matches where half \s+ your pattern was matched; in these cases, the capture group did not capture anything.
From looking at the docs for re.split, I donβt see a simple way to discard blank lines between matches, although a simple list comprehension (or filter , if you like) can easily drop None and blank lines:
def tokenize(s): import re pattern = re.compile("(\-|\+\=|\=\=|\=|\+)|\s+") return [ x for x in pattern.split(s) if x ]
Last note . For what you have described so far, this will work fine, but depending on the direction of your project, you may switch to the appropriate parsing library. The Python wiki has a good overview of some of the options.