RegEx Tokenizer: split text into words, numbers, punctuation marks and intervals (do not delete anything)

I almost found the answer to this question in this thread (samplebias); however, I need to split the phrase into words, numbers, punctuation, and spaces / tabs. I also need this to keep the order in which each of these things happens (which the code already does in this thread).

So, I found something like this:

from nltk.tokenize import * txt = "Today it 07.May 2011. Or 2.999." regexp_tokenize(txt, pattern=r'\w+([.,]\w+)*|\S+') ['Today', 'it', "'s", '07.May', '2011', '.', 'Or', '2.999', '.'] 

But this is the list I need to give:

  ['Today', ' ', 'it', "'s", ' ', '\t', '07.May', ' ', '2011', '.', ' ', 'Or', ' ', '2.999', '.'] 

Regex has always been one of my weaknesses, so after several hours of research I'm still at a standstill. Thanks!

+7
source share
3 answers

I think something like this should work for you. There is probably more to this regular expression than it should be, but your requirements are somewhat vague and do not exactly match the expected output.

 >>> txt = "Today it \t07.May 2011. Or 2.999." >>> p = re.compile(r"\d+|[-'az]+|[ ]+|\s+|[.,]+|\S+", re.I) >>> slice_starts = [m.start() for m in p.finditer(txt)] + [None] >>> [txt[s:e] for s, e in zip(slice_starts, slice_starts[1:])] ['Today', ' ', "it's", ' ', '\t', '07', '.', 'May', ' ', '2011', '.', ' ', 'Or', ' ', '2', '.', '999', '.'] 
+4
source

In regex \w+([.,]\w+)*|\S+ , \w+([.,]\w+)* captures words, and \S+ captures other non-whitespace.

To capture spaces and tabs, try the following: \w+([.,]\w+)*|\S+|[ \t] .

0
source

Not fully compatible with the expected result you provided, some details in the question will help, but in any case:

 >>> txt = "Today it 07.May 2011. Or 2.999." >>> regexp_tokenize(txt, pattern=r"\w+([.',]\w+)*|[ \t]+") ['Today', ' ', "it's", ' \t', '07.May', ' ', '2011', ' ', 'Or', ' ', '2.999'] 
0
source

All Articles