RegEx Tokenizer: split text into words, numbers, punctuation marks and intervals (do not delete anything)

Question

RegEx Tokenizer: split text into words, numbers, punctuation marks and intervals (do not delete anything)

I almost found the answer to this question in this thread (samplebias); however, I need to split the phrase into words, numbers, punctuation, and spaces / tabs. I also need this to keep the order in which each of these things happens (which the code already does in this thread).

So, I found something like this:

from nltk.tokenize import * txt = "Today it 07.May 2011. Or 2.999." regexp_tokenize(txt, pattern=r'\w+([.,]\w+)*|\S+') ['Today', 'it', "'s", '07.May', '2011', '.', 'Or', '2.999', '.']

But this is the list I need to give:

  ['Today', ' ', 'it', "'s", ' ', '\t', '07.May', ' ', '2011', '.', ' ', 'Or', ' ', '2.999', '.']

Regex has always been one of my weaknesses, so after several hours of research I'm still at a standstill. Thanks!

+7

python regex tokenize nltk

hangtwenty Aug 08 '11 at 19:16

source share

3 answers

In regex \w+([.,]\w+)*|\S+ , \w+([.,]\w+)* captures words, and \S+ captures other non-whitespace.

To capture spaces and tabs, try the following: \w+([.,]\w+)*|\S+|[ \t] .

0

Mrab Aug 08 '11 at 19:35

source share

Not fully compatible with the expected result you provided, some details in the question will help, but in any case:

 >>> txt = "Today it 07.May 2011. Or 2.999." >>> regexp_tokenize(txt, pattern=r"\w+([.',]\w+)*|[ \t]+") ['Today', ' ', "it's", ' \t', '07.May', ' ', '2011', ' ', 'Or', ' ', '2.999']

0

Savino sguera Aug 08 '11 at 10:40

source share

Andrew Clark · Accepted Answer · 2011-08-08T19:57:53+0000

I think something like this should work for you. There is probably more to this regular expression than it should be, but your requirements are somewhat vague and do not exactly match the expected output.

 >>> txt = "Today it \t07.May 2011. Or 2.999." >>> p = re.compile(r"\d+|[-'az]+|[ ]+|\s+|[.,]+|\S+", re.I) >>> slice_starts = [m.start() for m in p.finditer(txt)] + [None] >>> [txt[s:e] for s, e in zip(slice_starts, slice_starts[1:])] ['Today', ' ', "it's", ' ', '\t', '07', '.', 'May', ' ', '2011', '.', ' ', 'Or', ' ', '2', '.', '999', '.']

RegEx Tokenizer: split text into words, numbers, punctuation marks and intervals (do not delete anything)

More articles: