I almost found the answer to this question in this thread (samplebias); however, I need to split the phrase into words, numbers, punctuation, and spaces / tabs. I also need this to keep the order in which each of these things happens (which the code already does in this thread).
So, I found something like this:
from nltk.tokenize import * txt = "Today it 07.May 2011. Or 2.999." regexp_tokenize(txt, pattern=r'\w+([.,]\w+)*|\S+') ['Today', 'it', "'s", '07.May', '2011', '.', 'Or', '2.999', '.']
But this is the list I need to give:
['Today', ' ', 'it', "'s", ' ', '\t', '07.May', ' ', '2011', '.', ' ', 'Or', ' ', '2.999', '.']
Regex has always been one of my weaknesses, so after several hours of research I'm still at a standstill. Thanks!
hangtwenty
source share