I use text tokenization using nltk, only sentences submitted to wordpunct_tokenizer. This separates the abbreviations (for example, "not" by "don" + "+" t "), but I want to save them as one word. I refine my methods for more accurate and accurate marking of the text, so I need to delve deeper into the tokenization module nltk beyond simple tokenization.
I assume this is a common thing, and I would like to receive feedback from others who may have had to deal with a specific problem before.
edit:
Yes, this is a general, spray question I know.
Also, as a newbie to nlp, do I need to worry about reductions at all?
EDIT:
SExprTokenizer or TreeBankWordTokenizer seem to do what I'm looking for.
python nlp nltk
blueblank
source share