I have blocks of text that I want tokenize, but I donโt want tokenize to spaces and punctuation, as it seems standard with tools like NLTK . There are certain phrases that I want to designate as one token, and not an ordinary tokenizer.
For example, taking into account the proposal โThe West Wing is an American television serial drama created by Aaron Sorkin, which was originally transmitted to NBC from September 22, 1999 to May 14, 2006โ and adding the phrase to the tokenizer โ West Wing ,โ the resulting tokens will be:
- West wing
- is an
- element
- American
- ...
What is the best way to achieve this? I would rather stay with tools like NLTK.
python tokenize nlp nltk
yavoh
source share