How to define the "untokenizable" special words for nltk.word_tokenize

I use nltk.word_tokenizetokenize some sentences that contain programming languages, frameworks, etc. that are mistakenly labeled.

For example:

>>> tokenize.word_tokenize("I work with C#.")
['I', 'work', 'with', 'C', '#', '.']

Is there a way to introduce a list of “exceptions” like this tokenizer? I have already compiled a list of all things (languages, etc.) that I do not want to share.

+6
source share

All Articles