I use nltk.word_tokenizetokenize some sentences that contain programming languages, frameworks, etc. that are mistakenly labeled.
For example:
>>> tokenize.word_tokenize("I work with C#.")
['I', 'work', 'with', 'C', '#', '.']
Is there a way to introduce a list of “exceptions” like this tokenizer? I have already compiled a list of all things (languages, etc.) that I do not want to share.
source
share