How to define the "untokenizable" special words for nltk.word_tokenize

Question

How to define the "untokenizable" special words for nltk.word_tokenize

I use nltk.word_tokenizetokenize some sentences that contain programming languages, frameworks, etc. that are mistakenly labeled.

For example:

>>> tokenize.word_tokenize("I work with C#.")
['I', 'work', 'with', 'C', '#', '.']

Is there a way to introduce a list of “exceptions” like this tokenizer? I have already compiled a list of all things (languages, etc.) that I do not want to share.

+6

tokenize nltk

Suilan estévez Aug 10 '17 at 16:03

source share

No one has answered this question yet.

See similar questions:

0

how to prevent NLTK from breaking specific words?

0

Modify python nltk.word_tokenize to exclude "#" as delimiter

or similar:

5

Apostrophe splitting prevention when using words using nltk

3

NLTK PunktSentenceTokenizer ellipsis separation

2

Words NLTK vs word_tokenize

2

Determine if there is a list of words in a sentence?

1

NLTK Perceptron Tagger - What does it recognize as FW (foreign word)?

1

Separate words on the border

1

offers training tokenizer at spaCy

1

NLTK Sentence Tokenizer, custom consoles

0

You don’t want the word NLTK tokenize to mean one word “gotta” in “got” and “ta”,

0

Make NLTK Tokenizer Ignore Capitalization

How to define the "untokenizable" special words for nltk.word_tokenize

More articles: