Nltk Tokenization and Reduction

I use text tokenization using nltk, only sentences submitted to wordpunct_tokenizer. This separates the abbreviations (for example, "not" by "don" + "+" t "), but I want to save them as one word. I refine my methods for more accurate and accurate marking of the text, so I need to delve deeper into the tokenization module nltk beyond simple tokenization.

I assume this is a common thing, and I would like to receive feedback from others who may have had to deal with a specific problem before.

edit:

Yes, this is a general, spray question I know.

Also, as a newbie to nlp, do I need to worry about reductions at all?

EDIT:

SExprTokenizer or TreeBankWordTokenizer seem to do what I'm looking for.

+8
python nlp nltk
source share
3 answers

Which tokenizer you use really depends on what you want to do next. As the G4dget inspector said, some part-of-speech taggers decide separate abbreviations, in which case splitting is good. But perhaps this is not what you want. To decide which tokenizer is best, consider what you need for the next step, and then send the text http://text-processing.com/demo/tokenize/ to see how each NLTK tokenizer behaves.

+7
source share

I have worked with NLTK before this project . When I did this, I found that contractions were useful for consideration.

However, I did not write a custom tokenizer, I just processed it after marking POS.

I suspect this is not the answer you are looking for, but I hope this helps a little

+1
source share

Since the number of abbreviations is very minimal, one way to do this is to search and replace all abbreviations to the full equivalent (for example: "do not", "do not"), and then transfer the updated sentences to wordpunct_tokenizer.

+1
source share

All Articles