Nltk Tokenization and Reduction

Question

Nltk Tokenization and Reduction

I use text tokenization using nltk, only sentences submitted to wordpunct_tokenizer. This separates the abbreviations (for example, "not" by "don" + "+" t "), but I want to save them as one word. I refine my methods for more accurate and accurate marking of the text, so I need to delve deeper into the tokenization module nltk beyond simple tokenization.

I assume this is a common thing, and I would like to receive feedback from others who may have had to deal with a specific problem before.

edit:

Yes, this is a general, spray question I know.

Also, as a newbie to nlp, do I need to worry about reductions at all?

EDIT:

SExprTokenizer or TreeBankWordTokenizer seem to do what I'm looking for.

+8

python nlp nltk

blueblank Jul 05 '12 at 19:32

source share

3 answers

I have worked with NLTK before this project . When I did this, I found that contractions were useful for consideration.

However, I did not write a custom tokenizer, I just processed it after marking POS.

I suspect this is not the answer you are looking for, but I hope this helps a little

+1

inspectorG4dget Jul 05 '12 at 19:54

source share

Since the number of abbreviations is very minimal, one way to do this is to search and replace all abbreviations to the full equivalent (for example: "do not", "do not"), and then transfer the updated sentences to wordpunct_tokenizer.

+1

Neodawn Jul 6 '12 at 2:44

source share

Jacob · Accepted Answer · 2012-07-06T01:39:05+0000

Which tokenizer you use really depends on what you want to do next. As the G4dget inspector said, some part-of-speech taggers decide separate abbreviations, in which case splitting is good. But perhaps this is not what you want. To decide which tokenizer is best, consider what you need for the next step, and then send the text http://text-processing.com/demo/tokenize/ to see how each NLTK tokenizer behaves.

Nltk Tokenization and Reduction

More articles: