How to avoid the NLTK offer tokenizer sharing acronyms?

Question

How to avoid the NLTK offer tokenizer sharing acronyms?

I am currently using NLTK to process the language, but I ran into the problem of token suggestion.

Here's the problem: Suppose I have a sentence: "Figure 2 shows a USA map." When I use the punkt tokenizer, my code is as follows:

from nltk.tokenize.punkt import PunktSentenceTokenizer, PunktParameters punkt_param = PunktParameters() abbreviation = ['USA', 'fig'] punkt_param.abbrev_types = set(abbreviation) tokenizer = PunktSentenceTokenizer(punkt_param) tokenizer.tokenize('Fig. 2 shows a USA map.')

He returns this:

 ['Fig. 2 shows a USA', 'map.']

the tokenizer cannot detect the abbreviation "USA", but it worked for "fig." Now when I use the default tokenizer, NLTK provides:

 import nltk nltk.tokenize.sent_tokenize('Fig. 2 shows a USA map.')

This time I get:

 ['Fig.', '2 shows a USA map.']

He recognizes the more common "USA" but does not see "fig"!

How can I combine these two methods? I want to use standard abbreviations, as well as add my own abbreviations.

+7

python tokenize nlp nltk

joe wong Jan 15 '16 at 7:01

source share

1 answer

Prashant puri · Accepted Answer · 2016-01-15T08:21:50+0000

I think lowercase for usa in the abbreviations list will work fine for you. Try this,

 from nltk.tokenize.punkt import PunktSentenceTokenizer, PunktParameters punkt_param = PunktParameters() abbreviation = ['usa', 'fig'] punkt_param.abbrev_types = set(abbreviation) tokenizer = PunktSentenceTokenizer(punkt_param) tokenizer.tokenize('Fig. 2 shows a USA map.')

He returns this to me:

 ['Fig. 2 shows a USA map.']

How to avoid the NLTK offer tokenizer sharing acronyms?

More articles: