How to avoid the NLTK offer tokenizer sharing acronyms?

I am currently using NLTK to process the language, but I ran into the problem of token suggestion.

Here's the problem: Suppose I have a sentence: "Figure 2 shows a USA map." When I use the punkt tokenizer, my code is as follows:

from nltk.tokenize.punkt import PunktSentenceTokenizer, PunktParameters punkt_param = PunktParameters() abbreviation = ['USA', 'fig'] punkt_param.abbrev_types = set(abbreviation) tokenizer = PunktSentenceTokenizer(punkt_param) tokenizer.tokenize('Fig. 2 shows a USA map.') 

He returns this:

 ['Fig. 2 shows a USA', 'map.'] 

the tokenizer cannot detect the abbreviation "USA", but it worked for "fig." Now when I use the default tokenizer, NLTK provides:

 import nltk nltk.tokenize.sent_tokenize('Fig. 2 shows a USA map.') 

This time I get:

 ['Fig.', '2 shows a USA map.'] 

He recognizes the more common "USA" but does not see "fig"!

How can I combine these two methods? I want to use standard abbreviations, as well as add my own abbreviations.

+7
python tokenize nlp nltk
source share
1 answer

I think lowercase for usa in the abbreviations list will work fine for you. Try this,

 from nltk.tokenize.punkt import PunktSentenceTokenizer, PunktParameters punkt_param = PunktParameters() abbreviation = ['usa', 'fig'] punkt_param.abbrev_types = set(abbreviation) tokenizer = PunktSentenceTokenizer(punkt_param) tokenizer.tokenize('Fig. 2 shows a USA map.') 

He returns this to me:

 ['Fig. 2 shows a USA map.'] 
+6
source share

All Articles