I am currently using NLTK to process the language, but I ran into the problem of token suggestion.
Here's the problem: Suppose I have a sentence: "Figure 2 shows a USA map." When I use the punkt tokenizer, my code is as follows:
from nltk.tokenize.punkt import PunktSentenceTokenizer, PunktParameters punkt_param = PunktParameters() abbreviation = ['USA', 'fig'] punkt_param.abbrev_types = set(abbreviation) tokenizer = PunktSentenceTokenizer(punkt_param) tokenizer.tokenize('Fig. 2 shows a USA map.')
He returns this:
['Fig. 2 shows a USA', 'map.']
the tokenizer cannot detect the abbreviation "USA", but it worked for "fig." Now when I use the default tokenizer, NLTK provides:
import nltk nltk.tokenize.sent_tokenize('Fig. 2 shows a USA map.')
This time I get:
['Fig.', '2 shows a USA map.']
He recognizes the more common "USA" but does not see "fig"!
How can I combine these two methods? I want to use standard abbreviations, as well as add my own abbreviations.
python tokenize nlp nltk
joe wong
source share