Python: phrase tokenization

I have blocks of text that I want tokenize, but I donโ€™t want tokenize to spaces and punctuation, as it seems standard with tools like NLTK . There are certain phrases that I want to designate as one token, and not an ordinary tokenizer.

For example, taking into account the proposal โ€œThe West Wing is an American television serial drama created by Aaron Sorkin, which was originally transmitted to NBC from September 22, 1999 to May 14, 2006โ€ and adding the phrase to the tokenizer โ€œ West Wing ,โ€ the resulting tokens will be:

  • West wing
  • is an
  • element
  • American
  • ...

What is the best way to achieve this? I would rather stay with tools like NLTK.

+6
python tokenize nlp nltk
source share
3 answers

If you have a fixed set of phrases that you are looking for, then a simple solution is tokenize your input and โ€œbuildโ€ verbose tokens. Alternatively, search for and replace regular expressions before the tokenization, which turns The West Wing into The_West_Wing .

For more complex options, use regexp_tokenize or see chapter 7 of the NLTK book .

+1
source share

You can use the MWETokenizer Multi-Word MWETokenizer for NLTK:

 from nltk.tokenize import MWETokenizer tokenizer = MWETokenizer() tokenizer.add_mwe(('the', 'west', 'wing')) tokenizer.tokenize('Something about the west wing'.split()) 

You'll get:

 ['Something', 'about', 'the_west_wing'] 
+1
source share

If you do not know specific phrases in advance, you can use the scikit class CountVectorizer () . It has the ability to specify large ranges of n-grams (ngram_range), and then ignore any words that are not displayed in a sufficient number of documents (min_df). You could identify a few phrases that you did not understand were common, but you can also find some that do not make sense. It also has the ability to filter out English stop words (meaningless words like "have") using the stop_words parameter.

0
source share

All Articles