The behavior of the word "tokenization" in NLTK for double quotes is confusing

Question

The behavior of the word "tokenization" in NLTK for double quotes is confusing

import nltk >>> nltk.__version__ '3.0.4' >>> nltk.word_tokenize('"') ['``'] >>> nltk.word_tokenize('""') ['``', '``'] >>> nltk.word_tokenize('"A"') ['``', 'A', "''"]

See how it changes " to double` `and '' ?

What's going on here? Why does he change character? Is there a fix? Since I need to search every token in the string later.

Python 2.7.6 if that matters.

+8

python nltk

Motasim Aug 24 '15 at 14:40

source share

1 answer

alvas · Accepted Answer · 2015-08-25T06:53:23+0000

TL; DR :

nltk.word_tokenize changes the beginning of double quotes with " -> `` and ends the double quotes from " -> '' .

In the long :

First, nltk.word_tokenize signs the database on how Penn TreeBank was tokenized, it comes from nltk.tokenize.treebank , see https://github.com/nltk/nltk/blob/develop/nltk/tokenize/ init .py # L91 and https://github.com/nltk/nltk/blob/develop/nltk/tokenize/treebank.py#L23

 class TreebankWordTokenizer(TokenizerI): """ The Treebank tokenizer uses regular expressions to tokenize text as in Penn Treebank. This is the method that is invoked by ``word_tokenize()``. It assumes that the text has already been segmented into sentences, eg using ``sent_tokenize()``.

Then a list of regular expressions for abbreviations appears at https://github.com/nltk/nltk/blob/develop/nltk/tokenize/treebank.py#L48 , it comes from the Robert MacIntyre "tokener, i.e. https: // www.cis.upenn.edu/~treebank/tokenizer.sed

Abbreviations separate words like "going to," "want," etc.:

 >>> from nltk import word_tokenize >>> word_tokenize("I wanna go home") ['I', 'wan', 'na', 'go', 'home'] >>> word_tokenize("I gonna go home") ['I', 'gon', 'na', 'go', 'home']

After that, we will reach the punctuation point you are asking about, see https://github.com/nltk/nltk/blob/develop/nltk/tokenize/treebank.py#L63 :

 def tokenize(self, text): #starting quotes text = re.sub(r'^\"', r'``', text) text = re.sub(r'(``)', r' \1 ', text) text = re.sub(r'([ (\[{<])"', r'\1 `` ', text)

Ah ha, starting with quotes, changes from "->` ` :

 >>> import re >>> text = '"A"' >>> re.sub(r'^\"', r'``', text) '``A"' KeyboardInterrupt >>> re.sub(r'(``)', r' \1 ', re.sub(r'^\"', r'``', text)) ' `` A"' >>> re.sub(r'([ (\[{<])"', r'\1 `` ', re.sub(r'(``)', r' \1 ', re.sub(r'^\"', r'``', text))) ' `` A"' >>> text_after_startquote_changes = re.sub(r'([ (\[{<])"', r'\1 `` ', re.sub(r'(``)', r' \1 ', re.sub(r'^\"', r'``', text))) >>> text_after_startquote_changes ' `` A"'

Then we see https://github.com/nltk/nltk/blob/develop/nltk/tokenize/treebank.py#L85 , which refers to the final quotes:

  #ending quotes text = re.sub(r'"', " '' ", text) text = re.sub(r'(\S)(\'\')', r'\1 \2 ', text)

The use of regular expressions:

 >>> re.sub(r'"', " '' ", text_after_startquote_changes) " `` A '' " >>> re.sub(r'(\S)(\'\')', r'\1 \2 ', re.sub(r'"', " '' ", text_after_startquote_changes)) " `` A '' "

So, if you want to find the list of tokens for double quotes after nltk.word_tokenize , just search for `` and '' instead of " .

The behavior of the word "tokenization" in NLTK for double quotes is confusing

More articles: