The behavior of the word "tokenization" in NLTK for double quotes is confusing
import nltk >>> nltk.__version__ '3.0.4' >>> nltk.word_tokenize('"') ['``'] >>> nltk.word_tokenize('""') ['``', '``'] >>> nltk.word_tokenize('"A"') ['``', 'A', "''"] See how it changes " to double` `and '' ?
What's going on here? Why does he change character? Is there a fix? Since I need to search every token in the string later.
Python 2.7.6 if that matters.
TL; DR :
nltk.word_tokenize changes the beginning of double quotes with " -> `` and ends the double quotes from " -> '' .
In the long :
First, nltk.word_tokenize signs the database on how Penn TreeBank was tokenized, it comes from nltk.tokenize.treebank , see https://github.com/nltk/nltk/blob/develop/nltk/tokenize/ init .py # L91 and https://github.com/nltk/nltk/blob/develop/nltk/tokenize/treebank.py#L23
class TreebankWordTokenizer(TokenizerI): """ The Treebank tokenizer uses regular expressions to tokenize text as in Penn Treebank. This is the method that is invoked by ``word_tokenize()``. It assumes that the text has already been segmented into sentences, eg using ``sent_tokenize()``. Then a list of regular expressions for abbreviations appears at https://github.com/nltk/nltk/blob/develop/nltk/tokenize/treebank.py#L48 , it comes from the Robert MacIntyre "tokener, i.e. https: // www.cis.upenn.edu/~treebank/tokenizer.sed
Abbreviations separate words like "going to," "want," etc.:
>>> from nltk import word_tokenize >>> word_tokenize("I wanna go home") ['I', 'wan', 'na', 'go', 'home'] >>> word_tokenize("I gonna go home") ['I', 'gon', 'na', 'go', 'home'] After that, we will reach the punctuation point you are asking about, see https://github.com/nltk/nltk/blob/develop/nltk/tokenize/treebank.py#L63 :
def tokenize(self, text): #starting quotes text = re.sub(r'^\"', r'``', text) text = re.sub(r'(``)', r' \1 ', text) text = re.sub(r'([ (\[{<])"', r'\1 `` ', text) Ah ha, starting with quotes, changes from "->` ` :
>>> import re >>> text = '"A"' >>> re.sub(r'^\"', r'``', text) '``A"' KeyboardInterrupt >>> re.sub(r'(``)', r' \1 ', re.sub(r'^\"', r'``', text)) ' `` A"' >>> re.sub(r'([ (\[{<])"', r'\1 `` ', re.sub(r'(``)', r' \1 ', re.sub(r'^\"', r'``', text))) ' `` A"' >>> text_after_startquote_changes = re.sub(r'([ (\[{<])"', r'\1 `` ', re.sub(r'(``)', r' \1 ', re.sub(r'^\"', r'``', text))) >>> text_after_startquote_changes ' `` A"' Then we see https://github.com/nltk/nltk/blob/develop/nltk/tokenize/treebank.py#L85 , which refers to the final quotes:
#ending quotes text = re.sub(r'"', " '' ", text) text = re.sub(r'(\S)(\'\')', r'\1 \2 ', text) The use of regular expressions:
>>> re.sub(r'"', " '' ", text_after_startquote_changes) " `` A '' " >>> re.sub(r'(\S)(\'\')', r'\1 \2 ', re.sub(r'"', " '' ", text_after_startquote_changes)) " `` A '' " So, if you want to find the list of tokens for double quotes after nltk.word_tokenize , just search for `` and '' instead of " .