I use nltk to split text into sentence units. However, I need sentences containing quotes that need to be extracted as a whole. Right now, every sentence, even if it is inside a quote, is retrieved as a separate part.
This is an example of what I'm trying to extract as a whole:
"This is a sentence. This is also a sentence," said the cat.
I now have this code:
import nltk.data tokenizer = nltk.data.load('tokenizers/punkt/english.pickle') text = 'This is a sentence. This is also a sentence," said the cat.' print '\n-----\n'.join(tokenizer.tokenize(text, realign_boundaries=True))
This works very well, but I want to support quotation marks in them, even when the quotes themselves contain several sentences.
The above code creates:
This is a sentence.
I am trying to get all the text extracted as a whole:
"This is a sentence. This is also a sentence," said the cat.
Is there an easy way to do this with nltk or use a regex instead? I was impressed with how easy it is to get started with nltk, but now I'm stuck.
emh
source share