Splitting offers using nltk while saving quotes

I use nltk to split text into sentence units. However, I need sentences containing quotes that need to be extracted as a whole. Right now, every sentence, even if it is inside a quote, is retrieved as a separate part.

This is an example of what I'm trying to extract as a whole:

"This is a sentence. This is also a sentence," said the cat. 

I now have this code:

 import nltk.data tokenizer = nltk.data.load('tokenizers/punkt/english.pickle') text = 'This is a sentence. This is also a sentence," said the cat.' print '\n-----\n'.join(tokenizer.tokenize(text, realign_boundaries=True)) 

This works very well, but I want to support quotation marks in them, even when the quotes themselves contain several sentences.

The above code creates:

 This is a sentence. ----- This is also a sentence," said the cat. 

I am trying to get all the text extracted as a whole:

 "This is a sentence. This is also a sentence," said the cat. 

Is there an easy way to do this with nltk or use a regex instead? I was impressed with how easy it is to get started with nltk, but now I'm stuck.

+7
python regex nltk
source share
2 answers

If I understood the problem correctly, this regex should do this:

 import re text = '"This is a sentence. This is also a sentence," said the cat.' for grp in re.findall(r'"[^"]*\."|("[^"]*")*([^".]*\.)', text): print "".join(grp) 

This is a combination of two patterns or together. The first finds the usual quoted sentences. The second finds the usual sentences or sentences with a quote followed by a period. If you have more complex proposals, you may need additional adjustments.

+2
source share

Just change the print statement as follows:

 print ' '.join(tokenizer.tokenize(text, realign_boundaries=True)) 

This will concatenate sentences with a space instead of \n-----\n .

0
source share

All Articles