Splitting offers using nltk while saving quotes

Question

Splitting offers using nltk while saving quotes

I use nltk to split text into sentence units. However, I need sentences containing quotes that need to be extracted as a whole. Right now, every sentence, even if it is inside a quote, is retrieved as a separate part.

This is an example of what I'm trying to extract as a whole:

"This is a sentence. This is also a sentence," said the cat.

I now have this code:

 import nltk.data tokenizer = nltk.data.load('tokenizers/punkt/english.pickle') text = 'This is a sentence. This is also a sentence," said the cat.' print '\n-----\n'.join(tokenizer.tokenize(text, realign_boundaries=True))

This works very well, but I want to support quotation marks in them, even when the quotes themselves contain several sentences.

The above code creates:

 This is a sentence. ----- This is also a sentence," said the cat.

I am trying to get all the text extracted as a whole:

 "This is a sentence. This is also a sentence," said the cat.

Is there an easy way to do this with nltk or use a regex instead? I was impressed with how easy it is to get started with nltk, but now I'm stuck.

+7

python python-2.7 regex nltk

emh Nov 12 '13 at 15:43

source share

2 answers

Harold ship · Answer 1 · 2015-03-15T18:40:47+0000

If I understood the problem correctly, this regex should do this:

 import re text = '"This is a sentence. This is also a sentence," said the cat.' for grp in re.findall(r'"[^"]*\."|("[^"]*")*([^".]*\.)', text): print "".join(grp)

This is a combination of two patterns or together. The first finds the usual quoted sentences. The second finds the usual sentences or sentences with a quote followed by a period. If you have more complex proposals, you may need additional adjustments.

Drewness · Answer 2 · 2013-11-12T16:00:47+0000

Just change the print statement as follows:

 print ' '.join(tokenizer.tokenize(text, realign_boundaries=True))

This will concatenate sentences with a space instead of \n-----\n .

Splitting offers using nltk while saving quotes

More articles: