I'm not sure what the desired result is, but I think you may need to segment the paragraph to nltk.sent_tokenize , that is:
>>> text = """After Du died of suffocation, her boyfriend posted a heartbreaking message online: "Losing consciousness in my arms, your breath and heartbeat became weaker and weaker. Finally they pushed you out of the cold emergency room. I failed to protect you." ... ... Li Na, 23, a migrant worker from a farming family in Jiangxi province, was looking forward to getting married in 2015.""" >>> from nltk import sent_tokenize >>> paragraphs = text.split('\n\n') >>> for pg in paragraphs: ... for sent in sent_tokenize(pg): ... print sent ... After Du died of suffocation, her boyfriend posted a heartbreaking message online: "Losing consciousness in my arms, your breath and heartbeat became weaker and weaker. Finally they pushed you out of the cold emergency room. I failed to protect you." Li Na, 23, a migrant worker from a farming family in Jiangxi province, was looking forward to getting married in 2015.
You might want double-quoted strings if you could try this:
>>> import re >>> str_in_doublequotes = r'"([^"]*)"' >>> re.findall(str_in_doublequotes, text) ['Losing consciousness in my arms, your breath and heartbeat became weaker and weaker. Finally they pushed you out of the cold emergency room. I failed to protect you.']
Or maybe you need the following:
>>> for pg in paragraphs: ...
When reading from a file, try using the io package:
alvas@ubi :~$ echo -e """After Du died of suffocation, her boyfriend posted a heartbreaking message online: \"Losing consciousness in my arms, your breath and heartbeat became weaker and weaker. Finally they pushed you out of the cold emergency room. I failed to protect you.\"\n\nLi Na, 23, a migrant worker from a farming family in Jiangxi province, was looking forward to getting married in 2015.""" > in.txt alvas@ubi :~$ cat in.txt After Du died of suffocation, her boyfriend posted a heartbreaking message online: "Losing consciousness in my arms, your breath and heartbeat became weaker and weaker. Finally they pushed you out of the cold emergency room. I failed to protect you." Li Na, 23, a migrant worker from a farming family in Jiangxi province, was looking forward to getting married in 2015. alvas@ubi :~$ python Python 2.7.6 (default, Jun 22 2015, 17:58:13) [GCC 4.8.2] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> import io >>> from nltk import sent_tokenize >>> text = io.open('in.txt', 'r', encoding='utf8').read() >>> print text After Du died of suffocation, her boyfriend posted a heartbreaking message online: "Losing consciousness in my arms, your breath and heartbeat became weaker and weaker. Finally they pushed you out of the cold emergency room. I failed to protect you." Li Na, 23, a migrant worker from a farming family in Jiangxi province, was looking forward to getting married in 2015. >>> for sent in sent_tokenize(text): ... print sent ... After Du died of suffocation, her boyfriend posted a heartbreaking message online: "Losing consciousness in my arms, your breath and heartbeat became weaker and weaker. Finally they pushed you out of the cold emergency room. I failed to protect you." Li Na, 23, a migrant worker from a farming family in Jiangxi province, was looking forward to getting married in 2015.
And with the help of paragraphs and extract quotes:
>>> import io, re >>> from nltk import sent_tokenize >>> str_in_doublequotes = r'"([^"]*)"' >>> paragraphs = text.split('\n\n') >>> for pg in paragraphs: ...
For magic, to combine the predicate sentence with quotation marks (don't blink, it looks exactly the same as above):
>>> import io, re >>> from nltk import sent_tokenize >>> str_in_doublequotes = r'"([^"]*)"' >>> paragraphs = text.split('\n\n') >>> for pg in paragraphs: ...
The problem with the code above is that it is limited to sentences, for example:
After Du died of suffocation, her boyfriend published a heartbreaking message online: "Loss of consciousness on my hands, your breathing and heartbeat became weaker and weaker. Finally, they pushed you out of the cold ambulance room. I could not protect you.
And can not handle:
"The loss of consciousness in your arms, your breathing and heartbeat became weaker and weaker. Finally, they pushed you out of the cold emergency state of the room. I could not protect you," her boyfriend sent a heartbreaking message online after Du died of suffocation.
To make sure my python / nltk versions are:
$ python -c "import nltk; print nltk.__version__" '3.0.3' $ python -V Python 2.7.6
Besides the computational aspect of word processing, there is something subtly different from the grammar of the text in question.
The fact that the quotation mark is followed by a semicolon : is not typical of traditional English grammar. This could be popularized in Chinese news, because in Chinese:
ε ζ ηͺζ― ζ»δΊ‘ ε, η·ε ε¨ η½δΈ ε δΊ δ»€δΊΊ εΏη’ η ζΆζ―: "..."
In traditional English, in a very prescriptive grammatical sense, this would be:
After Du died of suffocation, her boyfriend posted a heartbreaking message online, "..."
And the post quote operator will be signaled with a trailing comma instead of a full run, for example:
"...", her boyfriend posted a heartbreaking message online after Du died of suffocation.