Identification of sentences for texts containing quotation marks

Question

Identification of sentences for texts containing quotation marks

code:

from nltk.tokenize import sent_tokenize pprint(sent_tokenize(unidecode(text)))

Output:

 [After Du died of suffocation, her boyfriend posted a heartbreaking message online: "Losing consciousness in my arms, your breath and heartbeat became weaker and weaker.', 'Finally they pushed you out of the cold emergency room.', 'I failed to protect you.', '"Li Na, 23, a migrant worker from a farming family in Jiangxi province, was looking forward to getting married in 2015.',]

Input:

After Du died of suffocation, her boyfriend published a heartbreaking message online: "Loss of consciousness on my hands, your breathing and heartbeat became weaker and weaker. Finally, they pushed you out of the cold ambulance room. I could not protect you.
Li Na, 23, a migrant worker from a farming family in Jiangxi Province, was looking forward to getting married in 2015.

Quotations should be included in the previous sentence. Instead of " Li.

Unable to complete ." How to fix this?

Edit: Explain text extraction.

 html = open(path, "r").read() #reads html code article = extractor.extract(raw_html=html) #extracts content text = unidecode(article.cleaned_text) #changes encoding

Here article.cleaned_text is in unicode. The idea is to use this to change the characters to.

@Alvas solutions Invalid result:

 ['After Du died of suffocation, her boyfriend posted a heartbreaking message online: "Losing consciousness in my arms, your breath and heartbeat became weaker and weaker.', 'Finally they pushed you out of the cold emergency room.', 'I failed to protect you.', '"', 'Li Na, 23, a migrant worker from a farming family in Jiangxi province, was looking forward to getting married in 2015.' ]

Edit2: (Updated) nltk and python version

 python -c "import nltk; print nltk.__version__" 3.0.4 python -V Python 2.7.9

+5

python tokenize nlp nltk

Abhishek bhatia Aug 14 '15 at 6:03

source share

1 answer

alvas · Answer 1 · 2015-08-14T13:12:10+0000

I'm not sure what the desired result is, but I think you may need to segment the paragraph to nltk.sent_tokenize , that is:

 >>> text = """After Du died of suffocation, her boyfriend posted a heartbreaking message online: "Losing consciousness in my arms, your breath and heartbeat became weaker and weaker. Finally they pushed you out of the cold emergency room. I failed to protect you." ... ... Li Na, 23, a migrant worker from a farming family in Jiangxi province, was looking forward to getting married in 2015.""" >>> from nltk import sent_tokenize >>> paragraphs = text.split('\n\n') >>> for pg in paragraphs: ... for sent in sent_tokenize(pg): ... print sent ... After Du died of suffocation, her boyfriend posted a heartbreaking message online: "Losing consciousness in my arms, your breath and heartbeat became weaker and weaker. Finally they pushed you out of the cold emergency room. I failed to protect you." Li Na, 23, a migrant worker from a farming family in Jiangxi province, was looking forward to getting married in 2015.

You might want double-quoted strings if you could try this:

 >>> import re >>> str_in_doublequotes = r'"([^"]*)"' >>> re.findall(str_in_doublequotes, text) ['Losing consciousness in my arms, your breath and heartbeat became weaker and weaker. Finally they pushed you out of the cold emergency room. I failed to protect you.']

Or maybe you need the following:

 >>> for pg in paragraphs: ... # Collects the quotes inside the paragraph ... in_quotes = re.findall(str_in_doublequotes, pg) ... for q in in_quotes: ... # Keep track of the quotes with tabs. ... pg = pg.replace('"{}"'.format(q), '\t') ... for _pg in pg.split('\t'): ... for sent in sent_tokenize(_pg): ... print sent ... try: ... print '"{}"'.format(in_quotes.pop(0)) ... except IndexError: # Nothing to pop. ... pass ... After Du died of suffocation, her boyfriend posted a heartbreaking message online: "Losing consciousness in my arms, your breath and heartbeat became weaker and weaker. Finally they pushed you out of the cold emergency room. I failed to protect you." Li Na, 23, a migrant worker from a farming family in Jiangxi province, was looking forward to getting married in 2015.

When reading from a file, try using the io package:

 alvas@ubi :~$ echo -e """After Du died of suffocation, her boyfriend posted a heartbreaking message online: \"Losing consciousness in my arms, your breath and heartbeat became weaker and weaker. Finally they pushed you out of the cold emergency room. I failed to protect you.\"\n\nLi Na, 23, a migrant worker from a farming family in Jiangxi province, was looking forward to getting married in 2015.""" > in.txt alvas@ubi :~$ cat in.txt After Du died of suffocation, her boyfriend posted a heartbreaking message online: "Losing consciousness in my arms, your breath and heartbeat became weaker and weaker. Finally they pushed you out of the cold emergency room. I failed to protect you." Li Na, 23, a migrant worker from a farming family in Jiangxi province, was looking forward to getting married in 2015. alvas@ubi :~$ python Python 2.7.6 (default, Jun 22 2015, 17:58:13) [GCC 4.8.2] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> import io >>> from nltk import sent_tokenize >>> text = io.open('in.txt', 'r', encoding='utf8').read() >>> print text After Du died of suffocation, her boyfriend posted a heartbreaking message online: "Losing consciousness in my arms, your breath and heartbeat became weaker and weaker. Finally they pushed you out of the cold emergency room. I failed to protect you." Li Na, 23, a migrant worker from a farming family in Jiangxi province, was looking forward to getting married in 2015. >>> for sent in sent_tokenize(text): ... print sent ... After Du died of suffocation, her boyfriend posted a heartbreaking message online: "Losing consciousness in my arms, your breath and heartbeat became weaker and weaker. Finally they pushed you out of the cold emergency room. I failed to protect you." Li Na, 23, a migrant worker from a farming family in Jiangxi province, was looking forward to getting married in 2015.

And with the help of paragraphs and extract quotes:

 >>> import io, re >>> from nltk import sent_tokenize >>> str_in_doublequotes = r'"([^"]*)"' >>> paragraphs = text.split('\n\n') >>> for pg in paragraphs: ... # Collects the quotes inside the paragraph ... in_quotes = re.findall(str_in_doublequotes, pg) ... for q in in_quotes: ... # Keep track of the quotes with tabs. ... pg = pg.replace('"{}"'.format(q), '\t') ... for _pg in pg.split('\t'): ... for sent in sent_tokenize(_pg): ... print sent ... try: ... print '"{}"'.format(in_quotes.pop(0)) ... except IndexError: # Nothing to pop. ... pass ... After Du died of suffocation, her boyfriend posted a heartbreaking message online: "Losing consciousness in my arms, your breath and heartbeat became weaker and weaker. Finally they pushed you out of the cold emergency room. I failed to protect you." Li Na, 23, a migrant worker from a farming family in Jiangxi province, was looking forward to getting married in 2015.

For magic, to combine the predicate sentence with quotation marks (don't blink, it looks exactly the same as above):

 >>> import io, re >>> from nltk import sent_tokenize >>> str_in_doublequotes = r'"([^"]*)"' >>> paragraphs = text.split('\n\n') >>> for pg in paragraphs: ... # Collects the quotes inside the paragraph ... in_quotes = re.findall(str_in_doublequotes, pg) ... for q in in_quotes: ... # Keep track of the quotes with tabs. ... pg = pg.replace('"{}"'.format(q), '\t') ... for _pg in pg.split('\t'): ... for sent in sent_tokenize(_pg): ... print sent, ... try: ... print '"{}"'.format(in_quotes.pop(0)) ... except IndexError: # Nothing to pop. ... pass ... After Du died of suffocation, her boyfriend posted a heartbreaking message online: "Losing consciousness in my arms, your breath and heartbeat became weaker and weaker. Finally they pushed you out of the cold emergency room. I failed to protect you." Li Na, 23, a migrant worker from a farming family in Jiangxi province, was looking forward to getting married in 2015.

The problem with the code above is that it is limited to sentences, for example:

After Du died of suffocation, her boyfriend published a heartbreaking message online: "Loss of consciousness on my hands, your breathing and heartbeat became weaker and weaker. Finally, they pushed you out of the cold ambulance room. I could not protect you.

And can not handle:

"The loss of consciousness in your arms, your breathing and heartbeat became weaker and weaker. Finally, they pushed you out of the cold emergency state of the room. I could not protect you," her boyfriend sent a heartbreaking message online after Du died of suffocation.

To make sure my python / nltk versions are:

 $ python -c "import nltk; print nltk.__version__" '3.0.3' $ python -V Python 2.7.6

Besides the computational aspect of word processing, there is something subtly different from the grammar of the text in question.

The fact that the quotation mark is followed by a semicolon : is not typical of traditional English grammar. This could be popularized in Chinese news, because in Chinese:

啊杜窒息死亡后, 男友在网上发了令人心碎的消息: "..."

In traditional English, in a very prescriptive grammatical sense, this would be:

After Du died of suffocation, her boyfriend posted a heartbreaking message online, "..."

And the post quote operator will be signaled with a trailing comma instead of a full run, for example:

"...", her boyfriend posted a heartbreaking message online after Du died of suffocation.

Identification of sentences for texts containing quotation marks

More articles: