How to extract sentences containing the names of specific individuals using R

I use R to extract sentences containing the names of specific individuals from texts, and here is an example:

As a reformer in Tübingen, he accepted the call to the University of Wittenberg by Martin Luther, recommended by his uncle Johann Reichlin. Melanchthon became a professor of Greek at Wittenberg at the age of 21. He studied Scripture, especially Paul, and the gospel doctrine. He attended the Leipzig dispute (1519) as a spectator, but participated in his comments. Johann Ek, having attacked his views, answered Melanhton, based on the authority of the Scriptures in his Defensio contra Johannem Eckium.

In this short paragraph there are several names of people, such as: Johann Reichlin, Melanchthon, Johann Ek. Using the openNLP package, the names of three people Martin Luther , Paul and Melanchthon can be correctly extracted and recognized. Then I have two questions:

  • How can I extract sentences containing these names ?
  • Since the output of an entity resolver name is not so promising if I add "[[]]" to each name, for example [[Johann Reuchlin]], [[Melanchthon]], how can I extract sentences containing these name expressions [[A] ], [[B]] ...?
+6
source share
2 answers
Using `strsplit` and `grep`, first I set made an object `para` which was your paragraph. toMatch <- c("Martin Luther", "Paul", "Melanchthon") unlist(strsplit(para,split="\\."))[grep(paste(toMatch, collapse="|"),unlist(strsplit(para,split="\\.")))] > unlist(strsplit(para,split="\\."))[grep(paste(toMatch, collapse="|"),unlist(strsplit(para,split="\\.")))] [1] "Opposed as a reformer at Tübingen, he accepted a call to the University of Wittenberg by Martin Luther, recommended by his great-uncle Johann Reuchlin" [2] " Melanchthon became professor of the Greek language in Wittenberg at the age of 21" [3] " He studied the Scripture, especially of Paul, and Evangelical doctrine" [4] " Johann Eck having attacked his views, Melanchthon replied based on the authority of Scripture in his Defensio contra Johannem Eckium" 

Or a little cleaner:

 sentences<-unlist(strsplit(para,split="\\.")) sentences[grep(paste(toMatch, collapse="|"),sentences)] 

If you are looking for offers in which each person is in the form of separate returns, follow these steps:

 toMatch <- c("Martin Luther", "Paul", "Melanchthon") sentences<-unlist(strsplit(para,split="\\.")) foo<-function(Match){sentences[grep(Match,sentences)]} lapply(toMatch,foo) [[1]] [1] "Opposed as a reformer at Tübingen, he accepted a call to the University of Wittenberg by Martin Luther, recommended by his great-uncle Johann Reuchlin" [[2]] [1] " He studied the Scripture, especially of Paul, and Evangelical doctrine" [[3]] [1] " Melanchthon became professor of the Greek language in Wittenberg at the age of 21" [2] " Johann Eck having attacked his views, Melanchthon replied based on the authority of Scripture in his Defensio contra Johannem Eckium" 

Edit 3: To add each person’s name, do something simple, for example:

 foo<-function(Match){c(Match,sentences[grep(Match,sentences)])} 

EDIT 4:

And if you want to find sentences with several people / places / things (words), just add an argument for two such as:

 toMatch <- c("Martin Luther", "Paul", "Melanchthon","(?=.*Melanchthon)(?=.*Scripture)") 

and change perl to TRUE :

 foo<-function(Match){c(Match,sentences[grep(Match,sentences,perl = T)])} > lapply(toMatch,foo) [[1]] [1] "Martin Luther" [2] "Opposed as a reformer at Tübingen, he accepted a call to the University of Wittenberg by Martin Luther, recommended by his great-uncle Johann Reuchlin" [[2]] [1] "Paul" [2] " He studied the Scripture, especially of Paul, and Evangelical doctrine" [[3]] [1] "Melanchthon" [2] " Melanchthon became professor of the Greek language in Wittenberg at the age of 21" [3] " Johann Eck having attacked his views, Melanchthon replied based on the authority of Scripture in his Defensio contra Johannem Eckium" [[4]] [1] "(?=.*Melanchthon)(?=.*Scripture)" [2] " Johann Eck having attacked his views, Melanchthon replied based on the authority of Scripture in his Defensio contra Johannem Eckium" 

EDIT 5: Answering your other question:

Given:

 sentenceR<-"Opposed as a reformer at [[Tübingen]], he accepted a call to the University of [[Wittenberg]] by [[Martin Luther]], recommended by his great-uncle [[Johann Reuchlin]]" gsub("\\[\\[|\\]\\]", "", regmatches(sentenceR, gregexpr("\\[\\[.*?\\]\\]", sentenceR))[[1]]) 

Gives you words inside double brackets.

 > gsub("\\[\\[|\\]\\]", "", regmatches(sentenceR, gregexpr("\\[\\[.*?\\]\\]", sentenceR))[[1]]) [1] "Tübingen" "Wittenberg" "Martin Luther" "Johann Reuchlin" 
+6
source

It is much easier to use two quantization packages and stringi here :

 sents <- unlist(quanteda::tokenize(txt, what = "sentence")) namesToExtract <- c("Martin Luther", "Paul", "Melanchthon") namesFound <- unlist(stringi::stri_extract_all_regex(sents, paste(namesToExtract, collapse = "|"))) sentList <- split(sents, list(namesFound)) sentList[["Melanchthon"]] ## [1] "Melanchthon became professor of the Greek language in Wittenberg at the age of 21." ## [2] "Johann Eck having attacked his views, Melanchthon replied based on the authority of Scripture in his Defensio contra Johannem Eckium." sentList ## $`Martin Luther` ## [1] "Opposed as a reformer at Tübingen, he accepted a call to the University of Wittenberg by Martin Luther, recommended by his great-uncle Johann Reuchlin." ## ## $Melanchthon ## [1] "Melanchthon became professor of the Greek language in Wittenberg at the age of 21." ## [2] "Johann Eck having attacked his views, Melanchthon replied based on the authority of Scripture in his Defensio contra Johannem Eckium." ## ## $Paul ## [1] "He studied the Scripture, especially of Paul, and Evangelical doctrine." 
+2
source

All Articles