Find string vector matches in another string vector

Question

Find string vector matches in another string vector

I am trying to create a subset of a news news frame that mentions at least one element of a set of keywords or phrases.

# Sample data frame of articles articles <- data.frame(id=c(1, 2, 3, 4), text=c("Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod", "tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam,", "quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo", "consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse")) articles$text <- as.character(articles$text) # Sample vector of keywords or phrases keywords <- as.character(c("elit", "tempor incididunt", "reprehenderit")) # id text # 1 1 Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod # 2 2 tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, # 3 3 quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo # 4 4 consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse

Given the vector of keywords, the subset should contain lines 1, 2, and 4, since these lines contain one or more elements of the vector.

Neither %in nor grepl() work, since %in% apparently requires every word in the data frame to be vectorized ( articles$text %in% keywords prints four FALSE s), and grep() does not look for processing vectorized patterns ( grep(keywords, articles$text) gives an error). Apparently, none of the functions works well in different dimensions (i.e. it will be easy to search for one word in all lines, but not all 3 at the same time).

What is the best way to find and select all rows of a data frame that contain at least one of the elements of the keyword vector?

+8

string-matching grep r grepl

Andrew Jun 16 '13 at 4:12

source share

1 answer

A5C1D2H2I1M1N2O1R2T1 · Accepted Answer · 2013-06-16T04:17:23+0000

You can try inserting your “keywords” together and splitting them into a channel character ( | ) that will work as “or”, for example:

 > articles[grepl(paste(keywords, collapse="|"), articles$text),] id text 1 1 Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod 2 2 tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, 4 4 consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse

Find string vector matches in another string vector

More articles: