I am trying to create a subset of a news news frame that mentions at least one element of a set of keywords or phrases.
# Sample data frame of articles articles <- data.frame(id=c(1, 2, 3, 4), text=c("Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod", "tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam,", "quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo", "consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse")) articles$text <- as.character(articles$text) # Sample vector of keywords or phrases keywords <- as.character(c("elit", "tempor incididunt", "reprehenderit")) # id text # 1 1 Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod # 2 2 tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, # 3 3 quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo # 4 4 consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse
Given the vector of keywords, the subset should contain lines 1, 2, and 4, since these lines contain one or more elements of the vector.
Neither %in nor grepl() work, since %in% apparently requires every word in the data frame to be vectorized ( articles$text %in% keywords prints four FALSE s), and grep() does not look for processing vectorized patterns ( grep(keywords, articles$text) gives an error). Apparently, none of the functions works well in different dimensions (i.e. it will be easy to search for one word in all lines, but not all 3 at the same time).
What is the best way to find and select all rows of a data frame that contain at least one of the elements of the keyword vector?
string-matching grep r grepl
Andrew
source share