Parse a row, set it as a factor column in R data.table

I canโ€™t find an elegant way to achieve this, please help.

I have a DT data.table:

 name,value "lorem pear ipsum",4 "apple ipsum lorem",2 "lorem ipsum plum",6 

And based on the list of Fruits <- c("pear", "apple", "plum") I would like to create a factor type column.

 name,value,factor "lorem pear ipsum",4,"pear" "apple ipsum lorem",2,"apple" "lorem ipsum plum",6,"plum" 

I assume that the basic one, but I was kind of stuck, this is how far I got:

DT[grep("apple", name, ignore.case=TRUE), factor := as.factor("apple")]

Thanks in advance.

+4
source share
3 answers

You can vectorize it with regular expressions, for example. using gsub() :

Set up data:

 strings <- c("lorem pear ipsum", "apple ipsum lorem", "lorem ipsum plum") fruit <- c("pear", "apple", "plum") 

Now create a regex

 ptn <- paste0(".*(", paste(fruit, collapse="|"), ").*") gsub(ptn, "\\1", strings) [1] "pear" "apple" "plum" 

A regular expression works by dividing each search item into | embedded in parentheses:

 ptn [1] ".*(pear|apple|plum).*" 

To do this inside a data table, according to your question, it will be as simple as:

 library(data.table) DT <- data.table(name=strings, value=c(4, 2, 6)) DT[, factor:=gsub(ptn, "\\1", strings)] DT name value factor 1: lorem pear ipsum 4 pear 2: apple ipsum lorem 2 apple 3: lorem ipsum plum 6 plum 
+6
source

I don't know if there is a "data.table" way for this, but you can try the following:

 DT[, factor := sapply(Fruits, function(x) Fruits[grep(x, name, ignore.case=TRUE)])] DT # name value factor # 1: lorem pear ipsum 4 pear # 2: apple ipsum lorem 2 apple # 3: lorem ipsum plum 6 plum 
+5
source

Here is my encoded solution. The hard part gets the string match from regex . The best general solution (that finds everything that matches any regular expression) that I know of is a combination of regexec and regmatches (see below).

 # Create the data frame name <- c("lorem pear ipsum", "apple ipsum lorem", "lorem ipsum plum") value <- c(4,2,6) DT <- data.frame(name=name, value=value, stringsAsFactors=FALSE) # Create the regular expression Fruits <- c("pear", "apple", "plum") myRegEx <- paste(Fruits, collapse = "|") # Find the matches r <- regexec(myRegEx, DT$name, ignore.case = TRUE) matches <- regmatches(DT$name, r) # Extract the matches, convert to factors factor <- sapply(matches, function(x) as.factor(x[[1]])) # Add to data frame DT$factor <- factor 

This is probably a longer solution than you would like.

+2
source

All Articles