The fastest way to filter the contents of a column of a data.frame list in R / Rcpp

I have data.frame:

df <- structure(list(id = 1:3, vars = list("a", c("a", "b", "c"), c("b", "c"))), .Names = c("id", "vars"), row.names = c(NA, -3L), class = "data.frame") 

with a list column (each with a character vector):

 > str(df) 'data.frame': 3 obs. of 2 variables: $ id : int 1 2 3 $ vars:List of 3 ..$ : chr "a" ..$ : chr "a" "b" "c" ..$ : chr "b" "c" 

I want to filter data.frame according to setdiff(vars,remove_this)

 library(dplyr) library(tidyr) res <- df %>% mutate(vars = lapply(df$vars, setdiff, "a")) 

which gets me from this:

  > res id vars 1 1 2 2 b, c 3 3 b, c 

But in order to get the drops of character(0) vars, I have to do something like:

 res %>% unnest(vars) # and then do the equivalent of nest(vars) again after... 

Actual datasets:

  • 560K rows and 3800K rows, which also contain 10 more columns (for wrapping).

(this is pretty slow, which leads to the question ...)

What is the fastest way to do this in R ?

  • Is there dplyr / data.table / another faster method?
  • How to do it using Rcpp ?

UPDATE / EXTENSION:

  • can a column modification be done in place, and not by copying the result lapply(vars,setdiff(... ?

  • what is the most efficient way to filter for vars == character(0) if it should be a separate step.

+5
source share
3 answers

Reverting any algorithmic improvements, a similar solution to data.table will automatically be faster, because you will not need to copy the whole thing just to add a column:

 library(data.table) dt = as.data.table(df) # or use setDT to convert in place dt[, newcol := lapply(vars, setdiff, 'a')][sapply(newcol, length) != 0] # id vars newcol #1: 2 a,b,cb,c #2: 3 b,cb,c 

You can also remove the original column (basically 0 value) by adding [, vars := NULL] to the end). Or you can just overwrite the starting column if you don't need this information, i.e. dt[, vars := lapply(vars, setdiff, 'a')] .


Now, as far as algorithmic improvements go, if your id values ​​are unique for each vars (and if not, add a new unique identifier), I think this is much faster and automatically takes care of filtering:

 dt[, unlist(vars), by = id][!V1 %in% 'a', .(vars = list(V1)), by = id] # id vars #1: 2 b,c #2: 3 b,c 

To wrap other columns, I think it’s easier to just merge back:

 dt[, othercol := 5:7] # notice the keyby dt[, unlist(vars), by = id][!V1 %in% 'a', .(vars = list(V1)), keyby = id][dt, nomatch = 0] # id vars i.vars othercol #1: 2 b,ca,b,c 6 #2: 3 b,cb,c 7 
+7
source

Here's another way:

 # prep DT <- data.table(df) DT[,vstr:=paste0(sort(unlist(vars)),collapse="_"),by=1:nrow(DT)] setkey(DT,vstr) get_badkeys <- function(x) unlist(sapply(1:length(x),function(n) combn(sort(x),n,paste0,collapse="_"))) # choose values to exclude baduns <- c("a","b") # subset DT[!J(get_badkeys(baduns))] 

It is pretty fast, but it takes key .


Tests. Here is an example:

Candidates:

 hannahh <- function(df,baduns){ df %>% mutate(vars = lapply(.$vars, setdiff, baduns)) %>% filter(!!sapply(vars,length)) } eddi <- function(df,baduns){ dt = as.data.table(df) dt[, unlist(vars) , by = id][!V1 %in% baduns, .(vars = list(V1)) , keyby = id][dt, nomatch = 0] } stevenb <- function(df,baduns){ df %>% rowwise() %>% do(id = .$id, vars = .$vars, newcol = setdiff(.$vars, baduns)) %>% mutate(length = length(newcol)) %>% ungroup() %>% filter(length > 0) } frank <- function(df,baduns){ DT <- data.table(df) DT[,vstr:=paste0(sort(unlist(vars)),collapse="_"),by=1:nrow(DT)] setkey(DT,vstr) DT[!J(get_badkeys(baduns))] } 

Simulation:

 nvals <- 4 nbads <- 2 maxlen <- 4 nobs <- 1e4 exdf <- data.table( id=1:nobs, vars=replicate(nobs,list(sample(valset,sample(maxlen,1)))) ) setDF(exdf) baduns <- valset[1:nbads] 

Results:

 system.time(frank_res <- frank(exdf,baduns)) # user system elapsed # 0.24 0.00 0.28 system.time(hannahh_res <- hannahh(exdf,baduns)) # 0.42 0.00 0.42 system.time(eddi_res <- eddi(exdf,baduns)) # 0.05 0.00 0.04 system.time(stevenb_res <- stevenb(exdf,baduns)) # 36.27 55.36 93.98 

Verification:

 identical(sort(frank_res$id),eddi_res$id) # TRUE identical(unlist(stevenb_res$id),eddi_res$id) # TRUE identical(unlist(hannahh_res$id),eddi_res$id) # TRUE 

Discussion:

With eddi() and hannahh() results are unlikely to change with nvals , nbads and nbads . In contrast, when baduns goes to 20, frank() becomes incredibly slow (for example, 20+ seconds); it also scales with nbads and nbads little worse than the other two.

The scaling of nobs , eddi() exceeding hannahh() remains unchanged by about 10 times. Against frank() it sometimes shrinks and sometimes remains unchanged. In the best case, nobs = 1e5 for frank() , eddi() is still 3 times faster.

If we move from the valset character to something that frank() must force the character to paste0 string for its operation, both eddi() and hannahh() beat it like nobs .


Tests for repeated execution. This is probably obvious, but if you need to do this “many” times (... how hard it is to say), it is better to create a key to go through a subset for each set of baduns . In the simulation above, eddi() about 5 times faster than frank() , so I would go for the latter if I did this subset 10 + times.

 maxbadlen <- 2 set_o_baduns <- replicate(10,sample(valset,size=sample(maxbadlen,1))) system.time({ DT <- data.table(exdf) DT[,vstr:=paste0(sort(unlist(vars)),collapse="_"),by=1:nrow(DT)] setkey(DT,vstr) for (i in 1:10) DT[!J(get_badkeys(set_o_baduns[[i]]))] }) # user system elapsed # 0.29 0.00 0.29 system.time({ dt = as.data.table(exdf) for (i in 1:10) dt[, unlist(vars), by = id][!V1 %in% set_o_baduns[[i]], .(vars = list(V1)), keyby = id][dt, nomatch = 0] }) # user system elapsed # 0.39 0.00 0.39 system.time({ for (i in 1:10) hannahh(exdf,set_o_baduns[[i]]) }) # user system elapsed # 4.10 0.00 4.13 

So, as expected, frank() takes very little time for additional evaluations, and eddi() and hannahh() grow linearly.

+8
source

Here is another idea:

 df %>% rowwise() %>% do(id = .$id, vars = .$vars, newcol = setdiff(.$vars, "a")) %>% mutate(length = length(newcol)) %>% ungroup() 

What gives:

 # id vars newcol length #1 1 a 0 #2 2 a, b, cb, c 2 #3 3 b, cb, c 2 

Then you can filter on length > 0 to keep only nonempty newcol

 df %>% rowwise() %>% do(id = .$id, vars = .$vars, newcol = setdiff(.$vars, "a")) %>% mutate(length = length(newcol)) %>% ungroup() %>% filter(length > 0) 

What gives:

 # id vars newcol length #1 2 a, b, cb, c 2 #2 3 b, cb, c 2 

Note As noted in the comments of @Arun, this approach is rather slow. You better work with data.table solutions.

+1
source

All Articles