Here's another way:
# prep DT <- data.table(df) DT[,vstr:=paste0(sort(unlist(vars)),collapse="_"),by=1:nrow(DT)] setkey(DT,vstr) get_badkeys <- function(x) unlist(sapply(1:length(x),function(n) combn(sort(x),n,paste0,collapse="_")))
It is pretty fast, but it takes key .
Tests. Here is an example:
Candidates:
hannahh <- function(df,baduns){ df %>% mutate(vars = lapply(.$vars, setdiff, baduns)) %>% filter(!!sapply(vars,length)) } eddi <- function(df,baduns){ dt = as.data.table(df) dt[, unlist(vars) , by = id][!V1 %in% baduns, .(vars = list(V1)) , keyby = id][dt, nomatch = 0] } stevenb <- function(df,baduns){ df %>% rowwise() %>% do(id = .$id, vars = .$vars, newcol = setdiff(.$vars, baduns)) %>% mutate(length = length(newcol)) %>% ungroup() %>% filter(length > 0) } frank <- function(df,baduns){ DT <- data.table(df) DT[,vstr:=paste0(sort(unlist(vars)),collapse="_"),by=1:nrow(DT)] setkey(DT,vstr) DT[!J(get_badkeys(baduns))] }
Simulation:
nvals <- 4 nbads <- 2 maxlen <- 4 nobs <- 1e4 exdf <- data.table( id=1:nobs, vars=replicate(nobs,list(sample(valset,sample(maxlen,1)))) ) setDF(exdf) baduns <- valset[1:nbads]
Results:
system.time(frank_res <- frank(exdf,baduns)) # user system elapsed # 0.24 0.00 0.28 system.time(hannahh_res <- hannahh(exdf,baduns)) # 0.42 0.00 0.42 system.time(eddi_res <- eddi(exdf,baduns)) # 0.05 0.00 0.04 system.time(stevenb_res <- stevenb(exdf,baduns)) # 36.27 55.36 93.98
Verification:
identical(sort(frank_res$id),eddi_res$id)
Discussion:
With eddi() and hannahh() results are unlikely to change with nvals , nbads and nbads . In contrast, when baduns goes to 20, frank() becomes incredibly slow (for example, 20+ seconds); it also scales with nbads and nbads little worse than the other two.
The scaling of nobs , eddi() exceeding hannahh() remains unchanged by about 10 times. Against frank() it sometimes shrinks and sometimes remains unchanged. In the best case, nobs = 1e5 for frank() , eddi() is still 3 times faster.
If we move from the valset character to something that frank() must force the character to paste0 string for its operation, both eddi() and hannahh() beat it like nobs .
Tests for repeated execution. This is probably obvious, but if you need to do this “many” times (... how hard it is to say), it is better to create a key to go through a subset for each set of baduns . In the simulation above, eddi() about 5 times faster than frank() , so I would go for the latter if I did this subset 10 + times.
maxbadlen <- 2 set_o_baduns <- replicate(10,sample(valset,size=sample(maxbadlen,1))) system.time({ DT <- data.table(exdf) DT[,vstr:=paste0(sort(unlist(vars)),collapse="_"),by=1:nrow(DT)] setkey(DT,vstr) for (i in 1:10) DT[!J(get_badkeys(set_o_baduns[[i]]))] })
So, as expected, frank() takes very little time for additional evaluations, and eddi() and hannahh() grow linearly.