How to speed up the missing search process in R data.table

I am writing a general function to handle missing values. Data can have Char, numeric, factorial, or integer columns. Below is sample data

dt<-data.table( num1=c(1,2,3,4,NA,5,NA,6), num3=c(1,2,3,4,5,6,7,8), int1=as.integer(c(NA,NA,102,105,NA,300,400,700)), int3=as.integer(c(1,10,102,105,200,300,400,700)), cha1=c('a','b','c',NA,NA,'c','d','e'), cha3=c('xcda','b','c','miss','no','c','dfg','e'), fact1=c('a','b','c',NA,NA,'c','d','e'), fact3=c('ad','bd','cc','zz','yy','cc','dd','ed'), allm=as.integer(c(NA,NA,NA,NA,NA,NA,NA,NA)), miss=as.character(c("","",'c','miss','no','c','dfg','e')), miss2=as.integer(c('','',3,4,5,6,7,8)), miss3=as.factor(c(".",".",".","c","d","e","f","g")), miss4=as.factor(c(NA,NA,'.','.','','','t1','t2')), miss5=as.character(c(NA,NA,'.','.','','','t1','t2')) ) 

I used this code to indicate missing values:

 dt[,flag:=ifelse(is.na(miss5)|!nzchar(miss5),1,0)] 

But this turns out to be very slow, in addition, I have to add logic that could also be considered as absent. Therefore, I plan to write this for the absence of a value identifier

 dt[miss5 %in% c(NA,'','.'),flag:=1] 

but on a 6 millionth record set, it takes about 1 second to run it, whereas

 dt[!nzchar(miss5),flag:=1] takes close 0.14 secod to run. 

My question is: can we have a code in which the time is as short as possible, while we can look for the values ​​NA, blank and Dot (NA, ".", "") As missing?

Any help is appreciated.

+7
r data.table
source share
2 answers

== and %in% optimized for using binary search automatically (NEW FEATURE: Automatic indexing). To use it, we must make sure that:

a) we use dt[...] instead of set() since it is not yet implemented in set() , # 1196 .

b) When the RHS in %in% has a higher SEXPTYPE than LHS, automatic indexing redirects to the R base to provide the correct results (since binary search always forces RHS). Therefore, for whole columns, we need to make sure that we pass only NA , not "." or "" .

Using the @akrun data, enter the code and runtime:

 in_col = grep("^miss", names(dt), value=TRUE) out_col = gsub("^miss", "flag", in_col) system.time({ dt[, (out_col) := 0L] for (j in seq_along(in_col)) { if (class(.subset2(dt, in_col[j])) %in% c("character", "factor")) { lookup = c("", ".", NA) } else lookup = NA expr = call("%in%", as.name(in_col[j]), lookup) tt = dt[eval(expr), (out_col[j]) := 1L] } }) # user system elapsed # 1.174 0.295 1.476 

How it works:

a) first initialize all output columns to 0.

b) Then for each column we check its type and create a lookup accordingly.

c) Then we create the corresponding expression for i - miss(.) %in% lookup

d) Then we evaluate an expression in i that will use automatic indexing to quickly create an index and use that index to quickly find matching indexes using binary search.

Note. If necessary, you can add set2key(dt, NULL) at the end of the for loop so that the created indexes are deleted immediately after use (to save space).

Compared to this run, the fastest answer @akrun takes 6.33 seconds, which is ~ 4.2 times.

Update: It takes ~ 9.2 seconds in 4 million rows and 100 columns. This is ~ 0.092 seconds per column.

Calling [.data.table 100 times can be costly. When automatic indexing is implemented in set() , it would be nice to compare performance.

+8
source share

You can skip the skip columns and create the corresponding flag columns with set .

 library(data.table)#v1.9.5+ ind <- grep('^miss', names(dt)) nm1 <- sub('miss', 'flag',names(dt)[ind]) dt[,(nm1) := 0] for(j in seq_along(ind)){ set(dt, i=which(dt[[ind[j]]] %in% c('.', '', NA)),j= nm1[j], value=1L) } 

Benchmarks

 set.seed(24) df1 <- as.data.frame(matrix(sample(c(NA,0:9), 6e6*5, replace=TRUE), ncol=5)) set.seed(23) df2 <- as.data.frame(matrix(sample(c('.','', letters[1:5]), 6e6*5, replace=TRUE), ncol=5)) set.seed(234) i1 <- sample(10) dfN <- setNames(cbind(df1, df2)[i1], paste0('miss',1:10)) dt <- as.data.table(dfN) system.time({ ind <- grep('^miss', names(dt)) nm1 <- sub('miss', 'flag',names(dt)[ind]) dt[,(nm1) := 0L] for(j in seq_along(ind)){ set(dt, i=which(dt[[ind[j]]] %in% c('.', '', NA)), j= nm1[j], value=1L) } } ) #user system elapsed # 8.352 0.150 8.496 system.time({ m1 <- matrix(0, nrow=6e6, ncol=10) m2 <- sapply(seq_along(dt), function(i) { ind <- which(dt[[i]] %in% c('.', '', NA)) replace(m1[,i], ind, 1L)}) cbind(dt, m2)}) #user system elapsed # 14.227 0.362 14.582 
+3
source share

All Articles