I am writing a general function to handle missing values. Data can have Char, numeric, factorial, or integer columns. Below is sample data
dt<-data.table( num1=c(1,2,3,4,NA,5,NA,6), num3=c(1,2,3,4,5,6,7,8), int1=as.integer(c(NA,NA,102,105,NA,300,400,700)), int3=as.integer(c(1,10,102,105,200,300,400,700)), cha1=c('a','b','c',NA,NA,'c','d','e'), cha3=c('xcda','b','c','miss','no','c','dfg','e'), fact1=c('a','b','c',NA,NA,'c','d','e'), fact3=c('ad','bd','cc','zz','yy','cc','dd','ed'), allm=as.integer(c(NA,NA,NA,NA,NA,NA,NA,NA)), miss=as.character(c("","",'c','miss','no','c','dfg','e')), miss2=as.integer(c('','',3,4,5,6,7,8)), miss3=as.factor(c(".",".",".","c","d","e","f","g")), miss4=as.factor(c(NA,NA,'.','.','','','t1','t2')), miss5=as.character(c(NA,NA,'.','.','','','t1','t2')) )
I used this code to indicate missing values:
dt[,flag:=ifelse(is.na(miss5)|!nzchar(miss5),1,0)]
But this turns out to be very slow, in addition, I have to add logic that could also be considered as absent. Therefore, I plan to write this for the absence of a value identifier
dt[miss5 %in% c(NA,'','.'),flag:=1]
but on a 6 millionth record set, it takes about 1 second to run it, whereas
dt[!nzchar(miss5),flag:=1] takes close 0.14 secod to run.
My question is: can we have a code in which the time is as short as possible, while we can look for the values ββNA, blank and Dot (NA, ".", "") As missing?
Any help is appreciated.