Delete rows with NA from data.table in R

I would like to delete all rows of data.table containing Inf in any of my columns. So far I have used this approach:

 DT <- data.table(col1 = c(1,2,3), col2 = c(4,Inf,5)) DT[,drop := apply(.SD, 1, function(x) any(is.infinite(x))), by = 1:nrow(DT)] DT <- DT[(!drop)][,drop:=NULL] 

which is coming from https://stackoverflow.com/a/312960/ However, this approach does not scale well for large amounts of data. Is there a better way to delete rows using Inf ?

+5
source share
1 answer

You can use rowSums to check if any element of a string is finite.

 DT[is.finite(rowSums(DT))] 

OR you can use the fact that Inf * 0 is NA and use complete.cases

 DT[complete.cases(DT*0)] 

Some tests show that rowSums is the fastest for smaller datasets, and complete.cases is the fastest solution for larger datasets.

 require(microbenchmark) microbenchmark( DT[is.finite(rowSums(DT))] , DT[complete.cases(DT*0)] , DT[DT[, Reduce('&', lapply(.SD, is.finite))]] ) ## ## nrow(DT) = 3000 ## Unit: microseconds ## expr min lq mean median uq max neval cld ## DT[is.finite(rowSums(DT))] 786.797 839.235 864.0215 852.8465 884.756 1021.988 100 a ## DT[complete.cases(DT * 0)] 1265.658 1326.575 1363.3985 1350.0055 1386.377 1898.040 100 c ## DT[DT[, Reduce("&", lapply(.SD, is.finite))]] 1220.137 1275.030 1319.6226 1308.0555 1348.443 1624.023 100 b ## ## nrow(DT) = 300000 ## Unit: milliseconds ## expr min lq mean median uq max neval cld ## DT[is.finite(rowSums(DT))] 21.617935 22.687452 26.698070 25.75765 26.07942 87.56290 100 c ## DT[complete.cases(DT * 0)] 7.209252 7.567393 9.908503 10.17569 10.37473 71.31375 100 a ## DT[DT[, Reduce("&", lapply(.SD, is.finite))]] 11.786773 12.647652 14.128624 14.78512 15.05089 15.39542 100 b 
+16
source

Source: https://habr.com/ru/post/1214733/


All Articles