Parsing quotes from the strings "NA"

There are some variables in my DataFrame that contain missing values โ€‹โ€‹as strings, such as "NA" . What is the most efficient way to parse all the columns in the data framework that contains them and convert them to real NAs that are captured by functions like is.na() ?

I am using sqldf to query the database.

Playable example:

 vect1 <- c("NA", "NA", "BANANA", "HELLO") vect2 <- c("NA", 1, 5, "NA") vect3 <- c(NA, NA, "NA", "NA") df = data.frame(vect1,vect2,vect3) 
+6
source share
3 answers

To add to the alternatives, you can also use replace instead of the typical blah[index] <- NA approach. replace will look like this:

 df <- replace(df, df == "NA", NA) 

Another alternative to consider is type.convert . This is the function that R uses when reading data to automatically convert column types. Thus, the result differs from your current approach in that, for example, the second column is converted to a numeric one.

 df[] <- lapply(df, function(x) type.convert(as.character(x), na.strings = "NA")) df 

Performance is compared here. The sample data is taken from @roland's answer.

Here are the features to check:

 funop <- function() { df[df == "NA"] <- NA df } funr <- function() { ind <- which(vapply(df, function(x) class(x) %in% c("character", "factor"), FUN.VALUE = TRUE)) as.data.table(df)[, names(df)[ind] := lapply(.SD, function(x) { is.na(x) <- x == "NA" x }), .SDcols = ind][] } funam1 <- function() replace(df, df == "NA", NA) funam2 <- function() { df[] <- lapply(df, function(x) type.convert(as.character(x), na.strings = "NA")) df } 

Here's the benchmarking:

 library(microbenchmark) microbenchmark(funop(), funr(), funam1(), funam2(), times = 10) # Unit: seconds # expr min lq mean median uq max neval # funop() 3.629832 3.750853 3.909333 3.855636 4.098086 4.248287 10 # funr() 3.074825 3.212499 3.320430 3.279268 3.332304 3.685837 10 # funam1() 3.714561 3.899456 4.238785 4.065496 4.280626 5.512706 10 # funam2() 1.391315 1.455366 1.623267 1.566486 1.606694 2.253258 10 

replace will be the same as @roland's approach, which is similar to @jgozal. However, the type.convert approach will result in different types of columns.

 all.equal(funop(), setDF(funr())) all.equal(funop(), funam()) str(funop()) # 'data.frame': 10000000 obs. of 3 variables: # $ vect1: Factor w/ 3 levels "BANANA","HELLO",..: 2 2 NA 2 1 1 1 NA 1 1 ... # $ vect2: Factor w/ 3 levels "1","5","NA": NA 2 1 NA 1 NA NA 1 NA 2 ... # $ vect3: Factor w/ 1 level "NA": NA NA NA NA NA NA NA NA NA NA ... str(funam2()) # 'data.frame': 10000000 obs. of 3 variables: # $ vect1: Factor w/ 2 levels "BANANA","HELLO": 2 2 NA 2 1 1 1 NA 1 1 ... # $ vect2: int NA 5 1 NA 1 NA NA 1 NA 5 ... # $ vect3: logi NA NA NA NA NA NA ... 
+5
source

I found this good way to do this from this question:

So, for this particular situation, it will be simple:

 df[df=="NA"]<-NA 

It took about 30 seconds with 5 million rows and ~ 250 variables

+4
source

This is a bit faster:

 set.seed(42) df <- do.call(data.frame, lapply(df, sample, size = 1e7, replace = TRUE)) df2 <- df system.time(df[df=="NA"]<-NA ) # user system elapsed #3.601 0.378 3.984 library(data.table) setDT(df2) system.time({ #find character and factor columns ind <- which(vapply(df2, function(x) class(x) %in% c("character", "factor"), FUN.VALUE = TRUE)) #assign by reference df2[, names(df2)[ind] := lapply(.SD, function(x) { is.na(x) <- x == "NA" x }), .SDcols = ind] }) # user system elapsed #2.484 0.190 2.676 all.equal(df, setDF(df2)) #[1] TRUE 
+4
source

All Articles