Parsing quotes from the strings "NA"

Question

Parsing quotes from the strings "NA"

There are some variables in my DataFrame that contain missing values as strings, such as "NA" . What is the most efficient way to parse all the columns in the data framework that contains them and convert them to real NAs that are captured by functions like is.na() ?

I am using sqldf to query the database.

Playable example:

 vect1 <- c("NA", "NA", "BANANA", "HELLO") vect2 <- c("NA", 1, 5, "NA") vect3 <- c(NA, NA, "NA", "NA") df = data.frame(vect1,vect2,vect3)

+6

r sqldf

jgozal Jan 2 '15 at 14:30

source share

3 answers

I found this good way to do this from this question:

So, for this particular situation, it will be simple:

 df[df=="NA"]<-NA

It took about 30 seconds with 5 million rows and ~ 250 variables

+4

jgozal Jan 2 '15 at 14:55

source share

This is a bit faster:

 set.seed(42) df <- do.call(data.frame, lapply(df, sample, size = 1e7, replace = TRUE)) df2 <- df system.time(df[df=="NA"]<-NA ) # user system elapsed #3.601 0.378 3.984 library(data.table) setDT(df2) system.time({ #find character and factor columns ind <- which(vapply(df2, function(x) class(x) %in% c("character", "factor"), FUN.VALUE = TRUE)) #assign by reference df2[, names(df2)[ind] := lapply(.SD, function(x) { is.na(x) <- x == "NA" x }), .SDcols = ind] }) # user system elapsed #2.484 0.190 2.676 all.equal(df, setDF(df2)) #[1] TRUE

+4

Rolling Jan 2 '15 at 15:31

source share

A5C1D2H2I1M1N2O1R2T1 · Accepted Answer · 2016-01-02T16:11:54+0000

To add to the alternatives, you can also use replace instead of the typical blah[index] <- NA approach. replace will look like this:

 df <- replace(df, df == "NA", NA)

Another alternative to consider is type.convert . This is the function that R uses when reading data to automatically convert column types. Thus, the result differs from your current approach in that, for example, the second column is converted to a numeric one.

 df[] <- lapply(df, function(x) type.convert(as.character(x), na.strings = "NA")) df

Performance is compared here. The sample data is taken from @roland's answer.

Here are the features to check:

 funop <- function() { df[df == "NA"] <- NA df } funr <- function() { ind <- which(vapply(df, function(x) class(x) %in% c("character", "factor"), FUN.VALUE = TRUE)) as.data.table(df)[, names(df)[ind] := lapply(.SD, function(x) { is.na(x) <- x == "NA" x }), .SDcols = ind][] } funam1 <- function() replace(df, df == "NA", NA) funam2 <- function() { df[] <- lapply(df, function(x) type.convert(as.character(x), na.strings = "NA")) df }

Here's the benchmarking:

 library(microbenchmark) microbenchmark(funop(), funr(), funam1(), funam2(), times = 10) # Unit: seconds # expr min lq mean median uq max neval # funop() 3.629832 3.750853 3.909333 3.855636 4.098086 4.248287 10 # funr() 3.074825 3.212499 3.320430 3.279268 3.332304 3.685837 10 # funam1() 3.714561 3.899456 4.238785 4.065496 4.280626 5.512706 10 # funam2() 1.391315 1.455366 1.623267 1.566486 1.606694 2.253258 10

replace will be the same as @roland's approach, which is similar to @jgozal. However, the type.convert approach will result in different types of columns.

 all.equal(funop(), setDF(funr())) all.equal(funop(), funam()) str(funop()) # 'data.frame': 10000000 obs. of 3 variables: # $ vect1: Factor w/ 3 levels "BANANA","HELLO",..: 2 2 NA 2 1 1 1 NA 1 1 ... # $ vect2: Factor w/ 3 levels "1","5","NA": NA 2 1 NA 1 NA NA 1 NA 2 ... # $ vect3: Factor w/ 1 level "NA": NA NA NA NA NA NA NA NA NA NA ... str(funam2()) # 'data.frame': 10000000 obs. of 3 variables: # $ vect1: Factor w/ 2 levels "BANANA","HELLO": 2 2 NA 2 1 1 1 NA 1 1 ... # $ vect2: int NA 5 1 NA 1 NA NA 1 NA 5 ... # $ vect3: logi NA NA NA NA NA NA ...

Parsing quotes from the strings "NA"

More articles: