Ifelse behavior in data.table (R)

Question

Ifelse behavior in data.table (R)

I have a data.table full of some consumer products. I created some difference for products like 'low' , 'high' or 'unknown' quality. The data are time series, and I'm interested in smoothing out some seasonality of the data. If the product classification (the classification selected by the algorithm I used to determine the quality) is 'low' quality in period X, but its original classification was 'high' quality in period X-1, I will reclassify this product as 'high' for period X. This process is carried out as part of any difference in the product group.

For this, I have something like the following:

 require(data.table) # lag takes a column and lags it by one period, # padding with NA lag <- function(var) { lagged <- c(NA, var[1:(length(var)-1)]) return(lagged) } set.seed(120) foo <- data.table(group = c('A', rep(c('B', 'C', 'D'), 5)), period = c(1:16), quality = c('unknown', sample(c('high', 'low', 'unknown'), 15, replace = TRUE))) foo[, quality_lag := lag(quality), by = group] foo[, quality_1 := ifelse(quality == 'low' & quality_lag == 'high', 'high', quality)]

Looking at foo :

  group period quality quality_lag quality_1 1: A 1 unknown NA unknown 2: B 2 low NA NA 3: C 3 high NA high 4: D 4 low NA NA 5: B 5 unknown low unknown 6: C 6 high high high 7: D 7 low low low 8: B 8 unknown unknown unknown 9: C 9 high high high 10: D 10 unknown low unknown 11: B 11 unknown unknown unknown 12: C 12 low high high 13: D 13 unknown unknown unknown 14: B 14 high unknown high 15: C 15 high low high 16: D 16 unknown unknown unknown

So quality_1 is basically what I want. If period X is 'low' and period X-1 is 'high' , we see that reclassification to 'high' occurs and everything remains basically intact from quality . However, when quality_lag is NA, 'low' gets reclassified to NA in quality_1 . This is not a problem with 'high' or 'unknown' .

That is, the first four lines of foo should look like this:

  group period quality quality_lag quality_1 1: A 1 unknown NA unknown 2: B 2 low NA low 3: C 3 high NA high 4: D 4 low NA low

Any thoughts on what causes this?

+5

r data.table

thagzone Jan 29 '15 at 19:50

source share

2 answers

Your problem is ifelse(NA, 1, 2) == NA , and when you do NA == 'low' , the result is NA . An easy fix is to represent NA as strings in your delay function. Here is the working version of your code:

 require(data.table) # lag takes a column and lags it by one period, # padding with NA lag <- function(var) { lagged <- c("NA", var[1:(length(var)-1)]) return(lagged) } set.seed(120) foo <- data.table(group = c('A', rep(c('B', 'C', 'D'), 5)), period = c(1:16), quality = c('unknown', sample(c('high', 'low', 'unknown'), 15, replace = TRUE))) foo[, quality_lag := lag(quality), by = group] foo[, quality_1 := ifelse(quality == 'low' & quality_lag == 'high', 'high', quality)]

-3

thalesmello Jan 29 '15 at 20:18

source share

David Arenburg · Accepted Answer · 2015-01-29T20:51:27+0000

Firstly, the development version on GitHub already has an effective shift delay function, which can be used both as a lag and as a leader (and also has some additional functions, see ?shift ).

Take also here , as there are tons of other new features that are now present in v> = 1.9.5

So, with v> = 1.9.5 we could just do

 foo[, quality_lag := shift(quality), by = group]

Although even with v <1.9.5 you can use .N instead of creating this function as follows

 foo[, quality_lag2 := c(NA, quality[-.N]), by = group]

As for your second question, I would recommend avoiding ifelse all together for many reasons mentioned here

A possible alternative would be simple indexing, as in

 foo[, quality_1 := quality][quality == 'low' & quality_lag == 'high', quality_1 := "high"]

This solution has several overheads by calling [.data.table twice, but it will be much more efficient / secure than ifelse solution.

Ifelse behavior in data.table (R)

More articles: