Can dplyr be used for conditional mutation?

Can mutate be used if the mutation is conditional (depending on the values โ€‹โ€‹of certain column values)?

This example helps show what I mean.

structure(list(a = c(1, 3, 4, 6, 3, 2, 5, 1), b = c(1, 3, 4, 2, 6, 7, 2, 6), c = c(6, 3, 6, 5, 3, 6, 5, 3), d = c(6, 2, 4, 5, 3, 7, 2, 6), e = c(1, 2, 4, 5, 6, 7, 6, 3), f = c(2, 3, 4, 2, 2, 7, 5, 2)), .Names = c("a", "b", "c", "d", "e", "f"), row.names = c(NA, 8L), class = "data.frame") abcdef 1 1 1 6 6 1 2 2 3 3 3 2 2 3 3 4 4 6 4 4 4 4 6 2 5 5 5 2 5 3 6 3 3 6 2 6 2 7 6 7 7 7 7 5 2 5 2 6 5 8 1 6 3 6 3 2 

I was hoping to find a solution to my problem with the dplyr package (and yes, I know that this is not code that should work, but I think this makes the goal clear) to create a new column g:

  library(dplyr) df <- mutate(df, if (a == 2 | a == 5 | a == 7 | (a == 1 & b == 4)){g = 2}, if (a == 0 | a == 1 | a == 4 | a == 3 | c == 4) {g = 3}) 

The result of the code I'm looking for should have this result in this particular example:

  abcdefg 1 1 1 6 6 1 2 3 2 3 3 3 2 2 3 3 3 4 4 6 4 4 4 3 4 6 2 5 5 5 2 NA 5 3 6 3 3 6 2 NA 6 2 7 6 7 7 7 2 7 5 2 5 2 6 5 2 8 1 6 3 6 3 2 3 

Does anyone have an idea on how to do this in dplyr? This data frame is just an example, the data frames I deal with are much larger. Due to its speed, I tried to use dplyr, but maybe there are other, more efficient ways to solve this problem?

+124
r if-statement dplyr case-when mutate
Jun 27 '14 at 19:48
source share
5 answers

Use ifelse

 df %>% mutate(g = ifelse(a == 2 | a == 5 | a == 7 | (a == 1 & b == 4), 2, ifelse(a == 0 | a == 1 | a == 4 | a == 3 | c == 4, 3, NA))) 

Added - if_else: Note that in dplyr 0.5 there is an if_else function, so an alternative would be to replace ifelse with if_else ; note that since if_else more strict than ifelse (both legs of the condition must be of the same type), so NA in this case needs to be replaced with NA_real_ .

 df %>% mutate(g = if_else(a == 2 | a == 5 | a == 7 | (a == 1 & b == 4), 2, if_else(a == 0 | a == 1 | a == 4 | a == 3 | c == 4, 3, NA_real_))) 

Added - case_when . Since this question was submitted, dplyr added case_when , so another alternative would be the following:

 df %>% mutate(g = case_when(a == 2 | a == 5 | a == 7 | (a == 1 & b == 4) ~ 2, a == 0 | a == 1 | a == 4 | a == 3 | c == 4 ~ 3, TRUE ~ NA_real_)) 
+151
Jun 27 '14 at 19:59
source share

Since you are requesting other ways to solve this problem, here is another way to use data.table :

 require(data.table) ## 1.9.2+ setDT(df) df[a %in% c(0,1,3,4) | c == 4, g := 3L] df[a %in% c(2,5,7) | (a==1 & b==4), g := 2L] 

Note that the order of conditional statements is reversed in order to get g correctly. There was no copy of g , even during the second task - it was replaced on the spot.

With big data, this will have better performance than using a nested if-else , since it can evaluate both yes and no , and nesting can be more difficult to read / support IMHO.




Here's a comparative analysis of relatively larger data:

 # R version 3.1.0 require(data.table) ## 1.9.2 require(dplyr) DT <- setDT(lapply(1:6, function(x) sample(7, 1e7, TRUE))) setnames(DT, letters[1:6]) # > dim(DT) # [1] 10000000 6 DF <- as.data.frame(DT) DT_fun <- function(DT) { DT[(a %in% c(0,1,3,4) | c == 4), g := 3L] DT[a %in% c(2,5,7) | (a==1 & b==4), g := 2L] } DPLYR_fun <- function(DF) { mutate(DF, g = ifelse(a %in% c(2,5,7) | (a==1 & b==4), 2L, ifelse(a %in% c(0,1,3,4) | c==4, 3L, NA_integer_))) } BASE_fun <- function(DF) { # R v3.1.0 transform(DF, g = ifelse(a %in% c(2,5,7) | (a==1 & b==4), 2L, ifelse(a %in% c(0,1,3,4) | c==4, 3L, NA_integer_))) } system.time(ans1 <- DT_fun(DT)) # user system elapsed # 2.659 0.420 3.107 system.time(ans2 <- DPLYR_fun(DF)) # user system elapsed # 11.822 1.075 12.976 system.time(ans3 <- BASE_fun(DF)) # user system elapsed # 11.676 1.530 13.319 identical(as.data.frame(ans1), as.data.frame(ans2)) # [1] TRUE identical(as.data.frame(ans1), as.data.frame(ans3)) # [1] TRUE 

Not sure if this is the alternative you requested, but I hope this helps.

+48
Jun 27 '14 at 20:21
source share

dplyr now has a case_when function that offers a vectorized if. The syntax is a bit strange compared to mosaic:::derivedFactor , since you cannot access variables in the standard way dplyr and must declare NA mode, but much faster than mosaic:::derivedFactor .

 df %>% mutate(g = case_when(a %in% c(2,5,7) | (a==1 & b==4) ~ 2L, a %in% c(0,1,3,4) | c == 4 ~ 3L, TRUE~as.integer(NA))) 

EDIT: If you use dplyr::case_when() before version 0.7.0 of the package, you need to specify " .$ " Before the variable names (for example, write .$a == 1 inside case_when ).

Benchmark : For a reference (reusing functions from an Arun message) and reducing the sample size:

 require(data.table) require(mosaic) require(dplyr) require(microbenchmark) DT <- setDT(lapply(1:6, function(x) sample(7, 10000, TRUE))) setnames(DT, letters[1:6]) DF <- as.data.frame(DT) DPLYR_case_when <- function(DF) { DF %>% mutate(g = case_when(a %in% c(2,5,7) | (a==1 & b==4) ~ 2L, a %in% c(0,1,3,4) | c==4 ~ 3L, TRUE~as.integer(NA))) } DT_fun <- function(DT) { DT[(a %in% c(0,1,3,4) | c == 4), g := 3L] DT[a %in% c(2,5,7) | (a==1 & b==4), g := 2L] } DPLYR_fun <- function(DF) { mutate(DF, g = ifelse(a %in% c(2,5,7) | (a==1 & b==4), 2L, ifelse(a %in% c(0,1,3,4) | c==4, 3L, NA_integer_))) } mosa_fun <- function(DF) { mutate(DF, g = derivedFactor( "2" = (a == 2 | a == 5 | a == 7 | (a == 1 & b == 4)), "3" = (a == 0 | a == 1 | a == 4 | a == 3 | c == 4), .method = "first", .default = NA )) } microbenchmark( DT_fun(DT), DPLYR_fun(DF), DPLYR_case_when(DF), mosa_fun(DF), times=20 ) 

This gives:

  expr min lq mean median uq max neval DT_fun(DT) 1.503589 1.626971 2.054825 1.755860 2.292157 3.426192 20 DPLYR_fun(DF) 2.420798 2.596476 3.617092 3.484567 4.184260 6.235367 20 DPLYR_case_when(DF) 2.153481 2.252134 6.124249 2.365763 3.119575 72.344114 20 mosa_fun(DF) 396.344113 407.649356 413.743179 412.412634 416.515742 459.974969 20 
+29
Oct 08 '16 at 18:22
source share

The derivedFactor function from the mosaic package seems to be designed to handle this. Using this example, it will look like this:

 library(dplyr) library(mosaic) df <- mutate(df, g = derivedFactor( "2" = (a == 2 | a == 5 | a == 7 | (a == 1 & b == 4)), "3" = (a == 0 | a == 1 | a == 4 | a == 3 | c == 4), .method = "first", .default = NA )) 

(If you want the result to be numeric rather than a multiplier, you can derivedFactor in an as.numeric call.)

derivedFactor can also be used for an arbitrary number of conventions.

+13
Oct 22 '15 at 19:59
source share

case_when now a pretty clean SQL style implementation when:

 structure(list(a = c(1, 3, 4, 6, 3, 2, 5, 1), b = c(1, 3, 4, 2, 6, 7, 2, 6), c = c(6, 3, 6, 5, 3, 6, 5, 3), d = c(6, 2, 4, 5, 3, 7, 2, 6), e = c(1, 2, 4, 5, 6, 7, 6, 3), f = c(2, 3, 4, 2, 2, 7, 5, 2)), .Names = c("a", "b", "c", "d", "e", "f"), row.names = c(NA, 8L), class = "data.frame") -> df df %>% mutate( g = case_when( a == 2 | a == 5 | a == 7 | (a == 1 & b == 4 ) ~ 2, a == 0 | a == 1 | a == 4 | a == 3 | c == 4 ~ 3 )) 

Using dplyr 0.7.4

Manual: http://dplyr.tidyverse.org/reference/case_when.html

+8
Oct 12 '17 at 11:03 on
source share



All Articles