Can dplyr be used for conditional mutation?

Question

Can dplyr be used for conditional mutation?

Can mutate be used if the mutation is conditional (depending on the values of certain column values)?

This example helps show what I mean.

structure(list(a = c(1, 3, 4, 6, 3, 2, 5, 1), b = c(1, 3, 4, 2, 6, 7, 2, 6), c = c(6, 3, 6, 5, 3, 6, 5, 3), d = c(6, 2, 4, 5, 3, 7, 2, 6), e = c(1, 2, 4, 5, 6, 7, 6, 3), f = c(2, 3, 4, 2, 2, 7, 5, 2)), .Names = c("a", "b", "c", "d", "e", "f"), row.names = c(NA, 8L), class = "data.frame") abcdef 1 1 1 6 6 1 2 2 3 3 3 2 2 3 3 4 4 6 4 4 4 4 6 2 5 5 5 2 5 3 6 3 3 6 2 6 2 7 6 7 7 7 7 5 2 5 2 6 5 8 1 6 3 6 3 2

I was hoping to find a solution to my problem with the dplyr package (and yes, I know that this is not code that should work, but I think this makes the goal clear) to create a new column g:

  library(dplyr) df <- mutate(df, if (a == 2 | a == 5 | a == 7 | (a == 1 & b == 4)){g = 2}, if (a == 0 | a == 1 | a == 4 | a == 3 | c == 4) {g = 3})

The result of the code I'm looking for should have this result in this particular example:

  abcdefg 1 1 1 6 6 1 2 3 2 3 3 3 2 2 3 3 3 4 4 6 4 4 4 3 4 6 2 5 5 5 2 NA 5 3 6 3 3 6 2 NA 6 2 7 6 7 7 7 2 7 5 2 5 2 6 5 2 8 1 6 3 6 3 2 3

Does anyone have an idea on how to do this in dplyr? This data frame is just an example, the data frames I deal with are much larger. Due to its speed, I tried to use dplyr, but maybe there are other, more efficient ways to solve this problem?

+124

r if-statement dplyr case-when mutate

rdatasculptor Jun 27 '14 at 19:48

source share

5 answers

Since you are requesting other ways to solve this problem, here is another way to use data.table :

 require(data.table) ## 1.9.2+ setDT(df) df[a %in% c(0,1,3,4) | c == 4, g := 3L] df[a %in% c(2,5,7) | (a==1 & b==4), g := 2L]

Note that the order of conditional statements is reversed in order to get g correctly. There was no copy of g , even during the second task - it was replaced on the spot.

With big data, this will have better performance than using a nested if-else , since it can evaluate both yes and no , and nesting can be more difficult to read / support IMHO.

Here's a comparative analysis of relatively larger data:

 # R version 3.1.0 require(data.table) ## 1.9.2 require(dplyr) DT <- setDT(lapply(1:6, function(x) sample(7, 1e7, TRUE))) setnames(DT, letters[1:6]) # > dim(DT) # [1] 10000000 6 DF <- as.data.frame(DT) DT_fun <- function(DT) { DT[(a %in% c(0,1,3,4) | c == 4), g := 3L] DT[a %in% c(2,5,7) | (a==1 & b==4), g := 2L] } DPLYR_fun <- function(DF) { mutate(DF, g = ifelse(a %in% c(2,5,7) | (a==1 & b==4), 2L, ifelse(a %in% c(0,1,3,4) | c==4, 3L, NA_integer_))) } BASE_fun <- function(DF) { # R v3.1.0 transform(DF, g = ifelse(a %in% c(2,5,7) | (a==1 & b==4), 2L, ifelse(a %in% c(0,1,3,4) | c==4, 3L, NA_integer_))) } system.time(ans1 <- DT_fun(DT)) # user system elapsed # 2.659 0.420 3.107 system.time(ans2 <- DPLYR_fun(DF)) # user system elapsed # 11.822 1.075 12.976 system.time(ans3 <- BASE_fun(DF)) # user system elapsed # 11.676 1.530 13.319 identical(as.data.frame(ans1), as.data.frame(ans2)) # [1] TRUE identical(as.data.frame(ans1), as.data.frame(ans3)) # [1] TRUE

Not sure if this is the alternative you requested, but I hope this helps.

+48

Arun Jun 27 '14 at 20:21

source share

dplyr now has a case_when function that offers a vectorized if. The syntax is a bit strange compared to mosaic:::derivedFactor , since you cannot access variables in the standard way dplyr and must declare NA mode, but much faster than mosaic:::derivedFactor .

 df %>% mutate(g = case_when(a %in% c(2,5,7) | (a==1 & b==4) ~ 2L, a %in% c(0,1,3,4) | c == 4 ~ 3L, TRUE~as.integer(NA)))

EDIT: If you use dplyr::case_when() before version 0.7.0 of the package, you need to specify " .$ " Before the variable names (for example, write .$a == 1 inside case_when ).

Benchmark : For a reference (reusing functions from an Arun message) and reducing the sample size:

 require(data.table) require(mosaic) require(dplyr) require(microbenchmark) DT <- setDT(lapply(1:6, function(x) sample(7, 10000, TRUE))) setnames(DT, letters[1:6]) DF <- as.data.frame(DT) DPLYR_case_when <- function(DF) { DF %>% mutate(g = case_when(a %in% c(2,5,7) | (a==1 & b==4) ~ 2L, a %in% c(0,1,3,4) | c==4 ~ 3L, TRUE~as.integer(NA))) } DT_fun <- function(DT) { DT[(a %in% c(0,1,3,4) | c == 4), g := 3L] DT[a %in% c(2,5,7) | (a==1 & b==4), g := 2L] } DPLYR_fun <- function(DF) { mutate(DF, g = ifelse(a %in% c(2,5,7) | (a==1 & b==4), 2L, ifelse(a %in% c(0,1,3,4) | c==4, 3L, NA_integer_))) } mosa_fun <- function(DF) { mutate(DF, g = derivedFactor( "2" = (a == 2 | a == 5 | a == 7 | (a == 1 & b == 4)), "3" = (a == 0 | a == 1 | a == 4 | a == 3 | c == 4), .method = "first", .default = NA )) } microbenchmark( DT_fun(DT), DPLYR_fun(DF), DPLYR_case_when(DF), mosa_fun(DF), times=20 )

This gives:

  expr min lq mean median uq max neval DT_fun(DT) 1.503589 1.626971 2.054825 1.755860 2.292157 3.426192 20 DPLYR_fun(DF) 2.420798 2.596476 3.617092 3.484567 4.184260 6.235367 20 DPLYR_case_when(DF) 2.153481 2.252134 6.124249 2.365763 3.119575 72.344114 20 mosa_fun(DF) 396.344113 407.649356 413.743179 412.412634 416.515742 459.974969 20

+29

Matifou Oct 08 '16 at 18:22

source share

The derivedFactor function from the mosaic package seems to be designed to handle this. Using this example, it will look like this:

 library(dplyr) library(mosaic) df <- mutate(df, g = derivedFactor( "2" = (a == 2 | a == 5 | a == 7 | (a == 1 & b == 4)), "3" = (a == 0 | a == 1 | a == 4 | a == 3 | c == 4), .method = "first", .default = NA ))

(If you want the result to be numeric rather than a multiplier, you can derivedFactor in an as.numeric call.)

derivedFactor can also be used for an arbitrary number of conventions.

+13

Jake Fisher Oct 22 '15 at 19:59

source share

case_when now a pretty clean SQL style implementation when:

 structure(list(a = c(1, 3, 4, 6, 3, 2, 5, 1), b = c(1, 3, 4, 2, 6, 7, 2, 6), c = c(6, 3, 6, 5, 3, 6, 5, 3), d = c(6, 2, 4, 5, 3, 7, 2, 6), e = c(1, 2, 4, 5, 6, 7, 6, 3), f = c(2, 3, 4, 2, 2, 7, 5, 2)), .Names = c("a", "b", "c", "d", "e", "f"), row.names = c(NA, 8L), class = "data.frame") -> df df %>% mutate( g = case_when( a == 2 | a == 5 | a == 7 | (a == 1 & b == 4 ) ~ 2, a == 0 | a == 1 | a == 4 | a == 3 | c == 4 ~ 3 ))

Using dplyr 0.7.4

Manual: http://dplyr.tidyverse.org/reference/case_when.html

+8

Rasmus Larsen Oct 12 '17 at 11:03 on

source share

G. Grothendieck · Accepted Answer · 2014-06-27 19:59

Use ifelse

 df %>% mutate(g = ifelse(a == 2 | a == 5 | a == 7 | (a == 1 & b == 4), 2, ifelse(a == 0 | a == 1 | a == 4 | a == 3 | c == 4, 3, NA)))

Added - if_else: Note that in dplyr 0.5 there is an if_else function, so an alternative would be to replace ifelse with if_else ; note that since if_else more strict than ifelse (both legs of the condition must be of the same type), so NA in this case needs to be replaced with NA_real_ .

 df %>% mutate(g = if_else(a == 2 | a == 5 | a == 7 | (a == 1 & b == 4), 2, if_else(a == 0 | a == 1 | a == 4 | a == 3 | c == 4, 3, NA_real_)))

Added - case_when . Since this question was submitted, dplyr added case_when , so another alternative would be the following:

 df %>% mutate(g = case_when(a == 2 | a == 5 | a == 7 | (a == 1 & b == 4) ~ 2, a == 0 | a == 1 | a == 4 | a == 3 | c == 4 ~ 3, TRUE ~ NA_real_))

Can dplyr be used for conditional mutation?

More articles: