Creating indicator variable columns in dplyr chain

Question

Creating indicator variable columns in dplyr chain

Updated . Apologizing to those who answered, in my original example, I overlooked the fact that data.frame() created var as a factor, and not as a character vector, as I expected. I fixed the example and this will break at least one of the answers.

- original -

I have a data frame in which I perform a series of dplyr and tidyr operations, and I would like to add columns for indicator variables that will be encoded as 0 or 1, and do this in the dplyr chain. Each factor level (currently stored as symbolic vectors) must be encoded in a separate column, and column names are the concatenation of a fixed prefix with a variable level, for example. var has level a, the new var_a column will be 1, and all other var_a rows will be 0.

The following minimal example using the R base gives exactly the results that I want (thanks to this blog post ), but I would like to flip all this into the dplyr chain and cannot figure out how to do this.

 library(dplyr) df <- data.frame(var = sample(x = letters[1:4], size = 10, replace = TRUE), stringsAsFactors = FALSE) for(level in unique(df$var)){ df[paste("var", level, sep = "_")] <- ifelse(df$var == level, 1, 0) }

Note that the actual data set contains several columns, none of which should be changed or deleted when creating indicator variables, except for the var column, which can be converted to a type factor.

+8

r dplyr tidyr

Tom Mar 11 '16 at 15:02

source share

4 answers

The only requirements for a function that should be part of the dplyr pipeline are that it takes a data frame as input and returns a data frame as output. So using model.matrix :

 make_inds <- function(df, cols=names(df)) { # do each variable separately to get around model.matrix dropping aliased columns do.call(cbind, c(df, lapply(cols, function(n) { x <- df[[n]] mm <- model.matrix(~ x - 1) colnames(mm) <- gsub("^x", paste(n, "_", sep=""), colnames(mm)) mm }))) } # insert into pipeline data %>% ... %>% make_inds %>% ...

+3

Hong ooi Mar 11 '16 at 17:10

source share

This is possible without creating a function, although it requires lapply . If var is a factor, you can work with its levels; we can bind its columns to lapply , which traverses var levels and creates values, calls them using setNames and converts them to tbl_df .

 df %>% bind_cols(as_data_frame(setNames(lapply(levels(df$var), function(x){as.integer(df$var == x)}), paste0('var2_', levels(df$var)))))

returns

 Source: local data frame [10 x 5] var var_d var_c var2_c var2_d (fctr) (dbl) (dbl) (int) (int) 1 d 1 0 0 1 2 c 0 1 1 0 3 c 0 1 1 0 4 c 0 1 1 0 5 d 1 0 0 1 6 d 1 0 0 1 7 c 0 1 1 0 8 c 0 1 1 0 9 d 1 0 0 1 10 c 0 1 1 0

If var is a character vector, not a factor, you can do the same, but using unique instead of levels :

 df %>% bind_cols(as_data_frame(setNames(lapply(unique(df$var), function(x){as.integer(df$var == x)}), paste0('var2_', unique(df$var)))))

Two notes:

This approach will work regardless of the data type, but will be slower. It’s big enough in your data so that it matters, it probably makes sense to store the data as a factor anyway, as it contains many repeating levels.
Both versions retrieve data from df$var , because it lives in the calling environment, not how it can exist in a larger chain, and assume that var does not change in what is passed. To refer to the dynamic value of var , other than dplyr , normal NSE is rather a pain, as I have seen.

Another alternative that is a bit simpler and factor diagnostic using reshape2::dcast :

 library(reshape2) df %>% cbind(1 * !is.na(dcast(df, seq_along(var) ~ var, value.var = 'var')[,-1]))

It still pulls the df version from the calling environment, so the chain really only determines what you are joining. Since bind_cols used instead of bind_cols , the result will be data.frame , not tbl_df , so if you want to save all tbl_df (smart if the data is big) you need to replace cbind with bind_cols(as_data_frame( ... )) ; bind_cols doesn't seem to want to do the conversion for you.

Note, however, that although this version is simpler, it is comparatively slower, as on the factor data:

 Unit: microseconds expr min lq mean median uq max neval factor 358.889 384.0010 479.5746 427.9685 501.580 3995.951 100 unique 547.249 585.4205 696.4709 633.4215 696.402 4528.099 100 dcast 2265.517 2490.5955 2721.1118 2628.0730 2824.949 3928.796 100

and string data:

 Unit: microseconds expr min lq mean median uq max neval unique 307.190 336.422 414.1031 362.6485 419.3625 3693.340 100 dcast 2117.807 2249.077 2517.0417 2402.4285 2615.7290 3793.178 100

For small data, this does not matter, but for big data it may be worthwhile to perform complexity.

+2

alistaire Mar 11 '16 at 17:48

source share

At first I got the answers to these questions and answers because I really wanted to put model.matrix in the magrittr pipe workflow or create equivalent output using only the tidyverse functions (sorry baseRs).

Later, I found this solution , which elegantly used functions that I considered possible (but I did not invent by myself):

 df <- data_frame(var = sample(x = letters[1:4], size = 10, replace = TRUE)) df %>% mutate(unique_row_id = 1:n()) %>% #The rows need to be unique for 'spread' to work. mutate(dummy = 1) %>% spread(var, dummy, fill = 0)

So, I am adding an updated / modified version of the related solution so that the people who land here first do not look (like me).

0

D. Woods Dec 10 '18 at 21:44

source share

Mrflick · Accepted Answer · 2016-03-11T16:46:51+0000

It is not very, but this function should work

 dummy <- function(data, col) { for(c in col) { idx <- which(names(data)==c) v <- data[[idx]] stopifnot(class(v)=="factor") m <- matrix(0, nrow=nrow(data), ncol=nlevels(v)) m[cbind(seq_along(v), as.integer(v))]<-1 colnames(m) <- paste(c, levels(v), sep="_") r <- data.frame(m) if ( idx>1 ) { r <- cbind(data[1:(idx-1)],r) } if ( idx<ncol(data) ) { r <- cbind(r, data[(idx+1):ncol(data)]) } data <- r } data }

Here's a sample data.frame

 dd <- data.frame(a=runif(30), b=sample(letters[1:3],30,replace=T), c=rnorm(30), d=sample(letters[10:13],30,replace=T) )

and you specify the columns you want to expand as a character vector. You can do

 dd %>% dummy("b")

or

 dd %>% dummy(c("b","d"))

Creating indicator variable columns in dplyr chain

More articles: