R: create dummy variables based on a categorical variable * lists *

I have a data frame with a categorical variable containing lists of rows with a variable length (this is important because otherwise this question will duplicate this or this ), for example:

df <- data.frame(x = 1:5) df$y <- list("A", c("A", "B"), "C", c("B", "D", "C"), "E") df 
  xy 1 1 A 2 2 A, B 3 3 C 4 4 B, D, C 5 5 E 

And the desired shape is a dummy variable for each unique line visible somewhere in df$y , i.e.:

 data.frame(x = 1:5, A = c(1,1,0,0,0), B = c(0,1,0,1,0), C = c(0,0,1,1,0), D = c(0,0,0,1,0), E = c(0,0,0,0,1)) 
  x ABCDE 1 1 1 0 0 0 0 2 2 1 1 0 0 0 3 3 0 0 1 0 0 4 4 0 1 1 1 0 5 5 0 0 0 0 1 

This naive approach works:

 > uniqueStrings <- unique(unlist(df$y)) > n <- ncol(df) > for (i in 1:length(uniqueStrings)) { + df[, n + i] <- sapply(df$y, function(x) ifelse(uniqueStrings[i] %in% x, 1, 0)) + colnames(df)[n + i] <- uniqueStrings[i] + } 

However, it is very ugly, lazy and slow with large frames of data.

Any suggestions? Something like tidyverse ?


UPDATE: I have 3 different approaches below. I tested them with system.time on my (Windows 7, 32GB RAM) laptop on a real data set consisting of 1M lines, each line contains a list of lengths from 1 to 4 lines (of ~ 350 unique string values), total 200MB on drive. Thus, the expected result is a 1M x 350 data frame. The tidyverse (@Sotos) and base (@ joel.wilson) approaches took so long that I had to restart the R. qdapTools (@akrun) approach, however, it worked fantastic :

 > system.time(res1 <- mtabulate(varsLists)) user system elapsed 47.05 10.27 116.82 

So this is the approach that I will designate as accepted.

+7
list r tidyverse dummy-variable
source share
3 answers

We can use mtabulate

 library(qdapTools) cbind(df[1], mtabulate(df$y)) # x ABCDE #1 1 1 0 0 0 0 #2 2 1 1 0 0 0 #3 3 0 0 1 0 0 #4 4 0 1 1 1 0 #5 5 0 0 0 0 1 
+6
source share

Another idea

 library(dplyr) library(tidyr) df %>% unnest(y) %>% mutate(new = 1) %>% spread(y, new, fill = 0) # x ABCDE #1 1 1 0 0 0 0 #2 2 1 1 0 0 0 #3 3 0 0 1 0 0 #4 4 0 1 1 1 0 #5 5 0 0 0 0 1 

In addition to the cases mentioned in the comments, we can use the dcast from reshape2 as it is more flexible than spread ,

 df2 <- df %>% unnest(y) %>% group_by(x) %>% filter(!duplicated(y)) %>% ungroup() reshape2::dcast(df2, x ~ y, value.var = 'y', length) # x ABCDE #1 1 1 0 0 0 0 #2 2 1 1 0 0 0 #3 3 0 0 1 0 0 #4 4 0 1 1 1 0 #5 5 0 0 0 0 1 #or with df$x <- c(1, 1, 2, 2, 3) # x ABCDE #1 1 1 1 0 0 0 #2 2 0 1 1 1 0 #3 3 0 0 0 0 1 #or with df$x <- rep(1,5) # x ABCDE #1 1 1 1 1 1 1 
+6
source share

there are no external packages

 # thanks to Sotos for suggesting to use `unique(unlist(df$y))` instead of `LETTERS[1!:5]` sapply(unique(unlist(df$y)), function(j) as.numeric(grepl(j, df$y))) # ABCDE #[1,] 1 0 0 0 0 #[2,] 1 1 0 0 0 #[3,] 0 0 1 0 0 #[4,] 0 1 1 1 0 #[5,] 0 0 0 0 1 
+1
source share

All Articles