I have a data frame with a categorical variable containing lists of rows with a variable length (this is important because otherwise this question will duplicate this or this ), for example:
df <- data.frame(x = 1:5) df$y <- list("A", c("A", "B"), "C", c("B", "D", "C"), "E") df
xy 1 1 A 2 2 A, B 3 3 C 4 4 B, D, C 5 5 E
And the desired shape is a dummy variable for each unique line visible somewhere in df$y , i.e.:
data.frame(x = 1:5, A = c(1,1,0,0,0), B = c(0,1,0,1,0), C = c(0,0,1,1,0), D = c(0,0,0,1,0), E = c(0,0,0,0,1))
x ABCDE 1 1 1 0 0 0 0 2 2 1 1 0 0 0 3 3 0 0 1 0 0 4 4 0 1 1 1 0 5 5 0 0 0 0 1
This naive approach works:
> uniqueStrings <- unique(unlist(df$y)) > n <- ncol(df) > for (i in 1:length(uniqueStrings)) { + df[, n + i] <- sapply(df$y, function(x) ifelse(uniqueStrings[i] %in% x, 1, 0)) + colnames(df)[n + i] <- uniqueStrings[i] + }
However, it is very ugly, lazy and slow with large frames of data.
Any suggestions? Something like tidyverse ?
UPDATE: I have 3 different approaches below. I tested them with system.time on my (Windows 7, 32GB RAM) laptop on a real data set consisting of 1M lines, each line contains a list of lengths from 1 to 4 lines (of ~ 350 unique string values), total 200MB on drive. Thus, the expected result is a 1M x 350 data frame. The tidyverse (@Sotos) and base (@ joel.wilson) approaches took so long that I had to restart the R. qdapTools (@akrun) approach, however, it worked fantastic :
> system.time(res1 <- mtabulate(varsLists)) user system elapsed 47.05 10.27 116.82
So this is the approach that I will designate as accepted.