Convert a list column to multiple columns in a data frame

I have a data frame with one column, which is a list, for example:

>head(movies$genre_list) [[1]] [1] "drama" "action" "romance" [[2]] [1] "crime" "drama" [[3]] [1] "crime" "drama" "mystery" [[4]] [1] "thriller" "indie" [[5]] [1] "thriller" [[6]] [1] "drama" "family" 

I want to convert this one column to several columns, one for each unique item in lists (in this case, in genres), and have them as binary columns. I am looking for an elegant solution that does not involve first figuring out how many genres exist, then creating a column for each, and then checking each element of the list to fill the columns of the genre. I tried unlist, but it does not work with the list vector the way I want.

Thanks!

+7
source share
2 answers

Here are a few approaches:

 movies <- data.frame(genre_list = I(list( c("drama", "action", "romance"), c("crime", "drama"), c("crime", "drama", "mystery"), c("thriller", "indie"), c("thriller"), c("drama", "family")))) 

Update, years later ....

You can use the mtabulate function from "qdapTools" or from the unexported charMat function from my "splitstackshape" package.

Syntax:

 library(qdapTools) mtabulate(movies$genre_list) # action crime drama family indie mystery romance thriller # 1 1 0 1 0 0 0 1 0 # 2 0 1 1 0 0 0 0 0 # 3 0 1 1 0 0 1 0 0 # 4 0 0 0 0 1 0 0 1 # 5 0 0 0 0 0 0 0 1 # 6 0 0 1 1 0 0 0 0 

or

 splitstackshape:::charMat(movies$genre_list, fill = 0) # action crime drama family indie mystery romance thriller # [1,] 1 0 1 0 0 0 1 0 # [2,] 0 1 1 0 0 0 0 0 # [3,] 0 1 1 0 0 1 0 0 # [4,] 0 0 0 0 1 0 0 1 # [5,] 0 0 0 0 0 0 0 1 # [6,] 0 0 1 1 0 0 0 0 

Update: Some More Direct Approaches

Improved option 1 : use table somewhat directly:

 table(rep(1:nrow(movies), sapply(movies$genre_list, length)), unlist(movies$genre_list, use.names=FALSE)) 

Improved option 2 . Use the for loop.

 x <- unique(unlist(movies$genre_list, use.names=FALSE)) m <- matrix(0, ncol = length(x), nrow = nrow(movies), dimnames = list(NULL, x)) for (i in 1:nrow(m)) { m[i, movies$genre_list[[i]]] <- 1 } m 

Below is the OLD answer

Convert the list to a table list (in turn converted to data.frame s):

 tables <- lapply(seq_along(movies$genre_list), function(x) { temp <- as.data.frame.table(table(movies$genre_list[[x]])) names(temp) <- c("Genre", paste("Record", x, sep = "_")) temp }) 

Use the Reduce to merge result list. If I understand your final goal correctly, this will lead to a transposed form of the result you are interested in.

 merged_tables <- Reduce(function(x, y) merge(x, y, all = TRUE), tables) merged_tables # Genre Record_1 Record_2 Record_3 Record_4 Record_5 Record_6 # 1 action 1 NA NA NA NA NA # 2 drama 1 1 1 NA NA 1 # 3 romance 1 NA NA NA NA NA # 4 crime NA 1 1 NA NA NA # 5 mystery NA NA 1 NA NA NA # 6 indie NA NA NA 1 NA NA # 7 thriller NA NA NA 1 1 NA # 8 family NA NA NA NA NA 1 

Transposing and converting NA to 0 quite simple. Just leave the first column and reuse it as the names column for the new data.frame

 movie_genres <- setNames(data.frame(t(merged_tables[-1])), merged_tables[[1]]) movie_genres[is.na(movie_genres)] <- 0 movie_genres 
+4
source

Using the same input as the other answers, some alternatives are given:

1) factor / table / rbind

 > levs <- levels(factor(unlist(movies[[1]]))) > as.data.frame(do.call(rbind, lapply(lapply(movies[[1]], factor, levs), table))) action crime drama family indie mystery romance thriller 1 1 0 1 0 0 0 1 0 2 0 1 1 0 0 0 0 0 3 0 1 1 0 0 1 0 0 4 0 0 0 0 1 0 0 1 5 0 0 0 0 0 0 0 1 6 0 0 1 1 0 0 0 0 

2) make.groups / xtabs

 > library(lattice) > m <- do.call(make.groups, movies[[1]]) > as.data.frame.matrix(xtabs(~ which + data, m)) action crime drama family indie mystery romance thriller c("drama", "action", "romance") 1 0 1 0 0 0 1 0 c("crime", "drama") 0 1 1 0 0 0 0 0 c("crime", "drama", "mystery") 0 1 1 0 0 1 0 0 c("thriller", "indie") 0 0 0 0 1 0 0 1 thriller 0 0 0 0 0 0 0 1 c("drama", "family") 0 0 1 1 0 0 0 0 

2a) make.groups / dcast . This is alternative 2 option using dcast from reshape2 instead of as.data.frame.matrix and xtabs . The molten data frame m is an alternative 2.

 library(reshape2) dcast(m, which ~ data, fun.aggregate = length, value.var = "which") 

UPDATE: alternative 2 added.

UPDATE 2: Added alternative 2a.

+3
source

All Articles