"unpacking" the list of factors from data.frame

I am new to R /, having the ability to easily reorganize data and hunt for a solution, but cannot find what I would like to do. Reshape2 melt / cast does not work, and I have not mastered plyr well enough to enable it here.

Basically, I have a data.frame with the structure described below, with a category column in which each element is a list of categories with a variable length (more compact, because # columns are much larger, and in fact I have several list_ categories that I would like to leave it separately):

>mydf ID category_list xval yval 1 ID1 cat1, cat2, cat3 xnum1 ynum1 2 ID2 cat2, cat3 xnum2 ynum2 3 ID3 cat1 xnum3 ynum3 

I want to do category manipulations as factors (and their associated values, i.e. 3/4 columns), so I think I need something like this at the end, where the identifiers and x / y / other column values ​​are duplicated according to the length of the list of categories:

  ID category xval yval 1 ID1 cat1 xnum1 ynum1 2 ID1 cat2 xnum1 ynum1 3 ID1 cat3 xnum1 ynum1 4 ID2 cat2 xnum2 ynum2 5 ID2 cat3 xnum2 ynum2 6 ID3 cat3 xnum2 ynum2 

If there is another solution for factor / facet in category_list, this would be a simpler solution, but I have not come across methods that support this, for example the following gives an error

 >ggplot(mydf, aes(x=x, y=y)) + geom_point() + facet_grid(~cat_list) 

Mistake in layout_base (data, cols, drop = drop): At least one layer must contain all the variables used for faceting

Thanks!

+6
source share
6 answers

The answer will depend on the format of category_list . If in fact it is a list for each row

Sort of

 mydf <- data.frame(ID = paste0('ID',1:3), category_list = I(list(c('cat1','cat2','cat3'), c('cat2','cat3'), c('cat1'))), xval = 1:3, yval = 1:3) 

or

 library(data.table) mydf <- as.data.frame(data.table(ID = paste0('ID',1:3), category_list = list(c('cat1','cat2','cat3'), c('cat2','cat3'), c('cat1')), xval = 1:3, yval = 1:3) ) 

Then you can use plyr and merge to create your long form data

  newdf <- merge(mydf, ddply(mydf, .(ID), summarize, cat_list = unlist(category_list)), by = 'ID') ID category_list xval yval cat_list 1 ID1 cat1, cat2, cat3 1 1 cat1 2 ID1 cat1, cat2, cat3 1 1 cat2 3 ID1 cat1, cat2, cat3 1 1 cat3 4 ID2 cat2, cat3 2 2 cat2 5 ID2 cat2, cat3 2 2 cat3 6 ID3 cat1 3 3 cat1 

or non-plyr approach that does not require merge

  do.call(rbind,lapply(split(mydf, mydf$ID), transform, cat_list = unlist(category_list))) 
+9
source

A tight but seemingly reliable solution:

 ## Some example data df <- as.data.frame(cbind(ID = paste0("ID", 1:2), category_list = list(4:1, 2:3), xvar = 8:9, yvar = 10:9)) ## Calculate number of times each row of df will be repeated nn <- sapply(df$category_list, length) ii <- rep(seq_along(nn), times=nn) ## Reshape data.frame transform(df[ii,], category = unlist(df$category_list), category_list = NULL, row.names = NULL) # ID xvar yvar category # 1 ID1 8 10 4 # 2 ID1 8 10 3 # 3 ID1 8 10 2 # 4 ID1 8 10 1 # 5 ID2 9 9 2 # 6 ID2 9 9 3 
+5
source

Opportunity:

 x <- read.table(textConnection(' ID category_list xval yval ID1 "cat1, cat2, cat3" xnum1 ynum1 ID2 "cat2, cat3" xnum2 ynum2 ID3 "cat1" xnum3 ynum3'), header=TRUE,stringsAsFactors=FALSE) library(plyr) ddply(x,"ID",transform,category=strsplit(category_list,",")[[1]]) ## ID category_list xval yval category ## 1 ID1 cat1, cat2, cat3 xnum1 ynum1 cat1 ## 2 ID1 cat1, cat2, cat3 xnum1 ynum1 cat2 ## 3 ID1 cat1, cat2, cat3 xnum1 ynum1 cat3 ## 4 ID2 cat2, cat3 xnum2 ynum2 cat2 ## 5 ID2 cat2, cat3 xnum2 ynum2 cat3 
+2
source

This will be the non-plyr approach:

 cbind( x[ rep(1:nrow(x), times=sapply(x$category_list, function(xx) sapply( strsplit(xx, ","), length) ) ), -2], # to get rid of the old category column new_cats = unlist( strsplit(x$category_list, ",") ) ) # this used Bolker example. If these are factor will need to add `as.character` ID xval yval new_cats 1 ID1 xnum1 ynum1 cat1 1.1 ID1 xnum1 ynum1 cat2 1.2 ID1 xnum1 ynum1 cat3 2 ID2 xnum2 ynum2 cat2 2.1 ID2 xnum2 ynum2 cat3 3 ID3 xnum3 ynum3 cat1 
0
source

Another basic R feature using by :

 do.call(rbind, by(mydf, mydf$ID, function(x) { data.frame( ID=x$ID, category_list = unlist(strsplit(x$category_list,",")), xval=x$xval, yval=x$yval ) } ) ) 

Result:

  ID category_list xval yval ID1.1 ID1 cat1 xnum1 ynum1 ID1.2 ID1 cat2 xnum1 ynum1 ID1.3 ID1 cat3 xnum1 ynum1 ID2.1 ID2 cat2 xnum2 ynum2 ID2.2 ID2 cat3 xnum2 ynum2 ID3 ID3 cat1 xnum3 ynum3 
0
source

Note. The original answer was deleted, as my answer was based on a different data structure than the OP actually.


Scenario 1: list column

Using @mnel fetch data:

 mydf <- data.frame(ID = paste0('ID',1:3), category_list = I(list(c('cat1','cat2','cat3'), c('cat2','cat3'), c('cat1'))), xval = 1:3, yval = 1:3) 

Using listCol_l from my splitstackshape package

 library(splitstackshape) listCol_l(mydf, "category_list") # ID xval yval category_list_ul # 1: ID1 1 1 cat1 # 2: ID1 1 1 cat2 # 3: ID1 1 1 cat3 # 4: ID2 2 2 cat2 # 5: ID2 2 2 cat3 # 6: ID3 3 3 cat1 

Using unnest from the unnest package

 library(tidyr) unnest(mydf, "category_list") # ID category_list xval yval # 1 ID1 cat1 1 1 # 2 ID1 cat2 1 1 # 3 ID1 cat3 1 1 # 4 ID2 cat2 2 2 # 5 ID2 cat3 2 2 # 6 ID3 cat1 3 3 

Scenario 2: Column is a concatenated row

Using @BenBolker example data:

 x <- read.table(textConnection(' ID category_list xval yval ID1 "cat1, cat2, cat3" xnum1 ynum1 ID2 "cat2, cat3" xnum2 ynum2 ID3 "cat1" xnum3 ynum3'), header=TRUE,stringsAsFactors=FALSE) 

Using cSplit from my splitstackshape package

 library(splitstackshape) cSplit(x, "category_list", ",", "long") # ID category_list xval yval # 1: ID1 cat1 xnum1 ynum1 # 2: ID1 cat2 xnum1 ynum1 # 3: ID1 cat3 xnum1 ynum1 # 4: ID2 cat2 xnum2 ynum2 # 5: ID2 cat3 xnum2 ynum2 # 6: ID3 cat1 xnum3 ynum3 
0
source

All Articles