Flatten list with complex nested structure

I have a list with the following sample structure:

> dput(test) structure(list(id = 1, var1 = 2, var3 = 4, section1 = structure(list( var1 = 1, var2 = 2, var3 = 3), .Names = c("var1", "var2", "var3")), section2 = structure(list(row = structure(list(var1 = 1, var2 = 2, var3 = 3), .Names = c("var1", "var2", "var3")), row = structure(list(var1 = 4, var2 = 5, var3 = 6), .Names = c("var1", "var2", "var3")), row = structure(list(var1 = 7, var2 = 8, var3 = 9), .Names = c("var1", "var2", "var3"))), .Names = c("row", "row", "row"))), .Names = c("id", "var1", "var3", "section1", "section2")) > str(test) List of 5 $ id : num 1 $ var1 : num 2 $ var3 : num 4 $ section1:List of 3 ..$ var1: num 1 ..$ var2: num 2 ..$ var3: num 3 $ section2:List of 3 ..$ row:List of 3 .. ..$ var1: num 1 .. ..$ var2: num 2 .. ..$ var3: num 3 ..$ row:List of 3 .. ..$ var1: num 4 .. ..$ var2: num 5 .. ..$ var3: num 6 ..$ row:List of 3 .. ..$ var1: num 7 .. ..$ var2: num 8 .. ..$ var3: num 9 

Note that the section2 list contains elements named rows . They are a few entries. I have a nested list in which some items are at the root level and others are multiple nested entries for the same observation. I would like to get the following result in data.frame format:

 > desired id var1 var3 section1.var1 section1.var2 section1.var3 section2.var1 section2.var2 section2.var3 1 1 2 4 1 2 3 1 4 7 2 NA NA NA NA NA NA 2 5 8 3 NA NA NA NA NA NA 3 6 9 

Root level elements must fill the first row, and row elements must have their own rows. As an added complication, the number of variables in row entries can vary.

+5
source share
4 answers

Here is a general approach. It does not assume that you will only have three lines; it will work with any number of lines that you have. And if there is no value in the nested structure (for example, var1 does not exist for some subscriptions in section 2), the code correctly returns NA for this cell.

eg. if we use the following data:

 test <- structure(list(id = 1, var1 = 2, var3 = 4, section1 = structure(list(var1 = 1, var2 = 2, var3 = 3), .Names = c("var1", "var2", "var3")), section2 = structure(list(row = structure(list(var1 = 1, var2 = 2), .Names = c("var1", "var2")), row = structure(list(var1 = 4, var2 = 5), .Names = c("var1", "var2")), row = structure(list( var2 = 8, var3 = 9), .Names = c("var2", "var3"))), .Names = c("row", "row", "row"))), .Names = c("id", "var1", "var3", "section1", "section2")) 

A general approach is to use a melt to create a data frame that includes information about the nested structure, and then dcast to format it to the desired format.

 library("reshape2") flat <- unlist(test, recursive=FALSE) names(flat)[grep("row", names(flat))] <- gsub("row", "var", paste0(names(flat)[grep("row", names(flat))], seq_len(length(names(flat)[grep("row", names(flat))])))) ## keeps track of rows by adding an ID ul <- melt(unlist(flat)) split <- strsplit(rownames(ul), split=".", fixed=TRUE) ## splits the names into component parts max <- max(unlist(lapply(split, FUN=length))) pad <- function(a) { c(a, rep(NA, max-length(a))) } levels <- matrix(unlist(lapply(split, FUN=pad)), ncol=max, byrow=TRUE) ## Get the nesting structure nested <- data.frame(levels, ul) nested$X3[is.na(nested$X3)] <- levels(as.factor(nested$X3))[[1]] desired <- dcast(nested, X3~X1 + X2) names(desired) <- gsub("_", "\\.", gsub("_NA", "", names(desired))) desired <- desired[,names(flat)] > desired ## id var1 var3 section1.var1 section1.var2 section1.var3 section2.var1 section2.var2 section2.var3 ## 1 1 2 4 1 2 3 1 4 7 ## 2 NA NA NA NA NA NA 2 5 8 ## 3 NA NA NA NA NA NA 3 6 9 
+3
source

The central idea of ​​this solution is to flatten all sub-lists except sub-lists with the name "string". This can be done by creating a unique identifier for each element of the list (stored in z ), and then requesting that all elements on the same line have the same identifier (stored in z2 ; you had to write a recursive function to move the nested list). Then z2 can be used to group elements that belong to the same line. The resulting list can be converted into matrix form using stri_list2matrix from the stringi package, and then converted to a data frame.

 utest <- unlist(test) z <- relist(seq_along(utest),test) recurse <- function(L) { if (class(L)!='list') return(L) b <- names(L)=='row' Lb <- lapply(L[b],function(k) relist(rep(k[[1]],length(k)),k)) L.nb <- lapply(L[!b],recurse) c(Lb,L.nb) } z2 <- unlist(recurse(z)) library(stringi) desired <- as.data.frame(stri_list2matrix(split(utest,z2))) names(desired) <- names(z2)[unique(z2)] desired # id var1 var3 section1.var1 section1.var2 section1.var3 section2.row.var1 # 1 1 2 4 1 2 3 1 # 2 <NA> <NA> <NA> <NA> <NA> <NA> 2 # 3 <NA> <NA> <NA> <NA> <NA> <NA> 3 # section2.row.var1 section2.row.var1 # 1 4 7 # 2 5 8 # 3 6 9 
+1
source

Since your problem is not defined when the rows are complex structures (i.e. if each row in test contains a list test `, how should the rows be linked together. And what if the rows in the same table have different structures?), The following solution depends from the fact that strings are a list of values.

However, I assume that in the general case, your test list will contain either values, lists of values, or lists of strings (where strings are lists of values). Also, if the strings are not always called "string", this solution still works.

 temp <- lapply(test, function(x){ if(!is.list(x)) # x is a value return(x) # x is a lis of rows or values out <- do.call(cbind,x) if(nrow(out)>1){ # x is a list of rows colnames(out)<-paste0(colnames(out),'.',rownames(out)) rownames(out)<-rep_len(NA,nrow(out)) } return(out) }) # a function that extends a matrix to a fixt number of rows (n) # by appending rows of NA rowExtend <- function(x,N){ if((!is.matrix(x)) ){ out<-do.call(rbind,c(list(x),as.list(rep_len(NA,N - 1)))) colnames(out) <- "" out }else if(nrow(x) < N) do.call(rbind,c(list(x),as.list(rep_len(NA,N - nrow(x))))) else x } # calculate the maximum number of rows .nrows <- sapply(temp,nrow) .nrows <- max(unlist(.nrows[!sapply(.nrows,is.null)])) # extend the shorter rows (temp2<-lapply(temp, rowExtend,.nrows)) # calculate new column namames newColNames <- mapply(function(x,y) { if(nzchar(y)[1L]) paste0(x,'.',y) else x }, names(temp2), lapply(temp2,colnames)) do.call(cbind,mapply(`colnames<-`,temp2,newColNames)) #> id var1 var3 section1.var1 section1.var2 section1.var3 section2.row.var1 section2.row.var2 section2.row.var3 #> 1 2 4 1 2 3 1 4 7 #> NA NA NA NA NA NA 2 5 8 #> NA NA NA NA NA NA 3 6 9 
0
source

It starts like a tiffany, but then diverges a bit.

 library(data.table) # flatten the first level flat = unlist(test, recursive = FALSE) # compute max length N = max(sapply(flat, length)) # pad NA and convert to data.table (at this point it will *look* like the right answer) dt = as.data.table(lapply(flat, function(l) c(l, rep(NA, N - length(l))))) # but in reality some of the columns are lists - check by running sapply(dt, class) # so unlist them dt = dt[, lapply(.SD, unlist)] # id var1 var3 section1.var1 section1.var2 section1.var3 section2.row section2.row section2.row #1: 1 2 4 1 2 3 1 4 7 #2: NA NA NA NA NA NA 2 5 8 #3: NA NA NA NA NA NA 3 6 9 
0
source

Source: https://habr.com/ru/post/1211953/


All Articles