Link data.frames differently in R without making copies

I have a large list of data.frames that need to be paired in columns and then in rows before being fed into the predictive model. Since no values ​​will be changed, I would like to have a final data.frame file pointing to the original data.frames in my list.

For example:

library(pryr) #individual dataframes df1 <- data.frame(a=1:1e6+0, b=1:1e6+1) df2 <- data.frame(a=1:1e6+2, b=1:1e6+3) df3 <- data.frame(a=1:1e6+4, b=1:1e6+5) #each occupy 16MB object_size(df1) # 16 MB object_size(df2) # 16 MB object_size(df3) # 16 MB object_size(df1, df2, df3) # 48 MB #will be in a named list dfs <- list(df1=df1, df2=df2, df3=df3) #putting into list doesn't create a copy object_size(df1, df2, df3, dfs) #48MB 

The final data.frame will have this orientation (each unique data pair bound by columns, then pairs connected by rows):

 df1, df2 df1, df3 df2, df3 

I am currently implementing this as such:

 #generate unique df combinations df_names <- names(dfs) pairs <- combn(df_names, 2, simplify=FALSE) #bind dfs by columns combo_dfs <- lapply(pairs, function(x) cbind(dfs[[x[1]]], dfs[[x[2]]])) #no copies created yet object_size(dfs, combo_dfs) # 48MB #bind dfs by rows combo_df <- do.call(rbind, combo_dfs) #now data gets copied object_size(combo_df) # 96 MB object_size(dfs, combo_df) # 144 MB 

How can I avoid copying my data, but still achieve the same end result?

+7
clone memory r dataframe pryr
source share
1 answer

Saving values, you hope, will require R to compress the data in the data frame. I do not believe that data frames support compression.

If your motivation for storing data this way makes it hard to install in memory, you can try ff package . This will allow you to store it more compactly on disk. It seems that the ffdf class has the necessary properties:

By default, creating an ffdf object will NOT create new ff files, existing files will be saved instead. This is different from data.frame, which always creates copies of input objects, especially in data.frame (matrix ()), where the input matrix is ​​converted to separate columns. ffdf, by contrast, will store the input matrix physically as the same matrix and actually map it to columns.

In addition, the ff package is optimized for quick access.

Note that I myself have not used this package, so I can not guarantee that it will solve your problem.

0
source share

All Articles