The sum of the hybrid data frames depending on the set of conditions in R

This is a more complicated continuation of my previous question. The answer was to use a matrix, but this does not work with data frames having values ​​of different modes.

I want to combine data frames of different sizes with columns of characters and integers and calculate their sum depending on several conditions.

Conditions

  • amounts are calculated only for those lines that have the corresponding "Name" value
  • amounts are calculated only for matching column names
  • If the cell in df4 not 0, not NA, the sum should be df3 + df4
  • otherwise the amount should be df1 + df2 + df3

Example

 > df1 <- data.frame(Name=c("Joe","Ann","Lee","Dan"), "1"=c(0,1,5,2), "2"=c(3,1,0,0), "3"=c(2,0,2,2), "4"=c(2,1,3,4)) > df1 Name X1 X2 X3 X4 1 Joe 0 3 2 2 2 Ann 1 1 0 1 3 Lee 5 0 2 3 4 Dan 2 0 2 4 > df2 <- data.frame(Name=c("Joe","Ann","Ken"), "1"=c(3,4,1), "2"=c(2,3,0), "3"=c(2,4,3)) > df2 Name X1 X2 X3 1 Joe 3 2 2 2 Ann 4 3 4 3 Ken 1 0 3 > df3 <- data.frame(Name=c("Lee","Ben"), "1"=c(1,3), "2"=c(3,4), "3"=c(4,3)) > df3 Name X1 X2 X3 1 Lee 1 3 4 2 Ben 3 4 3 

The condition depends on this frame:

 > df4 <- data.frame(Name=c("Lee","Ann","Dan"), "1"=c(6,0,NA), "2"=c(0,0,4), "3"=c(0,NA,0)) > df4 Name X1 X2 X3 1 Lee 6 0 0 2 Ann 0 0 NA 3 Dan NA 4 0 

With the examples above, this is the expected result (* values ​​depend on df4):

 > dfsum Name X1 X2 X3 X4 1 Joe 3 5 4 2 2 Ann 5 4 4 1 3 Lee 7* 3 6 3 4 Dan 2 4* 2 4 5 Ken 1 0 3 NA 6 Ben 3 4 3 NA 

Possible steps?

First expand df1, df2, df3, df4 to 5 columns and 6 rows, fill in the missing NA data.

Then for each data frame:

  • sort strings by "Name"
  • separate column "Name" from "X1" ... "X4"
  • convert columns "X1" ... "X4" to a matrix
  • calculate the sums of matrices, as in the answer to my other question , but with the additional condition 1
  • converts a result matrix into a data frame
  • cbind "Name" column with result data frame

How can this be done in R?


Decision

@Ricardo Saporta's solution works with minor changes:

Add , padValue=NA) to four addCols ().

As said here , replace the definitions of sumD3D4 and dtsum with:

 plus <- function(x) { if(all(is.na(x))){ c(x[0],NA)} else { sum(x,na.rm = TRUE)} } sumD3D4 <- setkey(rbind(dt3, dt4)[,lapply(.SD, plus), by = Name], "Name") dtsum <- setkey(rbind(dt1, dt2, dt3)[, lapply(.SD, plus), by=Name], "Name") 
+4
source share
1 answer

If you use data.table instead of data.frame, you can use its by=xxxx function to add by name. The code below should give you the expected results.

Note that I populate data.tables with additional empty columns. However, we compute condTrue before that.

 library(data.table) dt1 <- data.table(df1) dt2 <- data.table(df2) dt3 <- data.table(df3) dt4 <- data.table(df4) # make sure all dt have the same columns #-----------------------------------------# # identify which dt4 satisfy the condition condTrue <- as.data.table(which(!(is.na(dt4) | dt4==0), arr.ind=TRUE)) # ignore column "Name" from dt4 condTrue <- condTrue[col>1] # convert from (row, col) index to ("Name", columnName) condTrue <- data.table(Name=dt4[condTrue$row, Name], colm=names(dt4)[condTrue$col], key="Name") # First make a list of all the unique column names allColumnNames <- unique(c(names(dt1), names(dt2), names(dt3), names(dt4))) # add columns as necessary, using addCols (definted below) addCols(dt1, allColumnNames) addCols(dt2, allColumnNames) addCols(dt3, allColumnNames) addCols(dt4, allColumnNames) sumD3D4 <- setkey(rbind(dt3, dt4)[, lapply(.SD, sum), by=Name], "Name") dtsum <- setkey(rbind(dt1, dt2, dt3)[, lapply(.SD, sum), by=Name], "Name") for (Nam in condTrue$Name) { colsRepl <- condTrue[.(Nam)]$colm valsRepl <- unlist(sumD3D4[.(Nam), c(colsRepl), with=FALSE]) dtsum[.(Nam), c(colsRepl) := as.list(valsRepl)] } dtsum # Name 1 2 3 4 # 1: Ann 5 4 4 1 # 2: Ben 3 4 3 0 # 3: Dan 2 4 2 4 # 4: Joe 3 5 4 2 # 5: Ken 1 0 3 0 # 6: Lee 7 3 6 3 

 addCols <- function(x, cols, padValue=0) { # adds to x any columns that are in cols but not in x # Returns TRUE if columns were added # FALSE if no columns added colsMissing <- setdiff(cols, names(x)) # grab the actual DT name that was passed to function dtName <- as.character(match.call()[2]) if (length(colsMissing)) { get(dtName, envir=parent.frame(1))[, c(colsMissing) := padValue] return(TRUE) } return(FALSE) } 
+3
source

All Articles