The sum of the hybrid data frames depending on the set of conditions in R

Question

The sum of the hybrid data frames depending on the set of conditions in R

This is a more complicated continuation of my previous question. The answer was to use a matrix, but this does not work with data frames having values of different modes.

I want to combine data frames of different sizes with columns of characters and integers and calculate their sum depending on several conditions.

Conditions

amounts are calculated only for those lines that have the corresponding "Name" value
amounts are calculated only for matching column names
If the cell in df4 not 0, not NA, the sum should be df3 + df4
otherwise the amount should be df1 + df2 + df3

Example

 > df1 <- data.frame(Name=c("Joe","Ann","Lee","Dan"), "1"=c(0,1,5,2), "2"=c(3,1,0,0), "3"=c(2,0,2,2), "4"=c(2,1,3,4)) > df1 Name X1 X2 X3 X4 1 Joe 0 3 2 2 2 Ann 1 1 0 1 3 Lee 5 0 2 3 4 Dan 2 0 2 4 > df2 <- data.frame(Name=c("Joe","Ann","Ken"), "1"=c(3,4,1), "2"=c(2,3,0), "3"=c(2,4,3)) > df2 Name X1 X2 X3 1 Joe 3 2 2 2 Ann 4 3 4 3 Ken 1 0 3 > df3 <- data.frame(Name=c("Lee","Ben"), "1"=c(1,3), "2"=c(3,4), "3"=c(4,3)) > df3 Name X1 X2 X3 1 Lee 1 3 4 2 Ben 3 4 3

The condition depends on this frame:

 > df4 <- data.frame(Name=c("Lee","Ann","Dan"), "1"=c(6,0,NA), "2"=c(0,0,4), "3"=c(0,NA,0)) > df4 Name X1 X2 X3 1 Lee 6 0 0 2 Ann 0 0 NA 3 Dan NA 4 0

With the examples above, this is the expected result (* values depend on df4):

 > dfsum Name X1 X2 X3 X4 1 Joe 3 5 4 2 2 Ann 5 4 4 1 3 Lee 7* 3 6 3 4 Dan 2 4* 2 4 5 Ken 1 0 3 NA 6 Ben 3 4 3 NA

Possible steps?

First expand df1, df2, df3, df4 to 5 columns and 6 rows, fill in the missing NA data.

Then for each data frame:

sort strings by "Name"
separate column "Name" from "X1" ... "X4"
convert columns "X1" ... "X4" to a matrix
calculate the sums of matrices, as in the answer to my other question , but with the additional condition 1
converts a result matrix into a data frame
cbind "Name" column with result data frame

How can this be done in R?

Decision

@Ricardo Saporta's solution works with minor changes:

Add , padValue=NA) to four addCols ().

As said here , replace the definitions of sumD3D4 and dtsum with:

 plus <- function(x) { if(all(is.na(x))){ c(x[0],NA)} else { sum(x,na.rm = TRUE)} } sumD3D4 <- setkey(rbind(dt3, dt4)[,lapply(.SD, plus), by = Name], "Name") dtsum <- setkey(rbind(dt1, dt2, dt3)[, lapply(.SD, plus), by=Name], "Name")

+4

r dataframe

R-obert Feb 23 '13 at 18:59

source share

1 answer

Ricardo saporta · Accepted Answer · 2013-02-23T21:09:55+0000

If you use data.table instead of data.frame, you can use its by=xxxx function to add by name. The code below should give you the expected results.

Note that I populate data.tables with additional empty columns. However, we compute condTrue before that.

 library(data.table) dt1 <- data.table(df1) dt2 <- data.table(df2) dt3 <- data.table(df3) dt4 <- data.table(df4) # make sure all dt have the same columns #-----------------------------------------# # identify which dt4 satisfy the condition condTrue <- as.data.table(which(!(is.na(dt4) | dt4==0), arr.ind=TRUE)) # ignore column "Name" from dt4 condTrue <- condTrue[col>1] # convert from (row, col) index to ("Name", columnName) condTrue <- data.table(Name=dt4[condTrue$row, Name], colm=names(dt4)[condTrue$col], key="Name") # First make a list of all the unique column names allColumnNames <- unique(c(names(dt1), names(dt2), names(dt3), names(dt4))) # add columns as necessary, using addCols (definted below) addCols(dt1, allColumnNames) addCols(dt2, allColumnNames) addCols(dt3, allColumnNames) addCols(dt4, allColumnNames) sumD3D4 <- setkey(rbind(dt3, dt4)[, lapply(.SD, sum), by=Name], "Name") dtsum <- setkey(rbind(dt1, dt2, dt3)[, lapply(.SD, sum), by=Name], "Name") for (Nam in condTrue$Name) { colsRepl <- condTrue[.(Nam)]$colm valsRepl <- unlist(sumD3D4[.(Nam), c(colsRepl), with=FALSE]) dtsum[.(Nam), c(colsRepl) := as.list(valsRepl)] } dtsum # Name 1 2 3 4 # 1: Ann 5 4 4 1 # 2: Ben 3 4 3 0 # 3: Dan 2 4 2 4 # 4: Joe 3 5 4 2 # 5: Ken 1 0 3 0 # 6: Lee 7 3 6 3

 addCols <- function(x, cols, padValue=0) { # adds to x any columns that are in cols but not in x # Returns TRUE if columns were added # FALSE if no columns added colsMissing <- setdiff(cols, names(x)) # grab the actual DT name that was passed to function dtName <- as.character(match.call()[2]) if (length(colsMissing)) { get(dtName, envir=parent.frame(1))[, c(colsMissing) := padValue] return(TRUE) } return(FALSE) }

The sum of the hybrid data frames depending on the set of conditions in R

Conditions

Example

Possible steps?

How can this be done in R?

Decision

More articles: