An elegant way to solve the ddply problem with an aggregate (hoping for better performance)

I would like to combine a data.frame with an identifier variable called ensg . The data frame looks like this:

  chromosome probeset ensg symbol XXA_00 XXA_36 XXB_00 1 X 4938842 ENSMUSG00000000003 Pbsn 4.796123 4.737717 5.326664 

I want to calculate the average value for each numeric column over rows with the same ensg value. The problem here is that I would like to leave the remaining chronological variables and the symbol intact, since they are the same for the same ensg .

In the end, I would like to have data.frame with the identification columns chromosome , ensg , symbol and the average number of numeric columns per row with the same identifier. I implemented this in ddply , but it is very slow compared to aggregate :

 spec.mean <- function(eset.piece) { cbind(eset.piece[1,-numeric.columns],t(colMeans(eset.piece[,numeric.columns]))) } t mean.eset <- ddply(eset.consensus.grand,.(ensg),spec.mean,.progress="tk") 

My first aggregate implementation looks like this:

 mean.eset=aggregate(eset[,numeric.columns], by=list(eset$ensg), FUN=mean, na.rm=TRUE); 

and much faster. But the problem with aggregate is that I have to re-bind the describing variables. I did not understand how to use my custom function with aggregate , since aggregate does not transmit data frames, but only vectors.

Is there an elegant way to do this with aggregate ? Or is there a faster way to do this with ddply ?

+7
source share
2 answers

First we define a toy example:

 df <- data.frame(chromosome = gl(3, 10, labels = c('A', 'B', 'C')), probeset = gl(3, 10, labels = c('X', 'Y', 'Z')), ensg = gl(3, 10, labels = c('E1', 'E2', 'E3')), symbol = gl(3, 10, labels = c('S1', 'S2', 'S3')), XXA_00 = rnorm(30), XXA_36 = rnorm(30), XXB_00 = rnorm(30)) 

And then we use aggregate with the formula interface:

 df1 <- aggregate(cbind(XXA_00, XXA_36, XXB_00) ~ ensg + chromosome + symbol, data = df, FUN = mean) > df1 ensg chromosome symbol XXA_00 XXA_36 XXB_00 1 E1 A S1 -0.02533499 -0.06150447 -0.01234508 2 E2 B S2 -0.25165987 0.02494902 -0.01116426 3 E3 C S3 0.09454154 -0.48468517 -0.25644569 
+7
source

If speed is a primary concern, you should take a look at the data.table package. When the number of rows or columns of a group is large, data.table really seems radiant. The wiki for the package is here and has some links to other good introductory documents.

Here, how would you do this aggregation using data.table()

 library(data.table) #Turn the data.frame above into a data.table dt <- data.table(df) #Aggregation dt[, list(XXA_00 = .Internal(mean(XXA_00)), XXA_36 = .Internal(mean(XXA_36)), XXB_00 = .Internal(mean(XXB_00))), by = c("ensg", "chromosome", "symbol") ] 

Gives us

  ensg chromosome symbol XXA_00 XXA_36 XXB_00 [1,] E1 A S1 0.18026869 0.13118997 0.6558433 [2,] E2 B S2 -0.48830539 0.24235537 0.5971377 [3,] E3 C S3 -0.04786984 -0.03139901 0.5618208 

The aggregate solution presented above seems to work very well when working with 30-line data.frame, comparing the output from the rbenchmark package. However, when data.frame contains 3e5 rows, data.table() is deleted as a clear winner. Here's the conclusion:

  benchmark(fag(), fdt(), replications = 10) test replications elapsed relative user.self sys.self 1 fag() 10 12.71 23.98113 12.40 0.31 2 fdt() 10 0.53 1.00000 0.48 0.05 
+8
source

All Articles