Pivot tables of data from wide data.frames

Question

Pivot tables of data from wide data.frames

I am trying to find lazy / easy ways to create pivot tables / data.frames from wide data.frames . Suppose the following data.frame, but with many columns, so specifying column names takes a lot of time:

 set.seed(2) x <- data.frame(Rep = rep(1:3, 4), Temp = c(rep(10,6), rep(20,6)), pH = rep(c(rep(8.1, 3), rep(7.6, 3)), 2), Var1 = rnorm(12, 5,2), Var2 = c(rnorm(6,4,1), rnorm(6,3,5)), Var3 = rt(12, 20)) x[1:3] <- as.data.frame(apply(x[1:3], 2, function(x) as.factor(x)))

Now I can calculate the final statistics with plyr :

 (mu <- ddply(x, .(Temp, pH), numcolwise(mean))) (std <- ddply(x, .(Temp, pH), numcolwise(sd))) (n <- ddply(x, .(Temp, pH), numcolwise(length)))

But I can’t understand how to apply all these functions at the same time:

 ddply(x, .(Temp, pH), numcolwise(mean, sd, length))

I could, of course, combine the various summary data.tables, but that would not be a "lazy / easy" way to do this. I am looking for something in common that I can apply in many cases. Something like this, except that it can be created with a single function:

 xx <- merge(mu, std, by = c("Temp", "pH"), sotr = F) colnames(xx) <- gsub("x", "mean", colnames(xx)) colnames(xx) <- gsub("y", "sd", colnames(xx)) xx <- merge(xx, n, by = c("Temp", "pH"), sotr = F) colnames(xx)[(ncol(xx)-2):ncol(xx)] <- paste0(colnames(xx)[(ncol(xx)-2):ncol(xx)], ".length") xx <- xx[c("Temp", "pH", grep("Var1", colnames(xx), value = T), grep("Var2", colnames(xx), value = T), grep("Var3", colnames(xx), value = T))] xx Temp pH Var1.mean Var1.sd Var1.length Var2.mean Var2.sd Var2.length Var3.mean Var3.sd Var3.length 1 10 7.6 4.281195 1.352194 3 3.534447 1.652884 3 0.1529616 1.076276 3 2 10 8.1 5.583853 2.491672 3 4.116622 1.478286 3 1.1611944 1.081301 3 3 20 7.6 5.840411 1.120549 3 6.907273 8.628021 3 0.1301949 1.764201 3 4 20 8.1 6.635154 2.232262 3 8.893188 4.208087 3 0.5509202 1.187431 3

Is it possible to do this in R ? Any advice would be greatly appreciated.

+1

r dataframe plyr summary

Mikko Feb 07 '13 at 10:47

source share

2 answers

One way to do this is with reshape2 and plyr . But you get results with variables in rows instead of columns:

 library(reshape2) library(plyr) md <- melt(x[,-1], id.vars=c("Temp","pH")) ddply(md, c("Temp", "pH", "variable"), summarize, mean=mean(value), sd=sd(value))

What gives:

  Temp pH variable mean sd 1 10 7.6 Var1 4.2811952 1.352194 2 10 7.6 Var2 3.5344474 1.652884 3 10 7.6 Var3 0.1529616 1.076276 4 10 8.1 Var1 5.5838533 2.491672 5 10 8.1 Var2 4.1166215 1.478286 6 10 8.1 Var3 1.1611944 1.081301 7 20 7.6 Var1 5.8404110 1.120549 8 20 7.6 Var2 6.9072734 8.628021 9 20 7.6 Var3 0.1301949 1.764201 10 20 8.1 Var1 6.6351538 2.232262 11 20 8.1 Var2 8.8931884 4.208087 12 20 8.1 Var3 0.5509202 1.187431

If you want your results to be wide format, you can use reshape :

 md <- melt(x[,-1], id.vars=c("Temp","pH")) result <- ddply(md, c("Temp", "pH", "variable"), summarize, mean=mean(value), sd=sd(value)) reshape(result, idvar=c("Temp","pH"), timevar="variable",direction="wide") Temp pH mean.Var1 sd.Var1 mean.Var2 sd.Var2 mean.Var3 sd.Var3 1 10 7.6 4.281195 1.352194 3.534447 1.652884 0.1529616 1.076276 4 10 8.1 5.583853 2.491672 4.116622 1.478286 1.1611944 1.081301 7 20 7.6 5.840411 1.120549 6.907273 8.628021 0.1301949 1.764201 10 20 8.1 6.635154 2.232262 8.893188 4.208087 0.5509202 1.187431

+5

juba Feb 07 '13 at 10:55

source share

A5C1D2H2I1M1N2O1R2T1 · Accepted Answer · 2013-02-07T11:13:41+0000

The R aggregate base can handle this, but in a weird way:

 (temp <- aggregate(. ~ Temp + pH, x, function(y) cbind(mean(y), sd(y), length(y)))) # Temp pH Rep.1 Rep.2 Rep.3 Var1.1 Var1.2 Var1.3 Var2.1 Var2.2 Var2.3 # 1 10 7.6 2 1 3 4.281195 1.352194 3.000000 3.534447 1.652884 3.000000 # 2 20 7.6 2 1 3 5.840411 1.120549 3.000000 6.907273 8.628021 3.000000 # 3 10 8.1 2 1 3 5.583853 2.491672 3.000000 4.116622 1.478286 3.000000 # 4 20 8.1 2 1 3 6.635154 2.232262 3.000000 8.893188 4.208087 3.000000 # Var3.1 Var3.2 Var3.3 # 1 0.1529616 1.0762763 3.0000000 # 2 0.1301949 1.7642008 3.0000000 # 3 1.1611944 1.0813007 3.0000000 # 4 0.5509202 1.1874306 3.0000000 str(temp) # 'data.frame': 4 obs. of 6 variables: # $ Temp: Factor w/ 2 levels "10","20": 1 2 1 2 # $ pH : Factor w/ 2 levels "7.6","8.1": 1 1 2 2 # $ Rep : num [1:4, 1:3] 2 2 2 2 1 1 1 1 3 3 ... # $ Var1: num [1:4, 1:3] 4.28 5.84 5.58 6.64 1.35 ... # $ Var2: num [1:4, 1:3] 3.53 6.91 4.12 8.89 1.65 ... # $ Var3: num [1:4, 1:3] 0.153 0.13 1.161 0.551 1.076 ...

Note that when we look at the structure of the output, we find that "Rep", "Var1", etc. are actually matrices. That way you can extract them and cbind them. But this is a little tiring.

I needed to do something similar some time ago, and I ended up just writing a basic wrapper around aggregate that looks like this.

 aggregate2 <- function(data, aggs, ids, funs = NULL, ...) { if (identical(aggs, ".")) aggs <- setdiff(names(data), ids) if (identical(ids, ".")) ids <- setdiff(names(data), aggs) if (is.null(funs)) stop("Aggregation function missing") myformula <- as.formula( paste(sprintf("cbind(%s)", paste(aggs, collapse = ", ")), " ~ ", paste(ids, collapse = " + "))) temp <- aggregate( formula = eval(myformula), data = data, FUN = function(x) sapply(seq_along(funs), function(z) eval(call(funs[z], quote(x)))), ...) temp1 <- do.call(cbind, lapply(temp[-c(1:length(ids))], as.data.frame)) names(temp1) <- paste(rep(aggs, each = length(funs)), funs, sep = ".") cbind(temp[1:length(ids)], temp1) }

Here you apply it to the sample data.

 (temp2 <- aggregate2(x, ".", c("Temp", "pH"), c("mean", "sd", "length"))) # Temp pH Rep.mean Rep.sd Rep.length Var1.mean Var1.sd Var1.length Var2.mean # 1 10 7.6 2 1 3 4.281195 1.352194 3 3.534447 # 2 20 7.6 2 1 3 5.840411 1.120549 3 6.907273 # 3 10 8.1 2 1 3 5.583853 2.491672 3 4.116622 # 4 20 8.1 2 1 3 6.635154 2.232262 3 8.893188 # Var2.sd Var2.length Var3.mean Var3.sd Var3.length # 1 1.652884 3 0.1529616 1.076276 3 # 2 8.628021 3 0.1301949 1.764201 3 # 3 1.478286 3 1.1611944 1.081301 3 # 4 4.208087 3 0.5509202 1.187431 3

And, structure is what we expect.

 str(temp2) # 'data.frame': 4 obs. of 14 variables: # $ Temp : Factor w/ 2 levels "10","20": 1 2 1 2 # $ pH : Factor w/ 2 levels "7.6","8.1": 1 1 2 2 # $ Rep.mean : num 2 2 2 2 # $ Rep.sd : num 1 1 1 1 # $ Rep.length : num 3 3 3 3 # $ Var1.mean : num 4.28 5.84 5.58 6.64 # $ Var1.sd : num 1.35 1.12 2.49 2.23 # $ Var1.length: num 3 3 3 3 # $ Var2.mean : num 3.53 6.91 4.12 8.89 # $ Var2.sd : num 1.65 8.63 1.48 4.21 # $ Var2.length: num 3 3 3 3 # $ Var3.mean : num 0.153 0.13 1.161 0.551 # $ Var3.sd : num 1.08 1.76 1.08 1.19 # $ Var3.length: num 3 3 3 3

If you don't want to use this function, this is the part that specifically deals with working with aggregate output, as applied to the temp object that we created at the beginning of this answer:

 temp1 <- do.call(cbind, lapply(temp[-c(1:2)], as.data.frame)) funs <- c("mean", "sd", "length") names(temp1) <- paste(rep(setdiff(names(temp), c("pH", "Temp")), each = length(funs)), funs, sep = ".") cbind(temp[1:2], temp1)

Update: simpler solution

It turns out that you can really do:

 do.call(data.frame, aggregate(. ~ Temp + pH, x, function(y) cbind(mean(y), sd(y), length(y))))

The downside here is that the names are less descriptive than the aggregate2 function I shared, but this can be fixed with a fairly simple call to names .

Pivot tables of data from wide data.frames

Update: simpler solution

More articles: