What is the ggplot2 / plyr method for calculating statistical tests between two subgroups?

Question

What is the ggplot2 / plyr method for calculating statistical tests between two subgroups?

I am a pretty novice user of R and began to appreciate the elegance of ggplot2 and plyr. Now I am trying to analyze a large dataset, which I cannot tell here, but I have reconstructed my problem with a diamond dataset (for short, for convenience). Without further ado:

diam <- diamonds[diamonds$cut=="Fair"|diamonds$cut=="Ideal",] boxplots <- ggplot(diam, aes(x=cut, price)) + geom_boxplot(aes(fill=cut)) + facet_wrap(~ color) print(boxplots)

What gives the plot is a set of boxes comparing the price of two sections "Fair" and "Ideal".

Now I would very much like to go on to a statistical comparison of two sections for each color subgroup (D, E, F, .., J), using either t.test or wilcox.test.

How can I implement this as elegantly as the ggplot2 syntax? I assume that I will use ddply from the plyr package, but I could not figure out how to pass two subgroups to a function that calculates the relevant statistics.

+8

r ggplot2 plyr

Michael Sep 26 '12 at 14:27

source share

2 answers

ddply returns a data frame as output, and assuming that I am reading your question correctly, this is not what you are looking for. I believe that you would like to conduct a series of t-tests using a series of subsets of data, so the only real task is to compile a list of these subsets. After using them, you can use the lapply () function to run a t-test for each subset in your list. I'm sure this is not the most elegant solution, but one way would be to create a list of unique pairs of your colors using the following function:

 get.pairs <- function(v){ l <- length(v) n <- sum(1:l-1) a <- vector("list",n) j = 1 k = 2 for(i in 1:n){ a[[i]] <- c(v[j],v[k]) if(k < l){ k <- k + 1 } else { j = j + 1 k = j + 1 } } return(a) }

Now you can use this function to get a list of unique color pairs:

 > (color.pairs <- get.pairs(levels(diam$color)))) [[1]] [1] "D" "E" [[2]] [1] "D" "F" ... [[21]] [1] "I" "J"

Now you can use each of these lists to run t.test (or whatever) in your subset of your data frame, for example:

 > t.test(price~cut,data=diam[diam$color %in% color.pairs[[1]],]) Welch Two Sample t-test data: price by cut t = 8.1594, df = 427.272, p-value = 3.801e-15 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: 1008.014 1647.768 sample estimates: mean in group Fair mean in group Ideal 3938.711 2610.820

Now use lapply () to run the test for each subset in the list of color pairs:

 > lapply(color.pairs,function(x) t.test(price~cut,data=diam[diam$color %in% x,])) [[1]] Welch Two Sample t-test data: price by cut t = 8.1594, df = 427.272, p-value = 3.801e-15 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: 1008.014 1647.768 sample estimates: mean in group Fair mean in group Ideal 3938.711 2610.820 ... [[21]] Welch Two Sample t-test data: price by cut t = 0.8813, df = 375.996, p-value = 0.3787 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -260.0170 682.3882 sample estimates: mean in group Fair mean in group Ideal 4802.912 4591.726

+2

user1327089 Sep 26 '12 at 16:37

source share

Ben bolker · Accepted Answer · 2012-09-26T16:50:24+0000

I think you are looking for:

 library(plyr) ddply(diam,"color", function(x) { w <- wilcox.test(price~cut,data=x) with(w,data.frame(statistic,p.value)) })

(Substituting t.test for wilcox.test seems to work fine too.)

results:

  color statistic p.value 1 D 339753.5 4.232833e-24 2 E 591104.5 6.789386e-19 3 F 731767.5 2.955504e-11 4 G 950008.0 1.176953e-12 5 H 611157.5 2.055857e-17 6 I 213019.0 3.299365e-04 7 J 56870.0 2.364026e-01

What is the ggplot2 / plyr method for calculating statistical tests between two subgroups?

More articles: