How can I generate summary statistics for groups if my grouping variable is a factor?

Question

How can I generate summary statistics for groups if my grouping variable is a factor?

Suppose I wanted to get summary statistics for the mtcars (part of the basic version of R 2.12.1). Below I group cars according to the number of engine cylinders that they have, and mtcars values of the remaining variables in mtcars for each group.

 > str(mtcars) 'data.frame': 32 obs. of 11 variables: $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ... $ cyl : num 6 6 4 6 8 6 8 4 4 6 ... $ disp: num 160 160 108 258 360 ... $ hp : num 110 110 93 110 175 105 245 62 95 123 ... $ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ... $ wt : num 2.62 2.88 2.32 3.21 3.44 ... $ qsec: num 16.5 17 18.6 19.4 17 ... $ vs : num 0 0 1 1 0 1 0 1 1 1 ... $ am : num 1 1 1 0 0 0 0 0 0 0 ... $ gear: num 4 4 4 3 3 3 3 4 4 4 ... $ carb: num 4 4 1 1 2 1 4 2 2 4 ... > ddply(mtcars, .(cyl), mean) mpg cyl disp hp drat wt qsec vs am gear 1 26.66364 4 105.1364 82.63636 4.070909 2.285727 19.13727 0.9090909 0.7272727 4.090909 2 19.74286 6 183.3143 122.28571 3.585714 3.117143 17.97714 0.5714286 0.4285714 3.857143 3 15.10000 8 353.1000 209.21429 3.229286 3.999214 16.77214 0.0000000 0.1428571 3.285714 carb 1 1.545455 2 3.428571 3 3.500000

But, if my grouping variable turns out to be a factor, it gets harder. ddply() gives a warning for each level of the factor, since mean() factor cannot be taken.

 > mtcars$cyl <- as.factor(mtcars$cyl) > str(mtcars) 'data.frame': 32 obs. of 11 variables: $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ... $ cyl : Factor w/ 3 levels "4","6","8": 2 2 1 2 3 2 3 1 1 2 ... $ disp: num 160 160 108 258 360 ... $ hp : num 110 110 93 110 175 105 245 62 95 123 ... $ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ... $ wt : num 2.62 2.88 2.32 3.21 3.44 ... $ qsec: num 16.5 17 18.6 19.4 17 ... $ vs : num 0 0 1 1 0 1 0 1 1 1 ... $ am : num 1 1 1 0 0 0 0 0 0 0 ... $ gear: num 4 4 4 3 3 3 3 4 4 4 ... $ carb: num 4 4 1 1 2 1 4 2 2 4 ... > ddply(mtcars, .(cyl), mean) mpg cyl disp hp drat wt qsec vs am gear 1 26.66364 NA 105.1364 82.63636 4.070909 2.285727 19.13727 0.9090909 0.7272727 4.090909 2 19.74286 NA 183.3143 122.28571 3.585714 3.117143 17.97714 0.5714286 0.4285714 3.857143 3 15.10000 NA 353.1000 209.21429 3.229286 3.999214 16.77214 0.0000000 0.1428571 3.285714 carb 1 1.545455 2 3.428571 3 3.500000 Warning messages: 1: In mean.default(X[[2L]], ...) : argument is not numeric or logical: returning NA 2: In mean.default(X[[2L]], ...) : argument is not numeric or logical: returning NA 3: In mean.default(X[[2L]], ...) : argument is not numeric or logical: returning NA >

So, I am wondering if I am not going to generate summary statistics in the wrong way.

How are data structures typically compiled from side or side summary statistics (e.g., means, standard deviations, etc.)? Should I use something other than ddply() ? If I can use ddply() , what can I do to avoid errors that occur when I try to take the average value of my grouping factor?

+6

r apply plyr reshape

briandk Jan 29 '11 at 3:26

source share

2 answers

Not the answer here, but observation. This is not a ddply() problem as such. Look at it. The following both work perfectly to create a tool table:

 aggregate(mtcars, by=list(mtcars$cyl), mean) apply(mtcars, 2, function(col) tapply(col, INDEX=mtcars$cyl, FUN=mean))

But after mtcars$cyl <- as.factor(mtcars$cyl) none of the above works, because R does not know how to take the average value of the factor column. We can avoid this by removing this column ("cyl" - column 2) from the things passed to mean() :

 aggregate(mtcars[ , -2], by=list(mtcars$cyl), mean) apply(mtcars[ , -2], 2, function(col) tapply(col, INDEX=mtcars$cyl, FUN=mean))

But this is pretty awkward.

+2

J. Win. Jan 29 '11 at 4:06

source share

Prasad chalasani · Accepted Answer · 2011-01-29T03:33:49+0000

Use numcolwise(mean) : the numcolwise function converts its argument (function) to a function that only works with numeric columns (and ignores categorical / factor columns).

  > ddply(mtcars, .(cyl), numcolwise(mean)) cyl mpg disp hp drat wt qsec vs 1 4 26.66364 105.1364 82.63636 4.070909 2.285727 19.13727 0.9090909 2 6 19.74286 183.3143 122.28571 3.585714 3.117143 17.97714 0.5714286 3 8 15.10000 353.1000 209.21429 3.229286 3.999214 16.77214 0.0000000 am gear carb 1 0.7272727 4.090909 1.545455 2 0.4285714 3.857143 3.428571 3 0.1428571 3.285714 3.500000

How can I generate summary statistics for groups if my grouping variable is a factor?

More articles: