How to reduce the data frame while maintaining order for other columns

I am trying to reduce the data frame using the max function in this column. I would like to keep the other columns, but keeping the values ​​from the same rows where each maximum value was selected. An example will facilitate this explanation.

Suppose we have the following data frame:

 dframe <- data.frame(list(BENCH=sort(rep(letters[1:4], 4)), CFG=rep(1:4, 4), VALUE=runif(4 * 4) )) 

This gives me:

  BENCH CFG VALUE
 1 a 1 0.98828096
 2 a 2 0.19630597
 3 a 3 0.83539540
 4 a 4 0.90988296
 5 b 1 0.01191147
 6 b 2 0.35164194
 7 b 3 0.55094787
 8 b 4 0.20744004
 9 c 1 0.49864470
 10 c 2 0.77845408
 11 c 3 0.25278871
 12 c 4 0.23440847
 13 d 1 0.29795494
 14 d 2 0.91766057
 15 d 3 0.68044728
 16 d 4 0.18448748

Now I want to reduce the data to select the maximum VALUE for each other BENCH:

 aggregate(VALUE ~ BENCH, dframe, FUN=max) 

This gives me the expected result:

  BENCH VALUE
 1 a 0.9882810
 2 b 0.5509479
 3 c 0.7784541
 4 d 0.9176606

Next, I tried to save the other columns:

 aggregate(cbind(VALUE, CFG) ~ BENCH, dframe, FUN=max) 

This reduction returns:

  BENCH VALUE CFG
 1 a 0.9882810 4
 2 b 0.5509479 4
 3 c 0.7784541 4
 4 d 0.9176606 4

Both VALUE and CFG are reduced using the max function. But that I do not want. For example, in this example, I would like to get:

  BENCH VALUE CFG
 1 a 0.9882810 1
 2 b 0.5509479 3
 3 c 0.7784541 2
 4 d 0.9176606 2

where CFG is not decreasing, but it just stores the value associated with the maximum VALUE for every other BENCH.

How can I change my abbreviation to get the latest result?

+4
source share
3 answers

Here's the basic R solution:

 do.call(rbind, by(dframe, dframe$BENCH, FUN=function(X) X[which.max(X$VALUE),])) # BENCH CFG VALUE # aa 1 0.9882810 # bb 3 0.5509479 # cc 2 0.7784541 # dd 2 0.9176606 
+2
source

If your problem scales to big data (millions or 10 million million rows and groups), then the data.table package might be of interest. Here is the relevant syntax:

 require(data.table) dtable <- data.table(dframe) dtable[, .SD[which.max(VALUE),], by = BENCH] 
+5
source

you can use ddply from the plyr package:

 ddply(dframe, .(BENCH), function(df) return(df[df$VALUE==max(df$VALUE),])) 
+1
source

All Articles