Adding a non-aggregated column to an aggregated dataset based on aggregation of another column

Question

Adding a non-aggregated column to an aggregated dataset based on aggregation of another column

Is it possible to use the aggregate function to add another column from the original data frame, without actually using this column to aggregate the data?

This is a very simplified version of the data that will help illustrate my question (let me call it data)

name result.1 result.2 replicate day data.for.mean "obj.1" 1 "good" 1 1 5 "obj.1" 1 "good" 2 1 7 "obj.1" 1 "great" 1 2 6 "obj.1" 1 "good" 2 2 9 "obj.1" 2 "bad" 1 1 10 "obj.1" 2 "not good" 2 1 6 "obj.1" 2 "bad" 1 2 5 "obj.1" 2 "not good" 2 2 3 "obj.2" 1 "excellent" 1 1 14 "obj.2" 1 "good" 2 1 10 "obj.2" 1 "good" 1 2 11 "obj.2" 1 "not bad" 2 2 7 "obj.2" 2 "bad" 1 1 4 "obj.2" 2 "bad" 2 1 3 "obj.2" 2 "horrible" 1 2 2 "obj.2" 2 "dismal" 2 2 1

You will notice that the result result.1 and result.2 are related so that if result.1 == 1, result.2 is good / large, and if result.1 == 2, then result.2 = = bad / not good . I need both of these columns in an aggregated dataset, and it does not matter which value from result.2 is selected when aggregating the data, I just need information to determine if result 1 of column 1 is good / bad and similar for result .2. Thus, it can have all the meanings “dull”, corresponding to all the results .1 values 2.

The problem is that since result.2 uses different names to define good / bad, I cannot use it as a column for aggregation.

Currently, my aggregate function looks like this:

 aggregated.data <- aggregate(data[c("data.for.mean")], by=data[c("name", "result.1", "day") ], FUN= mean } );

which would give one line of output, such as ...

 name result.1 day data.for.mean "obj.1" 1 1 6

(All replicas for obj.1, with the result .1 == 1, on day 1 were averaged. They were 5 and 7 and were the first two rows in my layout dataset.)

So that I would like to create an output string such as

 name result.1 result.2 day data.for.mean "obj.1" 1 "good" 1 6

Again, “good” can be replaced by “excellent”, “good”, “excellent”, for all values that match the result .1 value of “1”.

What would be the best way to capture information from a .2 result and add it to aggregated.data (output of an aggregate function)?

Thanks.

+3

r aggregate

Dashanimal Jan 28 '14 at 4:49

source share

2 answers

How about this with dplyr :

 require(dplyr) group_by(data,name,result.1,day) %.% summarise(mean=mean(data.for.mean),result.2=result.2[1]) #Source: local data frame [8 x 5] #Groups: name, result.1 # name result.1 day mean result.2 #1 obj.2 1 2 9.0 good #2 obj.2 1 1 12.0 excellent #3 obj.1 1 1 6.0 good #4 obj.1 1 2 7.5 great #5 obj.1 2 2 4.0 bad #6 obj.1 2 1 8.0 bad #7 obj.2 2 2 1.5 horrible #8 obj.2 2 1 3.5 bad

+1

Troy Jan 28 '14 at 5:51

source share

Matthew lundberg · Accepted Answer · 2014-01-29T02:01:15+0000

Here's a solution in the database that uses merge , and then another aggregate :

 agg.2 <- merge(aggregated.data, data[,names(data) != 'data.for.mean']) aggregate(result.2 ~ name+result.1+day+data.for.mean, data=agg.2, FUN=sample, size=1) ## name result.1 day data.for.mean result.2 ## 1 obj.2 2 2 1.5 dismal ## 2 obj.2 2 1 3.5 bad ## 3 obj.1 2 2 4.0 bad ## 4 obj.1 1 1 6.0 good ## 5 obj.1 1 2 7.5 great ## 6 obj.1 2 1 8.0 not good ## 7 obj.2 1 2 9.0 not bad ## 8 obj.2 1 1 12.0 excellent

Here's how it works:

The merge adds the result.2 values, but will create several rows where there are several such values. Then aggregate used to select one of these rows.

As you say, you don't care which of the corresponding result.2 shortcuts you get, I get them randomly with sample .

To return the first label to result.2 , use head with n=1 instead:

 aggregate(result.2 ~ name+result.1+day+data.for.mean, data=agg.2, FUN=head, n=1)

Similarly, to get the last such label, use tail with n=1 .

Adding a non-aggregated column to an aggregated dataset based on aggregation of another column

More articles: