Filter data. Table by the number of groups

Say I have data.table as

 sample<-data.table(id=c(1,1,2,2,3,3,3,4,4), name=c("apple","apple","orange","orange", "pear","pear","pear","banana","banana"), atr=c("pretty","ugly","bruised","delicious", "pear-shaped","bruised","infested", "too-ripe","perfect"), N=c(10,9,15,4,5,7,7,4,12)) 

I want to return essentially unique(sample[,list(id, name)]) , except that I also need the atr column for the value with the largest N. In cases where there is a relationship for the highest N, I don't care which out of two but I want only one to be selected.

It almost works merge(sample[,list(N=max(N)),by=list(id,name1)], sample,by=c("id","name1","N")) , but since the pear has two atr values ​​that bind for max, this returns two pears. Besides the fact that it does not give the expected result, I also assume / hope that there is a way to do this that is not related to the connection.

+4
source share
2 answers

I would just use order :

 > unique(sample[order(-N), .(id, name, atr)], by = c("id", "name")) id name atr 1: 2 orange bruised 2: 4 banana perfect 3: 1 apple pretty 4: 3 pear bruised 

If you want to keep the general collation just use order(id, name, -N) .

You can also split this into two lines:

 setorder(sample, -N) #done by reference, as with all set* functions in data.table unique(sample[ , .(id, name, atr)], by = c("id", "name")) 

Or perhaps better depending on your ultimate goal:

 setkey(setorder(sample, -N), id, name) unique(sample[ , .(id, name, atr)]) 

(note: order is critical in the latter case, since using setorder overwrite NULL keys first)

+3
source

You can use atr[N == max(N)][1] to return only the first in case of binding, for example:

 library(data.table) sample[, .(atr = atr[N == max(N)][1]), by = .(id, name)] # id name atr # 1: 1 apple pretty # 2: 2 orange bruised # 3: 3 pear bruised # 4: 4 banana perfect 

Note: As Frank notes, atr[N == max(N)][1] also just atr[which.max(N)]

+4
source

All Articles