R Get unique records in a data frame based on secondary field conditions.

Question

R Get unique records in a data frame based on secondary field conditions.

UPDATED AND SIMPLIFIED

I have a really big table (~ 7 million records) that has the following structure.

temp <- read.table(header = TRUE, stringsAsFactors=FALSE, text = "Website Datetime Rating A 2007-12-06T14:53:07Z 1 A 2006-07-28T03:52:26Z 4 B 2006-11-02T11:06:25Z 2 C 2007-06-19T06:56:08Z 5 C 2009-11-28T22:27:58Z 2 C 2009-11-28T22:28:13Z 2")

What I want to get are unique sites with maximum website ratings:

 Website Rating A 4 B 2 C 5

I tried using a for loop, but it was too slow. Is there any other way to achieve this.

+4

r

Michael Dec 15 '12 at 8:35

source share

2 answers

  do.call( rbind, lapply( split(temp, temp$Website) , function(d) d[ which.max(d$Rating), ] ) ) Website Datetime Rating AA 2006-07-28T03:52:26Z 4 BB 2006-11-02T11:06:25Z 2 CC 2007-06-19T06:56:08Z 5

Since your 'Datetime' variable is not yet a Date or datetime object, you must first convert the Date object.

which.max will select the first element that is the maximum.

 > which.max(c(1,1,2,2)) [1] 3

So Ananda cannot be right in his warning in this regard. Of course, data-based methods will be faster and can also succeed if the machine's memory is modest. The method above can make several copies along this path, and the data.table functions do not need so many copies.

+3

42- Dec 15 '12 at 10:08

source share

A5C1D2H2I1M1N2O1R2T1 · Accepted Answer · 2012-12-15T09:27:23+0000

I would most likely be exploring the data.table package, although without any details, the following data.table solution will most likely not be what you need. I mention this because, in particular, there can be more than one “Rating” entry for each group that corresponds to max ; how would you like to deal with these cases?

 library(data.table) temp <- read.table(header = TRUE, stringsAsFactors=FALSE, text = "Website Datetime Rating A 2012-10-9 10 A 2012-11-10 12 B 2011-10-9 5") DT <- data.table(temp, key="Website") DT # Website Datetime Rating # 1: A 2012-10-9 10 # 2: A 2012-11-10 12 # 3: B 2011-10-9 5 DT[, list(Datetime = Datetime[which.max(Rating)], Rating = max(Rating)), by = key(DT)] # Website Datetime Rating # 1: A 2012-11-10 12 # 2: B 2011-10-9 5

I would recommend that for better answers you might want to include information such as how your datetime variable can affect your aggregation, or it is possible that there can be more than one "max" value for each group.

If you want all the lines that match max, the fix is easy:

 DT[, list(Time = Times[Rating == max(Rating)], Rating = max(Rating)), by = key(DT)]

If you need a Rating column, there are many ways around this. Following the same steps as above to convert to data.table , try:

 DT[, list(Datetime = max(Rating)), by = key(DT)] Website Datetime # 1: A 4 # 2: B 2 # 3: C 5

Or, keeping the original “pace” of data.frame , try aggregate() :

 aggregate(Rating ~ Website, temp, max) Website Rating # 1 A 4 # 2 B 2 # 3 C 5

Another approach using ave :

 temp[with(temp, Rating == ave(Rating, Website, FUN=max)), ]

R Get unique records in a data frame based on secondary field conditions.

More articles: