R Get unique records in a data frame based on secondary field conditions.

UPDATED AND SIMPLIFIED

I have a really big table (~ 7 million records) that has the following structure.

temp <- read.table(header = TRUE, stringsAsFactors=FALSE, text = "Website Datetime Rating A 2007-12-06T14:53:07Z 1 A 2006-07-28T03:52:26Z 4 B 2006-11-02T11:06:25Z 2 C 2007-06-19T06:56:08Z 5 C 2009-11-28T22:27:58Z 2 C 2009-11-28T22:28:13Z 2") 

What I want to get are unique sites with maximum website ratings:

 Website Rating A 4 B 2 C 5 

I tried using a for loop, but it was too slow. Is there any other way to achieve this.

+4
source share
2 answers

I would most likely be exploring the data.table package, although without any details, the following data.table solution will most likely not be what you need. I mention this because, in particular, there can be more than one β€œRating” entry for each group that corresponds to max ; how would you like to deal with these cases?

 library(data.table) temp <- read.table(header = TRUE, stringsAsFactors=FALSE, text = "Website Datetime Rating A 2012-10-9 10 A 2012-11-10 12 B 2011-10-9 5") DT <- data.table(temp, key="Website") DT # Website Datetime Rating # 1: A 2012-10-9 10 # 2: A 2012-11-10 12 # 3: B 2011-10-9 5 DT[, list(Datetime = Datetime[which.max(Rating)], Rating = max(Rating)), by = key(DT)] # Website Datetime Rating # 1: A 2012-11-10 12 # 2: B 2011-10-9 5 

I would recommend that for better answers you might want to include information such as how your datetime variable can affect your aggregation, or it is possible that there can be more than one "max" value for each group.

If you want all the lines that match max, the fix is ​​easy:

 DT[, list(Time = Times[Rating == max(Rating)], Rating = max(Rating)), by = key(DT)] 

If you need a Rating column, there are many ways around this. Following the same steps as above to convert to data.table , try:

 DT[, list(Datetime = max(Rating)), by = key(DT)] Website Datetime # 1: A 4 # 2: B 2 # 3: C 5 

Or, keeping the original β€œpace” of data.frame , try aggregate() :

 aggregate(Rating ~ Website, temp, max) Website Rating # 1 A 4 # 2 B 2 # 3 C 5 

Another approach using ave :

 temp[with(temp, Rating == ave(Rating, Website, FUN=max)), ] 
+2
source
  do.call( rbind, lapply( split(temp, temp$Website) , function(d) d[ which.max(d$Rating), ] ) ) Website Datetime Rating AA 2006-07-28T03:52:26Z 4 BB 2006-11-02T11:06:25Z 2 CC 2007-06-19T06:56:08Z 5 

Since your 'Datetime' variable is not yet a Date or datetime object, you must first convert the Date object.

which.max will select the first element that is the maximum.

 > which.max(c(1,1,2,2)) [1] 3 

So Ananda cannot be right in his warning in this regard. Of course, data-based methods will be faster and can also succeed if the machine's memory is modest. The method above can make several copies along this path, and the data.table functions do not need so many copies.

+3
source

All Articles