R group by and aggregate - returns relative rank within groups using plyr

Question

R group by and aggregate - returns relative rank within groups using plyr

UPDATE: I have a data "frame" that looks like this:

session_id seller_feedback_score 1 1 282470 2 1 275258 3 1 275258 4 1 275258 5 1 37831 6 1 282470 7 1 26 8 1 138351 9 1 321350 10 1 841 11 1 138351 12 1 17263 13 1 282470 14 1 396900 15 1 282470 16 1 282470 17 1 321350 18 1 321350 19 1 321350 20 1 0 21 1 1596 22 7 282505 23 7 275283 24 7 275283 25 7 275283 26 7 37834 27 7 282505 28 7 26 29 7 138359 30 7 321360

and code (using the plyr package), which apparently should evaluate to "seller_feedback_score" in each session_id group:

  test <- test %>% group_by(session_id) %>% mutate(seller_feedback_score_rank = dense_rank(-seller_feedback_score))

however, what really happens is that R splits the entire data frame without binding to groups (session_id):

 session_id seller_feedback_score seller_feedback_score_rank_2 1 1 282470 5 2 1 275258 7 3 1 275258 7 4 1 275258 7 5 1 37831 11 6 1 282470 5 7 1 26 15 8 1 138351 9 9 1 321350 3 10 1 841 14 11 1 138351 9 12 1 17263 12 13 1 282470 5 14 1 396900 1 15 1 282470 5 16 1 282470 5 17 1 321350 3 18 1 321350 3 19 1 321350 3 20 1 0 16 21 1 1596 13 22 7 282505 4 23 7 275283 6 24 7 275283 6 25 7 275283 6 26 7 37834 10 27 7 282505 4 28 7 26 15 29 7 138359 8 30 7 321360 2

I checked this by referring to the unique values of "seller_feedback_score_rank" and it is not surprising that it is equal to the highest rank value. I would appreciate if someone could reproduce and help. thanks

0

r group-by aggregate dplyr plyr

user3628777 Jan 13 '15 at 13:26

source share

2 answers

In data.table 1.9.5 on, the frank() function is exported (for fast rank). The interface is similar to base::rank , but it implements dense rank in addition to all base::rank ranking methods, and also works on a list in addition to vectors. You can install it by following the instructions here .

 require(data.table) ## 1.9.5+ setDT(df)[, rank := frank(-seller_feedback_score, ties.method="dense"), by=session_id]

As @David points out, maybe you want rank = "first" or "min" ?? Not sure...

 setDT(df)[, rank := frank(-seller_feedback_score, ties.method="first"), ## or "min" or "max" by=session_id]

In any case, it should be a lot. Here's a checkpoint against base R:

 require(data.table) set.seed(45L) val = sample(1e4, 1e7, TRUE) system.time(ans1 <- rank(val, ties.method = "min")) # user system elapsed # 16.771 0.199 17.035 system.time(an2 <- frank(val, ties.method = "min")) # user system elapsed # 0.532 0.013 0.550 identical(ans1, ans2) # [1] TRUE

+3

Arun Jan 13 '15 at 13:41

source share

docendo discimus · Accepted Answer · 2015-01-13T13:31:43+0000

One option:

 library(dplyr) df %>% group_by(session_id) %>% mutate(rank = dense_rank(-seller_feedback_score))

dense_rank "like min_rank, but no spaces between ranks", so I denied the seller_feedback_score column to turn it into something like max_rank (which is not in dplyr).

If you want rows with gaps so that you reach 21 for the lowest in your case, you can use min_rank instead of dense_rank :

 library(dplyr) df %>% group_by(session_id) %>% mutate(rank = min_rank(-seller_feedback_score))

R group by and aggregate - returns relative rank within groups using plyr

More articles: