Dplyr + group_by and avoid sorting alphabetically

Question

Dplyr + group_by and avoid sorting alphabetically

I have the following data:

data <- structure(list(user = c(1234L, 1234L, 1234L, 1234L, 1234L, 1234L, 1234L, 1234L, 1234L, 1234L, 1234L, 4758L, 4758L, 9584L, 9584L, 9584L, 9584L, 9584L, 9584L), time = c(1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 11L, 5L, 6L, 1L, 2L, 3L, 4L, 5L, 6L), fruit = structure(c(1L, 6L, 1L, 1L, 6L, 5L, 5L, 3L, 4L, 1L, 2L, 4L, 2L, 1L, 6L, 5L, 5L, 3L, 2L), .Label = c("apple", "banana", "lemon", "lime", "orange", "pear"), class = "factor"), count = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), cum_sum = c(1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 11L, 1L, 2L, 1L, 2L, 3L, 4L, 5L, 6L)), .Names = c("user", "time", "fruit", "count", "cum_sum" ), row.names = c(NA, -19L), class = "data.frame")

For each user in this set, I want to look at the sequence of fruits over time. But, some fruits are listed back to back on time.

  user time fruit count cum_sum 1 1234 1 apple 1 1 2 1234 2 pear 1 2 3 1234 3 apple 1 3 4 1234 4 apple 1 4 5 1234 5 pear 1 5 6 1234 6 orange 1 6 7 1234 7 orange 1 7

What I'm looking for is rather a set of time for unique fruits.

The problem is that if I group by user and fruit, then summarize, dplyr will automatically sort the fruits alphabetically:

 data %>% group_by(user, fruit) %>% summarise(temp_var=1) %>% mutate(cum_sum = cumsum(temp_var))

I really want for user 1234 above (for example) so that the fruits are listed in time series order, but remove any duplicates. So where do we see apple> pear> apple> apple> pear> orange> orange, instead we only see apple> pear> apple> pear> orange

+5

r dplyr

Marc tulla Jun 25 '15 at 17:05

source share

3 answers

Thus, using the rleid function from the latest version of data.table on CRAN, we can simply do it (although we are not sure about your exact desired output)

 library(data.table) ## v >= 1.9.6 res <- setDT(data)[, .(fruit = fruit[1L]), by = .(user, indx = rleid(fruit)) ][, cum_sum := seq_len(.N), by = user ][, indx := NULL] res # user fruit cum_sum # 1: 1234 apple 1 # 2: 1234 pear 2 # 3: 1234 apple 3 # 4: 1234 pear 4 # 5: 1234 orange 5 # 6: 1234 lemon 6 # 7: 1234 lime 7 # 8: 1234 apple 8 # 9: 1234 banana 9 # 10: 4758 lime 1 # 11: 4758 banana 2 # 12: 9584 apple 1 # 13: 9584 pear 2 # 14: 9584 orange 3 # 15: 9584 lemon 4 # 16: 9584 banana 5

+6

David Arenburg Jun 25 '15 at 17:56

source share

You can use group_indices to handle this case:

 data %>% filter(group_indices_(., .dots = c("user", "fruit")) != lag(group_indices_(., .dots = c("user", "fruit")), default = 0)) %>% group_by(user) %>% mutate(cum_sum = row_number())

Like rleid it generates a unique identifier for each group. You basically filter out all values that have the same identifier as the previous one using lag() .

 #Source: local data frame [16 x 3] #Groups: user # # user fruit cum_sum #1 1234 apple 1 #2 1234 pear 2 #3 1234 apple 3 #4 1234 pear 4 #5 1234 orange 5 #6 1234 lemon 6 #7 1234 lime 7 #8 1234 apple 8 #9 1234 banana 9 #10 4758 lime 1 #11 4758 banana 2 #12 9584 apple 1 #13 9584 pear 2 #14 9584 orange 3 #15 9584 lemon 4 #16 9584 banana 5

+3

Steven beaupré Jun 25 '15 at 10:04

source share

Pierre lafortune · Accepted Answer · 2015-06-25T17:41:22+0000

Based on your examples, this may help:

 data %>% group_by(user) %>% filter(c(T,fruit[-1L] != fruit[-length(fruit)])) %>% mutate(cum_sum = cumsum(count), time = seq_along(count)) # Source: local data frame [16 x 5] # Groups: user # # user time fruit count cum_sum # 1 1234 1 apple 1 1 # 2 1234 2 pear 1 2 # 3 1234 3 apple 1 3 # 4 1234 4 pear 1 4 # 5 1234 5 orange 1 5 # 6 1234 6 lemon 1 6 # 7 1234 7 lime 1 7 # 8 1234 8 apple 1 8 # 9 1234 9 banana 1 9 # 10 4758 1 lime 1 1 # 11 4758 2 banana 1 2 # 12 9584 1 apple 1 1 # 13 9584 2 pear 1 2 # 14 9584 3 orange 1 3 # 15 9584 4 lemon 1 4 # 16 9584 5 banana 1 5

Dplyr + group_by and avoid sorting alphabetically

More articles: