Rearrange data.frame to get consistent product order

I have a dataframe of the following form:

df <- data.frame(client = c("client1", "client1", "client2", "client3", "client3"), product = c("A", "B", "A", "D", "A"), purchase_Date = c("2010-03-22", "2010-02-02", "2009-03-02", "2011-04-05", "2012-11-01")) df$purchase_Date <- as.Date(df$purchase_Date, format = "%Y-%m-%d") 

which is as follows:

  client product purchase_Date 1 client1 A 2010-03-02 2 client1 B 2010-02-02 3 client2 A 2009-03-02 4 client3 D 2011-04-05 5 client3 A 2012-11-01 

which I would like to change as follows:

  client purchase1 purchase2 1 client1 BA 2 client2 A <NA> 3 client3 DA 

therefore, I would like to know which product was the first, second, third, etc., each person ordered by the date of purchase. I can easily get each individually using data.table:

 library(data.table) setDT(df)[ , .SD[order(-purchase_Date), product][1], by = client] 

for the first one. but I have no idea how to effectively get the desired result.

+5
source share
3 answers

Here is a possible solution to data.table (if you have more than 10 purchases, I would recommend avoiding using paste0 and just use indx := seq_len(.N) instead, as this could ruin the purchase order)

 setDT(df)[order(purchase_Date), indx := paste0("purchase", seq_len(.N)), by = client] dcast(df, client ~ indx, value.var = "product") # client purchase1 purchase2 # 1: client1 BA # 2: client2 A NA # 3: client3 DA 

The comparison between frank() and order() is suitable for creating indx col:

 require(data.table) set.seed(45L); dt = data.table(client = sample(paste("client", 1:1e4, sep=""), 1e6, TRUE)) dt[, `:=`(product = sample(paste("p", 1:200, sep=""), .N, FALSE), purchase_Date = as.Date(sample(14610:16586, .N, FALSE), origin = "1970-01-01")), by=client] system.time(dt[order(purchase_Date), indx := seq_len(.N), by = client]) # user system elapsed # 0.19 0.02 0.20 system.time(dt[, purch_rank := frank(purchase_Date, ties.method = "dense"), by=client]) # user system elapsed # 3.94 0.00 3.98 
+7
source

A dplyr / tidyr:

 library(dplyr) library(tidyr) df %>% group_by(client) %>% mutate(purch_rank = dense_rank(purchase_Date)) %>% select(-purchase_Date) %>% spread(purch_rank, product) #Source: local data frame [3 x 3] # # client 1 2 #1 client1 BA #2 client2 A NA #3 client3 DA 

And a possible data.table approach:

 library(data.table) #v 1.9.5+ currently from GitHub for "frank" setDT(df)[, purch_rank := frank(purchase_Date, ties.method = "dense"), by=client] dcast(df, client ~ purch_rank, value.var = "product") # client 1 2 #1: client1 BA #2: client2 A NA #3: client3 DA 
+4
source

Here is my solution with dplyr and tidyr :

 df %>% group_by(client) %>% select(-purchase_Date) %>% mutate(purchase = seq_along(product)) %>% spread(purchase, product) Source: local data frame [3 x 3] client 1 2 1 client1 AB 2 client2 A NA 3 client3 DA 

A slightly different approach with a different output would be with the reshape2 package. Just use the previous code, except for the last line to be replaced with this:

 dcast(client ~ product) Using purchase as value column: use value.var to override. client ABD 1 client1 1 2 NA 2 client2 1 NA NA 3 client3 2 NA 1 
0
source

All Articles