Effective use of R data.table and unique ()

Is there a more efficient query than the following

DT[, list(length(unique(OrderNo)) ),customerID] 

to clarify the LONG format table with customer identifiers, serial number and commodity items, which means that there will be duplicate rows with the same order identifier if the customer has bought more than one item in this transaction.

An attempt to develop unique purchases. length() gives the counter of all order identifiers by customer ID, including duplicates, looking only for a unique number.

Edit here:

Here is some dummy code. Ideally, what I'm looking for is the result of the first query using unique() .

 df <- data.frame( customerID=as.factor(c(rep("A",3),rep("B",4))), product=as.factor(c(rep("widget",2),rep("otherstuff",5))), orderID=as.factor(c("xyz","xyz","abd","qwe","rty","yui","poi")), OrderDate=as.Date(c("2013-07-01","2013-07-01","2013-07-03","2013-06-01","2013-06-02","2013-06-03","2013-07-01")) ) DT.eg <- as.data.table(df) #Gives unique order counts DT.eg[, list(orderlength = length(unique(orderID)) ),customerID] #Gives counts of all orders by customer DT.eg[,.SD, keyby=list(orderID, customerID)][, .N, by=customerID] ^ | This should be .N, not .SD ~ RS 
+8
r data.table
source share
2 answers

if you are trying to count the number of unique purchases per customer, use

  DT[, .N, keyby=list(customerId, OrderNo)][, .N, by=customerId] 
+12
source share

Starting from version 1.9.6 (in CRAN 19 Sep 2015), data.table received an auxiliary function uniqueN() , which is equivalent to length(unique(x)) , but much faster (according to data.table NEWS ).

Wherein

 DT.eg[, list(orderlength = length(unique(orderID)) ),customerID] 

and

 DT.eg[,.N, keyby=list(orderID, customerID)][, .N, by=customerID] 

can be rewritten as

 DT.eg[, .(orderlength = uniqueN(orderID)), customerID] 
  customerID orderlength 1: A 2 2: B 4 
+1
source share

All Articles