An efficient way to incrementally count unique data points in a data frame

Question

An efficient way to incrementally count unique data points in a data frame

I am trying to find a more efficient way to incrementally count unique data points in a data frame.

For example, I have the following code:

df = matrix(c(1,2,3,3,4,5,1,2,4,4)) count = matrix(nrow = nrow(df),ncol=1) for (i in 1:nrow(df)) { count[i,1] = length(which(df[1:i,1] == df[i,1])) }

The purpose of the code is to gradually count each instance of a specific value, for example. the count column will have the following result:

 1,1,1,2,1,1,2,2,2,3.

The code that I have written so far does the job, however the df example above contains only 10 values. The real data frame that I am trying to execute with this function contains 52,118 values , which takes a huge amount of time.

Does anyone know a more efficient way to execute the code above?

+7

r count dataframe

Chintan desai May 14, '15 at 20:10

source share

3 answers

Here's a quick approach with the dplyr package:

 library(dplyr) # Fake data set.seed(20) dat = data.frame(values = sample(1:3, 50000, replace=TRUE)) dat %>% group_by(values) %>% mutate(runningCount = 1:n()) values runningCount 1 2 1 2 3 1 3 1 1 4 3 2 5 1 2 6 3 3 7 3 4 .. ... ...

Timing (in milliseconds):

  min lq mean median uq max neval 2.003755 2.134762 2.198161 2.186214 2.231662 3.665328 100

Timing for all answers so far (using the data I created):

  median dplyr: 2.11 data.table: 1.24 lapply/Reduce: 11.61 ave: 9.93

So data.table is the fastest.

+6

eipi10 May 14, '15 at 20:21

source share

One basic R approach:

 Reduce(`+`,lapply(unique(c(df)), function(u){b=c(df)==u;b[b==T]=cumsum(b[b==T]);b})) #[1] 1 1 1 2 1 1 2 2 2 3

+6

Colonel beauvel May 14, '15 at 20:22

source share

user227710 · Accepted Answer · 2015-05-14T20:34:51+0000

data.table solution

 library(data.table) set.seed(20) dat <-data.frame(values = sample(1:3, 50000, replace=TRUE)) setDT(dat)[,runningCount:=1:.N,values] values runningCount 1: 3 1 2: 3 2 3: 1 1 4: 2 1 5: 3 3 --- 49996: 1 16674 49997: 2 16516 49998: 2 16517 49999: 2 16518 50000: 2 16519

An efficient way to incrementally count unique data points in a data frame

More articles: