How can I rank observations in a group faster?

Question

How can I rank observations in a group faster?

I have a really simple problem, but I probably don't think that the vector th is enough to solve it effectively. I tried two different approaches, and they have long been fixated on two different computers. I would like to say that the contest made it more exciting, but ... bleh.

group observations

I have long data (many lines per person, one line per person), and I basically want a variable that tells me how often a person has already been observed.

I have the first two columns and the third is required:

person wave obs pers1 1999 1 pers1 2000 2 pers1 2003 3 pers2 1998 1 pers2 2001 2

Now I use two approaches. Both are painfully slow (150 thousand lines). I am sure that something is missing, but my search queries have not yet helped me (it is difficult to formulate the problem).

Thanks for any pointers!

 # ordered dataset by persnr and year of observation person.obs <- person.obs[order(person.obs$PERSNR,person.obs$wave) , ] person.obs$n.obs = 0 # first approach: loop through people and assign range unp = unique(person.obs$PERSNR) unplength = length(unp) for(i in 1:unplength) { print(unp[i]) person.obs[which(person.obs$PERSNR==unp[i]),]$n.obs = 1:length(person.obs[which(person.obs$PERSNR==unp[i]),]$n.obs) i=i+1 gc() } # second approach: loop through rows and reset counter at new person pnr = 0 for(i in 1:length(person.obs[,2])) { if(pnr!=person.obs[i,]$PERSNR) { pnr = person.obs[i,]$PERSNR e = 0 } e=e+1 person.obs[i,]$n.obs = e i=i+1 gc() }

+8

optimization r

Ruben May 28, '11 at 15:48

source share

4 answers

The answer from Marek to this question has been very helpful in the past. I recorded it and used it almost daily, as it was fast and efficient. We will use ave() and seq_along() .

 foo <-data.frame(person=c(rep("pers1",3),rep("pers2",2)),year=c(1999,2000,2003,1998,2011)) foo <- transform(foo, obs = ave(rep(NA, nrow(foo)), person, FUN = seq_along)) foo person year obs 1 pers1 1999 1 2 pers1 2000 2 3 pers1 2003 3 4 pers2 1998 1 5 pers2 2011 2

Another option using plyr

 library(plyr) ddply(foo, "person", transform, obs2 = seq_along(person)) person year obs obs2 1 pers1 1999 1 1 2 pers1 2000 2 2 3 pers1 2003 3 3 4 pers2 1998 1 1 5 pers2 2011 2 2

+14

Chase May 28 '11 at 16:35

source share

Will by do the trick?

 > foo <-data.frame(person=c(rep("pers1",3),rep("pers2",2)),year=c(1999,2000,2003,1998,2011),obs=c(1,2,3,1,2)) > foo person year obs 1 pers1 1999 1 2 pers1 2000 2 3 pers1 2003 3 4 pers2 1998 1 5 pers2 2011 2 > by(foo, foo$person, nrow) foo$person: pers1 [1] 3 ------------------------------------------------------------ foo$person: pers2 [1] 2

+2

lindelof May 28 '11 at 16:03

source share

Another option using aggregate and rank in the R database:

 foo$obs <- unlist(aggregate(.~person, foo, rank)[,2]) # person year obs # 1 pers1 1999 1 # 2 pers1 2000 2 # 3 pers1 2003 3 # 4 pers2 1998 1 # 5 pers2 2011 2

0

989 May 11, '17 at 15:14

source share

Jaap · Accepted Answer · 2016-02-17T08:48:05+0000

Several alternatives with data.table and dplyr packages.

data.table:

 library(data.table) setDT(foo)[, rn := 1:.N, by = person] # setDT(foo) is needed to convert to a data.table

Or using the new rowid function (v1.9.7 +, currently only available in the development version )

 setDT(foo)[, rn := rowid(person)]

both give:

 > foo person year rn 1: pers1 1999 1 2: pers1 2000 2 3: pers1 2003 3 4: pers2 1998 1 5: pers2 2011 2

If you need a true rank, you should use the frank function:

 setDT(foo)[, rn := frank(year, ties.method = 'dense'), by = person]

dplyr:

 library(dplyr) # method 1 foo <- foo %>% group_by(person) %>% mutate(rn = row_number()) # method 2 foo <- foo %>% group_by(person) %>% mutate(rn = 1:n())

both give a similar result:

 > foo Source: local data frame [5 x 3] Groups: person [2] person year rn (fctr) (dbl) (int) 1 pers1 1999 1 2 pers1 2000 2 3 pers1 2003 3 4 pers2 1998 1 5 pers2 2011 2

How can I rank observations in a group faster?

group observations

More articles: