How to order percentile data for each identifier in R dataframe [r]

Question

How to order percentile data for each identifier in R dataframe [r]

I have a dataframe that contains 70-80 rows of response time (rt) data for each of 228 people, each with a unique identifier # (each does not have the same number of rows). I want each of the RT users in 5 mailboxes. I want the 1st bit to be their fastest 20 percent RT, the 2nd bit to be their nearest 20 percent RT, etc. Etc. Etc. Each bit should have the same number of tests in it (if the total number of samples is not odd).

My current framework is as follows:

id RT 7000 225 7000 250 7000 253 7001 189 7001 201 7001 225

I want my new dataframe to look like this:

 id RT Bin 7000 225 1 7000 250 1

After my data looks like this, I will aggregate by id and bin

The only way I can do this is to split the data into a list (using the split command), loop through each person, use the quantile command to get breakpoints for different boxes, assign the value bin (1-5) to every response time. This seems very confusing (and it will be difficult for me). I am a bit stuck, and I would really appreciate any help on how to optimize this process. Thank you

+4

r dataframe percentile

Matt Oct 6 '11 at 2:25

source share

3 answers

Here's a reproducible example using the plyr and cut package:

 dat <- data.frame(id = rep(1:10, each = 10), value = rnorm(100)) ddply(dat, "id", transform, hists = cut(value, breaks = 5)) id value hists 1 1 -1.82080027 (-1.94,-1.41] 2 1 0.11035796 (-0.36,0.166] 3 1 -0.57487134 (-0.886,-0.36] 4 1 -0.99455189 (-1.41,-0.886] .... 96 10 -0.03376074 (-0.233,0.386] 97 10 -0.71879488 (-0.853,-0.233] 98 10 -0.17533570 (-0.233,0.386] 99 10 -1.07668282 (-1.47,-0.853] 100 10 -1.45170078 (-1.47,-0.853]

Go to labels = FALSE in cut if you want to return simple integer values instead of bins.

+3

Chase Oct 6 '11 at 2:40

source share

Here's the answer in plain old R.

 #make up some data df <- data.frame(rt = rnorm(60), id = rep(letters[1:3], rep(20)) ) #and this is all there is to it df <- df[order(df$id, df$rt),] df$bin <- rep( unlist( tapply( df$rt, df$id, quantile )), each = 4)

You will notice that any quantiles can be used for the quantile command used. Used by default for quintiles, but if you want to decile use

 quantile(x, seq(0, 1, 0.1))

in the above function.

The answer above is a little fragile. This requires an equal amount of RT / id, and I did not tell you how to get to magic number 4. But it will also work very quickly on a large data set. If you want a more reliable solution in the R database.

 library('Hmisc') df <- df[order(df$id),] df$bin <- unlist(lapply( unique(df$id), function(x) cut2(df$rt[df$id==x], g = 5) ))

This is much more reliable than the first solution, but it is not so fast. For small datasets you will not notice.

0

John Oct 6 '11 at 4:10

source share

Brian diggs · Accepted Answer · 2011-10-06T15:43:08+0000

@Chase's answer broke the range into 5 groups of equal length (endpoint difference). What you think is necessary are pentiles (5 groups with the same number in each group). To do this, you need the cut2 function in Hmisc

 library("plyr") library("Hmisc") dat <- data.frame(id = rep(1:10, each = 10), value = rnorm(100)) tmp <- ddply(dat, "id", transform, hists = as.numeric(cut2(value, g = 5)))

tmp now has what you want

 > tmp id value hists 1 1 0.19016791 3 2 1 0.27795226 4 3 1 0.74350982 5 4 1 0.43459571 4 5 1 -2.72263322 1 .... 95 10 -0.10111905 3 96 10 -0.28251991 2 97 10 -0.19308950 2 98 10 0.32827137 4 99 10 -0.01993215 4 100 10 -1.04100991 1

With the same number in each hists for each id

 > table(tmp$id, tmp$hists) 1 2 3 4 5 1 2 2 2 2 2 2 2 2 2 2 2 3 2 2 2 2 2 4 2 2 2 2 2 5 2 2 2 2 2 6 2 2 2 2 2 7 2 2 2 2 2 8 2 2 2 2 2 9 2 2 2 2 2 10 2 2 2 2 2

How to order percentile data for each identifier in R dataframe [r]

More articles: