Ranking multiple frames of data and summing over them in R

I have 10 frames of data with two columns each, I call dataframes a, b, c, d, e, f, g, h, i and j.

The first column in each data frame is called s for sequences, and the second is p for p-values ​​corresponding to each sequence. Column s contains the same sequences in all 10 data frames, essentially the only difference in p-values. Below is a short version of data frame a, which has 600,000 rows.

sp gtcg 0.06 gtcgg 0.05 gggaa 0.07 cttg 0.05 

I want to rank each data frame by p-value, the smallest p should get rank 1, and equal p-values ​​should get the same rank. Each leaf data frame should be in this format:

  s p_rank_a gtcg 2 gtcgg 1 gggaa 3 cttg 1 

I used this to do this:

r <-rang (a $ p)

cbind (a $ s, g)

but I am not very familiar with loops, and I don’t know how to do this automatically. In the end, I would like the last file to have a column s and in the next column the sum of the ranks of all ranks in all data frames for each particular sequence. SO basically this:

 s ranksum_P_a-j gtcg 34 gtcgg 5 gggaa 5009093 cttg 499 

Please help and thanks!

+4
source share
2 answers

I would put all data.frames in a list , and then use lapply and transform as follows:

 my_l <- list(a,b,c) # all your data.frames # you can use rank but it'll give you the average in case of ties # lapply(my_l, function(x) transform(x, rank_p = rank(p))) # I prefer this method instead my_o <- lapply(my_l, function(x) transform(x, p = as.numeric(factor(p)))) # now bind them in to a single data.frame my_o <- do.call(rbind, my_o) # now paste them aggregate(data = my_o, p ~ s, function(x) paste(x, collapse=",")) # sp # 1 cttg 1,1,1 # 2 gggaa 3,3,3 # 3 gtcg 2,2,2 # 4 gtcgg 1,1,1 

Change , since you asked to create a faster solution (due to big data), I would suggest, for example, @Ricardo, the solution data.table :

 require(data.table) # bind all your data.frames together dt <- rbindlist(my_l) # my_l is your list of data.frames # replace p-value with their "rank" dt[, p := as.numeric(factor(p))] # set key setkey(dt, "s") # combine them using `,` dt[, list(p_ranks = paste(p, collapse=",")), by=s] 

Try the following:

+2
source

for one data frame, you can do this in one line, as shown below:
credit to @Arun for instructing to use as.numeric(factor(p))

 library(data.table) aDT <- data.table(a)[, p_rank := as.numeric(factor(p))] 

I would suggest storing all data.frames in a single list so you can easily iterate over them. Since your date.frames are letters, ten of them are easy to assemble:

 # collect them all allOfThem <- lapply(letters[1:10], get, envir=.GlobalEnv) # keep in mind you named an object `c` # convert to DT and create the ranks allOfThem <- lapply(allOfThem, function(x) data.table(x)[, p_rank := as.numeric(factor(p))]) 

in a separate note: it may be a good habit to start avoiding naming objects " c " and other common functions in R otherwise, you will find that you will begin to encounter many “inexplicable” behaviors that, after you beat your head against the wall for an hour, trying to debug it, you realize that you have rewritten the name of the function. It never happened to me :)

+2
source

All Articles