Create an β€œindex” for each group element with data.table

My data is grouped by identifiers in V6 and ordered by position (V1: V3):

dt V1 V2 V3 V4 V5 V6 1: chr1 3054233 3054733 . + ENSMUSG00000090025 2: chr1 3102016 3102125 . + ENSMUSG00000064842 3: chr1 3205901 3207317 . - ENSMUSG00000051951 4: chr1 3206523 3207317 . - ENSMUSG00000051951 5: chr1 3213439 3215632 . - ENSMUSG00000051951 6: chr1 3213609 3216344 . - ENSMUSG00000051951 7: chr1 3214482 3216968 . - ENSMUSG00000051951 8: chr1 3421702 3421901 . - ENSMUSG00000051951 9: chr1 3466587 3466687 . + ENSMUSG00000089699 10: chr1 3513405 3513553 . + ENSMUSG00000089699 

What I would like to do is add and add a column with an index by position, i.e. for each group in V6, the first element will be β€œ1”, the second β€œ2” and so on. I can achieve this with ddply and a custom function:

 rankExons <- function(x){ if(unique(x$V5) == "+"){ x$index <- seq_len(nrow(x))} else{ x$index <- rev(seq_len(nrow(x)))} x } indexed <- ddply(dt, .(V6), rankExons) indexed V1 V2 V3 V4 V5 V6 index 1 chr1 3205901 3207317 . - ENSMUSG00000051951 6 2 chr1 3206523 3207317 . - ENSMUSG00000051951 5 3 chr1 3213439 3215632 . - ENSMUSG00000051951 4 4 chr1 3213609 3216344 . - ENSMUSG00000051951 3 5 chr1 3214482 3216968 . - ENSMUSG00000051951 2 6 chr1 3421702 3421901 . - ENSMUSG00000051951 1 7 chr1 3102016 3102125 . + ENSMUSG00000064842 1 8 chr1 3466587 3466687 . + ENSMUSG00000089699 1 9 chr1 3513405 3513553 . + ENSMUSG00000089699 2 10 chr1 3054233 3054733 . + ENSMUSG00000090025 1 

Unfortunately, it is very slow on a full data set (~ 620 thousand rows), and in parallel operation it crashes and burns:

 library(doMC) registerDoMC(cores=6) indexed <- ddply(dt, .(V6), rankExons, .parallel=TRUE) Error: serialization is too large to store in a raw vector Error: serialization is too large to store in a raw vector Error: serialization is too large to store in a raw vector Error: serialization is too large to store in a raw vector Error: serialization is too large to store in a raw vector Error: serialization is too large to store in a raw vector Warning message: In mclapply(argsList, FUN, mc.preschedule = preschedule, mc.set.seed = set.seed, : all scheduled cores encountered errors in user code 

So, I went to data.table, but could not get it to work. Here is what I tried:

 setkey(dt, "V6") dt[,index:=rankExons(dt), by=V6] dt[,rankExons(.sd), by=V6, .SDcols=c("V5, V6")] 

And both failed. How can I recreate my ddply using data.table?

 dput(dt) structure(list(V1 = c("chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1"), V2 = c(3054233L, 3102016L, 3205901L, 3206523L, 3213439L, 3213609L, 3214482L, 3421702L, 3466587L, 3513405L), V3 = c(3054733L, 3102125L, 3207317L, 3207317L, 3215632L, 3216344L, 3216968L, 3421901L, 3466687L, 3513553L), V4 = c(".", ".", ".", ".", ".", ".", ".", ".", ".", "."), V5 = c("+", "+", "-", "-", "-", "-", "-", "-", "+", "+"), V6 = c("ENSMUSG00000090025", "ENSMUSG00000064842", "ENSMUSG00000051951", "ENSMUSG00000051951", "ENSMUSG00000051951", "ENSMUSG00000051951", "ENSMUSG00000051951", "ENSMUSG00000051951", "ENSMUSG00000089699", "ENSMUSG00000089699" )), .Names = c("V1", "V2", "V3", "V4", "V5", "V6"), class = c("data.table", "data.frame"), row.names = c(NA, -10L), .internal.selfref = <pointer: 0x1de6a88>) 
+8
r indexing data.table plyr
source share
2 answers

As one bioinformatics, I often come across this operation. And here I love data.table modify a subset of rows by reference !

I would do it like this:

 dt[V5 == "+", index := 1:.N, by=V6] dt[V5 == "-", index := .N:1, by=V6] 

No features required. This is a little more profitable because it avoids checking == "+" or "-" once for each group! Instead, you can first subgroup all groups with + once, and then group by V6 and change only these lines in place!

Similarly, you do it again for "-" . Hope this helps.

Note: .N is a special variable containing the number of observations per group.

+14
source share

First I will load your sample data into R (currently you cannot use dput() with data.table ):

 df <- read.table(header = TRUE, stringsAsFactors = FALSE, text = " V1 V2 V3 V4 V5 V6 1 chr1 3205901 3207317 . - ENSMUSG00000051951 2 chr1 3206523 3207317 . - ENSMUSG00000051951 3 chr1 3213439 3215632 . - ENSMUSG00000051951 4 chr1 3213609 3216344 . - ENSMUSG00000051951 5 chr1 3214482 3216968 . - ENSMUSG00000051951 6 chr1 3421702 3421901 . - ENSMUSG00000051951 7 chr1 3102016 3102125 . + ENSMUSG00000064842 8 chr1 3466587 3466687 . + ENSMUSG00000089699 9 chr1 3513405 3513553 . + ENSMUSG00000089699 10 chr1 3054233 3054733 . + ENSMUSG00000090025") 

You can almost elegantly solve your problem with dplyr:

 library(dplyr) df %>% group_by(V6, V5) %>% mutate(index = row_number(V2)) 

(I assume V2 is the variable you want to index - I think it's better to be explicit, rather than relying on a line of order line)

But you need another resume for different subsets, which is currently not easy with dplyr. One approach would be to split and then combine again:

 rbind_list( df %>% filter(V5 == "+") %>% mutate(index = row_number(V2)), df %>% filter(V5 == "-") %>% mutate(index = row_number(desc(V2))) ) 

But this will be relatively slow since you need to make two copies of the data.

Another approach would be to use if inside the summary:

 df %>% group_by(V6, V5) %>% mutate(index = row_number(if (V5[1] == "+") V2 else desc(V2))) 
+3
source share

All Articles