Create an “index” for each group element with data.table

Question

Create an “index” for each group element with data.table

My data is grouped by identifiers in V6 and ordered by position (V1: V3):

dt V1 V2 V3 V4 V5 V6 1: chr1 3054233 3054733 . + ENSMUSG00000090025 2: chr1 3102016 3102125 . + ENSMUSG00000064842 3: chr1 3205901 3207317 . - ENSMUSG00000051951 4: chr1 3206523 3207317 . - ENSMUSG00000051951 5: chr1 3213439 3215632 . - ENSMUSG00000051951 6: chr1 3213609 3216344 . - ENSMUSG00000051951 7: chr1 3214482 3216968 . - ENSMUSG00000051951 8: chr1 3421702 3421901 . - ENSMUSG00000051951 9: chr1 3466587 3466687 . + ENSMUSG00000089699 10: chr1 3513405 3513553 . + ENSMUSG00000089699

What I would like to do is add and add a column with an index by position, i.e. for each group in V6, the first element will be “1”, the second “2” and so on. I can achieve this with ddply and a custom function:

 rankExons <- function(x){ if(unique(x$V5) == "+"){ x$index <- seq_len(nrow(x))} else{ x$index <- rev(seq_len(nrow(x)))} x } indexed <- ddply(dt, .(V6), rankExons) indexed V1 V2 V3 V4 V5 V6 index 1 chr1 3205901 3207317 . - ENSMUSG00000051951 6 2 chr1 3206523 3207317 . - ENSMUSG00000051951 5 3 chr1 3213439 3215632 . - ENSMUSG00000051951 4 4 chr1 3213609 3216344 . - ENSMUSG00000051951 3 5 chr1 3214482 3216968 . - ENSMUSG00000051951 2 6 chr1 3421702 3421901 . - ENSMUSG00000051951 1 7 chr1 3102016 3102125 . + ENSMUSG00000064842 1 8 chr1 3466587 3466687 . + ENSMUSG00000089699 1 9 chr1 3513405 3513553 . + ENSMUSG00000089699 2 10 chr1 3054233 3054733 . + ENSMUSG00000090025 1

Unfortunately, it is very slow on a full data set (~ 620 thousand rows), and in parallel operation it crashes and burns:

 library(doMC) registerDoMC(cores=6) indexed <- ddply(dt, .(V6), rankExons, .parallel=TRUE) Error: serialization is too large to store in a raw vector Error: serialization is too large to store in a raw vector Error: serialization is too large to store in a raw vector Error: serialization is too large to store in a raw vector Error: serialization is too large to store in a raw vector Error: serialization is too large to store in a raw vector Warning message: In mclapply(argsList, FUN, mc.preschedule = preschedule, mc.set.seed = set.seed, : all scheduled cores encountered errors in user code

So, I went to data.table, but could not get it to work. Here is what I tried:

 setkey(dt, "V6") dt[,index:=rankExons(dt), by=V6] dt[,rankExons(.sd), by=V6, .SDcols=c("V5, V6")]

And both failed. How can I recreate my ddply using data.table?

 dput(dt) structure(list(V1 = c("chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1"), V2 = c(3054233L, 3102016L, 3205901L, 3206523L, 3213439L, 3213609L, 3214482L, 3421702L, 3466587L, 3513405L), V3 = c(3054733L, 3102125L, 3207317L, 3207317L, 3215632L, 3216344L, 3216968L, 3421901L, 3466687L, 3513553L), V4 = c(".", ".", ".", ".", ".", ".", ".", ".", ".", "."), V5 = c("+", "+", "-", "-", "-", "-", "-", "-", "+", "+"), V6 = c("ENSMUSG00000090025", "ENSMUSG00000064842", "ENSMUSG00000051951", "ENSMUSG00000051951", "ENSMUSG00000051951", "ENSMUSG00000051951", "ENSMUSG00000051951", "ENSMUSG00000051951", "ENSMUSG00000089699", "ENSMUSG00000089699" )), .Names = c("V1", "V2", "V3", "V4", "V5", "V6"), class = c("data.table", "data.frame"), row.names = c(NA, -10L), .internal.selfref = <pointer: 0x1de6a88>)

+8

r indexing data.table plyr

fridaymeetssunday Feb 09 '14 at 11:22

source share

2 answers

First I will load your sample data into R (currently you cannot use dput() with data.table ):

 df <- read.table(header = TRUE, stringsAsFactors = FALSE, text = " V1 V2 V3 V4 V5 V6 1 chr1 3205901 3207317 . - ENSMUSG00000051951 2 chr1 3206523 3207317 . - ENSMUSG00000051951 3 chr1 3213439 3215632 . - ENSMUSG00000051951 4 chr1 3213609 3216344 . - ENSMUSG00000051951 5 chr1 3214482 3216968 . - ENSMUSG00000051951 6 chr1 3421702 3421901 . - ENSMUSG00000051951 7 chr1 3102016 3102125 . + ENSMUSG00000064842 8 chr1 3466587 3466687 . + ENSMUSG00000089699 9 chr1 3513405 3513553 . + ENSMUSG00000089699 10 chr1 3054233 3054733 . + ENSMUSG00000090025")

You can almost elegantly solve your problem with dplyr:

 library(dplyr) df %>% group_by(V6, V5) %>% mutate(index = row_number(V2))

(I assume V2 is the variable you want to index - I think it's better to be explicit, rather than relying on a line of order line)

But you need another resume for different subsets, which is currently not easy with dplyr. One approach would be to split and then combine again:

 rbind_list( df %>% filter(V5 == "+") %>% mutate(index = row_number(V2)), df %>% filter(V5 == "-") %>% mutate(index = row_number(desc(V2))) )

But this will be relatively slow since you need to make two copies of the data.

Another approach would be to use if inside the summary:

 df %>% group_by(V6, V5) %>% mutate(index = row_number(if (V5[1] == "+") V2 else desc(V2)))

+3

hadley Feb 09 '14 at 18:12

source share

Arun · Accepted Answer · 2014-02-09T11:33:18+0000

As one bioinformatics, I often come across this operation. And here I love data.table modify a subset of rows by reference !

I would do it like this:

 dt[V5 == "+", index := 1:.N, by=V6] dt[V5 == "-", index := .N:1, by=V6]

No features required. This is a little more profitable because it avoids checking == "+" or "-" once for each group! Instead, you can first subgroup all groups with + once, and then group by V6 and change only these lines in place!

Similarly, you do it again for "-" . Hope this helps.

Note: .N is a special variable containing the number of observations per group.

Create an “index” for each group element with data.table

More articles: