Extracting data.table segments

Question

Extracting data.table segments

I have data.table , and I need to extract segments of equal length, starting at different row locations. What is the easiest way to do this? For instance:

 x <- data.table(a=sample(1:1000,100), b=sample(1:1000,100)) r <- c(1,2,10,20,44) idx <- lapply(r, function(i) {j <-which(x$a == i); if (length(j)>0) {return(j)} }) y <- lapply(idx, function(i) {if (!is.null(i)) x[i:(i+5)]}) do.call(rbind, y) ab 1: 44 63 2: 96 730 3: 901 617 4: 446 370 5: 195 341 6: 298 411

This, of course, is not the way data.table things, so I was hoping there was a better way?

EDIT. In the comments below, I am editing this to make it clear that the values in a not necessarily adjacent and do not match the line number.

+4

r data.table rbind do.call

Alex Sep 2 '12 at 15:06

source share

2 answers

I think this should be the best way and according to the 10 minute introduction to data.table, that binary search and therefore preferred

 library(data.table) x <- data.table(a=1:100, b=1:100, key="a") r <- c(1,2,10,20,44) vec <- numeric() for (elem in r) { vec <- c(vec, seq(from=elem, by=1, length.out=6)) } x[data.table(vec)] ab 1: 1 1 2: 2 2 3: 3 3 4: 4 4 5: 5 5 6: 6 6 7: 2 2 ...

Note that I first set column a as the key, and then create an internal data table to join this column a. Creating a vec is probably not the best way, but this should not be a bottleneck.

+1

Christoph_J Sep 2 '12 at 15:47

source share

Matt dowle · Accepted Answer · 2012-09-02T22:07:19+0000

Not sure if you know about string positions or want to look for them. In any case, this should cover both.

 require(data.table) set.seed(1) DT = data.table(a=sample(1:1000,20), b=sample(1:1000,20)) setkey(DT,a) DT # ab # 1: 62 338 # 2: 175 593 # 3: 201 267 # 4: 204 478 # 5: 266 935 # 6: 372 212 # 7: 374 711 # 8: 380 184 # 9: 491 659 # 10: 572 651 # 11: 625 863 # 12: 657 380 # 13: 679 488 # 14: 707 782 # 15: 760 816 # 16: 763 404 # 17: 894 385 # 18: 906 126 # 19: 940 14 # 20: 976 107 r = c(201,380,760) starts = DT[J(r),which=TRUE] # binary search for items # skip if the starting row numbers are known starts # [1] 3 8 15

Option 1: create a sequence of line numbers, combine and perform a single search in DT (no keys or binary search is necessary only for selection by line numbers):

 DT[unlist(lapply(starts,seq.int,length=5))] # ab # 1: 201 267 # 2: 204 478 # 3: 266 935 # 4: 372 212 # 5: 374 711 # 6: 380 184 # 7: 491 659 # 8: 572 651 # 9: 625 863 # 10: 657 380 # 11: 760 816 # 12: 763 404 # 13: 894 385 # 14: 906 126 # 15: 940 14

Option 2: make a list of subsets of data.table, and then rbind them together. This is less effective than option 1, but for completeness:

 L = lapply(starts,function(i)DT[seq.int(i,i+4)]) L # [[1]] # ab # 1: 201 267 # 2: 204 478 # 3: 266 935 # 4: 372 212 # 5: 374 711 # # [[2]] # ab # 1: 380 184 # 2: 491 659 # 3: 572 651 # 4: 625 863 # 5: 657 380 # # [[3]] # ab # 1: 760 816 # 2: 763 404 # 3: 894 385 # 4: 906 126 # 5: 940 14

 rbindlist(L) # more efficient that do.call("rbind",L). See ?rbindlist. # ab # 1: 201 267 # 2: 204 478 # 3: 266 935 # 4: 372 212 # 5: 374 711 # 6: 380 184 # 7: 491 659 # 8: 572 651 # 9: 625 863 # 10: 657 380 # 11: 760 816 # 12: 763 404 # 13: 894 385 # 14: 906 126 # 15: 940 14

Extracting data.table segments

More articles: