String operations in data.table with `by = .I`

Here is a good explanation of SO row operations in data.table

One option that came to my mind is to use a unique id for each row, and then apply the function using the by argument. Like this:

 library(data.table) dt <- data.table(V0 =LETTERS[c(1,1,2,2,3)], V1=1:5, V2=3:7, V3=5:1) # create a column with row positions dt[, rowpos := .I] # calculate standard deviation by row dt[ , sdd := sd(.SD[, -1, with=FALSE]), by = rowpos ] 

Questions:

  • Is there a good reason not to use this approach? perhaps other more effective alternatives?

  • Why doesn't using by = .I work the same?

    dt[ , sdd := sd(.SD[, -1, with=FALSE]), by = .I ]

+7
r data.table
source share
1 answer

1) Well, one of the reasons not to use it, at least for the rowsums example, is performance and creating an unnecessary column. Compare with the f2 option below, which is almost 4 times faster and does not need a rowpos column:

 dt <- data.table(V0 =LETTERS[c(1,1,2,2,3)], V1=1:5, V2=3:7, V3=5:1) f1 <- function(dt){ dt[, rowpos := .I] dt[ , sdd := rowSums(.SD[, 2:4, with=FALSE]), by = rowpos ] } f2 <- function(dt){dt[, sdd := rowSums(dt[, 2:4, with=FALSE])]} library(microbenchmark) microbenchmark(f1(dt),f2(dt)) # Unit: milliseconds # expr min lq mean median uq max neval cld # f1(dt) 3.669049 3.732434 4.013946 3.793352 3.972714 5.834608 100 b # f2(dt) 1.052702 1.085857 1.154132 1.105301 1.138658 2.825464 100 a 

2) In the second question, although dt[, sdd := sum(.SD[, 2:4, with=FALSE]), by = .I] does not work, dt[, sdd := sum(.SD[, 2:4, with=FALSE]), by = 1:NROW(dt)] works fine. Given that according to ?data.table ".I is an integer vector equal to seq_len (nrow (x))", we would expect that they would be equivalent. The difference, however, is that .I used for j , not in by , because this value is returned by , rather than being evaluated in advance.

You can also expect (see the comment on the question above from @eddi) that by = .I should just throw an error. But this does not happen, because loading the data.table package creates an .I object in the data.table namespace, accessible from the global environment, and whose value is NULL . You can verify this by typing .I on the command line. (Note the same applies to .SD , .EACHI , .N , .GRP and .GRP )

 .I # Error: object '.I' not found library(data.table) .I # NULL data.table::.I # NULL 

As a result of this, the behavior of by = .I equivalent to by = NULL .

3) . Although we already saw in Part 1 that in the case of rowsums , which already works efficiently on a number of rows, there is a much faster way than creating a rowpos column. But what about a loop when we don't have a quick function on a line?

Benchmarking versions by = rowpos and by = 1:NROW(dt) in a for loop with set() is informative here and demonstrates that the version of the loop is faster than any of the by = approaches:

 f.rowpos <- function(){ dt <- data.table(V0 = rep(LETTERS[c(1,1,2,2,3)], 1e3), V1=1:5, V2=3:7, V3=5:1) dt[, rowpos := .I] dt[ , sdd := sum(.SD[, 2:4, with=FALSE]), by = rowpos ][] } f.nrow <- function(){ dt <- data.table(V0 = rep(LETTERS[c(1,1,2,2,3)], 1e3), V1=1:5, V2=3:7, V3=5:1) dt[, sdd := sum(.SD[, 2:4, with=FALSE]), by = 1:NROW(dt) ][] } f.forset<- function(){ dt <- data.table(V0 = rep(LETTERS[c(1,1,2,2,3)], 1e3), V1=1:5, V2=3:7, V3=5:1) dt[, sdd:=0L] for (i in 1L:NROW(dt)) { set(dt, i, 5L, sum(dt[i, 2:4])) } dt } microbenchmark(f.rowpos(),f.nrow(), f.forset(), times = 5) Unit: seconds expr min lq mean median uq max neval cld f.rowpos() 4.465371 4.503614 4.510916 4.505922 4.521629 4.558042 5 b f.nrow() 4.499120 4.499920 4.541131 4.558701 4.571267 4.576647 5 b f.forset() 2.540556 2.603505 2.654036 2.606108 2.750719 2.769292 5 a 

So, in conclusion , even in situations where there is no optimized function, such as rowsums , which already works on a row, there are always alternatives to using the rowpos column, which are faster, while not requiring the creation of an excess column.

+10
source share

All Articles