Merging is really not that big data.tables immediately leads to death of R

I have 32 GB of RAM on this machine, but I can get R to kill faster than anyone;)

Example

The goal here is to achieve rbind() two data.tables using functions that use the efficiency of data.table.

input:

 rm(list=ls()) gc() 

exit:

  used (Mb) gc trigger (Mb) max used (Mb) Ncells 1604987 85.8 2403845 128.4 2251281 120.3 Vcells 3019405 23.1 537019062 4097.2 468553954 3574.8 

input:

 tmp.table <- data.table(X1=sample(1:7,4096000,replace=TRUE), X2=as.factor(sample(1:2,4096000,replace=TRUE)), X3=sample(1:1000,4096000,replace=TRUE), X4=sample(1:256,4096000,replace=TRUE), X5=sample(1:16,4096000,replace=TRUE), X6=rnorm(4096000)) setkey(tmp.table,X1,X2,X3,X4,X5,X6) join.table <- data.table(X1 = integer(), X2 = factor(), X3 = integer(), X4=integer(), X5 = integer(), X6 = numeric()) setkey(join.table,X1,X2,X3,X4,X5,X6) tables() 

exit:

  NAME NROW MB COLS KEY [1,] join.table 0 1 X1,X2,X3,X4,X5,X6 X1,X2,X3,X4,X5,X6 [2,] tmp.table 4,096,000 110 X1,X2,X3,X4,X5,X6 X1,X2,X3,X4,X5,X6 Total: 111MB 

input:

 join.table <- merge(join.table,tmp.table,all.y=TRUE) 

exit:

Ha! Nope. RStudio restarts the session.

Question

What's going on here? Explicitly defining factor levels in join.table had no effect. rbind() instead of merge() didn't help - exactly the same behavior. I did a few more complex and cumbersome things related to this data without any problems.

version information

 $platform [1] "x86_64-pc-linux-gnu" $arch [1] "x86_64" $os [1] "linux-gnu" $system [1] "x86_64, linux-gnu" $version.string [1] "R version 3.0.2 (2013-09-25)" $nickname [1] "Frisbee Sailing" > rstudio::versionInfo() $version [1] '99.9.9' $mode [1] "server" 

Data.table - version 1.8.11.

+8
memory r data.table
source share
1 answer

Update: this has been fixed in commit 1123 v1.8.11 . From NEWS :

o rbindlist with at least one column of factors, along with the presence of at least one empty data.table , led to segfault (or a hash table error is reported in linux / mac). Now this is fixed, No. 5355. Thanks to Trevor Alexander for reporting SO (and mnel for reporting an error): it really doesn’t merge with the fact that large data.tables immediately lead to the death of R


This can be reproduced with a single row data.table with a factor column and a data table with a zero row with a factor column.

 library(data.table) A <- data.table(x=factor(1), key='x') B <- data.table(x=factor(), key='x') merge(B, A, all.y=TRUE) # Rstudio -> R encountered fatal error # R Gui -> R for windoze GUI has stopped working 

Using debugonce(data.table:::merge.data.table) , this can be traced back to the rbind(dt,yy) , which is equivalent

 rbind(B,A) 

which, if you run it, will give the same error.

This was reported to the authors of the packages as issue # 5355

+11
source share

All Articles