R - Why is adding 1 column to the data table nearly double the peak memory?

Question

R - Why is adding 1 column to the data table nearly double the peak memory?

After receiving help from two good gentlemen, I managed to switch to the data tables from the + plyr data frame.

Situation and my questions

When I was working, I noticed that the peak memory usage almost doubled from 3.5 GB to 6.8 GB (according to the Windows task manager) when I added 1 new column using := to my dataset containing ~ 200 thousand rows for 2.5 thousand columns.

Then I tried a 200M row with 25 columns, the increase was from 6 to 7.6 GB before dropping to 7.25 GB after gc() .

In particular, regarding the addition of new columns, Matt Dole himself mentioned here that:

Using the: = operator, you can:
 Add columns by reference Modify subsets of existing columns by reference, and by group by reference Delete columns by reference 
None of these operations copy (potentially large) data. table at all, not even once.

Question 1: Why add one “NA” column for DT with 2.5K columns, doubling the peak memory if the data table is not copied at all?

Question 2: Why does doubling not occur when the DT 200M x 25? I did not use a print screen for this, but feel free to change your code and try it.

Printscreens for memory usage using test code

Net restart, RStudio and MS Word open - 103 MB used
Valid DT creation code, but before adding a column - 3.5 GB
After adding 1 column filled with NA, but before using gc () - 6.8GB
After running gc () - 3.5 GB

Test code

To research, I executed the following test code, which accurately mimics my dataset:

 library(data.table) set.seed(1) # Credit: Dirk Eddelbuettel answer in # https://stackoverflow.com/questions/14720983/efficiently-generate-a-random-sample-of-times-and-dates-between-two-dates RandDate <- function(N, st="2000/01/01", et="2014/12/31") { st <- as.POSIXct(as.Date(st)) et <- as.POSIXct(as.Date(et)) dt <- as.numeric(difftime(et,st,unit="sec")) ev <- runif(N, 0, dt) rt <- as.character( strptime(st + ev, "%Y-%m-%d") ) } # Create Sample data TotalNoCol <- 2500 TotalCharCol <- 3 TotalDateCol <- 1 TotalIntCol <- 600 TotalNumCol <- TotalNoCol - TotalCharCol - TotalDateCol - TotalIntCol nrow <- 200000 ColNames = paste0("C", 1:TotalNoCol) dt <- as.data.table( setNames( c( replicate( TotalCharCol, sample( state.name, nrow, replace = T ), simplify = F ), replicate( TotalDateCol, RandDate( nrow ), simplify = F ), replicate( TotalNumCol, round( runif( nrow, 1, 30 ), 2), simplify = F ), replicate( TotalIntCol, sample( 1:10, nrow, replace = T ), simplify = F ) ), ColNames ) ) gc() # Add New columns, to be run separately dt[, New_Col := NA ] # Additional col; uses excessive memory?

Research completed

I did not find too many discussions on memory usage for DT with many columns, only this , but even then it does not concern memory.

Most discussions about a large dataset + memory usage are related to DTs with a very large number of rows, but relatively few columns.

My system

Intel i7-4700 with 4-wire / 8-thread; 16 GB of RAM DDR3-12800; Windows 8.1 64-bit 500 GB 7200 rpm HDD; 64-bit R; Datasheet ver 1.9.4

Disclaimers

Please forgive me for using the "non-R" method (ie task manager) to measure the memory used. Measuring / profiling memory in R is something that I still do not understand.

Edit 1: After updating the data table version 1.9.5 and restarting. The problem persists, unfortunately.

enter image description here

+8

memory r data.table large-data

NoviceProg Feb 05 '15 at 15:04

source share

1 answer

micstr · Accepted Answer · 2015-05-31T18:32:38+0000

(I take no responsibility as the great DT minds (Arun) work on this and find that it is related to print.data.table. Just close the loop for other SO users.)

It seems that the memory data.table with data.table with := was resolved on R version 3.2, as indicated: https://github.com/Rdatatable/data.table/issues/1062

[Quoting @Arun from Github issue 1062 ...]

recorded in R v3.2, IIUC, with this item from NEWS:
Automatic printing no longer duplicates objects when printing is sent to a method.

Therefore, others with this problem should look at upgrading to R 3.2.

R - Why is adding 1 column to the data table nearly double the peak memory?

More articles: