After receiving help from two good gentlemen, I managed to switch to the data tables from the + plyr data frame.
Situation and my questions
When I was working, I noticed that the peak memory usage almost doubled from 3.5 GB to 6.8 GB (according to the Windows task manager) when I added 1 new column using := to my dataset containing ~ 200 thousand rows for 2.5 thousand columns.
Then I tried a 200M row with 25 columns, the increase was from 6 to 7.6 GB before dropping to 7.25 GB after gc() .
In particular, regarding the addition of new columns, Matt Dole himself mentioned here that:
Using the: = operator, you can:
Add columns by reference Modify subsets of existing columns by reference, and by group by reference Delete columns by reference
None of these operations copy (potentially large) data. table at all, not even once.
Question 1: Why add one βNAβ column for DT with 2.5K columns, doubling the peak memory if the data table is not copied at all?
Question 2: Why does doubling not occur when the DT 200M x 25? I did not use a print screen for this, but feel free to change your code and try it.
Printscreens for memory usage using test code
Net restart, RStudio and MS Word open - 103 MB used 
Valid DT creation code, but before adding a column - 3.5 GB 
After adding 1 column filled with NA, but before using gc () - 6.8GB 
After running gc () - 3.5 GB 
Test code
To research, I executed the following test code, which accurately mimics my dataset:
library(data.table) set.seed(1) # Credit: Dirk Eddelbuettel answer in # https://stackoverflow.com/questions/14720983/efficiently-generate-a-random-sample-of-times-and-dates-between-two-dates RandDate <- function(N, st="2000/01/01", et="2014/12/31") { st <- as.POSIXct(as.Date(st)) et <- as.POSIXct(as.Date(et)) dt <- as.numeric(difftime(et,st,unit="sec")) ev <- runif(N, 0, dt) rt <- as.character( strptime(st + ev, "%Y-%m-%d") ) } # Create Sample data TotalNoCol <- 2500 TotalCharCol <- 3 TotalDateCol <- 1 TotalIntCol <- 600 TotalNumCol <- TotalNoCol - TotalCharCol - TotalDateCol - TotalIntCol nrow <- 200000 ColNames = paste0("C", 1:TotalNoCol) dt <- as.data.table( setNames( c( replicate( TotalCharCol, sample( state.name, nrow, replace = T ), simplify = F ), replicate( TotalDateCol, RandDate( nrow ), simplify = F ), replicate( TotalNumCol, round( runif( nrow, 1, 30 ), 2), simplify = F ), replicate( TotalIntCol, sample( 1:10, nrow, replace = T ), simplify = F ) ), ColNames ) ) gc() # Add New columns, to be run separately dt[, New_Col := NA ] # Additional col; uses excessive memory?
Research completed
I did not find too many discussions on memory usage for DT with many columns, only this , but even then it does not concern memory.
Most discussions about a large dataset + memory usage are related to DTs with a very large number of rows, but relatively few columns.
My system
Intel i7-4700 with 4-wire / 8-thread; 16 GB of RAM DDR3-12800; Windows 8.1 64-bit 500 GB 7200 rpm HDD; 64-bit R; Datasheet ver 1.9.4
Disclaimers
Please forgive me for using the "non-R" method (ie task manager) to measure the memory used. Measuring / profiling memory in R is something that I still do not understand.
Edit 1: After updating the data table version 1.9.5 and restarting. The problem persists, unfortunately.
