Ongoing Experiments:
Create a temporary working file:
n <- 1e7 dd <- data.frame(time=1:n,id=rep("a",n),x=1:n,y=1:n) fn <- tempfile() write.table(dd,file=fn,sep="\t",row.names=FALSE,col.names=FALSE)
Run 10 repetitions with read.table (with and without colClasses ) and scan :
edit : fixed scan call in response to comment, updated results:
library(rbenchmark) (b1 <- benchmark(read.table(fn, col.names=c("time","id","x","y"), colClasses=c("integer", "NULL","NULL","NULL")), read.table(fn, col.names=c("time","id","x","y")), scan(fn, what=list(integer(),NULL,NULL,NULL)),replications=10))
Results:
2 read.table(fn, col.names = c("time", "id", "x", "y")) 1 read.table(fn, col.names = c("time", "id", "x", "y"), colClasses = c("integer", "NULL", "NULL", "NULL")) 3 scan(fn, what = list(integer(), NULL, NULL, NULL)) replications elapsed relative user.self sys.self 2 10 278.064 1.857016 232.786 30.722 1 10 149.737 1.011801 141.365 2.388 3 10 143.118 1.000000 140.617 2.105
(warning, these values ββare slightly prepared / inconsistent, because I ran the control test again and combined the results ... but the qualitative result should be OK).
read.table without colClasses is the slowest (which is not surprising), but only (?) about 85% slower than scan for this example. scan only slightly smaller than read.table with the specified colClasses .
Using scan or read.table you can write a version of "chunked" that used the skip and nrows ( read.table ) or n ( scan ) options to read the bits of the file at a time, then paste them together at the end. I donβt know how much this will slow down the process, but it will allow txtProgressBar be called between pieces ...