Monitoring R data load using read.table

Question

Monitoring R data load using read.table

I found many answers for other types of data loading, but none of them show progress when R reads data using read.table(...) . I have a simple command:

 data = read.table(file=filename, sep="\t", col.names=c("time","id","x","y"), colClasses=c("integer","NULL","NULL","NULL"))

This loads a large amount of data in about 30 seconds or so, but the progress bar will be really nice: -D

+4

r progress-bar read.table

Hamy Jun 15 '11 at 19:39

source share

1 answer

Ben bolker · Answer 1 · 2011-06-15T20:49:57+0000

Ongoing Experiments:

Create a temporary working file:

 n <- 1e7 dd <- data.frame(time=1:n,id=rep("a",n),x=1:n,y=1:n) fn <- tempfile() write.table(dd,file=fn,sep="\t",row.names=FALSE,col.names=FALSE)

Run 10 repetitions with read.table (with and without colClasses ) and scan :

edit : fixed scan call in response to comment, updated results:

 library(rbenchmark) (b1 <- benchmark(read.table(fn, col.names=c("time","id","x","y"), colClasses=c("integer", "NULL","NULL","NULL")), read.table(fn, col.names=c("time","id","x","y")), scan(fn, what=list(integer(),NULL,NULL,NULL)),replications=10))

Results:

 2 read.table(fn, col.names = c("time", "id", "x", "y")) 1 read.table(fn, col.names = c("time", "id", "x", "y"), colClasses = c("integer", "NULL", "NULL", "NULL")) 3 scan(fn, what = list(integer(), NULL, NULL, NULL)) replications elapsed relative user.self sys.self 2 10 278.064 1.857016 232.786 30.722 1 10 149.737 1.011801 141.365 2.388 3 10 143.118 1.000000 140.617 2.105

(warning, these values are slightly prepared / inconsistent, because I ran the control test again and combined the results ... but the qualitative result should be OK).

read.table without colClasses is the slowest (which is not surprising), but only (?) about 85% slower than scan for this example. scan only slightly smaller than read.table with the specified colClasses .

Using scan or read.table you can write a version of "chunked" that used the skip and nrows ( read.table ) or n ( scan ) options to read the bits of the file at a time, then paste them together at the end. I don’t know how much this will slow down the process, but it will allow txtProgressBar be called between pieces ...

Monitoring R data load using read.table

More articles: