Like @Dominic Comtois, I would also recommend using SQL.
R can handle pretty big data - there is a good 2 billion line test that surpasses python, but since R works mostly in memory, you need to have a good machine to make it work. However, your case does not need to download more than 4.5 GB of file immediately, so it should be well-executed on a personal computer, see the second approach for a quick solution without a database.
You can use R to load data into an SQL database, and then query it from the database. If you do not know SQL, you can use a simple database. The easiest way from R is to use RSQLite (unfortunately, since v1.1 is no longer). You do not need to install or manage any external dependency. The RSQLite package contains an integrated database engine.
library(RSQLite) library(data.table) conn <- dbConnect(dbDriver("SQLite"), dbname="mydbfile.db") monthfiles <- c("month1","month2") # ... # write data for(monthfile in monthfiles){ dbWriteTable(conn, "mytablename", fread(monthfile), append=TRUE) cat("data for",monthfile,"loaded to db\n") } # query data df <- dbGetQuery(conn, "select * from mytablename where customerid = 1") # when working with bigger sets of data I would recommend to do below setDT(df) dbDisconnect(conn)
That's all. You use SQL without the high overhead typically associated with databases.
If you prefer to use the approach from your post, I think that you can significantly speed up the execution of write.csv by group when aggregated in data.table.
library(data.table) monthfiles <- c("month1","month2") # ... # write data for(monthfile in monthfiles){ fread(monthfile)[, write.csv(.SD,file=paste0(CustomerID,".csv"), append=TRUE), by=CustomerID] cat("data for",monthfile,"written to csv\n") }
This way you use a fast unique from data.table and execute a subset during grouping, which is also very fast. The following is a working example of the approach.
library(data.table) data.table(a=1:4,b=5:6)[,write.csv(.SD,file=paste0(b,".csv")),b]
Update 2016-12-05:
Starting with data.table 1.9.8+, you can replace write.csv with fwrite , for example, in this answer .