Manipulating Large Files in R

I have 15 data files, each about 4.5 GB. Each file contains monthly data for approximately 17,000 customers. Together, the data provide information on 17,000 customers over 15 months. I want to reformat this data, so instead of 15 files, each of which represents a month, I have 17,000 files for each client and all their data. I wrote a script for this:

#the variable 'files' is a vector of locations of the 15 month files exists = NULL #This vector keeps track of customers who have a file created for them for (w in 1:15){ #for each of the 15 month files month = fread(files[w],select = c(2,3,6,16)) #read in the data I want custlist = unique(month$CustomerID) #a list of all customers in this month file for (i in 1:length(custlist)){ #for each customer in this month file curcust = custlist[i] #the current customer newchunk = subset(month,CustomerID == curcust) #all the data for this customer filename = sprintf("cust%s",curcust) #what the filename is for this customer will be, or is if ((curcust %in% exists) == TRUE){ #check if a file has been created for this customer. If a file has been created, open it, add to it, and read it back custfile = fread(strwrap(sprintf("C:/custFiles/%s.csv",filename)))#read in file custfile$V1 = NULL #remove an extra column the fread adds custfile= rbind(custfile,newchunk)#combine read in data with our new data write.csv(custfile,file = strwrap(sprintf("C:/custFiles/%s.csv",filename))) } else { #if it has not been created, write newchunk to a csv write.csv(newchunk,file = strwrap(sprintf("C:/custFiles/%s.csv",filename))) exists = rbind(exists,curcust,deparse.level = 0) #add customer to list of existing files } } } 

The script is working (at least I'm sure). The problem is that it is incredibly slow. According to the course that I am going to complete, it will take a week or more to complete, and I do not have time. Do any of you have the best and fastest way to do this in R? Should I try to do this in something like SQL? I have never used SQL before; can any of you show me how this will be done? Any input is welcome.

+8
sql r data.table
source share
2 answers

Like @Dominic Comtois, I would also recommend using SQL.
R can handle pretty big data - there is a good 2 billion line test that surpasses python, but since R works mostly in memory, you need to have a good machine to make it work. However, your case does not need to download more than 4.5 GB of file immediately, so it should be well-executed on a personal computer, see the second approach for a quick solution without a database.
You can use R to load data into an SQL database, and then query it from the database. If you do not know SQL, you can use a simple database. The easiest way from R is to use RSQLite (unfortunately, since v1.1 is no longer). You do not need to install or manage any external dependency. The RSQLite package contains an integrated database engine.

 library(RSQLite) library(data.table) conn <- dbConnect(dbDriver("SQLite"), dbname="mydbfile.db") monthfiles <- c("month1","month2") # ... # write data for(monthfile in monthfiles){ dbWriteTable(conn, "mytablename", fread(monthfile), append=TRUE) cat("data for",monthfile,"loaded to db\n") } # query data df <- dbGetQuery(conn, "select * from mytablename where customerid = 1") # when working with bigger sets of data I would recommend to do below setDT(df) dbDisconnect(conn) 

That's all. You use SQL without the high overhead typically associated with databases.

If you prefer to use the approach from your post, I think that you can significantly speed up the execution of write.csv by group when aggregated in data.table.

 library(data.table) monthfiles <- c("month1","month2") # ... # write data for(monthfile in monthfiles){ fread(monthfile)[, write.csv(.SD,file=paste0(CustomerID,".csv"), append=TRUE), by=CustomerID] cat("data for",monthfile,"written to csv\n") } 

This way you use a fast unique from data.table and execute a subset during grouping, which is also very fast. The following is a working example of the approach.

 library(data.table) data.table(a=1:4,b=5:6)[,write.csv(.SD,file=paste0(b,".csv")),b] 

Update 2016-12-05:
Starting with data.table 1.9.8+, you can replace write.csv with fwrite , for example, in this answer .

+16
source share

I think you already have your answer. But to strengthen it, see Official Dock

R Import import data

It means that

In general, statistical systems such as R are not particularly well suited for manipulating large-scale data. Some other systems are better than R, and part of this guide suppose that instead of duplicating the functionality in R, we can make another system do the work! (For example, Therneau and Grambsch (2000) that they prefer to do data manipulation in SAS, and then use the package survivability in S for analysis.) Database manipulation systems are often very suitable for managing and retrieving data: several packages for interacting with DBMSs are discussed here.

Thus, storing massive data is not the main strength of R, but provides interfaces for several specialized tools for this. In my own work, a lightweight SQLite solution is enough, even if it is a matter of preference, to some extent. Find the โ€œdrawbacks of using SQLite,โ€ and you probably won't find much to dissuade you.

You should find the SQLite documentation pretty smoothly to follow. If you have enough programming experience, running a tutorial or two should make you go pretty fast on the SQL front. I donโ€™t see anything complicated in your code, therefore the most common and basic queries, such as CREATE TABLE, SELECT ... WHERE most likely will satisfy all your needs.

Edit

Another advantage of using a DBMS that I did not mention is that you can have views , which can easily access another schemas data schemas , if I may say so. By creating views, you can return to "visualization by month" without having to rewrite any table and duplicate any data.

+5
source share

All Articles