How to work with 50x large CSV file in r language?

I'm relatively new to the "big data processing" here, hope to find some recommendations on how to work with a 50 GB csv file. The current problem is as follows:

The table looks like this:

ID,Address,City,States,... (50 more fields of characteristics of a house) 1,1,1st street,Chicago,IL,... # the first 1 is caused by write.csv, they created an index raw in the file 

I would like to find all rows belonging to San Francisco, California. This should be an easy problem, but csv is too big.

I know that I have two ways to do this in R and another way to use a database to process it:

(1) Using R ffdf packages:

since the file was last saved, it uses write.csv and contains all the different types.

 all <- read.csv.ffdf( file="<path of large file>", sep = ",", header=TRUE, VERBOSE=TRUE, first.rows=10000, next.rows=50000, ) 

the console gives me the following:

 Error in ff(initdata = initdata, length = length, levels = levels, ordered = ordered, : vmode 'character' not implemented 

Searching online, I found several answers that did not fit my case, and I cannot figure out how to transfer the “character” to the “factor” type, as they mentioned.

Then I tried to use read.table.ffdf, this is even more disaster. I can not find reliable guidance for this.

(2) Using R readline:

I know this is another good way, but cannot find an effective way to do this.

(3) Using SQL:

I'm not sure how to transfer the file to the SQL version and how to handle it if there is a good guide that I would like to try. But overall I would like to stick with R.

Thanks for the answer and help!

+6
source share
2 answers

You can use R with SQLite behind curtains with sqldf package. You must use the read.csv.sql function in the sqldf package, and then you can query the data, however you want a smaller data frame.

Example from the docs:

 library(sqldf) iris2 <- read.csv.sql("iris.csv", sql = "select * from file where Species = 'setosa' ") 

I used this library on VERY large CSV files with good results.

+8
source

This is too long for comment.

R - in the basic configuration - loads data into memory. The memory is cheap. 50 GB is still not a typical configuration (and you will need more than downloading and saving data). If you are really good at R, you may find another mechanism. If you have access to the cluster, you can use some parallel version of R or Spark.

You can also upload data to the database. For this task, the database is very well suited for this problem. R easily connects to almost any database. And you can find a database very useful for what you want to do.

Or you can just process the text file in place. Command line tools such as awk, grep, and perl are very suitable for this task. I would recommend this approach for a one-time effort. I would recommend a database if you want to store data for analytical purposes.

+3
source