I'm relatively new to the "big data processing" here, hope to find some recommendations on how to work with a 50 GB csv file. The current problem is as follows:
The table looks like this:
ID,Address,City,States,... (50 more fields of characteristics of a house) 1,1,1st street,Chicago,IL,...
I would like to find all rows belonging to San Francisco, California. This should be an easy problem, but csv is too big.
I know that I have two ways to do this in R and another way to use a database to process it:
(1) Using R ffdf packages:
since the file was last saved, it uses write.csv and contains all the different types.
all <- read.csv.ffdf( file="<path of large file>", sep = ",", header=TRUE, VERBOSE=TRUE, first.rows=10000, next.rows=50000, )
the console gives me the following:
Error in ff(initdata = initdata, length = length, levels = levels, ordered = ordered, : vmode 'character' not implemented
Searching online, I found several answers that did not fit my case, and I cannot figure out how to transfer the “character” to the “factor” type, as they mentioned.
Then I tried to use read.table.ffdf, this is even more disaster. I can not find reliable guidance for this.
(2) Using R readline:
I know this is another good way, but cannot find an effective way to do this.
(3) Using SQL:
I'm not sure how to transfer the file to the SQL version and how to handle it if there is a good guide that I would like to try. But overall I would like to stick with R.
Thanks for the answer and help!