R: read.csv.sql from sqldf is able to successfully read one csv, but not another

I have a data set about 20 GB in size, and therefore I can not read it in an R-shaped frame, without running out of memory. After reading some posts here, I decided to use read.csv.sql in the database. The code I use is:

read.csv.sql ("jobs.csv", sql = "CREATE TABLE Jobs2 AS SELECT * FROM file", dbname = "Test1.sqlite")

When I run the following:

sqldf ("select * from Jobs2", dbname = "Test1.sqlite")

I get the column header, but then there is no value: <0 rows> (or 0-length row.names)

But when I try to do the same with the created csv created using the iris dataset, everything works fine.

What am I missing here?

Thanks in advance.

+5
source share
1 answer

sqldf is primarily intended for processing data frames, so it creates databases and database tables transparently and deletes them after sql completes. Thus, your first statement should not work, since sqldf will delete the database after the statement completes.

If SQL creates a database or table, not sqldf, then sqldf will not know about it so that it does not delete it. Here we create a database using attach and a table using create table to trick sqldf. On the last line, it will not delete the database database, because they were already there before this line started and never deleted the objects that it did not create:

 library(sqldf) read.csv.sql("jobs.csv", sql = c("attach 'test1.sqlite' as new", "create table new.jobs2 as select * from file")) sqldf("select * from jobs2", dbname = "test1.sqlite") 

Another problem that may go wrong is line endings. Normally sqldf can figure this out, but if not, you may need to specify the eol character. The need to specify it may arise, for example, if you are trying to read a file created in one operating system in another operating system. See Frequently Asked Questions 11. Why am I having difficulty reading a data file using SQLite in sqldf README .

Note. read.csv.sql usually used to read a piece of data. For example, this skips the first 100 rows and then reads the columns a and b from the following 1000 rows, but the query can be arbitrarily complex as you use all of SQLite SQL:

 read.csv.sql("jobs.csv", sql = "select a, b from file limit 1000 offset 100") 

The entire file is read into the temporary sqlite database, but only the requested part is ever read into R, so the whole file may be larger than R can handle.

Usually, if you are trying to achieve persistence, you are using RSQLite directly, not sqldf.

+5
source

All Articles