Sqldf: date range query data

I am reading from a huge text file with the date format '%d/%m/%Y' . I want to use read.csv.sql sqldf to read and filter data by date at the same time. This is necessary to save memory usage time and runtime, skipping many dates that do not interest me. I know how to do this with dplyr and lubridate , but I just want to try with sqldf for the above reason. Although I am very familiar with SQL syntax, it still gets most of the time, without exception, with sqldf .

Running a command such as the following returned data.frame with 0 line:

 first_date <- "2001-11-1" second_date <- "2003-11-1" query <- "select * from file WHERE strftime('%d/%m/%Y', Date, 'unixepoch', 'localtime') between '$first_date' AND '$second_date'" df <- read.csv.sql(data_file, sql= query, stringsAsFactors=FALSE, sep = ";", header = TRUE) 

So, for modeling, I tried using the sqldf function as follows:

 first_date <- "2001-11-1" second_date <- "2003-11-1" df2 <- data.frame( Date = paste(rep(1:3, each = 4), 11:12, 2001:2012, sep = "/")) sqldf("SELECT * FROM df2 WHERE strftime('%d/%m/%Y', Date, 'unixepoch') BETWEEN '$first-date' AND '$second_date' ") # Expect: # Date # 1 1-11-2001 # 2 1-12-2002 # 3 1-11-2003 
+7
r sqldf
source share
1 answer

strftime strftime with percentage codes is used to convert an object already considered by sqlite as datetime to something else, but you want the opposite, so the approach in question won't work. For example, here we convert the current time to the string dd-mm-yyyy:

 library(sqldf) sqldf("select strftime('%d-%m-%Y', 'now') now") ## now ## 1 07-09-2014 

The discussion . Since SQlite does not have date types, it is a bit cumbersome to solve this problem, especially with non-standard date formats of 1 or 2 digits, but if you really want to use SQLite, we can do this by tiringly parsing date strings. Using fn$ from the gsubfn package to interpolate strings makes this a little easier.

Code Below zero2d prints SQL code to add a null character to its input if it is a single digit. rmSlash outputs SQL code to remove any slashes in the argument. Year , Month and Day each SQL output code to take a character string representing the date in the format discussed and extract the specified component, reformatting it as a 2-digit zero character string in the case of Month and Day . fmtDate takes the character string of the form shown in the question for first_string and second_string , and displays the character string yyyy-mm-dd .

 library(sqldf) library(gsubfn) zero2d <- function(x) sprintf("substr('0' || %s, -2)", x) rmSlash <- function(x) sprintf("replace(%s, '/', '')", x) Year <- function(x) sprintf("substr(%s, -4)", x) Month <- function(x) { y <- sprintf("substr(%s, instr(%s, '/') + 1, 2)", x, x) zero2d(rmSlash(y)) } Day <- function(x) { y <- sprintf("substr(%s, 1, 2)", x) zero2d(rmSlash(y)) } fmtDate <- function(x) format(as.Date(x)) sql <- "select * from df2 where `Year('Date')` || '-' || `Month('Date')` || '-' || `Day('Date')` between '`fmtDate(first_date)`' and '`fmtDate(second_date)`'" fn$sqldf(sql) 

giving:

  Date 1 1/11/2001 2 1/12/2002 3 1/11/2003 

Notes

1) SQLite instr , replace and substr functions used are the main functions of sqlite

2) SQL The actual SQL statement that runs after fn$ performs the replacements as follows (slightly reformatted to match):

 > cat( fn$identity(sql), "\n") select * from df2 where substr(Date, -4) || '-' || substr('0' || replace(substr(Date, instr(Date, '/') + 1, 2), '/', ''), -2) || '-' || substr('0' || replace(substr(Date, 1, 2), '/', ''), -2) between '2001-11-01' and '2003-11-01' 

3) the source of complications The main complication is a non-standard 1st or 2nd digit day and month. If they were successively two digits, it would be reduced to this:

 first_date <- "2001-11-01" second_date <- ""2003-11-01" fn$sqldf("select Date from df2 where substr(Date, -4) || '-' || substr(Date, 4, 2) || '-' || substr(Date, 1, 2) between '`first_date`' and '`second_date`' ") 

4) H2 Here is a solution to H2. H2 is of datetime type, simplifying the solution essentially over SQLite. Suppose the data is in a file called mydata.dat . Note that read.csv.sql does not support H2, since H2 already has an internal csvread SQL function for this:

 library(RH2) library(sqldf) first_date <- "2001-11-01" second_date <- "2003-11-01" fn$sqldf(c("CREATE TABLE t(DATE TIMESTAMP) AS SELECT parsedatetime(DATE, 'd/M/y') as DATE FROM CSVREAD('mydata.dat')", "SELECT DATE FROM t WHERE DATE between '`first_date`' and '`second_date`'")) 

Note that the first RH2 request will be slow in the session, since it loads java. After that, you can try to make sure that the performance is adequate.

+7
source share

All Articles