Improve performance of data.table date + insert time?

Question

Improve performance of data.table date + insert time?

I'm not sure I can ask this question here, let me know if I do it somewhere else.

I have a data table with 1e6 rows having this structure:

V1 V2 V3 1: 03/09/2011 08:05:40 1145.0 2: 03/09/2011 08:06:01 1207.3 3: 03/09/2011 08:06:17 1198.8 4: 03/09/2011 08:06:20 1158.4 5: 03/09/2011 08:06:40 1112.2 6: 03/09/2011 08:06:59 1199.3

I convert the variables V1 and V2 to a unique datetime variable using this code:

  system.time(DT[,`:=`(index= as.POSIXct(paste(V1,V2), format='%d/%m/%Y %H:%M:%S'), V1=NULL,V2=NULL)]) user system elapsed 47.47 0.16 50.27

Is there any method to improve the performance of this conversion?

Here dput(head(DT)) :

 DT <- structure(list(V1 = c("03/09/2011", "03/09/2011", "03/09/2011", "03/09/2011", "03/09/2011", "03/09/2011"), V2 = c("08:05:40", "08:06:01", "08:06:17", "08:06:20", "08:06:40", "08:06:59"), V3 = c(1145, 1207.3, 1198.8, 1158.4, 1112.2, 1199.3)), .Names = c("V1", "V2", "V3"), class = c("data.table", "data.frame"), row.names = c(NA, -6L), .internal.selfref = <pointer: 0x00000000002a0788>)

+7

performance r data.table posixct

agstudy Dec 9 '13 at 23:45

source share

2 answers

If there are a lot of time stamps in your data, you can try adding ,by=list(V1, V2) , but repetition should be enough to pay for the costs of splitting.

The bottle wool here is pasta and conversion, so it makes me think that the answer is no. (If you are not using an alternative conversion method to POSIX)

+2

Ricardo saporta Dec 10 '13 at 0:01

source share

jlhoward · Accepted Answer · 2013-12-10T23:54:10+0000

This approach, which appears to be ~ 40X faster than the OP, takes advantage of lookup tables and takes advantage of extremely fast data table joins. In addition, he exploits the fact that, although there may be 1–6 combinations of date and time, there can be no more than 86400 unique times and possibly even fewer dates. Finally, he avoids using paste(...) at all.

 library(data.table) library(stringr) # create a dataset with 1MM rows set.seed(1) x <- 1000*sample(1:1e5,1e6,replace=T) dt <- data.table(id=1:1e6, V1=format(as.POSIXct(x,origin="2011-01-01"),"%d/%m/%Y"), V2=format(as.POSIXct(x,origin="2011-01-01"),"%H:%M:%S"), V3=x) DT <- dt index.date <- function(dt) { # Edit: this change processes only times from the dataset; slightly more efficient V2 <- unique(dt$V2) dt.time <- data.table(char.time=V2, int.time=as.integer(substr(V2,7,8))+ 60*(as.integer(substr(V2,4,5))+ 60*as.integer(substr(V2,1,2)))) setkey(dt.time,char.time) # all dates from dataset dt.date <- data.table(char.date=unique(dt$V1), int.date=as.integer(as.POSIXct(unique(dt$V1),format="%d/%m/%Y"))) setkey(dt.date,char.date) # join the dates setkey(dt,V1) dt <- dt[dt.date] # join the times setkey(dt,V2) dt <- dt[dt.time, nomatch=0] # numerical index dt[,int.index:=int.date+int.time] # POSIX date index dt[,index:=as.POSIXct(int.index,origin='1970-01-01')] # get back original order setkey(dt,id) return(dt) } # new approach system.time(dt<-index.date(dt)) # user system elapsed # 2.26 0.00 2.26 # original approach DT <- dt system.time(DT[,`:=`(index= as.POSIXct(paste(V1,V2), format='%d/%m/%Y %H:%M:%S'), V1=NULL,V2=NULL)]) # user system elapsed # 84.33 0.06 84.52

Please note that performance depends on the number of unique dates. In the test case, there were 1200 unique dates.

EDIT the sentence to write the function in the more syntax data.table-sugar and avoid the "$" for the subset:

 index.date <- function(dt,fmt="%d/%m/%Y") { dt.time <- data.table(char.time = dt[,unique(V2)],key='char.time') dt.time[,int.time :=as.integer(substr(char.time,7,8))+ 60*(as.integer(substr(char.time,4,5))+ 60*as.integer(substr(char.time,1,2)))] # all dates from dataset dt.date <- data.table(char.date = dt[,unique(V1)],key='char.date') dt.date[,int.date:=as.integer(as.POSIXct(char.date,format=fmt))] # join the dates setkey(dt,V1) dt <- dt[dt.date] # join the times setkey(dt,V2) dt <- dt[dt.time, nomatch=0] # numerical index dt[,int.index:=int.date+int.time] # POSIX date index dt[,index:=as.POSIXct.numeric(int.index,origin='1970-01-01')] # remove extra/temporary variables dt[,`:=`(int.index=NULL,int.date=NULL,int.time=NULL)] }

Improve performance of data.table date + insert time?

More articles: