Speed ​​Improvement for POSIX Sequence Application

I repeat the POSIX sequence to identify the number of simultaneous events at the moment using the method described in this question and the corresponding answer:

How to count the number of concurrent users using time interval data?

My problem is that my tinterval sequence in minutes spans a year, which means it has 523,025 records . In addition, I also think about resolution in a matter of seconds, which would make it even worse.

Is there anything I can do to improve this code (for example, is it the order of the date intervals from the input (tdata) of relevance?) Or do I need to accept performance if I have a solution in R?

0
source share
3 answers

You can try using the data.tables new function foverlaps. With data from another question:

library(data.table)
setDT(tdata)
setkey(tdata, start, end)
minutes <- data.table(start = seq(trunc(min(tdata[["start"]]), "mins"), 
                                  round(max(tdata[["end"]]), "mins"), by="min"))
minutes[, end := start+59]
setkey(minutes, start, end)
DT <- foverlaps(tdata, minutes, type="any")
counts <- DT[, .N, by=start]
plot(N~start, data=counts, type="s")

resulting plot

I did not confine this to huge data. Try it yourself.

+3
source

, . data.table lubridate . , , , 0 , , concurrent:

library(data.table)
library(lubridate)

td <- data.table(start=floor_date(tdata$start, "minute"),
                 end=ceiling_date(tdata$end, "minute"))

# create vector of all minutes from start to end
# about 530K for a whole year
time.grid <- seq(from=min(td$start), to=max(td$end), by="min")
users <- data.table(time=time.grid, key="time")

# match users on starting time and 
# sum matches by start time to count multiple loging in same minute
setkey(td, start)
users <- td[users, 
          list(started=!is.na(end)), 
          nomatch=NA, 
          allow.cartesian=TRUE][, list(started=sum(started)), 
                                by=start]

# match users on ending time, essentially the same procedure
setkey(td, end)
users <- td[users, 
            list(started, ended=!is.na(start)), 
            nomatch=NA, 
            allow.cartesian=TRUE][, list(started=sum(started), 
                                         ended=sum(ended)), 
                                  by=end]

# fix timestamp column name
setnames(users, "end", "time")

# here you can exclude all entries where both counts are zero
# for a sparse representation
users <- users[started > 0 | ended > 0]

# last step, take difference of cumulative sums to get concurrent users
users[, concurrent := cumsum(started) - cumsum(ended)]

( , ), , . , .

+1

R - , , , , , . , for "" , , , . , , .

  • R -, , , . , , , .
  • " " .
  • ( ) Rcpp C/Cpp. .
0
source

All Articles