I have a dataset with 500k appointments lasting from 5 to 60 minutes.
tdata <- structure(list(Start = structure(c(1325493000, 1325493600, 1325494200, 1325494800, 1325494800, 1325495400, 1325495400, 1325496000, 1325496000, 1325496600, 1325496600, 1325497500, 1325497500, 1325498100, 1325498100, 1325498400, 1325498700, 1325498700, 1325499000, 1325499300), class = c("POSIXct", "POSIXt"), tzone = "GMT"), End = structure(c(1325493600, 1325494200, 1325494500, 1325495400, 1325495400, 1325496000, 1325496000, 1325496600, 1325496600, 1325496900, 1325496900, 1325498100, 1325498100, 1325498400, 1325498700, 1325498700, 1325499000, 1325499300, 1325499600, 1325499600), class = c("POSIXct", "POSIXt"), tzone = "GMT"), Location = c("LocationA", "LocationA", "LocationA", "LocationA", "LocationA", "LocationA", "LocationA", "LocationA", "LocationA", "LocationB", "LocationB", "LocationB", "LocationB", "LocationB", "LocationB", "LocationB", "LocationB", "LocationB", "LocationB", "LocationB"), Room = c("RoomA", "RoomA", "RoomA", "RoomA", "RoomB", "RoomB", "RoomB", "RoomB", "RoomB", "RoomB", "RoomA", "RoomA", "RoomA", "RoomA", "RoomA", "RoomA", "RoomA", "RoomA", "RoomA", "RoomA")), .Names = c("Start", "End", "Location", "Room"), row.names = c(NA, 20L), class = "data.frame")
> head(tdata) Start End Location Room 1 2012-01-02 08:30:00 2012-01-02 08:40:00 LocationA RoomA 2 2012-01-02 08:40:00 2012-01-02 08:50:00 LocationA RoomA 3 2012-01-02 08:50:00 2012-01-02 08:55:00 LocationA RoomA 4 2012-01-02 09:00:00 2012-01-02 09:10:00 LocationA RoomA 5 2012-01-02 09:00:00 2012-01-02 09:10:00 LocationA RoomB 6 2012-01-02 09:10:00 2012-01-02 09:20:00 LocationA RoomB
I would like to calculate the number of simultaneous meetings in general, for location and number (and several other factors in the original dataset).
I tried using the mysql package to make a left join, which works for a small data set, but forever for the entire data set:
# SQL Join. start.min <- min(tdata$Start, na.rm=T) end.max <- max(tdata$End, na.rm=T) tinterval <- seq.POSIXt(start.min, end.max, by = "mins") tinterval <- as.data.frame(tinterval) library(sqldf) system.time( output <- sqldf("SELECT * FROM tinterval LEFT JOIN tdata ON tinterval.tinterval >= tdata.Start AND tinterval.tinterval < tdata.End ")) head(output) tinterval Start End Location Room 1 2012-01-02 09:30:00 2012-01-02 09:30:00 2012-01-02 09:40:00 LocationA RoomA 2 2012-01-02 09:31:00 2012-01-02 09:30:00 2012-01-02 09:40:00 LocationA RoomA 3 2012-01-02 09:32:00 2012-01-02 09:30:00 2012-01-02 09:40:00 LocationA RoomA 4 2012-01-02 09:33:00 2012-01-02 09:30:00 2012-01-02 09:40:00 LocationA RoomA 5 2012-01-02 09:34:00 2012-01-02 09:30:00 2012-01-02 09:40:00 LocationA RoomA 6 2012-01-02 09:35:00 2012-01-02 09:30:00 2012-01-02 09:40:00 LocationA RoomA
Creates a data frame in which all "active" destinations are listed for each minute. A large data set covers a full year (~ 525600 minutes). With an average meeting duration of 18 minutes, I expect the sql connection to create a data set with ~ 5 million rows, which I can use to create busy schedules for various factors (Location / Room, etc.).
Based on the sapply solution suggested in How to count the number of concurrent users I tried using data.table and data.table as follows:
require(snowfall) require(data.table) sfInit(par=T, cpu=4) sfLibrary(data.table) tdata <- data.table(tdata) tinterval <- seq.POSIXt(start.min, end.max, by = "mins") setkey(tdata, Start, End) sfExport("tdata")
This solution is fast, it takes ~ 18 seconds to calculate 1 day (about 2 hours throughout the year). The disadvantage is that I cannot create a subset of the number of simultaneous appointments for certain factors (Location, Room, etc.). I have a feeling that there should be a better way to do this .. any advice?
UPDATE : The final decision looks like this based on Jeffrey's answer. This example shows how employment can be determined for each location.
setkey(tdata, Location, Start, End) vecTime <- seq(from=tdata$Start[1],to=tdata$End[nrow(tdata)],by=60) res <- data.frame(time=vecTime) for(i in 1:length(unique(tdata$Location)) ) { addz <- array(0,length(vecTime)) remz <- array(0,length(vecTime)) tdata2 <- tdata[J(unique(tdata$Location)[i]),]