“smoothing” time data - can it be more efficient?

Question

“smoothing” time data - can it be more efficient?

I have a data frame containing an identifier, start date and end date. My data is sorted by identifier, start, end (in this sequence).

Now I want all the lines with the same identifier to have an overlapping time interval (or a start date that was right the day after the end date of the other line) to be combined together.

Merging them means that they fall on the same line with the same identifier, min (start date) and max (end date) (I hope you understand what I mean).

I wrote a function for this (it is not fully tested, but at the moment it looks fine). The problem is that since my data frame has almost 100,000 observations, the function is very slow.

Can you help me improve my function in terms of effectiveness?

Here is the function

smoothingEpisodes <- function (theData) { theOutput <- data.frame() curId <- theData[1, "ID"] curStart <- theData[1, "START"] curEnd <- theData[1, "END"] for(i in 2:nrow(theData)) { nextId <- theData[i, "ID"] nextStart <- theData[i, "START"] nextEnd <- theData[i, "END"] if (curId != nextId | (curEnd + 1) < nextStart) { theOutput <- rbind(theOutput, data.frame("ID" = curId, "START" = curStart, "END" = curEnd)) curId <- nextId curStart <- nextStart curEnd <- nextEnd } else { curEnd <- max(curEnd, nextEnd, na.rm = TRUE) } } theOutput <- rbind(theOutput, data.frame("ID" = curId, "START" = curStart, "END" = curEnd)) theOutput }

Thanks!

[edit]

test data:

  ID START END 1 1 2000-01-01 2000-03-31 2 1 2000-04-01 2000-05-31 3 1 2000-04-15 2000-07-31 4 1 2000-09-01 2000-10-31 5 2 2000-01-15 2000-03-31 6 2 2000-02-01 2000-03-15 7 2 2000-04-01 2000-04-15 8 3 2000-06-01 2000-06-15 9 3 2000-07-01 2000-07-15

(START and END have the data type "Date", identifier is a number)

Data volume:

 structure(list(ID = c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L), START = structure(c(10957, 11048, 11062, 11201, 10971, 10988, 11048, 11109, 11139), class = "Date"), END = structure(c(11047, 11108, 11169, 11261, 11047, 11031, 11062, 11123, 11153), class = "Date")), .Names = c("ID", "START", "END"), class = "data.frame", row.names = c(NA, 9L))

0

performance function datetime r

speendo Jun 21 '11 at 13:07

source share

3 answers

I did it a little differently so as not to delete empty lines at the end:

 smoothingEpisodes <- function (theData) { curId <- theData[1, "ID"] curStart <- theData[1, "START"] curEnd <- theData[1, "END"] theLength <- nrow(theData) out.1 <- integer(length = theLength) out.2 <- out.3 <- numeric(length = theLength) j <- 1 for(i in 2:nrow(theData)) { nextId <- theData[i, "ID"] nextStart <- theData[i, "START"] nextEnd <- theData[i, "END"] if (curId != nextId | (curEnd + 1) < nextStart) { out.1[j] <- curId out.2[j] <- curStart out.3[j] <- curEnd j <- j + 1 curId <- nextId curStart <- nextStart curEnd <- nextEnd } else { curEnd <- max(curEnd, nextEnd, na.rm = TRUE) } } out.1[j] <- curId out.2[j] <- curStart out.3[j] <- curEnd theOutput <- data.frame(ID = out.1[1:j], START = as.Date(out.2[1:j], origin = "1970-01-01"), END = as.Date(out.3[1:j], origin = "1970-01-01")) theOutput }

quite a big improvement in my original version!

+1

speendo Jun 21 '11 at 14:54

source share

Marcel, I thought I was just trying to improve the code a bit. The lower version is about 30 times faster (from 3 seconds to 0.1 second) ... The trick is to first extract three columns into integer and double vectors.

As a side note, I try to use [[ where applicable, and try to save integers as integers by writing j <- j + 1L , etc. It doesn't make any difference here, but sometimes coercion between integers and doubles can take quite some time.

 smoothingEpisodes3 <- function (theData) { theLength <- nrow(theData) if (theLength < 2L) return(theData) id <- as.integer(theData[["ID"]]) start <- as.numeric(theData[["START"]]) end <- as.numeric(theData[["END"]]) curId <- id[[1L]] curStart <- start[[1L]] curEnd <- end[[1L]] out.1 <- integer(length = theLength) out.2 <- out.3 <- numeric(length = theLength) j <- 1L for(i in 2:nrow(theData)) { nextId <- id[[i]] nextStart <- start[[i]] nextEnd <- end[[i]] if (curId != nextId | (curEnd + 1) < nextStart) { out.1[[j]] <- curId out.2[[j]] <- curStart out.3[[j]] <- curEnd j <- j + 1L curId <- nextId curStart <- nextStart curEnd <- nextEnd } else { curEnd <- max(curEnd, nextEnd, na.rm = TRUE) } } out.1[[j]] <- curId out.2[[j]] <- curStart out.3[[j]] <- curEnd theOutput <- data.frame(ID = out.1[1:j], START = as.Date(out.2[1:j], origin = "1970-01-01"), END = as.Date(out.3[1:j], origin = "1970-01-01")) theOutput }

Then in the following code the speed difference will be displayed. I just took your data and played it 1000 times ...

 x <- structure(list(ID = c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L), START = structure(c(10957, 11048, 11062, 11201, 10971, 10988, 11048, 11109, 11139), class = "Date"), END = structure(c(11047, 11108, 11169, 11261, 11047, 11031, 11062, 11123, 11153), class = "Date")), .Names = c("ID", "START", "END"), class = "data.frame", row.names = c(NA, 9L)) r <- 1000 y <- data.frame(ID=rep(x$ID, r) + rep(1:r, each=nrow(x))-1, START=rep(x$START, r), END=rep(x$END, r)) system.time( a1 <- smoothingEpisodes(y) ) # 2.95 seconds system.time( a2 <- smoothingEpisodes3(y) ) # 0.10 seconds all.equal( a1, a2 )

+1

Tommy Jun 21 '11 at 23:30

source share

Gavin simpson · Accepted Answer · 2011-06-21T13:31:12+0000

First of all (without thinking about what you are trying to do), I would suggest allocating storage for theOutput . At the moment, you are becoming theOutput on each iteration of the loop. In R, which is absolute no no !! This is something you never do if you don't like terribly slow code. R needs to copy the object and unwrap it during each iteration, and this is slow.

If we look at the code, we know that theOutput should have rows nrow(theData) - 1 and 3 columns. So create this before the start of the loop:

 theOutput <- data.frame(matrix(ncol = 3, nrow = nrow(theData) - 1))

then fill this object during the loop:

 theOutput[i, ] <- data.frame("ID" = curId, "START" = curStart, "END" = curEnd))

eg.

Unclear what is START and END ? if they are numbers, then working with a matrix, rather than with a data frame, can also improve work efficiency.

In addition, when creating a data frame, each iteration will be slow. I can't do this without spending a lot of time, but you can just fill in the bits you need without calling data.frame() during each iteration:

 theOutput[i, "ID"] <- curId theOutput[i, "START"] <- curStart theOutput[i, "END"] <- curEnd

The best advice I can give you is to profile your code. See where the bottlenecks are and speed them up. Run your function on a smaller subset of data; whose size is sufficient to give you a little time to collect useful profiling data without waiting for age to complete the profiling. For a profile in R, use Rprof() :

 Rprof(filename = "my_fun_profile.Rprof") ## run your function call here on a subset of the data Rprof(NULL)

You can see the result with

 summaryRprof("my_fun_profile.Rprof")

Hadley Wickham (@hadley) has a package to make this a little easier. It is called profr . And as Dirk says in the comments, there is also a Luke Tierney proftools package .

Edit: since the OP provided some test data, I knocked something fast to show the acceleration achieved only after good loop practice:

 smoothingEpisodes2 <- function (theData) { curId <- theData[1, "ID"] curStart <- theData[1, "START"] curEnd <- theData[1, "END"] nr <- nrow(theData) out1 <- integer(length = nr) out2 <- out3 <- numeric(length = nr) for(i in 2:nrow(theData)) { nextId <- theData[i, "ID"] nextStart <- theData[i, "START"] nextEnd <- theData[i, "END"] if (curId != nextId | (curEnd + 1) < nextStart) { out1[i-1] <- curId out2[i-1] <- curStart out3[i-1] <- curEnd curId <- nextId curStart <- nextStart curEnd <- nextEnd } else { curEnd <- max(curEnd, nextEnd, na.rm = TRUE) } } out1[i] <- curId out2[i] <- curStart out3[i] <- curEnd theOutput <- data.frame(ID = out1, START = as.Date(out2, origin = "1970-01-01"), END = as.Date(out3, origin = "1970-01-01")) ## drop empty theOutput <- theOutput[-which(theOutput$ID == 0), ] theOutput }

Using the test dataset in the testData object, I get:

 > res1 <- smoothingEpisodes(testData) > system.time(replicate(100, smoothingEpisodes(testData))) user system elapsed 1.091 0.000 1.131 > res2 <- smoothingEpisodes2(testData) > system.time(replicate(100, smoothingEpisodes2(testData))) user system elapsed 0.506 0.004 0.517

50% accelerates. Not dramatic, but just easy to achieve without increasing the object at each iteration.

“smoothing” time data - can it be more efficient?

More articles: