First of all (without thinking about what you are trying to do), I would suggest allocating storage for theOutput . At the moment, you are becoming theOutput on each iteration of the loop. In R, which is absolute no no !! This is something you never do if you don't like terribly slow code. R needs to copy the object and unwrap it during each iteration, and this is slow.
If we look at the code, we know that theOutput should have rows nrow(theData) - 1 and 3 columns. So create this before the start of the loop:
theOutput <- data.frame(matrix(ncol = 3, nrow = nrow(theData) - 1))
then fill this object during the loop:
theOutput[i, ] <- data.frame("ID" = curId, "START" = curStart, "END" = curEnd))
eg.
Unclear what is START and END ? if they are numbers, then working with a matrix, rather than with a data frame, can also improve work efficiency.
In addition, when creating a data frame, each iteration will be slow. I can't do this without spending a lot of time, but you can just fill in the bits you need without calling data.frame() during each iteration:
theOutput[i, "ID"] <- curId theOutput[i, "START"] <- curStart theOutput[i, "END"] <- curEnd
The best advice I can give you is to profile your code. See where the bottlenecks are and speed them up. Run your function on a smaller subset of data; whose size is sufficient to give you a little time to collect useful profiling data without waiting for age to complete the profiling. For a profile in R, use Rprof() :
Rprof(filename = "my_fun_profile.Rprof")
You can see the result with
summaryRprof("my_fun_profile.Rprof")
Hadley Wickham (@hadley) has a package to make this a little easier. It is called profr . And as Dirk says in the comments, there is also a Luke Tierney proftools package .
Edit: since the OP provided some test data, I knocked something fast to show the acceleration achieved only after good loop practice:
smoothingEpisodes2 <- function (theData) { curId <- theData[1, "ID"] curStart <- theData[1, "START"] curEnd <- theData[1, "END"] nr <- nrow(theData) out1 <- integer(length = nr) out2 <- out3 <- numeric(length = nr) for(i in 2:nrow(theData)) { nextId <- theData[i, "ID"] nextStart <- theData[i, "START"] nextEnd <- theData[i, "END"] if (curId != nextId | (curEnd + 1) < nextStart) { out1[i-1] <- curId out2[i-1] <- curStart out3[i-1] <- curEnd curId <- nextId curStart <- nextStart curEnd <- nextEnd } else { curEnd <- max(curEnd, nextEnd, na.rm = TRUE) } } out1[i] <- curId out2[i] <- curStart out3[i] <- curEnd theOutput <- data.frame(ID = out1, START = as.Date(out2, origin = "1970-01-01"), END = as.Date(out3, origin = "1970-01-01"))
Using the test dataset in the testData object, I get:
> res1 <- smoothingEpisodes(testData) > system.time(replicate(100, smoothingEpisodes(testData))) user system elapsed 1.091 0.000 1.131 > res2 <- smoothingEpisodes2(testData) > system.time(replicate(100, smoothingEpisodes2(testData))) user system elapsed 0.506 0.004 0.517
50% accelerates. Not dramatic, but just easy to achieve without increasing the object at each iteration.