Conditional Dplyr Window Window

Trying to convert the following R .frame file:

structure(list( Time=c("09:30:01" ,"09:30:29" ,"09:35:56", "09:37:17" ,"09:37:21" ,"09:37:28" ,"09:37:35" ,"09:37:51" ,"09:42:11" ,"10:00:31"), Price=c(1,2,3,4,5,6,7,8,9,10), Volume=c(100,200,300,100,200,300,100,200,600,100)), .Names = c("Time", "Price", "Volume"), row.names = c(NA,10L), class = "data.frame") Time Price Volume 1 09:30:01 1 100 2 09:30:29 2 200 3 09:35:56 3 300 4 09:37:17 4 100 5 09:37:21 5 200 6 09:37:28 6 300 7 09:37:35 7 100 8 09:37:51 8 200 9 09:42:11 9 600 10 10:00:31 10 100 

in that

  Time Price Volume Bin 1 09:30:01 1 100 1 2 09:30:29 2 200 1 3 09:35:56 3 200 1 4 09:35:56 3 100 2 5 09:37:17 4 100 2 6 09:37:21 5 200 2 7 09:37:28 6 100 2 8 09:37:28 6 200 3 9 09:37:35 7 100 3 10 09:37:51 8 200 3 11 09:42:11 9 500 4 12 09:42:11 9 100 5 13 10:00:31 10 100 5 

Essentially, it calculates cumulative amounts by volume and binning each time 500 is violated. Thus, bit 1 is 100 + 200 + 200 with a volume at 09:35:56, split into 200/100, and a new line inserted, and Bean counter increased.

It is relatively simple with an R base, but I was wondering if there is a more elegant and hopefully faster way with dplyr.

Greetings

Update:

Thanks @Frank and @AntoniosK.

To solve your question, the range of volume values ​​is integer positive values ​​from 1 to 10k.

I microbenchmarked both approaches and dplyr was a little faster, but not so much, in a data set similar to the above, with ~ 200 thousand rows.

Really rate quick answers and help

+6
source share
3 answers

Not sure if this is the best or fastest way, but for these values, Volume seems fast. The philosophy is simple. Based on the value of Volume you created a lot of Time and Price lines with Volume = 1 . Then let cumsum add numbers and a flag every time you have a new batch of 500. Use these flags to create your Bin values.

 structure(list( Time=c("09:30:01" ,"09:30:29" ,"09:35:56", "09:37:17" ,"09:37:21" ,"09:37:28" ,"09:37:35" ,"09:37:51" ,"09:42:11" ,"10:00:31"), Price=c(1,2,3,4,5,6,7,8,9,10), Volume=c(100,200,300,100,200,300,100,200,600,100)), .Names = c("Time", "Price", "Volume"), row.names = c(NA,10L), class = "data.frame") -> dt library(dplyr) dt %>% group_by(Time, Price) %>% ## for each Time and Price do(data.frame(Volume = rep(1,.$Volume))) %>% ## create as many rows, with Volume = 1, as the value of Volume ungroup() %>% ## forget about the grouping mutate(CumSum = cumsum(Volume), ## cumulative sums flag_500 = ifelse(CumSum %in% seq(501,sum(dt$Volume),by=500),1,0), ## flag 500 batches (at 501, 1001, etc.) Bin = cumsum(flag_500)+1) %>% ## create Bin values group_by(Bin, Time, Price) %>% ## for each Bin, Time and Price summarise(Volume = sum(Volume)) %>% ## get new Volume values select(Time, Price, Volume, Bin) %>% ## use only if you want to re-arrange column order ungroup() ## use if you want to forget the grouping # Time Price Volume Bin # (chr) (dbl) (dbl) (dbl) # 1 09:30:01 1 100 1 # 2 09:30:29 2 200 1 # 3 09:35:56 3 200 1 # 4 09:35:56 3 100 2 # 5 09:37:17 4 100 2 # 6 09:37:21 5 200 2 # 7 09:37:28 6 100 2 # 8 09:37:28 6 200 3 # 9 09:37:35 7 100 3 # 10 09:37:51 8 200 3 # 11 09:42:11 9 500 4 # 12 09:42:11 9 100 5 # 13 10:00:31 10 100 5 
+4
source

This is hardly "straightforward."

There are still a lot of lines of code with data.table:

 library(data.table) setDT(DF) DF[, c("cV","cVL") := shift(cumsum(Volume), 0:1, type="lag", fill=0) ] DF[, end := ( cV %/% 500 ) - ( cV %% 500 == 0 ) ] DF[, start := shift(end, type = "lag", fill = -1) + ( cVL %% 500 == 0 ) ] badcols = c("Volume","cV","cVL","start","end") DF[,{ V = if (start==end) Volume else c((start+1)*500-cVL, rep(500, max(end-start-2,0)), cV - end*500) c(.SD[, !badcols, with=FALSE], list(Volume = V, Bin = 1+start:end)) }, by=.(r=seq(nrow(DF)))][,!"r",with=FALSE] 

which gives

  Time Price Volume Bin 1: 09:30:01 1 100 1 2: 09:30:29 2 200 1 3: 09:35:56 3 200 1 4: 09:35:56 3 100 2 5: 09:37:17 4 100 2 6: 09:37:21 5 200 2 7: 09:37:28 6 100 2 8: 09:37:28 6 200 3 9: 09:37:35 7 100 3 10: 09:37:51 8 200 3 11: 09:42:11 9 500 4 12: 09:42:11 9 100 5 13: 10:00:31 10 100 5 
+3
source

Here is one way using data.table and its union function:

 require(data.table) # v1.9.6+ setDT(df)[, csum := cumsum(Volume)] ans = rbind(df, df[.(csum=500 * seq_len(max(csum)%/% 500L)), roll=-Inf, on="csum"]) setorder(ans, Price, csum) ans = ans[, `:=`(Volume = c(csum[1L], diff(csum)), id = (csum-1L) %/% 500L + 1L, csum = NULL)][Volume > 0L] 

The first step adds a new column with the total amount of Volume .

The second step is perhaps the most important. Let's look at the second part. For every multiple of 500 to max(csum) it joins the first value> = a multiple of 500 by df$csum . This is a NOCB slip joint (next observation moved back). With this we get:

 # Time Price Volume csum # 1: 09:35:56 3 300 500 # 2: 09:37:28 6 300 1000 # 3: 09:37:51 8 200 1500 # 4: 09:42:11 9 600 2000 

These are breakpoints that need to be added to the source data table. This is what we do with rbind() .

Then all we do is arrange on Price, csum , generate back the Volume column. From there, the generation of the id column can be done using the csum column, as shown.

+3
source