Multicriteria increment with reset to R

I have a very large (almost 6 m rows) dataframe called DF with the following structure:

CodeContract RelMonth AmtPmt A0001 10 0.00 A0001 11 15.00 A0002 12 4.55 A0003 4 0.00 ... ... ... 

RelMonth defined as the number of months since a specific static event associated with CodeContract .

This data is already sorted by CodeContract and RelMonth . Currently, the data frame has continuous RelMonth ; those. for any given CodeContract all intermediate RelMonth are populated, for example, for a given CodeContract , if I have Min RelMonth=5 and Max RelMonth=12 , then the data frame will include RelMonths 5:12 .

I would like to calculate another column named Mths_since_last_Pmt , which would count for a given CodeContract how many RelMonths it was, since this a CodeContract had AmtPmt > Amt_threshold .

It would work like this (assuming Amt_threshold=5 )

 CodeContract RelMonth AmtPmt Mths_since_last_Pmt A0001 1 0.00 1 A0001 2 3.00 2 A0001 3 0.00 3 A0001 4 10.00 0 A0001 5 0.00 1 A0002 1 10.00 0 A0002 2 12.00 0 A0002 3 0.00 1 A0002 4 0.00 2 

I currently have a working solution that uses a For loop, but it can only process 5000 lines / sec.

I am looking for a way to vectorize this calculation, and perhaps even without sorting the data first or with continuous RelMonths .

All the vectorized solutions that I tried to develop, usually with ddply that call seq_along , end up working with my RAM (24 GB). I am looking for a solution that will work under 2 GB of RAM. Maybe a custom function solution?

Any idea how to make this work?

Update @Roland

@Roland

I found a slightly different dataset that will lead to erroneous output with the code below. Input entered:

 DF <- read.table(text="CodeContract RelMonth AmtPmt Mths_since_last_Pmt A0001 1 0.00 1 A0001 2 3.00 2 A0001 3 0.00 3 A0001 4 10.00 0 A0001 5 0.00 1 A0002 1 1.00 0 A0002 2 14.00 0 A0002 3 14.00 1 A0002 4 14.00 2",header=TRUE) 

Corresponding output:

 CodeContract RelMonth AmtPmt Mths_since_last_Pmt Mths_since_last_Pmt2 1: A0001 1 0 1 1 2: A0001 2 3 2 2 3: A0001 3 0 3 3 4: A0001 4 10 0 0 5: A0001 5 0 1 1 6: A0002 1 1 0 1 7: A0002 2 14 0 0 8: A0002 3 14 1 -1 9: A0002 4 14 2 -2 

The negative numbers -1 and -2 in Mths_since_last_Pmt2 in the last lines are incorrect; they should be 0 as the threshold is exceeded. The algorithm seems to fail when the first element is a subgroup (here by changing the CodeContract ) below a threshold value sufficient to throw it away.

Is there a way we can apply to do this job?

+4
source share
2 answers

Well, I managed to find someone with a similar problem in SO and was able to adapt the answer to my problem. Kudos @ sven-hohenstein

The answer will look like this:

 require(data.table) DF<-as.data.table(DF) 

First I created a threshold test vector that returns 1 if AmtPmt is below the threshold:

 DF$trsh_test[DF$AmtPmt<trsh]<-1 DF$trsh_test[is.na(DF$trsh_test)]<-0 

Secondly, with the ave function along with seq_along

 DF[,Mths_since_last_Pmt2 := trsh_test * ave(trsh_test, c(0L, cumsum(diff(trsh_test) != 0)), FUN = seq_along) , by=CodeContract] 

You will get the following result, which is correct:

 CodeContract RelMonth AmtPmt Mths_since_last_Pmt trsh_test Mths_since_last_Pmt2 A0001 1 0 1 1 1 A0001 2 3 2 1 2 A0001 3 0 3 1 3 A0001 4 10 0 0 0 A0001 5 0 1 1 1 A0002 1 1 0 1 1 A0002 2 14 0 0 0 A0002 3 14 1 0 0 A0002 4 14 2 0 0 
+1
source

Try the following:

 DF <- read.table(text="CodeContract RelMonth AmtPmt Mths_since_last_Pmt A0001 1 0.00 1 A0001 2 3.00 2 A0001 3 0.00 3 A0001 4 10.00 0 A0001 5 0.00 1 A0002 1 10.00 0 A0002 2 12.00 0 A0002 3 0.00 1 A0002 4 0.00 2",header=TRUE) library(data.table) DT <- data.table(DF,key=c("CodeContract","RelMonth")) trsh <- 5 DT[,Mths_since_last_Pmt2 := cumsum(AmtPmt<=trsh)-cumsum(cumsum(AmtPmt<=trsh)*(AmtPmt>trsh)), by=CodeContract] # CodeContract RelMonth AmtPmt Mths_since_last_Pmt Mths_since_last_Pmt2 # 1: A0001 1 0 1 1 # 2: A0001 2 3 2 2 # 3: A0001 3 0 3 3 # 4: A0001 4 10 0 0 # 5: A0001 5 0 1 1 # 6: A0002 1 10 0 0 # 7: A0002 2 12 0 0 # 8: A0002 3 0 1 1 # 9: A0002 4 0 2 2 

We hope that assigning data.table by reference will keep you below the RAM limit.

+5
source

All Articles