I have a very large (almost 6 m rows) dataframe called DF with the following structure:
CodeContract RelMonth AmtPmt A0001 10 0.00 A0001 11 15.00 A0002 12 4.55 A0003 4 0.00 ... ... ...
RelMonth defined as the number of months since a specific static event associated with CodeContract .
This data is already sorted by CodeContract and RelMonth . Currently, the data frame has continuous RelMonth ; those. for any given CodeContract all intermediate RelMonth are populated, for example, for a given CodeContract , if I have Min RelMonth=5 and Max RelMonth=12 , then the data frame will include RelMonths 5:12 .
I would like to calculate another column named Mths_since_last_Pmt , which would count for a given CodeContract how many RelMonths it was, since this a CodeContract had AmtPmt > Amt_threshold .
It would work like this (assuming Amt_threshold=5 )
CodeContract RelMonth AmtPmt Mths_since_last_Pmt A0001 1 0.00 1 A0001 2 3.00 2 A0001 3 0.00 3 A0001 4 10.00 0 A0001 5 0.00 1 A0002 1 10.00 0 A0002 2 12.00 0 A0002 3 0.00 1 A0002 4 0.00 2
I currently have a working solution that uses a For loop, but it can only process 5000 lines / sec.
I am looking for a way to vectorize this calculation, and perhaps even without sorting the data first or with continuous RelMonths .
All the vectorized solutions that I tried to develop, usually with ddply that call seq_along , end up working with my RAM (24 GB). I am looking for a solution that will work under 2 GB of RAM. Maybe a custom function solution?
Any idea how to make this work?
Update @Roland
@Roland
I found a slightly different dataset that will lead to erroneous output with the code below. Input entered:
DF <- read.table(text="CodeContract RelMonth AmtPmt Mths_since_last_Pmt A0001 1 0.00 1 A0001 2 3.00 2 A0001 3 0.00 3 A0001 4 10.00 0 A0001 5 0.00 1 A0002 1 1.00 0 A0002 2 14.00 0 A0002 3 14.00 1 A0002 4 14.00 2",header=TRUE)
Corresponding output:
CodeContract RelMonth AmtPmt Mths_since_last_Pmt Mths_since_last_Pmt2 1: A0001 1 0 1 1 2: A0001 2 3 2 2 3: A0001 3 0 3 3 4: A0001 4 10 0 0 5: A0001 5 0 1 1 6: A0002 1 1 0 1 7: A0002 2 14 0 0 8: A0002 3 14 1 -1 9: A0002 4 14 2 -2
The negative numbers -1 and -2 in Mths_since_last_Pmt2 in the last lines are incorrect; they should be 0 as the threshold is exceeded. The algorithm seems to fail when the first element is a subgroup (here by changing the CodeContract ) below a threshold value sufficient to throw it away.
Is there a way we can apply to do this job?