Conditional Summation (R)

I am trying to create a conditional sum to calculate the average value. The idea is that the function (or the apply operator) checks if a certain value is true (for example, x> 0), then sums all the values ​​of x, where it is greater than zero. The final step would be to divide this amount by the number of instances that are greater than zero. The search for conditonal sum (ming) did not give me useful information.

This is a piece of data:

> tmpData Instrument TradeResult.Currency. 1 JPM -3 2 JPM 264 3 JPM 284 4 JPM 69 5 JPM 283 11 KFT -8 12 KFT -48 13 KFT 125 14 KFT -150 15 KFT -206 16 KFT 107 

Of the features I tried, the most promising is the following:

 avgProfit <- function(x) { ifelse(x > 0, sum(x) / length(which(x > 0)), return(0)) } 

However, the output of this function is 0:

 > with(tmpData, tapply(TradeResult.Currency., Instrument, avgProfit)) JPM KFT 0 0 > avgProfit(tmpData$TradeResult.Currency.) [1] 0 > x [1] 1 1 2 1 2 3 3 3 4 4 

(Values ​​should be 225 for JPM (a total of 900 divided into 4 instances, where more than zero) and 116 for KFT)

Despite the fact that I calculate the sum of x (which, if I understand correctly, should be the sum of the individual values ​​in data.frame) in the function, the output of the variable "x" puzzles me. I can’t find where these 1,2,3 and four come from.

How can I calculate the notional amount? Also, do I need to use the function, or am I making it too complicated (maybe there is a built-in R function for this that I missed?)

Any thoughts are more than welcome.

Hello,

+4
source share
6 answers

perhaps this is an easy way to remove unused rows first and then aggregate them:

 aggregate(TradeResult.Currency.~Instrument, mean, data=subset(tmpData,TradeResult.Currency.>0)) 
+9
source

You were almost there, I think ifelse was the wrong direction, since you need an average, not an elementary comparison.

You need to think about whether you might run into missing values ​​so that you can handle them correctly.

 tmpData <- read.table(textConnection(" Instrument TradeResult.Currency. 1 JPM -3 2 JPM 264 3 JPM 284 4 JPM 69 5 JPM 283 11 KFT -8 12 KFT -48 13 KFT 125 14 KFT -150 15 KFT -206 16 KFT 107")) with(tmpData, tapply(TradeResult.Currency., Instrument, function(x) mean(x[x > 0]))) 

JPM KFT 225 116

+5
source

Recently, quite a lot of these issues are aggregation / conditional data analysis. It is always interesting to see different approaches. I thought I would add an approach using plyr . I like plyr functions because they provide a standard syntax for all functions and allow you to specify the structure of both input and output. Here we will use ddply , as we go through data.frame and want a data.frame return from the other side. We use the summarise function to calculate the average for each instrument where the values ​​are positive.

 library(plyr) ddply(tmpData, .(instrument), summarise, avgProfit = mean(TCurr[TCurr > 0])) 

To track @Joris performance comparison, ddply seems to work if not better than other approaches:

 > tmpData <- data.frame( + instrument = rep(c("JPM","KFT"),each=10e6), + TCurr = runif(20e6,-10,100) + ) > > system.time( + ddply(tmpData, .(instrument), summarise, avgProfit = mean(TCurr[TCurr > 0])) + ) user system elapsed 4.43 0.89 5.32 > > avgProfit <- function(x) { mean(x[x>0])} > > system.time( + with(tmpData,tapply(TCurr,instrument,avgProfit)) + ) user system elapsed 7.88 0.47 8.36 > > system.time( + aggregate(TCurr~instrument,mean,data=subset(tmpData,TCurr>0)) + ) user system elapsed 28.29 2.35 30.65 
+4
source

aggregate is the easiest way, but I do not agree with "clean because you don’t need to write a custom function". Reading is enhanced when certain clear functions are defined. Especially if you need this middle pair of times in your scripts.

The aggregate is pretty slightly faster than your custom function because you forgot about the indexes. You wanted to do this:

 avgProfit <- function(x){ mean(x[x>0]) } 

This is again faster than the unit, due to the lack of overhead:

 > tmpData <- data.frame( + instrument = rep(c("JPM","KFT"),each=10000), + TCurr = runif(20000,-10,100) + ) > system.time( + with(tmpData,tapply(TCurr,instrument,avgProfit))) user system elapsed 0.02 0.00 0.02 > system.time( + aggregate(TCurr~instrument,mean,data=subset(tmpData,TCurr>0))) user system elapsed 0.09 0.00 0.10 

In most cases, you can simply ignore this difference. On huge data sets (n> 100,000), you will begin to feel this, especially if you need to do this for a whole set of variables.

EDIT: just saw that mdsummer had exactly the same solution that was neatly hidden between the output :-). I leave this as a reference to timing.

+2
source

There is a very simple and fast data.table approach to this:

 library(data.table) setDT(dt)[, .(avg = mean(TradeResult.Currency.[which(TradeResult.Currency.>0 )])), by= Instrument] # Instrument avg # 1: JPM 225 # 2: KFT 116 

Benchmark: Using the performance comparison of @Joris and @Chase, this solution is almost five times faster than the ddply approach and 40 times faster than the aggregate approach.

 tmpData <- data.frame( instrument = rep(c("JPM","KFT"),each=10e6), TCurr = runif(20e6,-10,100)) system.time( ddply(tmpData, .(instrument), summarise, avgProfit = mean(TCurr[TCurr > 0])) ) # user system elapsed # 1.41 0.62 2.03 system.time( setDT(tmpData)[, .(avg = mean(TCurr[which(TCurr>0 )])), by= instrument] ) # user system elapsed # 0.36 0.18 0.43 system.time( aggregate(TCurr~instrument, mean, data=subset(tmpData,TCurr>0)) ) # user system elapsed # 16.07 1.81 17.20 
+1
source

I would probably just approach this from an iterative style. Have a local variable called "battery" or something like that, loop around all the items in the list and use an if type, like

 if (x[index] > 0) accumulator = accumulator + x[index] 

and return the battery value when you are done.

-1
source

All Articles