Use an aggregate with a function that uses data from two columns (e.g. cov or prod)

I have a long row of daily data and 101 columns. Every month, I would like to calculate the cov each of the first 100 columns with the 101st column. This will create a monthly covariance with 101st column for each of the 100 columns based on daily data. It seems that aggregate does what I want with functions that take a single vector, like mean , but I can't get it to work with cov (or prod ).

Please let me know if dput from several months will help.

 > library("zoo") > data <- read.zoo("100Size-BM.csv", header=TRUE, sep=",", format="%Y%m%d") > head(data[, c("R1", "R2", "R3", "R100", "Mkt.RF")]) R1 R2 R3 R100 Mkt.RF 1963-07-01 -0.00212 0.00398 -0.00472 -0.00362 -0.0066 1963-07-02 -0.00242 0.00678 0.00068 -0.00012 0.0078 1963-07-03 0.00528 0.01078 0.00598 0.00338 0.0063 1963-07-05 0.01738 -0.00932 -0.00072 -0.00012 0.0040 1963-07-08 0.01048 -0.01262 -0.01332 -0.01392 -0.0062 1963-07-09 -0.01052 0.01048 0.01738 0.01388 0.0045 

mean works great and gives me the monthly data i want.

 > mean.temp <- aggregate(data[, 1:100], as.yearmon, mean) > head(mean.temp[, 1:3]) R1 R2 R3 Jul 1963 0.0003845455 7.545455e-05 0.0004300000 Aug 1963 -0.0006418182 2.412727e-03 0.0022263636 Sep 1963 0.0016250000 1.025000e-03 -0.0002600000 Oct 1963 -0.0007952174 2.226522e-03 0.0004873913 Nov 1963 0.0006555556 -5.211111e-03 -0.0013888889 Dec 1963 -0.0027066667 -1.249524e-03 -0.0005828571 

But I cannot get a function that uses two different columns / vectors to work.

 > cov.temp <- aggregate(data[, 1:100], as.yearmon, cov(x, data[, "Mkt.RF"])) Error in inherits(x, "data.frame") : object 'x' not found 

I can't get it to work with a cov wrapper.

 > f <- function(x) cov(x, data[, "Mkt.RF"]) > cov.temp <- aggregate(data[, 1:100], as.yearmon, f) Error in cov(x, data[, "Mkt.RF"]) : incompatible dimensions 

Should I do this with a for loop? I hope there is a more R way. Thanks!

+4
source share
4 answers

You forgot the function(x) declaration, and you need to make sure that you get the correct monthly subset of data (which, by the way, stomps on the data function). Try the following:

 > aggregate(data, as.yearmon, function(x) cov(x,data[index(x),"Mkt.RF"])) R1 R2 R3 R100 Mkt.RF Jul 1963 1.3265e-05 2.0340e-05 3.464e-05 2.2575e-05 6.267e-05 Aug 1963 -7.1295e-05 2.8875e-05 1.000e-06 -9.9700e-06 -2.608e-05 

* Please note that I changed the last three observations in your sample data until August, so that there will be more than one month of withdrawal.

+2
source

You can use the approach that I wrote here , namely:

 tapply(1:nrow(data), data$group, function(s) cov(data$x[s], data$y[s])) 
+4
source

In aggregate() , as is often the case for many R functions that apply other R functions to subsets of data, you name the function you want to apply, in this case adding FUN = cov to your aggregate() call. You can then pass arguments to this function as part of a special argument ...

You can pass data[, "Mkt.RF"]) as the y argument to the cov() function, so something like this should work:

 cov.temp <- aggregate(data[, 1:100], as.yearmon, FUN = cov, y = data[, "Mkt.RF"]) 

However, in this case this does not work the way you need to handle the nature of zoo data and be able to subset data[, "Mkt.RF"] similarly to another data[,1:100]1 columns are broken up by aggregate () `. Therefore, an alternative is to specify the inline function, for example:

 cov.temp <- aggregate(data[, 1:100], as.yearmon, FUN = function(x) cov(x, y = data[index(x), "Mkt.RF"])) 

Here is an example that should appear out of the box:

 library("zoo") dat <- data.frame(matrix(rnorm(365*10*6), ncol = 6)) Dates <- seq.Date(from = as.Date("1963-07-01"), by = "days", length = 365*10) dat2 <- zoo(dat, order.by = Dates) 

What gives us:

 > head(dat2) X1 X2 X3 X4 X5 X6 1963-07-01 0.30910867 0.5539864 0.6433690 0.20608146 -1.7706003 -0.4607610 1963-07-02 -0.02519616 -0.1856305 1.0419578 1.01319153 0.8671110 0.1196251 1963-07-03 1.56464024 0.4980238 0.2976338 0.05654036 0.4984225 -1.4626501 1963-07-04 -0.24028698 -1.4365257 0.5707873 -0.05851961 -0.7176343 0.1233137 1963-07-05 -0.87770815 -0.5217949 -2.4875626 -0.08200408 -0.6121038 -0.3881126 1963-07-06 -0.53660576 -1.1098966 2.7411511 -1.37106883 -0.5891641 1.6322411 

Now let's assume that X6 is your "Mkt.RF" column, and we will summarize over dat2 [, 1: 5]:

 cov.temp <- aggregate(dat2[, 1:5], as.yearmon, FUN = function(x) cov(x, y = dat2[index(x),"X6"])) head(cov.temp) 

What gives:

 > head(cov.temp) X1 X2 X3 X4 X5 Jul 1963 -0.30185387 0.09802210 0.019282934 -0.03621272 0.05332324 Aug 1963 0.14739044 0.04276340 0.081847499 -0.35195736 -0.14680017 Sep 1963 0.56698393 -0.08371676 0.003870935 -0.05948173 0.07550769 Oct 1963 0.00711595 -0.07939798 0.118030943 -0.22065278 -0.12474052 Nov 1963 0.06551982 0.22848268 0.231967655 0.02356194 -0.24272566 Dec 1963 0.23866775 0.29464398 -0.034313793 0.09694199 -0.10481527 

NTN

+2
source

I ended up using aggregate to format the data, but for each cov coefficient, it took about 50 minutes for each calculation. On a whim, I tried the plyr solution, which has huge benefits.

 cov.fn <- function(x) nrow(x) * cov(x[, 1:100], x[, 101]) temp <- zoo(daply(data, .(as.yearmon(index(data))), cov.fn), unique(as.yearmon(index(data)))) 

It takes about five seconds (600 times faster). I assume that in increasing the efficiency of subset operations, high speed will be achieved.

Thank you all for your help. I learned a lot from this.

0
source

All Articles