Matrix of p-values ​​of variables x and y from anova output

I have many X and Y variables (approximately 500 x 500). The following small data:

yvars <- data.frame (Yv1 = rnorm(100, 5, 3), Y2 = rnorm (100, 6, 4), Yv3 = rnorm (100, 14, 3)) xvars <- data.frame (Xv1 = sample (c(1,0, -1), 100, replace = T), X2 = sample (c(1,0, -1), 100, replace = T), Xv3 = sample (c(1,0, -1), 100, replace = T), D = sample (c(1,0, -1), 100, replace = T)) 

I want to highlight p-values ​​and make such a matrix:

  Yv1 Y2 Yv3 Xv1 X2 Xv3 D 

Here is my attempt to loop:

 prob = NULL anova.pmat <- function (x) { mydata <- data.frame(yvar = yvars[, x], xvars) for (i in seq(length(xvars))) { prob[[i]] <- anova(lm(yvar ~ mydata[, i + 1], data = mydata))$`Pr(>F)`[1] } } sapply (yvars,anova.pmat) Error in .subset(x, j) : only 0 may be mixed with negative subscripts What could be the solution ? 

Edit:

For the first variable Y:

For the first variable Y:

 prob <- NULL mydata <- data.frame(yvar = yvars[, 1], xvars) for (i in seq(length(xvars))) { prob[[i]] <- anova(lm(yvar ~ mydata[, i + 1], data = mydata))$`Pr(>F)`[1] } prob [1] 0.4995179 0.4067040 0.4181571 0.6291167 

Change again:

 for (j in seq(length (yvars))){ prob <- NULL mydata <- data.frame(yvar = yvars[, j], xvars) for (i in seq(length(xvars))) { prob[[i]] <- anova(lm(yvar ~ mydata[, i + 1], data = mydata))$`Pr(>F)`[1] } } Gives the same result as above !!! 
+4
source share
2 answers

Here is the approach plyr uses to cycle through the columns of a data frame (treating it as a list) for each of xvars and yvars , returning the corresponding p value, setting it to the matrix. Adding row / column names is optional.

 library("plyr") probs <- laply(xvars, function(x) { laply(yvars, function(y) { anova(lm(y~x))$`Pr(>F)`[1] }) }) rownames(probs) <- names(xvars) colnames(probs) <- names(yvars) 
+4
source

Here is one solution, which is to create all combinations of Y- and X-variables for testing (we cannot use combn ) and run a linear model in each case:

 dfrm <- data.frame(y=gl(ncol(yvars), ncol(xvars), labels=names(yvars)), x=gl(ncol(xvars), 1, labels=names(xvars)), pval=NA) ## little helper function to create formula on the fly fm <- function(x) as.formula(paste(unlist(x), collapse="~")) ## merge both datasets full.df <- cbind.data.frame(yvars, xvars) ## apply our LM row-wise dfrm$pval <- apply(dfrm[,1:2], 1, function(x) anova(lm(fm(x), full.df))$`Pr(>F)`[1]) ## arrange everything in a rectangular matrix of p-values res <- matrix(dfrm$pval, nc=3, dimnames=list(levels(dfrm$x), levels(dfrm$y))) 

Sidenote: With high-dimensional data sets, relying on QR decomposition to calculate the p-value of linear regression takes a lot of time. It is easier to calculate the Pearson linear correlation matrix for each pairwise comparison and convert the r-statistics to Fisher-Snedecor F using the relation F = Ξ½ a r 2 / (1-r 2 ), where degrees of freedom are defined as Ξ½ a = (n-2) - # {(x i = NA), (y i = NA)} (i.e. (N-2) minus the number of pairwise missing values ​​- if there are no missing values, this formula is the usual coefficient of R 2 in regression).

+1
source

All Articles