When modeling multivariate data for regression, how can I set the R-square (including example code)?

Question

When modeling multivariate data for regression, how can I set the R-square (including example code)?

I am trying to model a dataset with three variables so that I can use linear regression models on it. "X1" and "X2" will be continuous independent variables (average value = 0, sd = 1), and "Y" will be a continuous dependent variable.

The variables will be a regression model, produce the following coefficients: Y = 5 + 3 (X1) - 2 (X2)

I would like to model this dataset so that the resulting regression model has an R-squared value of 0.2. How to determine the value of "sd.value" so that the regression model has this R-square?

n <- 200 set.seed(101) sd.value <- 1 X1 <- rnorm(n, 0, 1) X2 <- rnorm(n, 0, 1) Y <- rnorm(n, (5 + 3*X1 - 2*X2), sd.value) simdata <- data.frame(X1, X2, Y) summary(lm(Y ~ X1 + X2, data=simdata))

+8

r

Slyron 30 sept '13 at 14:36

source share

4 answers

So the formula for R ^ 2 is 1-var (residual) / var (total)

In this case, the variance of Y will be 3^2+2^2+sd.value^2 , since we add three independent random variables. And, asymptotically, the residual variance will simply be sd.value^2 .

So you can explicitly calculate rsquared with this function:

 rsq<-function(x){1-x^2/(9+ 4+x^2)}

With a little algebra, you can calculate the inverse of this function:

 rsqi<-function(x){sqrt(13)*sqrt((1-x)/x)}

So setting sd.value<-rsqi(rsquared) should give you what you want.

We can verify this as follows:

 simrsq<-function(x){ Y <- rnorm(n, (5 + 3*X1 - 2*X2), rsqi(x)) simdata <- data.frame(X1, X2, Y) summary(lm(Y ~ X1 + X2, data=simdata))$r.squared } > meanrsq<-rep(0,9) > for(i in 1:50) + meanrsq<-meanrsq+Vectorize(simrsq)((1:9)/10) > meanrsq/50 [1] 0.1031827 0.2075984 0.3063701 0.3977051 0.5052408 0.6024988 0.6947790 [8] 0.7999349 0.8977187

So it looks right.

+2

mrip 30 sept '13 at 15:03

source share

Here's how I would do it (a blind iterative algorithm , if you don’t know when you are just interested in “how to simulate this”):

 simulate.sd <- function(nsim=10, n=200, seed=101, tol=0.01) { set.seed(seed) sd.value <- 1 rsquare <- 1:nsim results <- 1:nsim for (i in 1:nsim) { # tracking iteration: if we miss the value, abort at sd.value > 7. iter <- 0 while (rsquare[i] > (0.20 + tol) | rsquare[i] < (0.2 - tol)) { sd.value <- sd.value + 0.01 rsquare[i] <- simulate.sd.iter(sd.value, n) iter <- iter + 1 if (iter > 3000) { break } } results[i] <- sd.value # store the current sd.value that is OK! sd.value <- 1 } cbind(results, rsquare) } simulate.sd.iter <- function(sd.value, n=200) { # helper function # Takes the sd.value, creates data, and returns the r-squared X1 <- rnorm(n, 0, 1) X2 <- rnorm(n, 0, 1) Y <- rnorm(n, (5 + 3*X1 - 2*X2), sd.value) simdata <- data.frame(X1, X2, Y) return(summary(lm(Y ~ X1 + X2, data=simdata))$r.squared) } simulate.sd()

A few notes:

I am modifying X1 and X2, as this affects the sought-after sd.value .
Tolerance - how accurate is your estimate. Are you ok with r-squared ~ 0.19 or ~ 0.21? The tolerance should be 0.01.
Please note that too precise tolerance may not allow you to find the result.
A value of 1 is a pretty poor starting value, which makes this iterative algorithm pretty slow.

The resulting vector for 10 results:

[1] 5.64 5.35 5.46 5.42 5.79 5.39 5.64 5.62 4.70 5.55 ,

which takes about 13 seconds on my car.

The next step is to start from 4.5, add 0.001 to the iteration instead of 0.01, and possibly reduce the tolerance. Good luck

Well, some summary statistics for nsim = 100, taking 150 seconds, in increments of 0.001, and a tolerance of another 0.01:

  Min. 1st Qu. Median Mean 3rd Qu. Max. 4.513 4.913 5.036 5.018 5.157 5.393

Why are you interested in this?

+2

PascalVKooten 30 sept '13 at 15:30

source share

Here is another code for generating multiple linear regression with errors following the normal distribution: OPS sorry this code just does multiple regression

 sim.regression<-function(n.obs=10,coefficients=runif(10,-5,5),s.deviation=.1){ n.var=length(coefficients) M=matrix(0,ncol=n.var,nrow=n.obs) beta=as.matrix(coefficients) for (i in 1:n.var){ M[,i]=rnorm(n.obs,0,1) } y=M %*% beta + rnorm(n.obs,0,s.deviation) return (list(x=M,y=y,coeff=coefficients)) }

-one

TPArrow Sep 26 '14 at 11:59

source share

Adrian · Accepted Answer · 2013-09-30T15:01:03+0000

Take a look at this code, it should be close enough to what you want:

 simulate <- function(n.obs=10^4, beta=c(5, 3, -2), R.sq=0.8) { stopifnot(length(beta) == 3) df <- data.frame(x1=rnorm(n.obs), x2=rnorm(n.obs)) # x1 and x2 are independent var.epsilon <- (beta[2]^2 + beta[3]^2) * (1 - R.sq) / R.sq stopifnot(var.epsilon > 0) df$epsilon <- rnorm(n.obs, sd=sqrt(var.epsilon)) df$y <- with(df, beta[1] + beta[2]*x1 + beta[3]*x2 + epsilon) return(df) } get.R.sq <- function(desired) { model <- lm(y ~ x1 + x2, data=simulate(R.sq=desired)) return(summary(model)$r.squared) } df <- data.frame(desired.R.sq=seq(from=0.05, to=0.95, by=0.05)) df$actual.R.sq <- sapply(df$desired.R.sq, FUN=get.R.sq) plot(df) abline(a=0, b=1, col="red", lty=2)

Basically your question comes down to asking for an expression for var.epsilon. Since we have y = b1 + b2 * x1 + b3 * x2 + epsilon, and Xs and epsilon are all independent, we have var [y] = b2 ^ 2 * var [x1] + b3 ^ 2 * var [x2] + var [eps], where var [Xs] = 1 by assumption. Then you can solve for var [eps] as a function of the R-square.

When modeling multivariate data for regression, how can I set the R-square (including example code)?

More articles: