When modeling multivariate data for regression, how can I set the R-square (including example code)?

I am trying to model a dataset with three variables so that I can use linear regression models on it. "X1" and "X2" will be continuous independent variables (average value = 0, sd = 1), and "Y" will be a continuous dependent variable.

The variables will be a regression model, produce the following coefficients: Y = 5 + 3 (X1) - 2 (X2)

I would like to model this dataset so that the resulting regression model has an R-squared value of 0.2. How to determine the value of "sd.value" so that the regression model has this R-square?

n <- 200 set.seed(101) sd.value <- 1 X1 <- rnorm(n, 0, 1) X2 <- rnorm(n, 0, 1) Y <- rnorm(n, (5 + 3*X1 - 2*X2), sd.value) simdata <- data.frame(X1, X2, Y) summary(lm(Y ~ X1 + X2, data=simdata)) 
+8
r
source share
4 answers

Take a look at this code, it should be close enough to what you want:

 simulate <- function(n.obs=10^4, beta=c(5, 3, -2), R.sq=0.8) { stopifnot(length(beta) == 3) df <- data.frame(x1=rnorm(n.obs), x2=rnorm(n.obs)) # x1 and x2 are independent var.epsilon <- (beta[2]^2 + beta[3]^2) * (1 - R.sq) / R.sq stopifnot(var.epsilon > 0) df$epsilon <- rnorm(n.obs, sd=sqrt(var.epsilon)) df$y <- with(df, beta[1] + beta[2]*x1 + beta[3]*x2 + epsilon) return(df) } get.R.sq <- function(desired) { model <- lm(y ~ x1 + x2, data=simulate(R.sq=desired)) return(summary(model)$r.squared) } df <- data.frame(desired.R.sq=seq(from=0.05, to=0.95, by=0.05)) df$actual.R.sq <- sapply(df$desired.R.sq, FUN=get.R.sq) plot(df) abline(a=0, b=1, col="red", lty=2) 

Basically your question comes down to asking for an expression for var.epsilon. Since we have y = b1 + b2 * x1 + b3 * x2 + epsilon, and Xs and epsilon are all independent, we have var [y] = b2 ^ 2 * var [x1] + b3 ^ 2 * var [x2] + var [eps], where var [Xs] = 1 by assumption. Then you can solve for var [eps] as a function of the R-square.

+6
source share

So the formula for R ^ 2 is 1-var (residual) / var (total)

In this case, the variance of Y will be 3^2+2^2+sd.value^2 , since we add three independent random variables. And, asymptotically, the residual variance will simply be sd.value^2 .

So you can explicitly calculate rsquared with this function:

 rsq<-function(x){1-x^2/(9+ 4+x^2)} 

With a little algebra, you can calculate the inverse of this function:

 rsqi<-function(x){sqrt(13)*sqrt((1-x)/x)} 

So setting sd.value<-rsqi(rsquared) should give you what you want.

We can verify this as follows:

 simrsq<-function(x){ Y <- rnorm(n, (5 + 3*X1 - 2*X2), rsqi(x)) simdata <- data.frame(X1, X2, Y) summary(lm(Y ~ X1 + X2, data=simdata))$r.squared } > meanrsq<-rep(0,9) > for(i in 1:50) + meanrsq<-meanrsq+Vectorize(simrsq)((1:9)/10) > meanrsq/50 [1] 0.1031827 0.2075984 0.3063701 0.3977051 0.5052408 0.6024988 0.6947790 [8] 0.7999349 0.8977187 

So it looks right.

+2
source share

Here's how I would do it (a blind iterative algorithm , if you don’t know when you are just interested in β€œhow to simulate this”):

 simulate.sd <- function(nsim=10, n=200, seed=101, tol=0.01) { set.seed(seed) sd.value <- 1 rsquare <- 1:nsim results <- 1:nsim for (i in 1:nsim) { # tracking iteration: if we miss the value, abort at sd.value > 7. iter <- 0 while (rsquare[i] > (0.20 + tol) | rsquare[i] < (0.2 - tol)) { sd.value <- sd.value + 0.01 rsquare[i] <- simulate.sd.iter(sd.value, n) iter <- iter + 1 if (iter > 3000) { break } } results[i] <- sd.value # store the current sd.value that is OK! sd.value <- 1 } cbind(results, rsquare) } simulate.sd.iter <- function(sd.value, n=200) { # helper function # Takes the sd.value, creates data, and returns the r-squared X1 <- rnorm(n, 0, 1) X2 <- rnorm(n, 0, 1) Y <- rnorm(n, (5 + 3*X1 - 2*X2), sd.value) simdata <- data.frame(X1, X2, Y) return(summary(lm(Y ~ X1 + X2, data=simdata))$r.squared) } simulate.sd() 

A few notes:

  • I am modifying X1 and X2, as this affects the sought-after sd.value .
  • Tolerance - how accurate is your estimate. Are you ok with r-squared ~ 0.19 or ~ 0.21? The tolerance should be 0.01.
  • Please note that too precise tolerance may not allow you to find the result.
  • A value of 1 is a pretty poor starting value, which makes this iterative algorithm pretty slow.

The resulting vector for 10 results:

[1] 5.64 5.35 5.46 5.42 5.79 5.39 5.64 5.62 4.70 5.55 ,

which takes about 13 seconds on my car.

The next step is to start from 4.5, add 0.001 to the iteration instead of 0.01, and possibly reduce the tolerance. Good luck

Well, some summary statistics for nsim = 100, taking 150 seconds, in increments of 0.001, and a tolerance of another 0.01:

  Min. 1st Qu. Median Mean 3rd Qu. Max. 4.513 4.913 5.036 5.018 5.157 5.393 

Why are you interested in this?

+2
source share

Here is another code for generating multiple linear regression with errors following the normal distribution: OPS sorry this code just does multiple regression

 sim.regression<-function(n.obs=10,coefficients=runif(10,-5,5),s.deviation=.1){ n.var=length(coefficients) M=matrix(0,ncol=n.var,nrow=n.obs) beta=as.matrix(coefficients) for (i in 1:n.var){ M[,i]=rnorm(n.obs,0,1) } y=M %*% beta + rnorm(n.obs,0,s.deviation) return (list(x=M,y=y,coeff=coefficients)) } 
-one
source share

All Articles