Linear regression and group in R

I want to do linear regression in R using the lm() function. My data is an annual time series with one field for a year (22 years) and another for the state (50 states). I want to match the regression for each state, so at the end I have a lm response vector. I can imagine how to make a loop for each state, and then perform a regression inside the loop and add the results of each regression to the vector. However, this is not like R-like. In SAS, I would do a "by" statement, and in SQL I would make a "group". What is the way R do it?

+62
r regression linear-regression lm
Jul 23 '09 at 4:00
source share
9 answers

Here is one way to use the lme4 package.

 > library(lme4) > d <- data.frame(state=rep(c('NY', 'CA'), c(10, 10)), + year=rep(1:10, 2), + response=c(rnorm(10), rnorm(10))) > xyplot(response ~ year, groups=state, data=d, type='l') > fits <- lmList(response ~ year | state, data=d) > fits Call: lmList(formula = response ~ year | state, data = d) Coefficients: (Intercept) year CA -1.34420990 0.17139963 NY 0.00196176 -0.01852429 Degrees of freedom: 20 total; 16 residual Residual standard error: 0.8201316 
+31
Jul 23 '09 at 4:55
source share

Here's an approach using the plyr package:

 d <- data.frame( state = rep(c('NY', 'CA'), 10), year = rep(1:10, 2), response= rnorm(20) ) library(plyr) # Break up d by state, then fit the specified model to each piece and # return a list models <- dlply(d, "state", function(df) lm(response ~ year, data = df)) # Apply coef to each model and return a data frame ldply(models, coef) # Print the summary of each model l_ply(models, summary, .print = TRUE) 
+49
Jul 31 '09 at 19:30
source share

In my opinion, a mixed linear model is the best approach for this kind of data. The code below contains a fixed effect of a general trend. Random effects show how the trend for each individual state differs from the global trend. The correlation structure takes into account temporal autocorrelation. Look at Pinheiro and Bates (mixed effects models in S and S-Plus).

 library(nlme) lme(response ~ year, random = ~year|state, correlation = corAR1(~year)) 
+20
Jul 24 '09 at 11:32
source share

dplyr been released dplyr , which actually provides a very good way to make such a grouping very similar to what SAS does.

 library(dplyr) d <- data.frame(state=rep(c('NY', 'CA'), c(10, 10)), year=rep(1:10, 2), response=c(rnorm(10), rnorm(10))) fitted_models = d %>% group_by(state) %>% do(model = lm(response ~ year, data = .)) # Source: local data frame [2 x 2] # Groups: <by row> # # state model # (fctr) (chr) # 1 CA <S3:lm> # 2 NY <S3:lm> fitted_models$model # [[1]] # # Call: # lm(formula = response ~ year, data = .) # # Coefficients: # (Intercept) year # -0.06354 0.02677 # # # [[2]] # # Call: # lm(formula = response ~ year, data = .) # # Coefficients: # (Intercept) year # -0.35136 0.09385 

To get the odds and Rsquared / p.value, you can use the broom package. This package provides:

three generic S3 files: neat, which summarizes the model statistical data, such as regression coefficients; an addition that adds columns to the source data, such as predictions, remnants, and cluster assignments; and a look that provides a one-line summary of statistics at the model level.

 library(broom) fitted_models %>% tidy(model) # Source: local data frame [4 x 6] # Groups: state [2] # # state term estimate std.error statistic p.value # (fctr) (chr) (dbl) (dbl) (dbl) (dbl) # 1 CA (Intercept) -0.06354035 0.83863054 -0.0757668 0.9414651 # 2 CA year 0.02677048 0.13515755 0.1980687 0.8479318 # 3 NY (Intercept) -0.35135766 0.60100314 -0.5846187 0.5749166 # 4 NY year 0.09385309 0.09686043 0.9689519 0.3609470 fitted_models %>% glance(model) # Source: local data frame [2 x 12] # Groups: state [2] # # state r.squared adj.r.squared sigma statistic p.value df # (fctr) (dbl) (dbl) (dbl) (dbl) (dbl) (int) # 1 CA 0.004879969 -0.119510035 1.2276294 0.0392312 0.8479318 2 # 2 NY 0.105032068 -0.006838924 0.8797785 0.9388678 0.3609470 2 # Variables not shown: logLik (dbl), AIC (dbl), BIC (dbl), deviance (dbl), # df.residual (int) fitted_models %>% augment(model) # Source: local data frame [20 x 10] # Groups: state [2] # # state response year .fitted .se.fit .resid .hat # (fctr) (dbl) (int) (dbl) (dbl) (dbl) (dbl) # 1 CA 0.4547765 1 -0.036769875 0.7215439 0.4915464 0.3454545 # 2 CA 0.1217003 2 -0.009999399 0.6119518 0.1316997 0.2484848 # 3 CA -0.6153836 3 0.016771076 0.5146646 -0.6321546 0.1757576 # 4 CA -0.9978060 4 0.043541551 0.4379605 -1.0413476 0.1272727 # 5 CA 2.1385614 5 0.070312027 0.3940486 2.0682494 0.1030303 # 6 CA -0.3924598 6 0.097082502 0.3940486 -0.4895423 0.1030303 # 7 CA -0.5918738 7 0.123852977 0.4379605 -0.7157268 0.1272727 # 8 CA 0.4671346 8 0.150623453 0.5146646 0.3165112 0.1757576 # 9 CA -1.4958726 9 0.177393928 0.6119518 -1.6732666 0.2484848 # 10 CA 1.7481956 10 0.204164404 0.7215439 1.5440312 0.3454545 # 11 NY -0.6285230 1 -0.257504572 0.5170932 -0.3710185 0.3454545 # 12 NY 1.0566099 2 -0.163651479 0.4385542 1.2202614 0.2484848 # 13 NY -0.5274693 3 -0.069798386 0.3688335 -0.4576709 0.1757576 # 14 NY 0.6097983 4 0.024054706 0.3138637 0.5857436 0.1272727 # 15 NY -1.5511940 5 0.117907799 0.2823942 -1.6691018 0.1030303 # 16 NY 0.7440243 6 0.211760892 0.2823942 0.5322634 0.1030303 # 17 NY 0.1054719 7 0.305613984 0.3138637 -0.2001421 0.1272727 # 18 NY 0.7513057 8 0.399467077 0.3688335 0.3518387 0.1757576 # 19 NY -0.1271655 9 0.493320170 0.4385542 -0.6204857 0.2484848 # 20 NY 1.2154852 10 0.587173262 0.5170932 0.6283119 0.3454545 # Variables not shown: .sigma (dbl), .cooksd (dbl), .std.resid (dbl) 
+17
Nov 23 '15 at 11:37
source share

A good solution using data.table was posted here at CrossValidated by @Zach. I just add that you can iteratively get the regression coefficient r ^ 2:

 ## make fake data library(data.table) set.seed(1) dat <- data.table(x=runif(100), y=runif(100), grp=rep(1:2,50)) ##calculate the regression coefficient r^2 dat[,summary(lm(y~x))$r.squared,by=grp] grp V1 1: 1 0.01465726 2: 2 0.02256595 

as well as all other output from summary(lm) :

 dat[,list(r2=summary(lm(y~x))$r.squared , f=summary(lm(y~x))$fstatistic[1] ),by=grp] grp r2 f 1: 1 0.01465726 0.714014 2: 2 0.02256595 1.108173 
+9
Nov 17 '15 at 10:04
source share
 ## make fake data > ngroups <- 2 > group <- 1:ngroups > nobs <- 100 > dta <- data.frame(group=rep(group,each=nobs),y=rnorm(nobs*ngroups),x=runif(nobs*ngroups)) > head(dta) group yx 1 1 0.6482007 0.5429575 2 1 -0.4637118 0.7052843 3 1 -0.5129840 0.7312955 4 1 -0.6612649 0.9028034 5 1 -0.5197448 0.1661308 6 1 0.4240346 0.8944253 > > ## function to extract the results of one model > foo <- function(z) { + ## coef and se in a data frame + mr <- data.frame(coef(summary(lm(y~x,data=z)))) + ## put row names (predictors/indep variables) + mr$predictor <- rownames(mr) + mr + } > ## see that it works > foo(subset(dta,group==1)) Estimate Std..Error t.value Pr...t.. predictor (Intercept) 0.2176477 0.1919140 1.134090 0.2595235 (Intercept) x -0.3669890 0.3321875 -1.104765 0.2719666 x > ## one option: use command by > res <- by(dta,dta$group,foo) > res dta$group: 1 Estimate Std..Error t.value Pr...t.. predictor (Intercept) 0.2176477 0.1919140 1.134090 0.2595235 (Intercept) x -0.3669890 0.3321875 -1.104765 0.2719666 x ------------------------------------------------------------ dta$group: 2 Estimate Std..Error t.value Pr...t.. predictor (Intercept) -0.04039422 0.1682335 -0.2401081 0.8107480 (Intercept) x 0.06286456 0.3020321 0.2081387 0.8355526 x > ## using package plyr is better > library(plyr) > res <- ddply(dta,"group",foo) > res group Estimate Std..Error t.value Pr...t.. predictor 1 1 0.21764767 0.1919140 1.1340897 0.2595235 (Intercept) 2 1 -0.36698898 0.3321875 -1.1047647 0.2719666 x 3 2 -0.04039422 0.1682335 -0.2401081 0.8107480 (Intercept) 4 2 0.06286456 0.3020321 0.2081387 0.8355526 x > 
+6
Jul 23 '09 at 4:22
source share

Now I will answer a little later, but I was looking for similar functionality. It would seem that the built-in "by" function in R can also easily group:

?, contains the following example, which is suitable for each group and extracts coefficients using sapply:

 require(stats) ## now suppose we want to extract the coefficients by group tmp <- with(warpbreaks, by(warpbreaks, tension, function(x) lm(breaks ~ wool, data = x))) sapply(tmp, coef) 
+4
Feb 28 '17 at 2:53 on
source share

The lm() function above is a simple example. By the way, I suppose your database has columns, as in the following form:

year state var1 var2 y ...

From my point of view, you can use the following code:

 require(base) library(base) attach(data) # data = your data base #state is your label for the states column modell<-by(data, data$state, function(data) lm(y~I(1/var1)+I(1/var2))) summary(modell) 
+3
Aug 21 '12 at 23:25
source share

The question arises of how to call regression functions with formulas that are modified within the loop.

Here's how you can do it (using a diamond dataset):

 attach(ggplot2::diamonds) strCols = names(ggplot2::diamonds) formula <- list(); model <- list() for (i in 1:1) { formula[[i]] = paste0(strCols[7], " ~ ", strCols[7+i]) model[[i]] = glm(formula[[i]]) #then you can plot the results or anything else ... png(filename = sprintf("diamonds_price=glm(%s).png", strCols[7+i])) par(mfrow = c(2, 2)) plot(model[[i]]) dev.off() } 
0
Mar 28 '17 at 17:26
source share



All Articles