How to minimize the size of an object of class "lm" without prejudice to its transmission for forecasting ()

I want to run lm() in a large dataset with 50M + observations with two predictors. The analysis is performed on a remote server with a memory capacity of 10 GB for data storage. I tested 'lm ()' on 10K cases selected from the data and the resulting object was 2GB + in size.

I need an object of class "lm" returned from lm() ONLY to create summary statistics of the model ( summary(lm_object) ) and to create forecasts ( predict(lm_object) ).

I conducted an experiment with the options model, x, y, qr lm . If I set them all to FALSE , I will reduce the size by 38%

 library(MASS) fit1=lm(medv~lstat,data=Boston) size1 <- object.size(fit1) print(size1, units = "Kb") # 127.4 Kb bytes fit2=lm(medv~lstat,data=Boston,model=F,x=F,y=F,qr=F) size2 <- object.size(fit2) print(size2, units = "Kb") # 78.5 Kb Kb bytes - ((as.integer(size1) - as.integer(size2)) / as.integer(size1)) * 100 # -38.37994 

but

 summary(fit2) # Error in qr.lm(object) : lm object does not have a proper 'qr' component. # Rank zero or should not have used lm(.., qr=FALSE). predict(fit2,data=Boston) # Error in qr.lm(object) : lm object does not have a proper 'qr' component. # Rank zero or should not have used lm(.., qr=FALSE). 

Apparently I need to keep qr=TRUE , which reduces the size of the object by only 9% compared to the default object

 fit3=lm(medv~lstat,data=Boston,model=F,x=F,y=F,qr=T) size3 <- object.size(fit3) print(size3, units = "Kb") # 115.8 Kb - ((as.integer(size1) - as.integer(size3)) / as.integer(size1)) * 100 # -9.142752 

How can I bring the size of the object "lm" to a minimum without dumping a large amount of unnecessary information into memory and memory?

+7
memory r lm
source share
3 answers

The link here gives the corresponding answer (for the glm object, which is very similar to the output lm object).

http://www.win-vector.com/blog/2014/05/trimming-the-fat-from-glm-models-in-r/

Basically, predict, use only part of the coefficient, which is a very small part of glm output. the function below (copied from the link) is styling information that will not be used by the forecast.

However, he has a warning. After trimming, it cannot be used using summary (fit) or other summary functions, since these functions require more than what the forecast requires.

 cleanModel1 = function(cm) { # just in case we forgot to set # y=FALSE and model=FALSE cm$y = c() cm$model = c() cm$residuals = c() cm$fitted.values = c() cm$effects = c() cm$qr = c() cm$linear.predictors = c() cm$weights = c() cm$prior.weights = c() cm$data = c() cm } 
+4
source share

I am trying to deal with the same issue. What I use is not ideal for other purposes, but works for forecasting, you can basically take out the qr slot for the qr slot in lm:

 lmFull <- lm(Volume~Girth+Height,data=trees) lmSlim <- lmFull lmSlim$fitted.values <- lmSlim$qr$qr <- lmSlim$residuals <- lmSlim$model <- lmSlim$effects <- NULL pred1 <- predict(lmFull,newdata=data.frame(Girth=c(1,2,3),Height=c(2,3,4))) pred2 <- predict(lmSlim,newdata=data.frame(Girth=c(1,2,3),Height=c(2,3,4))) identical(pred1,pred2) [1] TRUE as.numeric((object.size(lmFull) - object.size(lmSlim)) / object.size(lmFull)) [1] 0.6550523 
0
source share

Xappp's answer is good, but not the whole story. There is also a huge environment variable that you can do something about (see: https://blogs.oracle.com/R/entry/is_the_size_of_your )

Or add this to the xappp function

  e <- attr(cm$terms, ".Environment") parent.env(e) <- emptyenv() rm(list=ls(envir=e), envir=e) 

Or use this version, which reduces the amount of data, but allows you to use summary ()

  cleanModel1 = function(cm) { # just in case we forgot to set # y=FALSE and model=FALSE cm$y = c() cm$model = c() e <- attr(cm$terms, ".Environment") parent.env(e) <- emptyenv() rm(list=ls(envir=e), envir=e) cm } 
0
source share

All Articles