How to minimize the size of an object of class "lm" without prejudice to its transmission for forecasting ()

Question

How to minimize the size of an object of class "lm" without prejudice to its transmission for forecasting ()

I want to run lm() in a large dataset with 50M + observations with two predictors. The analysis is performed on a remote server with a memory capacity of 10 GB for data storage. I tested 'lm ()' on 10K cases selected from the data and the resulting object was 2GB + in size.

I need an object of class "lm" returned from lm() ONLY to create summary statistics of the model ( summary(lm_object) ) and to create forecasts ( predict(lm_object) ).

I conducted an experiment with the options model, x, y, qr lm . If I set them all to FALSE , I will reduce the size by 38%

 library(MASS) fit1=lm(medv~lstat,data=Boston) size1 <- object.size(fit1) print(size1, units = "Kb") # 127.4 Kb bytes fit2=lm(medv~lstat,data=Boston,model=F,x=F,y=F,qr=F) size2 <- object.size(fit2) print(size2, units = "Kb") # 78.5 Kb Kb bytes - ((as.integer(size1) - as.integer(size2)) / as.integer(size1)) * 100 # -38.37994

but

 summary(fit2) # Error in qr.lm(object) : lm object does not have a proper 'qr' component. # Rank zero or should not have used lm(.., qr=FALSE). predict(fit2,data=Boston) # Error in qr.lm(object) : lm object does not have a proper 'qr' component. # Rank zero or should not have used lm(.., qr=FALSE).

Apparently I need to keep qr=TRUE , which reduces the size of the object by only 9% compared to the default object

 fit3=lm(medv~lstat,data=Boston,model=F,x=F,y=F,qr=T) size3 <- object.size(fit3) print(size3, units = "Kb") # 115.8 Kb - ((as.integer(size1) - as.integer(size3)) / as.integer(size1)) * 100 # -9.142752

How can I bring the size of the object "lm" to a minimum without dumping a large amount of unnecessary information into memory and memory?

+7

memory r lm

Cptnemo Feb 20 '14 at 1:30

source share

3 answers

xappppp · Answer 1 · 2016-05-03T20:52:09+0000

The link here gives the corresponding answer (for the glm object, which is very similar to the output lm object).

http://www.win-vector.com/blog/2014/05/trimming-the-fat-from-glm-models-in-r/

Basically, predict, use only part of the coefficient, which is a very small part of glm output. the function below (copied from the link) is styling information that will not be used by the forecast.

However, he has a warning. After trimming, it cannot be used using summary (fit) or other summary functions, since these functions require more than what the forecast requires.

 cleanModel1 = function(cm) { # just in case we forgot to set # y=FALSE and model=FALSE cm$y = c() cm$model = c() cm$residuals = c() cm$fitted.values = c() cm$effects = c() cm$qr = c() cm$linear.predictors = c() cm$weights = c() cm$prior.weights = c() cm$data = c() cm }

Aviad klein · Answer 2 · 2014-02-25T08:26:26+0000

I am trying to deal with the same issue. What I use is not ideal for other purposes, but works for forecasting, you can basically take out the qr slot for the qr slot in lm:

 lmFull <- lm(Volume~Girth+Height,data=trees) lmSlim <- lmFull lmSlim$fitted.values <- lmSlim$qr$qr <- lmSlim$residuals <- lmSlim$model <- lmSlim$effects <- NULL pred1 <- predict(lmFull,newdata=data.frame(Girth=c(1,2,3),Height=c(2,3,4))) pred2 <- predict(lmSlim,newdata=data.frame(Girth=c(1,2,3),Height=c(2,3,4))) identical(pred1,pred2) [1] TRUE as.numeric((object.size(lmFull) - object.size(lmSlim)) / object.size(lmFull)) [1] 0.6550523

Nightingale · Answer 3 · 2016-11-02T13:54:45+0000

Xappp's answer is good, but not the whole story. There is also a huge environment variable that you can do something about (see: https://blogs.oracle.com/R/entry/is_the_size_of_your )

Or add this to the xappp function

  e <- attr(cm$terms, ".Environment") parent.env(e) <- emptyenv() rm(list=ls(envir=e), envir=e)

Or use this version, which reduces the amount of data, but allows you to use summary ()

  cleanModel1 = function(cm) { # just in case we forgot to set # y=FALSE and model=FALSE cm$y = c() cm$model = c() e <- attr(cm$terms, ".Environment") parent.env(e) <- emptyenv() rm(list=ls(envir=e), envir=e) cm }

How to minimize the size of an object of class "lm" without prejudice to its transmission for forecasting ()

More articles: