Reusing a model built in R

When building a model in R, how do you keep the model specifications so that they can be reused for new data? Let's say I will build a logistic regression based on historical data, but I will not have new observations until next month. What is the best approach?

Things I reviewed:

  • Saving a model object and loading into a new session
  • I know that some models can be exported using PMML, but have not seen anything about importing PMML

I'm just trying to figure out what you do when you need to use your model in a new session.

Thanks in advance.

+53
r models
Feb 25 2018-11-11T00:
source share
2 answers

Reusing the model to predict new observations

If the model is not an expensive computing machine, I tend to document the entire process of building the model in an R script, which I will repeat when necessary. If a random element is involved in fitting the model, I must establish a known random seed.

If I calculate the computational cost for the calculation, then I still use the script as described above, but I save the model objects using the save() object in and rda. Then I try to modify the script so that if the saved object exists, load it or, if not, reinstall the model using a simple if()...else clause wrapped around the corresponding parts of the code.

When loading a saved model object, be sure to reload all the necessary packages, although in your case, if the logit model was installed using glm() , there will be no additional packages for loading outside R.

Here is an example:

 > set.seed(345) > df <- data.frame(x = rnorm(20)) > df <- transform(df, y = 5 + (2.3 * x) + rnorm(20)) > ## model > m1 <- lm(y ~ x, data = df) > ## save this model > save(m1, file = "my_model1.rda") > > ## a month later, new observations are available: > newdf <- data.frame(x = rnorm(20)) > ## load the model > load("my_model1.rda") > ## predict for the new `x`s in `newdf` > predict(m1, newdata = newdf) 1 2 3 4 5 6 6.1370366 6.5631503 2.9808845 5.2464261 4.6651015 3.4475255 7 8 9 10 11 12 6.7961764 5.3592901 3.3691800 9.2506653 4.7562096 3.9067537 13 14 15 16 17 18 2.0423691 2.4764664 3.7308918 6.9999064 2.0081902 0.3256407 19 20 5.4247548 2.6906722 

If you want to automate this, then I will probably do the following in a script:

 ## data df <- data.frame(x = rnorm(20)) df <- transform(df, y = 5 + (2.3 * x) + rnorm(20)) ## check if model exists? If not, refit: if(file.exists("my_model1.rda")) { ## load model load("my_model1.rda") } else { ## (re)fit the model m1 <- lm(y ~ x, data = df) } ## predict for new observations ## new observations newdf <- data.frame(x = rnorm(20)) ## predict predict(m1, newdata = newdf) 

Of course, the data generation code will be replaced by a code loading your actual data.

Updating a previously installed model with new observations

If you want to update the model using additional new observations. Then update() is a useful function. All he does is update the model with the updated model argument. If you want to include new cases in the data used to match the model, add new cases to the data frame passed to the 'data' argument, and then do the following:

 m2 <- update(m1, . ~ ., data = df) 

where m1 is the original, saved model,. . ~ . - this is a change in the model formula, which in this case means all existing variables on both the left and right sides ~ (in other words, do not make changes to the model formula), and df is the data frame used to fit the original model, extended to include new available observations.

Here is a working example:

 > set.seed(123) > df <- data.frame(x = rnorm(20)) > df <- transform(df, y = 5 + (2.3 * x) + rnorm(20)) > ## model > m1 <- lm(y ~ x, data = df) > m1 Call: lm(formula = y ~ x, data = df) Coefficients: (Intercept) x 4.960 2.222 > > ## new observations > newdf <- data.frame(x = rnorm(20)) > newdf <- transform(newdf, y = 5 + (2.3 * x) + rnorm(20)) > ## add on to df > df <- rbind(df, newdf) > > ## update model fit > m2 <- update(m1, . ~ ., data = df) > m2 Call: lm(formula = y ~ x, data = df) Coefficients: (Intercept) x 4.928 2.187 

Others mentioned in the comments formula() , which extracts a formula from a fitted model:

 > formula(m1) y ~ x > ## which can be used to set-up a new model call > ## so an alternative to update() above is: > m3 <- lm(formula(m1), data = df) 

However, if fitting a model includes additional arguments, such as the 'family' or 'subset' arguments in the more complex model fitting functions. If update() methods are available for your model fitting function (that they are intended for many common fitting functions, such as glm() ), it provides an easier way to update the model than retrieving and reusing the model formula.

If you intend to do all the modeling and future prediction in R, there really isn’t much point in abstracting the model through PMML or the like.

+101
Feb 25 2018-11-21T00:
source share
β€” -

If you use the same frame name for data and variables, you can (at least for lm() and glm() ) use the update function in the saved model:

 Df <- data.frame(X=1:10,Y=(1:10)+rnorm(10)) model <- lm(Y~X,data=Df) model Df <- rbind(Df,data.frame(X=2:11,Y=(10:1)+rnorm(10))) update(model) 

This is without a course, without data preparation, etc. It simply reuses the model specification set. Keep in mind that if you change the contrasts at the same time, the new model will be updated with new contrasts, not the old ones.

Therefore, using a script in most cases is the best answer. You can include all the steps in a convenience function that simply accepts a data framework, so you can run the script and then use this function in any new dataset. See also Gavin's answer to this.

+6
Feb 25 2018-11-11T00:
source share



All Articles