Reusing the model to predict new observations
If the model is not an expensive computing machine, I tend to document the entire process of building the model in an R script, which I will repeat when necessary. If a random element is involved in fitting the model, I must establish a known random seed.
If I calculate the computational cost for the calculation, then I still use the script as described above, but I save the model objects using the save() object in and rda. Then I try to modify the script so that if the saved object exists, load it or, if not, reinstall the model using a simple if()...else clause wrapped around the corresponding parts of the code.
When loading a saved model object, be sure to reload all the necessary packages, although in your case, if the logit model was installed using glm() , there will be no additional packages for loading outside R.
Here is an example:
> set.seed(345) > df <- data.frame(x = rnorm(20)) > df <- transform(df, y = 5 + (2.3 * x) + rnorm(20)) > ## model > m1 <- lm(y ~ x, data = df) > ## save this model > save(m1, file = "my_model1.rda") > > ## a month later, new observations are available: > newdf <- data.frame(x = rnorm(20)) > ## load the model > load("my_model1.rda") > ## predict for the new `x`s in `newdf` > predict(m1, newdata = newdf) 1 2 3 4 5 6 6.1370366 6.5631503 2.9808845 5.2464261 4.6651015 3.4475255 7 8 9 10 11 12 6.7961764 5.3592901 3.3691800 9.2506653 4.7562096 3.9067537 13 14 15 16 17 18 2.0423691 2.4764664 3.7308918 6.9999064 2.0081902 0.3256407 19 20 5.4247548 2.6906722
If you want to automate this, then I will probably do the following in a script:
## data df <- data.frame(x = rnorm(20)) df <- transform(df, y = 5 + (2.3 * x) + rnorm(20)) ## check if model exists? If not, refit: if(file.exists("my_model1.rda")) { ## load model load("my_model1.rda") } else { ## (re)fit the model m1 <- lm(y ~ x, data = df) } ## predict for new observations ## new observations newdf <- data.frame(x = rnorm(20)) ## predict predict(m1, newdata = newdf)
Of course, the data generation code will be replaced by a code loading your actual data.
Updating a previously installed model with new observations
If you want to update the model using additional new observations. Then update() is a useful function. All he does is update the model with the updated model argument. If you want to include new cases in the data used to match the model, add new cases to the data frame passed to the 'data' argument, and then do the following:
m2 <- update(m1, . ~ ., data = df)
where m1 is the original, saved model,. . ~ . - this is a change in the model formula, which in this case means all existing variables on both the left and right sides ~ (in other words, do not make changes to the model formula), and df is the data frame used to fit the original model, extended to include new available observations.
Here is a working example:
> set.seed(123) > df <- data.frame(x = rnorm(20)) > df <- transform(df, y = 5 + (2.3 * x) + rnorm(20)) > ## model > m1 <- lm(y ~ x, data = df) > m1 Call: lm(formula = y ~ x, data = df) Coefficients: (Intercept) x 4.960 2.222 > > ## new observations > newdf <- data.frame(x = rnorm(20)) > newdf <- transform(newdf, y = 5 + (2.3 * x) + rnorm(20)) > ## add on to df > df <- rbind(df, newdf) > > ## update model fit > m2 <- update(m1, . ~ ., data = df) > m2 Call: lm(formula = y ~ x, data = df) Coefficients: (Intercept) x 4.928 2.187
Others mentioned in the comments formula() , which extracts a formula from a fitted model:
> formula(m1) y ~ x >
However, if fitting a model includes additional arguments, such as the 'family' or 'subset' arguments in the more complex model fitting functions. If update() methods are available for your model fitting function (that they are intended for many common fitting functions, such as glm() ), it provides an easier way to update the model than retrieving and reusing the model formula.
If you intend to do all the modeling and future prediction in R, there really isnβt much point in abstracting the model through PMML or the like.