Linear model diagnostics for Bayesian models using rstan

I am looking for an effective method for identifying data points that adversely affect the parameters of a linear model. This is straightforward with regular linear models, but I'm not sure how to do this with Bayesian linear models.

Here, one way with conventional linear models, we can calculate the Cook distance for each data point and calculate diagnostic graphs that include the Cook distances:

# ordinary linear model diagnostics, similar to my use-case library(dplyr) library(purrr) library(tidyr) library(broom) # make linear models for each type of species xy <- iris %>% nest(-Species) %>% mutate(model = map(data, ~lm(Sepal.Length ~ Petal.Width, data = .)), fit = map(model, augment)) 

Here we have a data frame with nested lists, the model column contains a linear model for each view:

 > xy # A tibble: 3 × 4 Species data model fit <fctr> <list> <list> <list> 1 setosa <tibble [50 × 4]> <S3: lm> <data.frame [50 × 9]> 2 versicolor <tibble [50 × 4]> <S3: lm> <data.frame [50 × 9]> 3 virginica <tibble [50 × 4]> <S3: lm> <data.frame [50 × 9]> 

The broom::augment function allowed us to add the Cook distance values ​​for each data point in this data frame, and we can check them as follows:

 # inspect Cook distance values xy %>% unnest(fit) %>% arrange(desc(.cooksd)) # A tibble: 150 × 10 Species Sepal.Length Petal.Width .fitted .se.fit .resid .hat .sigma .cooksd <fctr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> 1 versicolor 5.9 1.8 6.612097 0.16181001 -0.7120969 0.13725081 0.4269862 0.24507448 2 setosa 5.0 0.6 5.335281 0.17114108 -0.3352811 0.25027563 0.3410686 0.21385214 3 virginica 4.9 1.7 6.375829 0.13613717 -1.4758292 0.04875277 0.5826838 0.15434787 4 setosa 5.7 0.4 5.149247 0.08625887 0.5507534 0.06357957 0.3355980 0.09396588 5 setosa 4.3 0.1 4.870195 0.08321347 -0.5701948 0.05916942 0.3349111 0.09285408 6 virginica 5.8 2.4 6.831411 0.14828703 -1.0314106 0.05784319 0.6035012 0.09117693 7 virginica 7.2 1.6 6.310746 0.16207266 0.8892538 0.06909799 0.6084108 0.08293253 8 versicolor 4.9 1.0 5.471005 0.11998077 -0.5710051 0.07546185 0.4328038 0.07544526 9 setosa 5.8 0.2 4.963212 0.05287342 0.8367879 0.02388828 0.3228858 0.07500610 10 versicolor 6.0 1.0 5.471005 0.11998077 0.5289949 0.07546185 0.4340307 0.06475225 # ... with 140 more rows, and 1 more variables: .std.resid <dbl> 

And using the autoplot method autoplot we can make informative diagnostic graphs that show the Cook distance values ​​and help us quickly identify data points with a negative effect on model parameters:

 # plot model diagnostics library(ggplot2) library(ggfortify) diagnostic_plots_df <- xy %>% mutate(diagnostic_plots = map(model, ~autoplot(., which = 1:6, ncol = 3, label.size = 3))) 

Here is just one of the graphs received:

 > diagnostic_plots_df[[1]] 

enter image description here

Now, using the Bayesian linear model, we can similarly calculate the linear models for each group in the data frame:

 # Bayesian linear model diagnostics library(rstanarm) bayes_xy <- iris %>% nest(-Species) %>% mutate(model = map(data, ~stan_lm(Sepal.Length ~ Petal.Width, data = ., prior = NULL, chains = 1, cores = 2, seed = 1)), fit = map(model, augment)) > bayes_xy # A tibble: 3 × 4 Species data model fit <fctr> <list> <list> <list> 1 setosa <tibble [50 × 4]> <S3: stanreg> <data.frame [50 × 5]> 2 versicolor <tibble [50 × 4]> <S3: stanreg> <data.frame [50 × 5]> 3 virginica <tibble [50 × 4]> <S3: stanreg> <data.frame [50 × 5]> 

But the broom::augment method has nothing like the distance value in a cube:

 # inspect fit diagnostics bayes_xy %>% unnest(fit) # A tibble: 150 × 6 Species Sepal.Length Petal.Width .fitted .se.fit .resid <fctr> <dbl> <dbl> <dbl> <dbl> <dbl> 1 setosa 5.1 0.2 4.963968 0.06020298 0.13482025 2 setosa 4.9 0.2 4.963968 0.06020298 -0.06517975 3 setosa 4.7 0.2 4.963968 0.06020298 -0.26517975 4 setosa 4.6 0.2 4.963968 0.06020298 -0.36517975 5 setosa 5.0 0.2 4.963968 0.06020298 0.03482025 6 setosa 5.4 0.4 5.151501 0.11299956 0.21818386 7 setosa 4.6 0.3 5.057734 0.05951488 -0.47349794 8 setosa 5.0 0.2 4.963968 0.06020298 0.03482025 9 setosa 4.4 0.2 4.963968 0.06020298 -0.56517975 10 setosa 4.9 0.1 4.870201 0.11408783 0.04313845 # ... with 140 more rows 

And not the autoplot method:

 # plot model diagnostics bayes_diagnostic_plots_df <- bayes_xy %>% mutate(diagnostic_plots = map(model, ~autoplot(., which = 1:6, ncol = 3, label.size = 3))) # Error, there doesn't seem to be an autoplot method for stan_lm output shinystan::launch_shinystan(bayes_xy$model[[1]]) # This is quite interesting, but nothing at the level of specific data points 

Some scientific literature talks about methods such as model perturbations , phi divergence, the distance in the rear Cook mode and the average distance from the back wall , the Kullback-Leibler divergence , etc. , etc. . But I don’t see anywhere that this has been learned using R code, and I am stuck.

unanswered question on this topic in Cross-validated. I am posting here because I am looking for ideas about writing code to compute impact statistics (and not about statistical theory and approach, etc., which should go on this other issue)

How can I get something like measuring the distance of a cookie at the output of rstanarm::stan_lm ?

+5
source share
2 answers

This post told Aki Vehtari that this is best:

The difference between lppd_i and loo_i was used as a measure of sensitivity (see, for example, Gelfand et al., 1992). Evaluation of the parameter Pareto k can if the difference between lppd_i and loo_i is large. It is not yet clear whether the estimate of the Pareto shape parameter k is better than lppd_i-loo_i, but at least we know that the estimate of lppd_i-loo_i is too small if k is close to 1 or more, so it’s better to look at k. In the example with the usual model, k for one observation is large, but with the Student model-tk is less. The normal model is the same as the student-t model, but with a very strong degree of freedom. So this is not just a strong preceding or greater shrinkage, but having a model that can describe the observations Well. With increased shrinkage and an unreliable observation model, one observation may still be unexpected. Naturally, this is not always the best solution for switching to a more reliable observation model that allows “outliers”. Instead, it would be better to make the regression function more non-linear (i.e. has a less strong background) or convert covariates or add more covariates. Therefore, I recommend looking at the parameters of the Pareto form parameter, but I am not recommended to increase the shrinkage if the values ​​are large.

You can get an estimate of the form parameter Pareto k from the $pareto_k list created by the loo function in the loo package, which is re-exported by rstanarm package. If this value is above 0.7 (the default), the loo function will recommend that you re-install the model, leaving this observation, because the rear distribution is probably too sensitive to this observation to satisfy the LOOIC assumption that each observation has little effect on subsequent distribution.

In the case of OP, only the seventh observation has an estimate of the Pareto shape parameter, which is slightly more than 0.5, so the observation probably does not have much effect on the back. But you definitely want to explore observations that have a value greater than 1.0

You can also call the plot method on the Loo object, especially with the label_points = TRUE option other than the default, to visualize the Pareto form parameter estimates.

+3
source

Looking at the stan-users discussion part of the email list, I saw that the output from loo contains “point contributions for each observation”. So here is an attempt to work with them:

 # Bayesian linear model diagnostics library(rstanarm) library(loo) bayes_xy <- iris %>% nest(-Species) %>% mutate(model = map(data, ~stan_lm(Sepal.Length ~ Petal.Width, data = ., prior = NULL, chains = 1, cores = 2, seed = 1))) bayes_xy_loo <- bayes_xy %>% mutate(loo_out = map(model, ~loo(.))) library(ggplot2) library(ggrepel) n <- 5 # how many points to label my_plots <- bayes_xy_loo %>% select(loo_out) %>% mutate(loo_pointwise = map(.$loo_out, ~data.frame(.$pointwise))) %>% mutate(plots = map(.$loo_pointwise, ~ggplot(., aes(elpd_loo, looic)) + geom_point(aes(size = p_loo)) + geom_text_repel(data = .[tail(order(.$looic), n),] , aes(label = row.names(.[tail(order(.$looic), n),])), nudge_x = -0.1, nudge_y = -0.3) + geom_label_repel(data = .[tail(order(.$elpd_loo), n),] , aes(label = row.names(.[tail(order(.$elpd_loo), n),])), nudge_x = 0.1, nudge_y = 0.1) + xlab("Expected log pointwise \npredictive density") + ylab("LOO information \ncriterion") + scale_size_area(name = "Estimated \neffective\nnumber \nof parameters") + theme_minimal(base_size = 10))) do.call(gridExtra::grid.arrange, my_plots$plots) 

enter image description here

However, the points that are supposedly influencing are not a good match. For example, in the question we have obs. 7, 15, and 30 with high Cook distance values. On the output of loo it seems that only obs. 15 is indicated as usual. Therefore, perhaps this is the wrong way to do this.

+1
source

All Articles