I am looking for an effective method for identifying data points that adversely affect the parameters of a linear model. This is straightforward with regular linear models, but I'm not sure how to do this with Bayesian linear models.
Here, one way with conventional linear models, we can calculate the Cook distance for each data point and calculate diagnostic graphs that include the Cook distances:
# ordinary linear model diagnostics, similar to my use-case library(dplyr) library(purrr) library(tidyr) library(broom) # make linear models for each type of species xy <- iris %>% nest(-Species) %>% mutate(model = map(data, ~lm(Sepal.Length ~ Petal.Width, data = .)), fit = map(model, augment))
Here we have a data frame with nested lists, the model column contains a linear model for each view:
> xy # A tibble: 3 × 4 Species data model fit <fctr> <list> <list> <list> 1 setosa <tibble [50 × 4]> <S3: lm> <data.frame [50 × 9]> 2 versicolor <tibble [50 × 4]> <S3: lm> <data.frame [50 × 9]> 3 virginica <tibble [50 × 4]> <S3: lm> <data.frame [50 × 9]>
The broom::augment function allowed us to add the Cook distance values for each data point in this data frame, and we can check them as follows:
# inspect Cook distance values xy %>% unnest(fit) %>% arrange(desc(.cooksd)) # A tibble: 150 × 10 Species Sepal.Length Petal.Width .fitted .se.fit .resid .hat .sigma .cooksd <fctr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> 1 versicolor 5.9 1.8 6.612097 0.16181001 -0.7120969 0.13725081 0.4269862 0.24507448 2 setosa 5.0 0.6 5.335281 0.17114108 -0.3352811 0.25027563 0.3410686 0.21385214 3 virginica 4.9 1.7 6.375829 0.13613717 -1.4758292 0.04875277 0.5826838 0.15434787 4 setosa 5.7 0.4 5.149247 0.08625887 0.5507534 0.06357957 0.3355980 0.09396588 5 setosa 4.3 0.1 4.870195 0.08321347 -0.5701948 0.05916942 0.3349111 0.09285408 6 virginica 5.8 2.4 6.831411 0.14828703 -1.0314106 0.05784319 0.6035012 0.09117693 7 virginica 7.2 1.6 6.310746 0.16207266 0.8892538 0.06909799 0.6084108 0.08293253 8 versicolor 4.9 1.0 5.471005 0.11998077 -0.5710051 0.07546185 0.4328038 0.07544526 9 setosa 5.8 0.2 4.963212 0.05287342 0.8367879 0.02388828 0.3228858 0.07500610 10 versicolor 6.0 1.0 5.471005 0.11998077 0.5289949 0.07546185 0.4340307 0.06475225 # ... with 140 more rows, and 1 more variables: .std.resid <dbl>
And using the autoplot method autoplot we can make informative diagnostic graphs that show the Cook distance values and help us quickly identify data points with a negative effect on model parameters:
# plot model diagnostics library(ggplot2) library(ggfortify) diagnostic_plots_df <- xy %>% mutate(diagnostic_plots = map(model, ~autoplot(., which = 1:6, ncol = 3, label.size = 3)))
Here is just one of the graphs received:
> diagnostic_plots_df[[1]]

Now, using the Bayesian linear model, we can similarly calculate the linear models for each group in the data frame:
# Bayesian linear model diagnostics library(rstanarm) bayes_xy <- iris %>% nest(-Species) %>% mutate(model = map(data, ~stan_lm(Sepal.Length ~ Petal.Width, data = ., prior = NULL, chains = 1, cores = 2, seed = 1)), fit = map(model, augment)) > bayes_xy # A tibble: 3 × 4 Species data model fit <fctr> <list> <list> <list> 1 setosa <tibble [50 × 4]> <S3: stanreg> <data.frame [50 × 5]> 2 versicolor <tibble [50 × 4]> <S3: stanreg> <data.frame [50 × 5]> 3 virginica <tibble [50 × 4]> <S3: stanreg> <data.frame [50 × 5]>
But the broom::augment method has nothing like the distance value in a cube:
# inspect fit diagnostics bayes_xy %>% unnest(fit) # A tibble: 150 × 6 Species Sepal.Length Petal.Width .fitted .se.fit .resid <fctr> <dbl> <dbl> <dbl> <dbl> <dbl> 1 setosa 5.1 0.2 4.963968 0.06020298 0.13482025 2 setosa 4.9 0.2 4.963968 0.06020298 -0.06517975 3 setosa 4.7 0.2 4.963968 0.06020298 -0.26517975 4 setosa 4.6 0.2 4.963968 0.06020298 -0.36517975 5 setosa 5.0 0.2 4.963968 0.06020298 0.03482025 6 setosa 5.4 0.4 5.151501 0.11299956 0.21818386 7 setosa 4.6 0.3 5.057734 0.05951488 -0.47349794 8 setosa 5.0 0.2 4.963968 0.06020298 0.03482025 9 setosa 4.4 0.2 4.963968 0.06020298 -0.56517975 10 setosa 4.9 0.1 4.870201 0.11408783 0.04313845 # ... with 140 more rows
And not the autoplot method:
# plot model diagnostics bayes_diagnostic_plots_df <- bayes_xy %>% mutate(diagnostic_plots = map(model, ~autoplot(., which = 1:6, ncol = 3, label.size = 3)))
Some scientific literature talks about methods such as model perturbations , phi divergence, the distance in the rear Cook mode and the average distance from the back wall , the Kullback-Leibler divergence , etc. , etc. . But I don’t see anywhere that this has been learned using R code, and I am stuck.
unanswered question on this topic in Cross-validated. I am posting here because I am looking for ideas about writing code to compute impact statistics (and not about statistical theory and approach, etc., which should go on this other issue)
How can I get something like measuring the distance of a cookie at the output of rstanarm::stan_lm ?