BIOSTAT 725 - Spring 2025 – Model Comparison

Review of last lecture

On Tuesday, we learned about various ways to check MCMC convergence and model fit.

Traceplots, effective sample size ( $n_{e f f}$ ), MC standard error, $\hat{R}$ , sampling issues
Posterior predictive checks
Model checks using shinystan

Today, we will learn about model comparisons.

Model comparison

In statistical modeling, a more complex model almost always results in a better fit to the data.
- A more complex model means one with more parameters.
If one has 10 observations, one can have a model with 10 parameters that can perfectly predict every single data point (by just having a parameter to predict each data point).
There are two problems with overly complex models.
1. They become increasingly hard to interpret (think a straight line versus a polynomial).
2. They are more at risk of overfitting, such that it does not work for future observations.

Model fit: an example data set

Let’s explore the idea of model fit using an example dataset from the openintro package called bdims.
This dataset contains body girth measurements and skeletal diameter measurements.
Today we will explore the association between height and weight.

Model fit: an example data set

library(openintro)
dat <- data.frame(weight = bdims$wgt * 2.20462, # convert weight to lbs
                  height = bdims$hgt * 0.393701, # convert height to inches
                  sex = ifelse(bdims$sex == 1, "Male", "Female"))

Models of increasing complexity

When using height to predict weight, we can models of increasing complexity using higher order polynomials.
Let’s fit the following models to the subset of 10 data points:

$\begin{aligned} E [w e i g h t_{i}] & = β_{0} + β_{1} h e i g h t_{i} \\ E [w e i g h t_{i}] & = β_{0} + β_{1} h e i g h t_{i} + β_{2} h e i g h t_{i}^{2} \\ E [w e i g h t_{i}] & = β_{0} + β_{1} h e i g h t_{i} + β_{2} h e i g h t_{i}^{2} + β_{3} h e i g h t_{i}^{3} \\ E [w e i g h t_{i}] & = β_{0} + β_{1} h e i g h t_{i} + β_{2} h e i g h t_{i}^{2} + β_{3} h e i g h t_{i}^{3} + β_{4} h e i g h t_{i}^{4} \end{aligned}$

We can compare these models using standard measures of goodness-of-fit, including $R^{2}$ and root mean squared error (RMSE).

Overfitting and underfitting

Overfitting and underfitting

Overfitting and underfitting

With more complex models, out-of-sample prediction becomes worse.
This is because when you use a complex model in a data set, it tailors the coefficients to any sampling errors and noise in the data such that it will not generalize to new observations.
Therefore, our goal in model comparison is to choose a model with the following two properties:
1. It is complex enough to capture the essence of the data generation process (and thus avoid underfitting),
2. It avoids overfitting to make the model usefull for predicting new observations.

Finding an optimal model

Trade-off between overfitting and underfitting (in machine learning this is commonly called bias-variance trade-off).
- A simple model tends to produce biased predictions because it does not capture the essence of the data generating process.
- A model that is overly complex is unbiased but results in a lot of uncertainty in the prediction.
Polynomials are merely one example of comparing simple to complex models. You can think about:
- Models with and without interactions,
- Models with a few predictors versus hundreds of predictors,
- Regression analyses versus hierarchical models, etc.

Model Comparison

When comparing models, we prefer models that are closer to the true data-generating process.
We need some ways to quantify the degree of closeness to the true model. Note that in this context models refer to the distributional family as well as the parameter values.
For example, the model $Y_{i} \sim N (5, 2)$ is a different model than $Y_{i} \sim N (3, 2)$ , which is a different model than $Y_{i} \sim G a m m a (2, 2)$ .
- The first two have the same family but different parameter values (different means, same SD), whereas the last two have different distributional families (Normal vs. Gamma).
One way to quantify the degree of closeness to the true model is using Kullback-Leibler (KL) divergence.

Kullback-Leibler divergence

For two models, $M_{0}$ and $M_{1}$ , the KL divergence is given by,

$\begin{aligned} D_{K L} (M_{0} | M_{1}) & = \int_{- \infty}^{\infty} f_{M_{0}} (Y) \log \frac{f_{M_{0}} (Y)}{f_{M_{1}} (Y)} d Y \\ = \int_{- \infty}^{\infty} f_{M_{0}} (Y) \log f_{M_{0}} (Y) d Y - \int_{- \infty}^{\infty} f_{M_{0}} (Y) \log f_{M_{1}} (Y) d Y \end{aligned}$

Note that $D_{K L}$ is not considered a distance, because it is not strictly symmetric, $D_{K L} (M_{0} | M_{1}) \neq D_{K L} (M_{1} | M_{0})$ .

Kullback-Leibler divergence

As an example, assume that the data are generated by a true model $M_{0}$ , and we have two candidate models $M_{1}$ and $M_{2}$ , where

$M_{0} : Y_{i} \sim N (3, 2)$
$M_{1} : Y_{i} \sim N (3.5, 2.5)$
$M_{2} : Y_{i} \sim C a u c h y (3, 2)$

$D_{K L} (M_{0} | M_{1}) = 0.063$ , $D_{K L} (M_{0} | M_{1}) = 0.259$ , so $M_{1}$ is a better model than $M_{2}$ .

Comparing models using KL

Note that in the expression of $D_{K L}$ , when talking about the same target model, the first term is always the same and describes the true model, $M_{0}$ .
Therefore, it is sufficient to compare models on the second term,
$\int_{- \infty}^{\infty} f_{M_{0}} (Y) \log f_{M_{1}} (Y) d Y,$ which can also be written as, $E_{M_{0}} [\log f_{M_{1}} (Y)] .$
This term is the expected log predictive density (elpd).
A larger elpd is preferred. Why?

Comparing models using KL

In the real world, we do not know $M_{0}$ .
- If we knew, then we would just need to choose $M_{0}$ as our model and there will be no problem about model comparisons.
- Even if we knew the true model, we would still need to estimate the parameter values.
Thus, we cannot compute elpd, since the expectation is over $f_{M_{0}} (Y)$ .
We need to estimate elpd!

Comparing models using KL

elpd is an expectation, so we can think about estimating it using Monte Carlo sampling, $\frac{1}{S} \sum_{s = 1}^{S} \log f_{M_{1}} (Y^{(s)}) \to E_{M_{0}} [\log f_{M_{1}} (Y)], Y^{(s)} \sim f_{M_{0}} (Y) .$
- We need to find a way to approximate, $f_{M_{0}} (Y^{(s)})$ .
A naive way to approximate $f (Y^{(s)})$ is to assume that the distribution of the observed data is the true model.
- This is equivalent to assuming that $Y^{(s)} \sim {Y_{1}, \dots, Y_{n}}$ .
- This leads to an overly optimistic estimate and favors complex models.

Comparing models using KL

A better way to estimate elpd is to collect data on a new independent sample that is believed to share the same data generating process as the current sample, and estimate elpd on the new sample.
- This is called out-of-sample validation.
- The problem, of course, is we usually do not have the resources to collect a new sample.
Therefore, statisticians have worked hard to find ways to estimate elpd from the current sample, and there are two broad approaches, information criteria and cross-validation.

Overview of comparison methods

Information criteria: AIC, DIC, and WAIC, which estimate the elpd in the current sample, minus a correction factor.
Cross validation: A method that splits the current sample into $K$ parts, estimates the parameters in $K - 1$ parts, and estimates the elpd in the remaining 1 part.

A special case is when $K = n$ so that each time one uses $n - 1$ data points to estimate the model parameters, and estimates the elpd for the observation that was left out. This is called leave-one-out cross-validation (LOO-CV).

Information criteria

Several information criteria have been proposed that do not require fitting the model several times, including AIC, DIC, and WAIC.
We will introduce the information criteria, assuming a likelihood $f (Y | θ)$ for observed data $Y = (Y_{1}, \dots, Y_{n})$ with population parameter $θ$ .
Information criteria are often presented as deviance, defined as, $D (Y | θ) = - 2 \log f (Y | θ)$ .
Ideally, models will have small deviance.
However, if a model is too complex it will have small deviance but be unstable (overfitting).

Akaike information criteria (AIC)

Akaike information criteria (AIC) estimates the elpd as,

${\hat{elpd}}_{AIC} = \log f (Y | {\hat{θ}}_{MLE}) - p,$ where $p$ is the number of parameters estimated in the model and ${\hat{θ}}_{MLE}$ is the MLE point estimate.

$AIC = - 2 \log f (Y | {\hat{θ}}_{MLE}) + 2 p$
$p$ is an adjustment for overfitting, but once we go beyond linear models, we cannot simply add $p$ .
Informative priors tend to reduce the amount of overfitting.
Model with smaller AIC are preferred.

Deviance information criteria (DIC)

Deviance information criteria (DIC) estimates the elpd as,

${\hat{elpd}}_{DIC} = \log f (Y | {\hat{θ}}_{Bayes}) - p_{DIC},$ where ${\hat{θ}}_{Bayes}$ is a Bayesian point estimate, typically a posterior mean, and $p_{DIC}$ is an estimate of the complexity penalty,

$p_{DIC} = 2 (\log f (Y | {\hat{θ}}_{Bayes}) - E_{θ | Y} [\log f (Y | θ)]) .$

The second term can be estimated as a MC integral.
$DIC = - 2 \log f (Y | {\hat{θ}}_{Bayes}) + 2 p_{DIC} .$

Deviance information criteria (DIC)

Advantages of DIC:
- The effective number of parameters is a useful measure of model complexity.
- Intuitively, if there are $p$ parameters and we have uninformative priors then $p_{D} \approx p$ .
- However, $p_{D} ≪ p$ if there are strong priors.
Disadvantages of DIC:
- DIC can only be used to compare models with the same likelihood.
- DIC really only applies when the posterior is approximately normal, and will give misleading results when the posterior is far from normality (e.g., bimodal).

Watanabe-Akaike information criteria (WAIC)

Watanabe-Akaike or widely available information criteria (WAIC) estimates the elpd as,

${\hat{elpd}}_{WAIC} = lppd - p_{WAIC} .$

The log pointwise predictive density (lppd) is given by, $lppd = \log \prod_{i = 1}^{n} f (Y_{i} | Y) = \sum_{i = 1}^{n} \log \int f (Y_{i} | θ) f (θ | Y) d θ .$
lppd can be estimated as, $\sum_{i = 1}^{n} \log (\frac{1}{S} \sum_{s = 1}^{S} f (Y_{i} | θ^{(s)}))$ , where $θ^{(s)}$ are drawn from the posterior.

WAIC

There are two common estimates of $p_{WAIC}$ , both of which can be estimated using MC samples of the posterior. $\begin{aligned} p_{{WAIC}_{1}} & = 2 \sum_{i = 1}^{n} (\log (E_{θ | Y} [f (Y_{i} | θ)]) - E_{θ | Y} [\log f (Y_{i} | θ)]) \\ p_{{WAIC}_{2}} & = \sum_{i = 1}^{n} V_{θ | Y} (\log f (Y_{i} | θ)) \end{aligned}$

$WAIC = - 2 lppd + 2 p_{WAIC} .$

WAIC

WAIC has the desirable property of averaging over the posterior distribution, instead of conditioning on a point estimate.
$p_{WAIC}$ can be thought of as an approximation to the number of unconstrained parameters in the model.
In practice, $p_{{WAIC}_{2}}$ is often used, since it is theoretically closer to LOO-CV.

Cross-validation

A common approach to compare models is using cross-validation.
This is exactly the same procedure used in classical statistics.
This operates under the assumption that the true model likely produces better out-of-sample predictions than competing models.
Advantages: Simple, intuitive, and broadly applicable.
Disadvantages: Slow because it requires several model fits and it is hard to say a difference is statistically significant.

K-fold cross-validation

Split the data into $K$ equally-sized groups.
Set aside group $k$ as test set and fit the model to the remaining $K - 1$ groups.
Make predictions for the test set $k$ based on the model fit to the training data.
Repeat steps 1 and 2 for $k = 1, \dots, K$ giving a predicted value ${\hat{Y}}_{i}$ for all $n$ observations.
Measure prediction accuracy, e.g.,

$M S E = \frac{1}{n} \sum_{i = 1}^{n} (Y_{i} - {\hat{Y}}_{i})^{2} .$

Variants of cross-validation

Usually $K$ is either 5 or 10.
$K = n$ is called leave-one-out cross-validation (LOO-CV), which is great but slow.
The predicted value ${\hat{Y}}_{i}$ can be either the posterior predictive mean or median.
Mean squared error (MSE) can be replaced with mean absolute deviation (MAD),

$M A D = \frac{1}{n} \sum_{i = 1}^{n} | Y_{i} - {\hat{Y}}_{i} | .$

Leave-one-out cross-validation (LOO-CV)

Assume the data are partitioned into a training set, $Y_{train}$ and a holdout set $Y_{test}$ , thus yielding a posterior distribution $f (θ | Y_{train})$ .
In the setting of LOO-CV, we have $n$ different $f (θ | Y_{- i})$ .
The Bayesian LOO-CV estimate of out-of-sample predictive fit is ${lppd}_{LOO-CV} = \sum_{i = 1}^{n} \log f (θ | Y_{- i}),$
The estimated number of parameters can be computed as,

$p_{LOO-CV} = lppd - {lppd}_{LOO-CV} .$

This also referred to as leave-one-out information criteria (LOO-IC)

LOO-CV

LOO-CV estimates the elpd as,

${\hat{elpd}}_{LOO-CV} = {lppd}_{LOO-CV} - p_{WAIC} = {lppd}_{LOO-CV} .$

Under some common models there are shortcuts for computing it, however in general these do not exist.
WAIC can be treated as a fast approximation of LOO-CV.
In Stan, LOO-CV is approximated using the Pareto smoothed importance sampling (PSIS) to make the process faster, without having to repeat the process $n$ times.

Computing WAIC and LOO-CV using Stan

We need to update the generated quantities code block.

generated quantities {
  ...
  vector[n] log_lik;
  for (i in 1:n) log_lik[i] = normal_lpdf(Y[i] | X[i, ] * beta, sigma);
}

Let’s simulate some data:

###True parameters
sigma <- 1.5 # true measurement error
beta <- matrix(c(-1.5, 3, 1), ncol = 1) # true beta

###Simulation settings
n <- 100 # number of observations
n_pred <- 10 # number of predicted observations
p <- length(beta) - 1 # number of covariates

###Simulate data
set.seed(54) # set seed
X <- cbind(1, matrix(rnorm(n * p), ncol = p))
Y <- as.numeric(X %*% beta + rnorm(n, 0, sigma))
X_pred <- cbind(1, matrix(rnorm(n_pred * p), ncol = p))
Y_pred <- as.numeric(X_pred %*% beta + rnorm(n_pred, 0, sigma))

An example model comparison

True Model: $E [Y_{i}] = β_{0} + β_{1} x_{i 1} + β_{2} x_{i 2}$

Model 1: $E [Y_{i}] = β_{0} + β_{1} x_{i 1}$

Model 2: $E [Y_{i}] = β_{0} + β_{1} x_{i 1} + β_{2} x_{i 2}$

Fit model 1

###Create stan data object
stan_data_model1 <- list(n = n, p = p - 1, Y = Y, X = X[, -3],
                         beta0 = 0, sigma_beta = 10, a = 3,  b = 1,
                         n_pred = n_pred, X_pred = X_pred[, -3])
  
###Compile model separately
stan_model <- stan_model(file = "linear_regression_ppd_log_lik.stan")

###Run model 1 and save
fit_model1 <- sampling(stan_model, data = stan_data_model1, 
                chains = 4, iter = 1000)
saveRDS(fit_model1, file = "linear_regression_ppd_log_lik_fit_model1.rds")

Fit model 2

###Create stan data object
stan_data_model2 <- list(n = n, p = p, Y = Y, X = X,
                         beta0 = 0, sigma_beta = 10, a = 3,  b = 1,
                         n_pred = n_pred, X_pred = X_pred)

###Run model 2 and save
fit_model2 <- sampling(stan_model, data = stan_data_model2, 
                chains = 4, iter = 1000)
saveRDS(fit_model2, file = "linear_regression_ppd_log_lik_fit_model2.rds")

Computing WAIC

Begin by extracting the log-likelihood values from the model.
We will use the loo package.

###Load loo package
library(loo)

###Extract log likelihood
log_lik_model1 <- loo::extract_log_lik(fit_model1, parameter_name = "log_lik", merge_chains = TRUE)
log_lik_model2 <- loo::extract_log_lik(fit_model2, parameter_name = "log_lik", merge_chains = TRUE)

###Explore the object
class(log_lik_model1)

[1] "matrix" "array"

dim(log_lik_model1)

[1] 2000  100

Computing WAIC

###Compute WAIC for the two models
waic_model1 <- loo::waic(log_lik_model1)
waic_model2 <- loo::waic(log_lik_model2)

###Inspect WAIC for model 1
waic_model1


Computed from 2000 by 100 log-likelihood matrix.

          Estimate   SE
elpd_waic   -201.0  6.6
p_waic         3.0  0.5
waic         401.9 13.1

Computing WAIC

###Inspect WAIC for model 2
waic_model2


Computed from 2000 by 100 log-likelihood matrix.

          Estimate   SE
elpd_waic   -178.9  6.6
p_waic         3.9  0.6
waic         357.7 13.1

Computing WAIC

###Make a comparison
comp_waic <- loo::loo_compare(list("true" = waic_model2, "misspec" = waic_model1))
print(comp_waic, digits = 2)

        elpd_diff se_diff
true      0.00      0.00 
misspec -22.09      5.76

print(comp_waic, digits = 2, simplify = FALSE)

        elpd_diff se_diff elpd_waic se_elpd_waic p_waic  se_p_waic waic   
true       0.00      0.00 -178.86      6.57         3.88    0.58    357.72
misspec  -22.09      5.76 -200.95      6.56         2.99    0.52    401.90
        se_waic
true      13.14
misspec   13.12

Computing LOO-CV/LOO-CI

###Compute LOO-IC for the two models
loo_model1 <- loo::loo(log_lik_model1)
loo_model2 <- loo::loo(log_lik_model2)

###Make a comparison
comp <- loo::loo_compare(list("true" = loo_model2, "misspec" = loo_model1))
print(comp, digits = 2)

        elpd_diff se_diff
true      0.00      0.00 
misspec -22.08      5.76

print(comp, digits = 2, simplify = FALSE)

        elpd_diff se_diff elpd_loo se_elpd_loo p_loo   se_p_loo looic  
true       0.00      0.00 -178.88     6.57        3.90    0.58   357.77
misspec  -22.08      5.76 -200.96     6.56        3.00    0.52   401.93
        se_looic
true      13.14 
misspec   13.12

Prepare for next class

Work on HW 02, which was just assigned
Complete reading to prepare for next Tuesday’s lecture
Tuesday’s lecture: Bayesian Workflow