Model Comparison

Prof. Sam Berchuck

Jan 30, 2025

Review of last lecture

On Tuesday, we learned about various ways to check MCMC convergence and model fit.

  • Traceplots, effective sample size (neff), MC standard error, R^, sampling issues

  • Posterior predictive checks

  • Model checks using shinystan

Today, we will learn about model comparisons.

Model comparison

  • In statistical modeling, a more complex model almost always results in a better fit to the data.

    • A more complex model means one with more parameters.
  • If one has 10 observations, one can have a model with 10 parameters that can perfectly predict every single data point (by just having a parameter to predict each data point).

  • There are two problems with overly complex models.

    1. They become increasingly hard to interpret (think a straight line versus a polynomial).

    2. They are more at risk of overfitting, such that it does not work for future observations.

Model fit: an example data set

  • Let’s explore the idea of model fit using an example dataset from the openintro package called bdims.

  • This dataset contains body girth measurements and skeletal diameter measurements.

  • Today we will explore the association between height and weight.

Model fit: an example data set

library(openintro)
dat <- data.frame(weight = bdims$wgt * 2.20462, # convert weight to lbs
                  height = bdims$hgt * 0.393701, # convert height to inches
                  sex = ifelse(bdims$sex == 1, "Male", "Female"))

Models of increasing complexity

  • When using height to predict weight, we can models of increasing complexity using higher order polynomials.

  • Let’s fit the following models to the subset of 10 data points:

E[weighti]=β0+β1heightiE[weighti]=β0+β1heighti+β2heighti2E[weighti]=β0+β1heighti+β2heighti2+β3heighti3E[weighti]=β0+β1heighti+β2heighti2+β3heighti3+β4heighti4

  • We can compare these models using standard measures of goodness-of-fit, including R2 and root mean squared error (RMSE).

Overfitting and underfitting

Overfitting and underfitting

Overfitting and underfitting

  • With more complex models, out-of-sample prediction becomes worse.

  • This is because when you use a complex model in a data set, it tailors the coefficients to any sampling errors and noise in the data such that it will not generalize to new observations.

  • Therefore, our goal in model comparison is to choose a model with the following two properties:

    1. It is complex enough to capture the essence of the data generation process (and thus avoid underfitting),

    2. It avoids overfitting to make the model usefull for predicting new observations.

Finding an optimal model

  • Trade-off between overfitting and underfitting (in machine learning this is commonly called bias-variance trade-off).

    • A simple model tends to produce biased predictions because it does not capture the essence of the data generating process.

    • A model that is overly complex is unbiased but results in a lot of uncertainty in the prediction.

  • Polynomials are merely one example of comparing simple to complex models. You can think about:

    • Models with and without interactions,

    • Models with a few predictors versus hundreds of predictors,

    • Regression analyses versus hierarchical models, etc.

Model Comparison

  • When comparing models, we prefer models that are closer to the true data-generating process.

  • We need some ways to quantify the degree of closeness to the true model. Note that in this context models refer to the distributional family as well as the parameter values.

  • For example, the model Yi∼N(5,2) is a different model than Yi∼N(3,2), which is a different model than Yi∼Gamma(2,2).

    • The first two have the same family but different parameter values (different means, same SD), whereas the last two have different distributional families (Normal vs. Gamma).
  • One way to quantify the degree of closeness to the true model is using Kullback-Leibler (KL) divergence.

Kullback-Leibler divergence

  • For two models, M0 and M1, the KL divergence is given by,

DKL(M0|M1)=∫−∞∞fM0(Y)log⁡fM0(Y)fM1(Y)dY=∫−∞∞fM0(Y)log⁡fM0(Y)dY−∫−∞∞fM0(Y)log⁡fM1(Y)dY

  • Note that DKL is not considered a distance, because it is not strictly symmetric, DKL(M0|M1)≠DKL(M1|M0).

Kullback-Leibler divergence

As an example, assume that the data are generated by a true model M0, and we have two candidate models M1 and M2, where

  • M0:Yi∼N(3,2)

  • M1:Yi∼N(3.5,2.5)

  • M2:Yi∼Cauchy(3,2)

  • DKL(M0|M1)=0.063, DKL(M0|M1)=0.259, so M1 is a better model than M2.

Comparing models using KL

  • Note that in the expression of DKL, when talking about the same target model, the first term is always the same and describes the true model, M0.

  • Therefore, it is sufficient to compare models on the second term,
    ∫−∞∞fM0(Y)log⁡fM1(Y)dY, which can also be written as, EM0[log⁡fM1(Y)].

  • This term is the expected log predictive density (elpd).

  • A larger elpd is preferred. Why?

Comparing models using KL

  • In the real world, we do not know M0.

    • If we knew, then we would just need to choose M0 as our model and there will be no problem about model comparisons.

    • Even if we knew the true model, we would still need to estimate the parameter values.

  • Thus, we cannot compute elpd, since the expectation is over fM0(Y).

  • We need to estimate elpd!

Comparing models using KL

  • elpd is an expectation, so we can think about estimating it using Monte Carlo sampling, 1S∑s=1Slog⁡fM1(Y(s))→EM0[log⁡fM1(Y)],Y(s)∼fM0(Y).

    • We need to find a way to approximate, fM0(Y(s)).
  • A naive way to approximate f(Y(s)) is to assume that the distribution of the observed data is the true model.

    • This is equivalent to assuming that Y(s)∼{Y1,…,Yn}.

    • This leads to an overly optimistic estimate and favors complex models.

Comparing models using KL

  • A better way to estimate elpd is to collect data on a new independent sample that is believed to share the same data generating process as the current sample, and estimate elpd on the new sample.

    • This is called out-of-sample validation.

    • The problem, of course, is we usually do not have the resources to collect a new sample.

  • Therefore, statisticians have worked hard to find ways to estimate elpd from the current sample, and there are two broad approaches, information criteria and cross-validation.

Overview of comparison methods

  1. Information criteria: AIC, DIC, and WAIC, which estimate the elpd in the current sample, minus a correction factor.

  2. Cross validation: A method that splits the current sample into K parts, estimates the parameters in K−1 parts, and estimates the elpd in the remaining 1 part.

  • A special case is when K=n so that each time one uses n−1 data points to estimate the model parameters, and estimates the elpd for the observation that was left out. This is called leave-one-out cross-validation (LOO-CV).

Information criteria

  • Several information criteria have been proposed that do not require fitting the model several times, including AIC, DIC, and WAIC.

  • We will introduce the information criteria, assuming a likelihood f(Y|θ) for observed data Y=(Y1,…,Yn) with population parameter θ.

  • Information criteria are often presented as deviance, defined as, D(Y|θ)=−2log⁡f(Y|θ).

  • Ideally, models will have small deviance.

  • However, if a model is too complex it will have small deviance but be unstable (overfitting).

Akaike information criteria (AIC)

Akaike information criteria (AIC) estimates the elpd as,

elpd^AIC=log⁡f(Y|θ^MLE)−p, where p is the number of parameters estimated in the model and θ^MLE is the MLE point estimate.

  • AIC=−2log⁡f(Y|θ^MLE)+2p

  • p is an adjustment for overfitting, but once we go beyond linear models, we cannot simply add p.

  • Informative priors tend to reduce the amount of overfitting.

  • Model with smaller AIC are preferred.

Deviance information criteria (DIC)

Deviance information criteria (DIC) estimates the elpd as,

elpd^DIC=log⁡f(Y|θ^Bayes)−pDIC, where θ^Bayes is a Bayesian point estimate, typically a posterior mean, and pDIC is an estimate of the complexity penalty,

pDIC=2(log⁡f(Y|θ^Bayes)−Eθ|Y[log⁡f(Y|θ)]).

  • The second term can be estimated as a MC integral.

  • DIC=−2log⁡f(Y|θ^Bayes)+2pDIC.

Deviance information criteria (DIC)

  • Advantages of DIC:

    • The effective number of parameters is a useful measure of model complexity.

    • Intuitively, if there are p parameters and we have uninformative priors then pD≈p.

    • However, pD≪p if there are strong priors.

  • Disadvantages of DIC:

    • DIC can only be used to compare models with the same likelihood.

    • DIC really only applies when the posterior is approximately normal, and will give misleading results when the posterior is far from normality (e.g., bimodal).

Watanabe-Akaike information criteria (WAIC)

Watanabe-Akaike or widely available information criteria (WAIC) estimates the elpd as,

elpd^WAIC=lppd−pWAIC.

  • The log pointwise predictive density (lppd) is given by, lppd=log⁡∏i=1nf(Yi|Y)=∑i=1nlog⁡∫f(Yi|θ)f(θ|Y)dθ.

  • lppd can be estimated as, ∑i=1nlog⁡(1S∑s=1Sf(Yi|θ(s))), where θ(s) are drawn from the posterior.

WAIC

There are two common estimates of pWAIC, both of which can be estimated using MC samples of the posterior. pWAIC1=2∑i=1n(log⁡(Eθ|Y[f(Yi|θ)])−Eθ|Y[log⁡f(Yi|θ)])pWAIC2=∑i=1nVθ|Y(log⁡f(Yi|θ))

  • WAIC=−2lppd+2pWAIC.

WAIC

  • WAIC has the desirable property of averaging over the posterior distribution, instead of conditioning on a point estimate.

  • pWAIC can be thought of as an approximation to the number of unconstrained parameters in the model.

  • In practice, pWAIC2 is often used, since it is theoretically closer to LOO-CV.

Cross-validation

  • A common approach to compare models is using cross-validation.

  • This is exactly the same procedure used in classical statistics.

  • This operates under the assumption that the true model likely produces better out-of-sample predictions than competing models.

  • Advantages: Simple, intuitive, and broadly applicable.

  • Disadvantages: Slow because it requires several model fits and it is hard to say a difference is statistically significant.

K-fold cross-validation

  1. Split the data into K equally-sized groups.

  2. Set aside group k as test set and fit the model to the remaining K−1 groups.

  3. Make predictions for the test set k based on the model fit to the training data.

  4. Repeat steps 1 and 2 for k=1,…,K giving a predicted value Y^i for all n observations.

  5. Measure prediction accuracy, e.g.,

MSE=1n∑i=1n(Yi−Y^i)2.

Variants of cross-validation

  • Usually K is either 5 or 10.

  • K=n is called leave-one-out cross-validation (LOO-CV), which is great but slow.

  • The predicted value Y^i can be either the posterior predictive mean or median.

  • Mean squared error (MSE) can be replaced with mean absolute deviation (MAD),

MAD=1n∑i=1n|Yi−Y^i|.

Leave-one-out cross-validation (LOO-CV)

  • Assume the data are partitioned into a training set, Ytrain and a holdout set Ytest, thus yielding a posterior distribution f(θ|Ytrain).

  • In the setting of LOO-CV, we have n different f(θ|Y−i).

  • The Bayesian LOO-CV estimate of out-of-sample predictive fit is lppdLOO-CV=∑i=1nlog⁡f(θ|Y−i),

  • The estimated number of parameters can be computed as,

pLOO-CV=lppd−lppdLOO-CV.

  • This also referred to as leave-one-out information criteria (LOO-IC)

LOO-CV

LOO-CV estimates the elpd as,

elpd^LOO-CV=lppdLOO-CV−pWAIC=lppdLOO-CV.

  • Under some common models there are shortcuts for computing it, however in general these do not exist.

  • WAIC can be treated as a fast approximation of LOO-CV.

  • In Stan, LOO-CV is approximated using the Pareto smoothed importance sampling (PSIS) to make the process faster, without having to repeat the process n times.

Computing WAIC and LOO-CV using Stan

We need to update the generated quantities code block.

generated quantities {
  ...
  vector[n] log_lik;
  for (i in 1:n) log_lik[i] = normal_lpdf(Y[i] | X[i, ] * beta, sigma);
}

Let’s simulate some data:

###True parameters
sigma <- 1.5 # true measurement error
beta <- matrix(c(-1.5, 3, 1), ncol = 1) # true beta

###Simulation settings
n <- 100 # number of observations
n_pred <- 10 # number of predicted observations
p <- length(beta) - 1 # number of covariates

###Simulate data
set.seed(54) # set seed
X <- cbind(1, matrix(rnorm(n * p), ncol = p))
Y <- as.numeric(X %*% beta + rnorm(n, 0, sigma))
X_pred <- cbind(1, matrix(rnorm(n_pred * p), ncol = p))
Y_pred <- as.numeric(X_pred %*% beta + rnorm(n_pred, 0, sigma))

An example model comparison

True Model: E[Yi]=β0+β1xi1+β2xi2

Model 1: E[Yi]=β0+β1xi1

Model 2: E[Yi]=β0+β1xi1+β2xi2

Fit model 1

###Create stan data object
stan_data_model1 <- list(n = n, p = p - 1, Y = Y, X = X[, -3],
                         beta0 = 0, sigma_beta = 10, a = 3,  b = 1,
                         n_pred = n_pred, X_pred = X_pred[, -3])
  
###Compile model separately
stan_model <- stan_model(file = "linear_regression_ppd_log_lik.stan")

###Run model 1 and save
fit_model1 <- sampling(stan_model, data = stan_data_model1, 
                chains = 4, iter = 1000)
saveRDS(fit_model1, file = "linear_regression_ppd_log_lik_fit_model1.rds")

Fit model 2

###Create stan data object
stan_data_model2 <- list(n = n, p = p, Y = Y, X = X,
                         beta0 = 0, sigma_beta = 10, a = 3,  b = 1,
                         n_pred = n_pred, X_pred = X_pred)

###Run model 2 and save
fit_model2 <- sampling(stan_model, data = stan_data_model2, 
                chains = 4, iter = 1000)
saveRDS(fit_model2, file = "linear_regression_ppd_log_lik_fit_model2.rds")

Computing WAIC

  • Begin by extracting the log-likelihood values from the model.

  • We will use the loo package.

###Load loo package
library(loo)

###Extract log likelihood
log_lik_model1 <- loo::extract_log_lik(fit_model1, parameter_name = "log_lik", merge_chains = TRUE)
log_lik_model2 <- loo::extract_log_lik(fit_model2, parameter_name = "log_lik", merge_chains = TRUE)

###Explore the object
class(log_lik_model1)
[1] "matrix" "array" 
dim(log_lik_model1)
[1] 2000  100

Computing WAIC

###Compute WAIC for the two models
waic_model1 <- loo::waic(log_lik_model1)
waic_model2 <- loo::waic(log_lik_model2)

###Inspect WAIC for model 1
waic_model1

Computed from 2000 by 100 log-likelihood matrix.

          Estimate   SE
elpd_waic   -201.0  6.6
p_waic         3.0  0.5
waic         401.9 13.1

Computing WAIC

###Inspect WAIC for model 2
waic_model2

Computed from 2000 by 100 log-likelihood matrix.

          Estimate   SE
elpd_waic   -178.9  6.6
p_waic         3.9  0.6
waic         357.7 13.1

Computing WAIC

###Make a comparison
comp_waic <- loo::loo_compare(list("true" = waic_model2, "misspec" = waic_model1))
print(comp_waic, digits = 2)
        elpd_diff se_diff
true      0.00      0.00 
misspec -22.09      5.76 
print(comp_waic, digits = 2, simplify = FALSE)
        elpd_diff se_diff elpd_waic se_elpd_waic p_waic  se_p_waic waic   
true       0.00      0.00 -178.86      6.57         3.88    0.58    357.72
misspec  -22.09      5.76 -200.95      6.56         2.99    0.52    401.90
        se_waic
true      13.14
misspec   13.12

Computing LOO-CV/LOO-CI

###Compute LOO-IC for the two models
loo_model1 <- loo::loo(log_lik_model1)
loo_model2 <- loo::loo(log_lik_model2)

###Make a comparison
comp <- loo::loo_compare(list("true" = loo_model2, "misspec" = loo_model1))
print(comp, digits = 2)
        elpd_diff se_diff
true      0.00      0.00 
misspec -22.08      5.76 
print(comp, digits = 2, simplify = FALSE)
        elpd_diff se_diff elpd_loo se_elpd_loo p_loo   se_p_loo looic  
true       0.00      0.00 -178.88     6.57        3.90    0.58   357.77
misspec  -22.08      5.76 -200.96     6.56        3.00    0.52   401.93
        se_looic
true      13.14 
misspec   13.12 

Prepare for next class

  • Work on HW 02, which was just assigned

  • Complete reading to prepare for next Tuesday’s lecture

  • Tuesday’s lecture: Bayesian Workflow

🔗 BIOSTAT 725 - Spring 2025

1 / 41
Model Comparison Prof. Sam Berchuck Jan 30, 2025

  1. Slides

  2. Tools

  3. Close
  • Model Comparison
  • Review of last lecture
  • Model comparison
  • Model fit: an example data set
  • Model fit: an example data set
  • Models of increasing complexity
  • Overfitting and underfitting
  • Overfitting and underfitting
  • Overfitting and underfitting
  • Finding an optimal model
  • Model Comparison
  • Kullback-Leibler divergence
  • Kullback-Leibler divergence
  • Comparing models using KL
  • Comparing models using KL
  • Comparing models using KL
  • Comparing models using KL
  • Overview of comparison methods
  • Information criteria
  • Akaike information criteria (AIC)
  • Deviance information criteria (DIC)
  • Deviance information criteria (DIC)
  • Watanabe-Akaike information criteria (WAIC)
  • WAIC
  • WAIC
  • Cross-validation
  • K-fold cross-validation
  • Variants of cross-validation
  • Leave-one-out cross-validation (LOO-CV)
  • LOO-CV
  • Computing WAIC and LOO-CV using Stan
  • Let’s simulate some data:
  • An example model comparison
  • Fit model 1
  • Fit model 2
  • Computing WAIC
  • Computing WAIC
  • Computing WAIC
  • Computing WAIC
  • Computing LOO-CV/LOO-CI
  • Prepare for next class
  • f Fullscreen
  • s Speaker View
  • o Slide Overview
  • e PDF Export Mode
  • b Toggle Chalkboard
  • c Toggle Notes Canvas
  • d Download Drawings
  • ? Keyboard Help