AE 04: Priors in Stan

Modeling diabetes disease progression

Published

January 23, 2025

Due date

Application exercises (AEs) are submitted by pushing your work to the relevant GitHub repo. AEs from Tuesday lectures should be submitted by Friday by 11:59pm ET, and AEs from Thursday lectures should be submitted by Sunday at 11:59pm ET. Because AEs are intended for in-class activities, there are no extensions given on AEs.

Final .qmd and .pdf files pushed to your GitHub repo
Note: For homeworks and exams, you will also be required to submit your final .pdf file submitted on Gradescope

Introduction

This AE will take another look at the diabetes regression problem from Tuesday. At the end of the AE, we realized that placing priors on the parameters yields incorrect posterior inference. During this AE we will explore data centering approaches to stabilize our Bayesian model and explore the implications of prior specification.

Learning goals

By the end of the AE, you will…

Understand the idea of centering data for stabilizing inference
Gain knowledge on prior specification and its sometimes unintended impact
Be able to compute and interpret a posterior predictive distribution

Getting Started

Clone the repo & start new RStudio project

Go to the course organization at github.com/biostat725-sp25 organization on GitHub.
Click on the repo with the prefix ae-04-. It contains the starter documents you need to complete the AE.
Click on the green CODE button, select Use SSH (this might already be selected by default, and if it is, you’ll see the text Clone with SSH). Click on the clipboard icon to copy the repo URL.
- See the HW 00 instructions if you have not set up the SSH key or configured git.
In RStudio, go to File \(\rightarrow\) New Project \(\rightarrow\) Version Control \(\rightarrow\) Git.
Copy and paste the URL of your assignment repo into the dialog box Repository URL. Again, please make sure to have SSH highlighted under Clone when you copy the address.
Click Create Project, and the files from your GitHub repo will be displayed in the Files pane in RStudio.
Click AE 04.qmd to open the template Quarto file. This is where you will write up your code and narrative for the AE.

R packages

We will begin by loading R packages that we will use in this AE.

library(tidyverse)    # data wrangling and visualization
library(knitr)        # format output
library(rstan)        # Stan
library(bayesplot)    # figures for post Stan inference

Data

Today we will use data from the paper Least Angle Regression by Efron et al. There are ten baseline variables, age, sex, body mass index, average blood pressure, and six blood serum measurements were obtained for each of n = 442 diabetes patients, as well as the response of interest, a quantitative measure of disease progression one year after baseline.

We will focus on the following variables:

Y: measure of disease progression one year after baseline
AGE: age in years
SEX: sex (1 = female, 2 = male)
BMI: body mass index (\(kg/m^2\))
BP: average blood pressure (mm Hg)

The goal of this analysis is to learn the associations between age, sex, BMI, and blood pressure on diabetic disease progression one year after baseline. The data is available in the ae-04- repo and is called diabetes.txt.

diabetes <- read_table("diabetes.txt")
glimpse(diabetes)

Rows: 442
Columns: 11
$ AGE <dbl> 59, 48, 72, 24, 50, 23, 36, 66, 60, 29, 22, 56, 53, 50, 61, 34, 47…
$ SEX <dbl> 2, 1, 2, 1, 1, 1, 2, 2, 2, 1, 1, 2, 1, 2, 1, 2, 1, 2, 1, 1, 1, 2, …
$ BMI <dbl> 32.1, 21.6, 30.5, 25.3, 23.0, 22.6, 22.0, 26.2, 32.1, 30.0, 18.6, …
$ BP  <dbl> 101.00, 87.00, 93.00, 84.00, 101.00, 89.00, 90.00, 114.00, 83.00, …
$ S1  <dbl> 157, 183, 156, 198, 192, 139, 160, 255, 179, 180, 114, 184, 186, 1…
$ S2  <dbl> 93.2, 103.2, 93.6, 131.4, 125.4, 64.8, 99.6, 185.0, 119.4, 93.4, 5…
$ S3  <dbl> 38, 70, 41, 40, 52, 61, 50, 56, 42, 43, 46, 32, 62, 49, 72, 39, 70…
$ S4  <dbl> 4.00, 3.00, 4.00, 5.00, 4.00, 2.00, 3.00, 4.55, 4.00, 4.00, 2.00, …
$ S5  <dbl> 4.8598, 3.8918, 4.6728, 4.8903, 4.2905, 4.1897, 3.9512, 4.2485, 4.…
$ S6  <dbl> 87, 69, 85, 89, 80, 68, 82, 92, 94, 88, 83, 77, 81, 88, 73, 81, 98…
$ Y   <dbl> 151, 75, 141, 206, 135, 97, 138, 63, 110, 310, 101, 69, 179, 185, …

Review of last AE

Last AE we fit the following model in Stan and presented posterior summaries.

\[ Y_i = \alpha + \mathbf{x}_i\boldsymbol{\beta} + \epsilon_i, \hspace{5mm} \epsilon \sim N(0, \sigma^2), \] where \(\mathbf{x}_i = (Age_i, 1(Sex_i = Male), BMI_i, BP_i)\) and flat priors for all parameters: \(f(\alpha,\boldsymbol{\beta},\sigma) \propto c.\) Flat priors are specified by default, so we can omit any prior specification.

Inference for Stan model: anon_model.
4 chains, each with iter=2000; warmup=1000; thin=1; 
post-warmup draws per chain=1000, total post-warmup draws=4000.

            mean se_mean    sd     2.5%    97.5% n_eff Rhat
alpha    -209.76    0.39 22.80  -254.10  -165.14  3400    1
beta[1]     0.14    0.00  0.24    -0.35     0.60  3630    1
beta[2]   -10.20    0.10  6.11   -22.54     1.44  3940    1
beta[3]     8.50    0.01  0.71     7.13     9.84  3201    1
beta[4]     1.44    0.00  0.24     0.97     1.91  3077    1
sigma      60.12    0.03  2.01    56.20    64.30  3376    1
lp__    -2433.16    0.05  1.79 -2437.62 -2430.72  1456    1

Samples were drawn using NUTS(diag_e) at Tue Jan 21 21:06:52 2025.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at 
convergence, Rhat=1).

The Bayesian regression results nicely matched the OLS/MLE results.

summary(lm(Y ~ AGE + as.factor(SEX) + BMI + BP, data = diabetes))


Call:
lm(formula = Y ~ AGE + as.factor(SEX) + BMI + BP, data = diabetes)

Residuals:
     Min       1Q   Median       3Q      Max 
-152.417  -43.576   -3.757   42.938  150.054 

Coefficients:
                 Estimate Std. Error t value Pr(>|t|)    
(Intercept)     -209.2284    22.6318  -9.245  < 2e-16 ***
AGE                0.1353     0.2329   0.581    0.562    
as.factor(SEX)2  -10.1590     5.9219  -1.716    0.087 .  
BMI                8.4843     0.7051  12.032  < 2e-16 ***
BP                 1.4345     0.2393   5.996 4.25e-09 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 59.98 on 437 degrees of freedom
Multiple R-squared:  0.4003,    Adjusted R-squared:  0.3948 
F-statistic: 72.91 on 4 and 437 DF,  p-value: < 2.2e-16

However, when we changed the priors to the following, our regression results become extremely different from the OLS/MLE.

\[\begin{align*} \alpha &\sim N(0,10^2)\\ \beta_j &\sim N(0,10^2),\quad j=1,\ldots,p\\ \sigma^2 &\sim \text{Inv-Gamma}(3,1). \end{align*}\]

Inference for Stan model: anon_model.
4 chains, each with iter=2000; warmup=1000; thin=1; 
post-warmup draws per chain=1000, total post-warmup draws=4000.

            mean se_mean     sd     2.5%    97.5% n_eff Rhat
alpha     -31.02    0.17   9.48   -48.88   -12.42  3072    1
beta[1]    -0.18    0.00   0.24    -0.66     0.29  2956    1
beta[2]    -3.70    0.09   5.39   -14.16     6.80  3853    1
beta[3]     6.05    0.01   0.69     4.69     7.39  2313    1
beta[4]     0.39    0.00   0.22    -0.05     0.82  2049    1
sigma2   4059.02    4.65 273.67  3555.48  4618.54  3464    1
lp__    -2513.33    0.04   1.70 -2517.48 -2510.89  1948    1

Samples were drawn using NUTS(diag_e) at Tue Jan 21 21:07:20 2025.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at 
convergence, Rhat=1).

Centering Data for Stable Inference

Before fitting our regression model, we will center the data. Define, \(\bar{Y} = \frac{1}{n}\sum_{i=1}^n Y_i\) and \(\bar{\mathbf{x}}_i = \frac{1}{n}\sum_{i=1}^n \mathbf{x}_i\). Then the centered data are \(Y_i^* = Y_i - \bar{Y}\) and \(\mathbf{x}_i^* = \mathbf{x}_i - \bar{\mathbf{x}}_i\). We will then perform regression using the centered data, such that \[Y^*_i = \alpha + \mathbf{x}_i^*\boldsymbol{\beta} + \epsilon_i, \hspace{5mm} \epsilon \sim N(0, \sigma^2).\] Centering our data can stabilize the posterior inference, because the data will be closer to zero. In particular, consider the scenario above where we placed a prior on \(\alpha \sim N(0,10^2)\). We know that the true value of \(\alpha\) (according to OLS/MLE) is close to -200, so our prior, which was intended to be weakly-informative, is actually pulling the posterior towards zero. This then leads to a domino-effect, where the incorrect inference for \(\alpha\) will actually lead to unstable inference for the other parameters. Thus, centering our data means that \(\alpha\) should be closer to 0, leading to stable inference for all of our parameters.

Even though we fit the model on centered data, we are still able to recover the parameters on the scale of our original data.

\[\begin{align*} \mathbb{E}[Y_i^* | \alpha, \boldsymbol{\beta}] &= \alpha + \mathbf{x}_i^*\boldsymbol{\beta}\\ \mathbb{E}[Y_i - \bar{Y} | \alpha, \boldsymbol{\beta}] &= \alpha + (\mathbf{x}_i - \bar{\mathbf{x}}_i)\boldsymbol{\beta}\\ \mathbb{E}[Y_i | \alpha, \boldsymbol{\beta}] &= \bar{Y} + \alpha - \bar{\mathbf{x}}_i\boldsymbol{\beta} + \mathbf{x}_i\boldsymbol{\beta}\\ \mathbb{E}[Y_i | \alpha, \boldsymbol{\beta}] &= \alpha^* + \mathbf{x}_i\boldsymbol{\beta}, \end{align*}\] where \(\alpha^* = \bar{Y} + \alpha - \bar{\mathbf{x}}_i\boldsymbol{\beta}\). Thus, on the original scale of the data the intercept would be equivalent to \(\alpha^*\). The regression slopes \(\boldsymbol{\beta}\) do not need any transformation and can be interpreted on the original scale. The error term, \(\sigma^2\), can also be interpreted on the original scale, because \(\mathbb{V}(Y_i^* | \alpha, \boldsymbol{\beta}) = \sigma^2\).

Exercises

Exercise 1

Fit the centered regression model detailed above with the following priors: \(\alpha \sim N(0,10^2)\), \(\beta_j \sim N(0,10^2)\), and \(\sigma^2 \sim \text{Inv-Gamma}(3,1).\) Obtain posterior samples for this model using Stan. Be sure to present posterior samples for \(\alpha^*\).

Answer:

# add code here

Exercise 2

Stan is extremely flexible in terms of the priors that can be used for parameters. Using the centered data specification, change the priors for all three parameters. Report how sensitive the results are.

Answer:

# add code here

Exercise 3

Compute the posterior predictive distribution for a 60 year old male with a BMI of 25 and average blood pressure of 85.

Answer:

# add code here

Important

To submit the AE:

Render the document to produce the PDF with all of your work from today’s class.
Push all your work to your AE repo on GitHub. You’re done! 🎉