Probabilistic Programming (Intro to Stan!)

Prof. Sam Berchuck

Jan 21, 2025

Review of last lecture

On Thursday, we performed posterior inference for Bayesian linear regression using Gibbs and Metropolis sampling.

  • We obtained correlated samples from the posterior using MCMC.

  • Gibbs required a lot math!

  • Metropolis required tuning!

Today we will introduce Stan, a probabilistic programming language that uses Hamiltonian Monte Carlo to perform general Bayesian inference.

Learning objectives

By the end of this lecture you should:

  • Know how to start coding up a model in Stan.

  • Appreciate how easy Stan makes things for us compared to coding up the algorithm ourselves.

  • Be able to fit a basic linear regression in Stan.

What is Stan and how do we use it?

  • Stan is an intuitive yet sophisticated programming language that does the hard work for us.

  • Programming language like R, Python, Matlab, C++…

  • Works like most other languages: can use loops, conditional statements, and functions.

  • Code up a model in Stan and then it implements HMC (actually something called NUTS) for us.

Why should we use Stan?

  • Stan is the brainchild of Andrew Gelman at Colombia.

  • Stan uses an extension of HMC called NUTS that automatically tunes. It is fast.

  • Stan is simple to learn.

  • Stan has excellent documentation (a manual full of extensive examples).

  • Most important: Stan has a very active and helpful user forum and development team; for example, typical question answered in less than a couple of hours.

How do we use it?

Code up model in Stan code in a text editor and save as .stan file.

  • Call Stan to run the model from:

    • R, python, the command line, Matlab, Stata, Julia
  • Use one of the above to analyse the data (of course you can export to another one).

A straightforward example

Suppose:

  • We record the height, \(Y_i\), of 10 people.

  • We want a model to explain the variation, and choose a normal likelihood: \[Y_i \sim N(\mu, \sigma^2)\]

  • We choose the following (independent) priors on each parameter:

    • \(\mu \sim N(0, 1)\)
    • \(\sigma^2 \sim IG(1, 1)\)
  • Question: how do we code this up in Stan?

An example Stan program

data {
  real Y[10]; // height for 10 people
}
parameters {
  real mu;
  real<lower = 0> sigma2;
}
model {
  Y ~ normal(mu, sqrt(sigma2)); // likelihood
  mu ~ normal(0, 1); // prior for mu
  sigma2 ~ inv_gamma(1, 1); // prior for sigma
}

An example Stan program: data block

data {
  real Y[10]; // height for 10 people
}
  • Declare all data that you will pass to Stan to estimate your model.

  • Terminate all statements with a semi-colon ;.

  • Use ## or // for comments.

An example Stan program: data block

data {
  real Y[10]; // height for 10 people
}
  • We need to tell Stan the type of data variable. For example:

    • real for continuous data.

    • int for discrete data.

    • Arrays: above we specified Y as an array of continuous data of length 10.

An example Stan program: data block

data {
  real Y[10]; // height for 10 people
}

Can place limits on data, for example:

  • real<lower = 0, upper = 1> X;

  • real<lower = 0> Z;

Vectors and matrices; only contain reals and can be used for matrix operations.

real Y[10]; // array representation
vector[10] Y; // vector representation

An example Stan program: parameter block

parameters {
  real mu;
  real<lower = 0> sigma2;
}
  • Declare all parameters that you use in your model.

  • Place limits on variables, for example:

    • real<lower = 0> sigma2

A multitude of parameter types including some of the aforementioned:

  • real for continuous parameters.

  • Arrays of types, for example real beta[10]

An example Stan program: parameter block

parameters {
  real mu;
  real<lower = 0> sigma2;
}
  • vector or matrix, specified by:

    • vector[5] beta

    • matrix[5, 3] gamma

  • simplex for a parameter vector that must sum to 1.

  • More exotic types like corr_matrix, or ordered.

An example Stan program: parameter block

parameters {
  real mu;
  real<lower = 0> sigma2;
}

Important: Stan is not developed yet to work with discrete parameters. Options for discrete parameters in Stan:

  • Marginalize out the parameter. For example, suppose we have \(f(\boldsymbol{\beta}, \theta)\), where \(\boldsymbol{\beta}\) is continuous and \(\theta\) is discrete:

\(f(\boldsymbol{\beta}) = \sum_{i = 1}^K f(\boldsymbol{\beta}, \theta_i)\)

  • Some models can be reformulated without discrete parameters.

An example Stan program: model block

model {
  Y ~ normal(mu, sqrt(sigma2)); // likelihood
  mu ~ normal(0, 1); // prior for mu
  sigma2 ~ inv_gamma(1, 1); // prior for sigma2
}
  • Used to define:

    • Likelihood.

    • Priors on parameters.

If don’t specify priors on parameters Stan assumes you are using flat priors (which can be improper).

An example Stan program: model block

model {
  Y ~ normal(mu, sqrt(sigma2)); // likelihood
  mu ~ normal(0, 1); // prior for mu
  sigma2 ~ inv_gamma(1, 1); // prior for sigma2
}

Huge range of probability distributions covered, across a range of parameterizations. For example:

  • Discrete: Bernoulli, binomial, Poisson, beta-binomial, negative-binomial, categorical, multinomial.

  • Continuous unbounded: normal, skew-normal, student-t, Cauchy, logistic.

An example Stan program: model block

model {
  Y ~ normal(mu, sqrt(sigma2)); // likelihood
  mu ~ normal(0, 1); // prior for mu
  sigma2 ~ inv_gamma(1, 1); // prior for sigma2
}
  • Continuous bounded: uniform, beta, log-normal, exponential, gamma, chi-squared, inverse-chi-squared, Weibull, Wiener diffusion, Pareto.

  • Multivariate continuous: normal, student-t, Gaussian process.

  • Exotics: Dirichlet, LKJ correlation distribution, Wishart and its inverse, Von-Mises.

Running Stan

Write model in a text editing program and save as a .stan file.

  • To create a .stan file from RStudio, File -> New File -> Stan File.
###Load packages
library(rstan)

###Generate fake data
Y <- rnorm(10, mean = 0, sd = 1)

###Compile and run model, and save in fit
fit <- stan(file = 'straightforward.stan', data = list(Y = Y), 
            iter = 1000, chains = 4, seed = 1)

Running Stan on example model

###Compile and run model, and save in fit
fit <- stan(file = 'straightforward.stan', data = list(Y = Y), 
            iter = 1000, chains = 4, seed = 1)

The above R code runs NUTS for our model with the following options:

  • \(S=1,000\) MCMC samples of which 500 are discarded as warm-up.

  • Across 4 chains.

  • Using a random number seed of 1 (good to ensure you can reproduce results).

Example model: results

###Print summary statistics
print(fit, probs = c(0.25, 0.5, 0.75))
Inference for Stan model: anon_model.
4 chains, each with iter=1000; warmup=500; thin=1; 
post-warmup draws per chain=500, total post-warmup draws=2000.

        mean se_mean   sd   25%   50%   75% n_eff Rhat
mu     -0.45    0.01 0.32 -0.66 -0.46 -0.26  1361    1
sigma2  1.23    0.02 0.62  0.81  1.09  1.48   865    1
lp__   -6.80    0.04 1.03 -7.21 -6.49 -6.06   808    1

Samples were drawn using NUTS(diag_e) at Tue Jan 21 14:36:33 2025.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at 
convergence, Rhat=1).

Example model: results

###Extract posterior samples
pars <- extract(fit)
class(pars)
[1] "list"
names(pars)
[1] "mu"     "sigma2" "lp__"  
###Extract samples for particular parameters
pars <- extract(fit, pars = "mu")
class(pars$mu)
[1] "array"
dim(pars$mu)
[1] 2000

Visualize posterior

###Extract samples for particular parameters
library(ggplot2)
data.frame(mu = pars$mu) |>
  ggplot(aes(x = mu)) +
  geom_histogram() +
  labs(x = expression(mu), y = "Count", 
       subtitle = bquote("Posterior distribution for " ~ mu))

Quick note: what does \(\sim\) mean?

model {
  Y ~ normal(mu, sigma); // likelihood
  mu ~ normal(0, 1); // prior for mu
  sigma ~ inv_gamma(1, 1); // prior for sigma
}
  • \(\sim\) doesn’t mean sampling, although often times it can be thought of as sampling

  • MCMC/HMC makes use of the log-posterior

\[\log f(\boldsymbol{\theta} | \mathbf{Y}) \propto \log f(\boldsymbol{\theta}) + \sum_{i=1}^n \log f({Y}_i | \boldsymbol{\theta})\]

  • As such \(\sim\) really means increment log probability

  • All we have to do in Stan is specify the log-posterior!

Alternate way of specifying Stan models

model {
  target += normal_lpdf(Y | mu, sqrt(sigma2)); // likelihood
  target += normal_lpdf(mu | 0, 1); // prior for mu
  target += inv_gamma_lpdf(sigma2 | 1, 1); // prior for sigma
}
  • target is a not a variable, but a special object that represents incremental log probability.

  • target is initialized to zero.

  • normal_lpdf is the log of the normal density of y given location mu and scale sigma:

  • Stan documentation for normal distribution

target += std_normal_lpdf(mu) // prior for mu using standard normal

Linear regression using Stan: data and parameter chunks

data {
  int<lower = 1> n; // number of observations
  int<lower = 1> p; // number of covariates
  vector[n] Y; // outcome vector
  matrix[n, p + 1] X; // covariate vector
  real beta0; // location hyperparameter for beta
  real<lower = 0> sigma_beta; // scale hyperparameter for beta
  real<lower = 0> a; // shape hyperparameter for sigma2
  real<lower = 0> b; // scale hyperparameter for sigma2
}
parameters {
  vector[p + 1] beta;
  real<lower = 0> sigma2;
}

Linear regression using Stan: model chunk

model {
  for (i in 1:n) {
    target += normal_lpdf(Y[i] | X[i, ] * beta, sqrt(sigma2)); // likelihood
  }
  target += normal_lpdf(beta | beta0, sigma_beta); // prior for beta
  target += inv_gamma_lpdf(sigma2 | a, b); // prior for sigma2
}

Linear regression using Stan: vectorization

It is always a good idea to vectorize Stan code for faster and more efficient inference

model {
  target += normal_lpdf(Y | X * beta, sqrt(sigma2)); // likelihood
  target += normal_lpdf(beta | beta0, sigma_beta); // prior for beta
  target += inv_gamma_lpdf(sigma2 | a, b); // prior for sigma2
}

Linear regression using Stan

// saved in linear_regression.stan
data {
  int<lower = 1> n; // number of observations
  int<lower = 1> p; // number of covariates
  vector[n] Y; // outcome vector
  matrix[n, p + 1] X; // covariate vector
  real beta0; // location hyperparameter for beta
  real<lower = 0> sigma_beta; // scale hyperparameter for beta
  real<lower = 0> a; // shape hyperparameter for sigma2
  real<lower = 0> b; // scale hyperparameter for sigma2
}
parameters {
  vector[p + 1] beta;
  real<lower = 0> sigma2;
}
model {
  target += normal_lpdf(Y | X * beta, sqrt(sigma2)); // likelihood
  target += normal_lpdf(beta | beta0, sigma_beta); // prior for beta
  target += inv_gamma_lpdf(sigma2 | a, b); // prior for sigma2
}

Let’s simulate some data again

###True parameters
sigma <- 1.5 # true measurement error
beta <- matrix(c(-1.5, 3), ncol = 1) # true beta

###Simulation settings
n <- 100 # number of observations
p <- length(beta) - 1 # number of covariates

###Simulate data
set.seed(54) # set seed
X <- cbind(1, matrix(rnorm(n * p), ncol = p))
Y <- as.numeric(X %*% beta + rnorm(n, 0, sigma))

Fit linear regression using Stan

###Load packages
library(rstan)

###Create stan data object
stan_data <- list(n = n,
                  p = p,
                  Y = Y,
                  X = X,
                  beta0 = 0,
                  sigma_beta = 10,
                  a = 3, 
                  b = 1)
  
###Compile model separately
stan_model <- stan_model(file = "linear_regression.stan")

###Run model and save
fit <- sampling(stan_model, data = stan_data, 
                chains = 4, iter = 1000)
saveRDS(fit, file = "linear_regression_fit.rds")

Example model: results

###Print summary statistics
print(fit, probs = c(0.25, 0.5, 0.75))
Inference for Stan model: anon_model.
4 chains, each with iter=1000; warmup=500; thin=1; 
post-warmup draws per chain=500, total post-warmup draws=2000.

           mean se_mean   sd     25%     50%     75% n_eff Rhat
beta[1]   -1.48    0.00 0.15   -1.58   -1.48   -1.37  1951    1
beta[2]    3.30    0.00 0.15    3.20    3.30    3.40  1674    1
sigma2     2.35    0.01 0.33    2.12    2.32    2.55  1679    1
lp__    -196.98    0.04 1.29 -197.58 -196.63 -196.04   899    1

Samples were drawn using NUTS(diag_e) at Tue Jan 21 14:38:36 2025.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at 
convergence, Rhat=1).

Stan plots: point estimate and intervals

stan_plot(fit, pars = c("beta", "sigma2"), include_warmup = FALSE,
          point_est = "median", ci_level = 0.8, outer_level = 0.95)

Stan plots: histogram

stan_hist(fit)

Stan plots: density

stan_dens(fit)

Stan: a few of the loops and conditions

Stan has pretty much the full range of language constructs to allow pretty much any model to be coded.

for (i in 1:10) {something;}

while (i > 1) {something;}

if (i > 1) {something 1;}
else if (i == 0) {something2;}
else {something 3;}

Stan speed concerns

While Stan is fast it pays to know the importance of each code block for efficiency.

  • data: called once at beginning of execution.

  • transformed data: called once at beginning of execution.

  • parameters: every log probability evaluation!

  • transformed parameters: every log probability evaluation!

  • model: every log probability evaluation!

  • generated quantities: once per sample.

  • functions: how many times it is called depends on the function’s nature.

Stan in parallel

In R can run chains in parallel easily using:

library(rstan)
options(mc.cores = 8)

Stan summary

  • Stan works by default with a HMC-like algorithm called NUTS.

  • The Stan language is similar in nature to other common languages with loops, conditional statements and user-definable functions (didn’t cover here).

  • Stan makes life easier for us than coding up the MCMC algorithms ourselves.

R packages that interface with Stan

  • rstan, brms, cmdstanr, rstanarm

  • rstan and cmdstanr you write the Stan code, which gives you the most options.

    • rstan has a more intuitive user interface.

    • cmdstanr is more memory efficient and a lightweight interface to Stan.

  • rstanarm and brms you don’t need to write the Stan code yourself, which makes it easier to use Stan, but is limiting.

    • rstanarm’s biggest advantage is that the models are pre-compiled, but this is also it’s biggest limitation.

    • brms writes Stan code on the fly, so has many more models, some that are pretty advanced.

Prepare for next class

  • Work on HW 01 which is due January 30

  • Complete reading to prepare for next Thursday’s lecture

  • Thursday’s lecture: Priors, Posteriors, and PPDs!