BIOSTAT 725 - Spring 2025 – Bayesian Clustering

Learning Objectives

We will introduce the basic mixture modeling framework as a mechanism for model-based clustering and describe computational and inferential challenges.
Variations of the popular finite Gaussian mixture model (GMM) will be introduced to cluster patients according to ED length-of-stay.
We present an implementation of mixture modeling in Stan and discuss challenges therein.
Finally, various posterior summaries will be explored.

Finding subtypes

Revisiting data on patients admitted to the emergency department (ED) from the MIMIC-IV-ED demo.

Can we identify subgroups within this population?

The usual setup

Most models introduced in this course are of the form:

$f (Y_{i} ∣ X_{i}) = f (Y_{i} ∣ θ_{i} (X_{i})) .$

$f (\cdot)$ is the density or distribution function of an assumed family (e.g., Gaussian, binomial),
$θ_{i}$ is a parameter (or parameters) that may depend on individual covariates $X_{i}$ .

The usual setup

$f (Y_{i} ∣ X_{i}) = f (Y_{i} ∣ θ_{i} (X_{i}))$

Linear regression:

$f$ is the Gaussian density function, and $θ_{i} (X_{i}) = (X_{i} β, σ^{2})^{⊤}$

Binary classification:

$f$ is the Bernoulli mass function, and $θ_{i} (X_{i}) = logit (X_{i} β)^{- 1}$

Limitations of the usual setup

Suppose patients $i = 1, \dots, n$ are administered a diagnostic test. Their outcome $Y_{i}$ depends only on whether or not they have previously received treatment: $X_{i} = 1$ if yes and $X_{i} = 0$ otherwise. Suppose the diagnostic test has Gaussian-distributed measurement error, so $Y_{i} ∣ X_{i} \sim N (α + β X_{i}, σ^{2}) .$ Now, suppose past treatment information is not included in patients’ record—we cannot condition on $X_{i}$ . Marginalizing, $\begin{aligned} f (Y_{i}) & = P (X_{i} = 1) \times N (Y_{i} ∣ α + β, σ^{2}) \\ + P (X_{i} = 0) \times N (Y_{i} ∣ α, σ^{2}) . \end{aligned}$

Limitations of the usual setup

n <- 500; mu <- c(1,4.5); s2 <- 1
x <- sample(1:2, n, T); y <- rnorm(n, mu[x], sqrt(s2))
ggplot(data.frame(y = y), aes(x = y)) + 
  geom_histogram() + 
  labs(x = "Y", y = "Count")

Limitations of the usual setup

fit <- lm(y ~ 1)
ggplot(data.frame(residuals = fit$residuals), aes(x = residuals)) + 
  geom_histogram() + 
  labs(x = "Residuals", y = "Count")

Normality of residuals?

Mixture Model

Motivation for using a mixture model: Standard distributional families are not sufficiently expressive.

The inflexibility of the model may be due to unobserved heterogeneity (e.g., unrecorded treatment history).

Generically, $f (Y_{i}) = \sum_{h = 1}^{k} π_{h} \times f_{h} (Y_{i}) .$

Uses of mixture models:

Modeling weird densities/distributions (e.g., bimodal).
Learning latent groups/clusters.

Mixture Model

$f (Y_{i}) = \sum_{h = 1}^{k} π_{h} \times f_{h} (Y_{i})$

This mixture is comprised of $k$ components indexed by $h = 1, \dots, k$ . For each component, we have a probability density (or mass) function $f_{h}$ and a mixture weight $π_{h}$ , where $\sum_{h = 1}^{k} π_{k} = 1$ .
When $k$ is finite, we call this a finite mixture model for $Y_{i}$ .
It is common to let, $f_{h} (Y_{i}) = f (Y_{i} ∣ θ_{h}) .$
- The component densities share a functional form and differ in their parameters.

Gaussian Mixture Model

Letting $f_{h} (Y_{i}) = N (Y_{i} ∣ μ_{h}, σ_{h}^{2})$ for $h = 1, \dots, k$ , yields the Gaussian mixture model. For $Y_{i} \in R$ , $f (Y_{i}) = \sum_{h = 1}^{k} π_{h} N (Y_{i} ∣ μ_{h}, σ_{h}^{2})$

For multivariate outcomes $Y_{i} \in R^{p}$ ,

$f (Y_{i}) = \sum_{h = 1}^{k} π_{h} N_{p} (Y_{i} ∣ μ_{h}, Σ_{h}) .$

Gaussian Mixture Model

Consider a mixture model with 3 groups:

Mixture 1: $μ_{1} = - 1.5, σ_{1} = 1$ .
Mixture 2: $μ_{2} = 0, σ_{2} = 1.5$ .
Mixture 3: $μ_{3} = 2, σ_{3} = 0.6$ .

Notice, both means $μ_{h}$ and variances $σ_{h}^{2}$ vary across clusters.

Generative perspective on GMM

To simulate from a $k$ -component Gaussian mixture with means $μ_{1}, \dots, μ_{k}$ , variances $σ_{1}^{2}, \dots, σ_{k}^{2}$ , and weights $π_{1}, \dots, π_{k}$ :

Sample the component indicator $z_{i} \in {1, \dots, k}$ with probabilities: $P (z_{i} = h) = π_{h} ⟺ z_{i} \sim Categorical (k, {π_{1}, \dots, π_{k}}) .$
Given $z_{i}$ , sample $Y_{i}$ from the appropriate component: $(Y_{i} ∣ z_{i} = h) \sim N (μ_{h}, σ_{h}^{2}) .$

Generative perspective on GMM

n <- 500 
mu <- c(1, 4.5)
s2 <- 1 
# implicit: pi = c(0.5, 0.5)
z <- sample(1:2, n, TRUE)
y <- rnorm(n, mu[z], sqrt(s2))

This is essentially the code used to simulate the missing treatment history example.

Marginalizing Component Indicators

The label $z_{i}$ indicates which component $Y_{i}$ is drawn from—think of this as the cluster label: $f (Y_{i} ∣ z_{i} = h) = N (Y_{i} ∣ μ_{h}, σ_{h}^{2}) .$

But $z_{i}$ is unknown, so we marginalize to obtain:

$\begin{aligned} f (Y_{i}) & = \int_{Z} f (Y_{i} ∣ z) f (z) d z \\ = \sum_{h = 1}^{k} f (Y_{i} ∣ z = h) P (z = h) \\ = \sum_{h = 1}^{k} N (Y_{i} ∣ μ_{h}, σ_{h}^{2}) \times π_{h} . \end{aligned}$

This is key to implementing in Stan.

Gaussian mixture in Stan

Component indicators $z_{i}$ are discrete parameters, which cannot be estimated in Stan. As before, suppose $f (Y_{i}) = \sum_{h = 1}^{k} π_{h} N (Y_{i} ∣ μ_{h}, σ_{h}^{2})$ .

The log-likelihood is:

$\begin{aligned} \log f (Y_{i}) & = \log \sum_{h = 1}^{k} \exp (\log [π_{h} N (Y_{i} ∣ μ_{h}, σ_{h}^{2})]) \\ = log_sum_exp [\log π_{1} + \log N (Y_{i} ∣ μ_{1}, σ_{1}^{2}), \\ \dots, \\ \log π_{k} + \log N (Y_{i} ∣ μ_{k}, σ_{k}^{2})], \end{aligned}$

log_sum_exp is a Stan function.

Gaussian mixture in Stan

// saved in mixture1.stan
data {
  int<lower = 1> k;          // number of mixture components
  int<lower = 1> n;          // number of data points
  array[n] real Y;           // observations
}
parameters {
  simplex[k] pi; // mixing proportions
  ordered[k] mu; // means of the mixture components
  vector<lower=0>[k] sigma; // sds of the mixture components
}
model {
  target += normal_lpdf(mu |0.0, 10.0);
  target += exponential_lpdf(sigma | 1.0);
  vector[k] log_probs = log(pi);
  for (i in 1:n){
    vector[k] lps = log_probs;
    for (h in 1:k){
      lps[h] += normal_lpdf(Y[i] | mu[h], sigma[h]);
    }
    target += log_sum_exp(lps);
  }
}

Of note: simplex and ordered types.

First fit

ed <- read.csv("exam1_data.csv")
dat <- list(Y = (ed$los - mean(ed$los)),
            n = length(ed$los),
            k = 2)
mod1 <- stan_model("mixture1.stan")
fit1 <- sampling(mod1, data=dat, chains=4, iter=5000, control=list("adapt_delta"=0.99))
print(fit1, pars = c("pi", "mu", "sigma"), probs = c(0.025, 0.975))

Inference for Stan model: anon_model.
4 chains, each with iter=5000; warmup=2500; thin=1; 
post-warmup draws per chain=2500, total post-warmup draws=10000.

          mean se_mean   sd  2.5% 97.5% n_eff Rhat
pi[1]     0.37    0.17 0.24  0.16  0.83     2 6.35
pi[2]     0.63    0.17 0.24  0.17  0.84     2 6.35
mu[1]    -3.17    0.09 0.51 -4.49 -2.49    33 1.05
mu[2]     0.45    3.96 5.69 -3.17 13.08     2 5.14
sigma[1] 13.25    4.48 6.46  2.09 20.11     2 4.97
sigma[2]  5.03    3.28 4.68  2.03 14.78     2 7.56

Samples were drawn using NUTS(diag_e) at Sat Mar 22 13:54:38 2025.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at 
convergence, Rhat=1).

What is going on?

pairs(fit1, pars = c("mu", "sigma"))

Bimodal posterior

In one mode, $σ_{1}^{2} ≪ σ_{2}^{2}$ and in the other, $σ_{1}^{2} ≫ σ_{2}^{2}$

Bimodal posterior

The Gaussian clusters have light tails, so outlying values of $Y$ force large values of $σ_{h}^{2}$ . When $σ_{h}^{2}$ is large, small changes to $μ_{h}$ have little impact on the log-likelihood, and the ordering constraint is not sufficient to identify the clusters.

Things to consider when your mixture model is mixed up

Mixture modeling, especially when clusters are of interest, can be fickle.

Different mixtures can give similar fit to data, leading to multimodal posteriors that are difficult to sample from (previous slides).
Clusters will depend on your choice of $f_{h}$ —a Gaussian mixture model can only find Gaussian-shaped clusters.
Increasing $k$ often improves fit, but may muddle cluster interpretation.

Things to consider when your mixture model is mixed up

Employ informative priors.
Vary the number of clusters.
Change the form of the kernel.

Updated model

// saved in mixture2.stan
data {
  int<lower = 1> k;          // number of mixture components
  int<lower = 1> n;          // number of data points
  array[n] real Y;         // observations
}
parameters {
  simplex[k] pi; // mixing proportions
  ordered[k] mu; // means of the mixture components
  vector<lower = 0>[k] sigma; // sds of the mixture components
  vector<lower = 1>[k] nu;
}
model {
  target += normal_lpdf(mu | 0.0, 10.0);
  target += normal_lpdf(sigma | 2.0, 0.5);
  target += gamma_lpdf(nu | 5.0, 0.5);
  vector[k] log_probs = log(pi);
  for (i in 1:n){
    vector[k] lps = log_probs;
    for (h in 1:k){
      lps[h] += student_t_lpdf(Y[i] | nu[h], mu[h], sigma[h]);
    }
    target += log_sum_exp(lps);
  }
}

Informative prior on $σ_{h}^{2}$ .
Mixture of Student-t.

Updated model fit

mod2 <- stan_model("mixture2.stan")
fit2 <- sampling(mod2, data=dat, chains=4, iter=5000, control=list("adapt_delta"=0.99))
print(fit2, pars=c("pi", "mu", "sigma", "nu"))

Inference for Stan model: anon_model.
4 chains, each with iter=5000; warmup=2500; thin=1; 
post-warmup draws per chain=2500, total post-warmup draws=10000.

          mean se_mean   sd  2.5%   25%   50%   75% 97.5% n_eff Rhat
pi[1]     0.81    0.00 0.04  0.73  0.79  0.81  0.83  0.87  5567    1
pi[2]     0.19    0.00 0.04  0.13  0.17  0.19  0.21  0.27  5567    1
mu[1]    -2.96    0.00 0.22 -3.39 -3.10 -2.96 -2.81 -2.53  7318    1
mu[2]     8.30    0.02 1.10  6.03  7.66  8.35  9.01 10.28  4306    1
sigma[1]  2.20    0.00 0.17  1.88  2.09  2.20  2.31  2.55  5987    1
sigma[2]  2.82    0.00 0.40  2.06  2.55  2.81  3.09  3.64  6745    1
nu[1]    11.98    0.04 4.44  5.14  8.75 11.38 14.57 22.22  9860    1
nu[2]     1.54    0.00 0.43  1.03  1.25  1.46  1.74  2.50  9286    1

Samples were drawn using NUTS(diag_e) at Sat Mar 29 12:52:38 2025.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at 
convergence, Rhat=1).

Updated model results

From marginal mixture model to clusters

Stan cannot directly infer categorical component indicators $z_{i}$ . Instead, for each individual, we compute

$\begin{aligned} P (z_{i} = h ∣ Y_{i}, μ, σ, π) & = \frac{f (Y_{i} ∣ z_{i} = h, μ_{h}, σ_{h}) P (z_{i} = h ∣ π_{h})}{\sum_{h^{'} = 1}^{k} f (Y_{i} ∣ z_{i} = h^{'}, μ_{h^{'}}, σ_{h^{'}}) P (z_{i} = h^{'} ∣ π_{h^{'}})} \\ = \frac{N (Y_{i} | μ_{h}, σ_{h}) π_{h}}{\sum_{h^{'} = 1}^{k} N (Y_{i} | μ_{h^{'}}, σ_{h^{'}}) π_{h^{'}}} = p_{i h} . \end{aligned}$

Given these cluster membership probabilities, we can recover cluster indicators through simulation: $(z_{i} ∣ -) \sim Categorical (k, {p_{i 1}, \dots, p_{i k}}) .$

From marginal mixture model to clusters

...

generated quantities {
  matrix[n,k] lPrZik;
  int<lower=1, upper=k> z[n];
  for (i in 1:n){
    for (h in 1:k){
      lPrZik[i,h] = log(pi[h]) + student_t_lpdf(Y[i] | nu[h], mu[h], sigma[h]);
    }
    lPrZik[i] -= log(sum(exp(lPrZik[i])));
    z[i] = categorical_rng(exp(lPrZik[i]'));
  }
}

Co-clustering probabilities

Recovering $z_{i}$ allows us to make the following pairwise comparison: what is the probability that unit $i$ and unit $j$ are in the same cluster? This is the “co-clustering probability”.

It is common to arrange these probabilities in a co-clustering matrix $C$ , where the $i, j$ entry is given by, $C_{i j} = P (z_{i} = z_{j} ∣ -) \approx \frac{1}{S} \sum_{s = 1}^{S} 1 [z_{i}^{(s)} = z_{j}^{(s)}] .$