Multiclass Classification

Prof. Sam Berchuck

Feb 20, 2025

Review of last lecture

  • On Tuesday, we learned about classification using logistic regression.

  • Today, we will focus on multiclass classification: multinomial regression, ordinal regression.

Multiclass regression

  • Often times one encounters an outcome variable that is nominal and has more than two categories.

  • If there is no inherent rank or order to the variable, we can use multinomial regression. Examples include:

    • gender (male, female, non-binary),

    • blood type (A, B, AB, O).

  • If there is an order to the variable, we can use ordinal regression. Examples include:

    • stages of cancer (stage I, II, III, IV),

    • pain level (mild, moderate, severe).

Multinomial random variable

  • Assume an outcome \(Y_i \in \{1,\ldots,K\}\) for \(i = 1,\ldots,n\).

  • The likelihood in multinomial regression can be written as the following categorical likelihood,

\[f(\mathbf{Y} | \boldsymbol{\beta}) = \prod_{i=1}^n \prod_{j=1}^K P(Y_i = j)^{\delta_{ij}},\]

  • \(\delta_{ij} = 1(Y_i = j)\) is the Kronecker delta.

  • Since \(Y_i\) is discrete, we only need to specify \(P(Y_i = j)\) for all \(i\) and \(j\).

Log-linear regression

  • One way to motivate multinomial regression is through a log-linear specification:

\[\log P(Y_i = j) = \mathbf{x}_i\boldsymbol{\beta}_j - \log Z.\]

  • \(\boldsymbol{\beta}_j\) is a \(j\) specific set of regression parameters.

  • \(P(Y_i = j) = \exp\{\mathbf{x}_i\boldsymbol{\beta}_j\}/Z\).

  • \(Z\) is a normalizing constant that guarentees that \(\sum_{j=1}^K P(Y_i = j) = 1\).

Finding the normalizing constant

  • We know that,

\[\begin{aligned} 1 &= \sum_{j=1}^K P(Y_i = j) = \frac{1}{Z}\sum_{j=1}^K \exp\{\mathbf{x}_i\boldsymbol{\beta}_j\}\\ &\implies Z = \sum_{j=1}^K \exp\{\mathbf{x}_i\boldsymbol{\beta}_j\} \end{aligned}\]

Multinomial probabilities

  • Thus, we have the following,

\[P(Y_i = k) = \frac{\exp\{\mathbf{x}_i\boldsymbol{\beta}_k\}}{\sum_{j = 1}^K\exp\{\mathbf{x}_i\boldsymbol{\beta}_j\}}.\]

  • This function is called the softmax function.

  • Unfortunately, this specification is not identifiable.

Identifiability issue

  • We can add a vector \(\mathbf{c}\) to all parameters and get the same result,

\[\begin{aligned} \frac{\exp\{\mathbf{x}_i(\boldsymbol{\beta}_k + \mathbf{c})\}}{\sum_{j = 1}^K\exp\{\mathbf{x}_i(\boldsymbol{\beta}_j + \mathbf{c})\}} &= \frac{\exp\{\mathbf{x}_i\boldsymbol{\beta}_k\}}{\sum_{j = 1}^K\exp\{\mathbf{x}_i\boldsymbol{\beta}_j\}}. \end{aligned}\]

  • A common solution is to set: \(\boldsymbol{\beta}_K = \mathbf{0}\).

Updating the probabilities

  • Using the identifiability constraint of \(\boldsymbol{\beta}_K = \mathbf{0}\), the probabilities become, \[\begin{aligned} P(Y_i = k) &= \frac{\exp\{\mathbf{x}_i\boldsymbol{\beta}_k\}}{1 + \sum_{j = 1}^{K-1}\exp\{\mathbf{x}_i\boldsymbol{\beta}_j\}}, \quad k \in \{1,\ldots,K-1\},\\ P(Y_i = K) &= \frac{1}{1 + \sum_{j = 1}^{K-1}\exp\{\mathbf{x}_i\boldsymbol{\beta}_j\}}. \end{aligned}\]

  • How to interpret the \(\boldsymbol{\beta}_k\)?

Deriving the additive log ratio model

  • Using our specification of the probabilities, it can be seen that,

\[\begin{aligned} P(Y_i = k) &= \frac{\exp\{\mathbf{x}_i\boldsymbol{\beta}_k\}}{1 + \sum_{j = 1}^{K-1}\exp\{\mathbf{x}_i\boldsymbol{\beta}_j\}}\\ &= \left[\frac{1}{1 + \sum_{j = 1}^{K-1}\exp\{\mathbf{x}_i\boldsymbol{\beta}_j\}}\right]\exp\{\mathbf{x}_i\boldsymbol{\beta}_k\}\\ &= P(Y_i = K) \exp\{\mathbf{x}_i \boldsymbol{\beta}_k\}. \end{aligned}\]

\(\implies \log \frac{P(Y_i = k)}{P(Y_i = K)} = \mathbf{x}_i \boldsymbol{\beta}_k, \quad k \in \{1,\ldots,K-1\}\)

Additive log ratio model

  • If outcome \(K\) is chosen as reference, the \(K − 1\) regression equations are:

\[\log \frac{P(Y_i = k)}{P(Y_i = K)} = \mathbf{x}_i \boldsymbol{\beta}_k, \quad k \in \{1,\ldots, K-1\}\]

  • This formulation is called additive log ratio.

  • \(\beta_{jk}\) represent the log-odds of being in category \(k\) relative to the baseline category \(K\) with a one-unit change in \(X_{ij}\).

    • \(\exp (\beta_{jk})\) is an odds ratio.

Getting back to the likelihood

  • The log-likelihood can be written as,

\[\log f(\mathbf{Y} | \boldsymbol{\beta}) = \sum_{i=1}^n \sum_{j=1}^K \delta_{ij} \log P(Y_i = j).\]

  • The \(P(Y_i = j)\) are given by the additive log ratio model.

  • As Bayesians, we only need to specify priors for \(\boldsymbol{\beta}_k, k \in \{1,\ldots,K-1\}\).

Multinomial regression in Stan

  • Hard coding the likelihood.
// additive_log_ratio.stan
functions {
  matrix compute_alr_probs(int n, int K, int p, matrix X, matrix beta) {
    matrix[n, K] probs;
    matrix[n, K - 1] expXbeta = exp(X * beta);
    for (i in 1:n) {
      real sum_i = sum(expXbeta[i, ]);
      for (j in 1:K) {
        if (j < K) {
          probs[i, j] = expXbeta[i, j] / (1 + sum_i);
        }
        if (j == K) probs[i, j] = 1 - sum(probs[i, 1:(K - 1)]);
      }
    }
    return probs
  }
}
data {
  int<lower = 1> K;
  int<lower = 1> n;
  int<lower = 1> p;
  array[n] int<lower = 1, upper = K> Y;
  matrix[n, p] X;
  matrix[n, K] delta;
}
parameters {
  matrix[p, K - 1] beta;
}
model {
  matrix[n, K] probs = compute_alr_probs(n, K, p, X, beta);
  for (i in 1:n) {
    for (j in 1:K) {
      target += delta[i, j] * log(probs[i, j]);
    }
  }
  target += normal_lpdf(to_vector(beta) | 0, 10);
}

Multinomial regression in Stan

  • Non-identifiable version.
// multi_logit_bad.stan
data {
  int<lower = 1> K;
  int<lower = 1> n;
  int<lower = 1> p;
  array[n] int<lower = 1, upper = K> Y;
  matrix[n, p] X;
}
parameters {
  matrix[p, K] beta;
}
model {
  matrix[n, K] Xbeta = X * beta;
  for (i in 1:n) {
    target += categorical_logit_lpmf(Y[i] | Xbeta[i]')
  }
  target += normal_lpdf(to_vector(beta) | 0, 10);
}

categorical_logit

Multinomial regression in Stan

  • Zero identifiability constraint.
// multi_logit.stan
data {
  int<lower = 1> K;
  int<lower = 1> n;
  int<lower = 1> p;
  array[n] int<lower = 1, upper = K> Y;
  matrix[n, p] X;
}
transformed data {
  vector[p] zeros = rep_vector(0, p);
}
parameters {
  matrix[p, K - 1] beta_raw;
}
transformed parameters {
  matrix[p, K] beta = append_col(beta_raw, zeros);
}
model {
  matrix[n, K] Xbeta = X * beta;
  for (i in 1:n) {
    target += categorical_logit_lpmf(Y[i] | Xbeta[i]')
  }
  target += normal_lpdf(to_vector(beta) | 0, 10);
}
generated quantities {
  matrix[p, K - 1] ors = exp(beta_raw);
  vector[n] Y_pred;
  vector[n] log_lik;
  for (i in 1:n) {
    Y_pred[i] = categorical_logit_rng(Xbeta[i]');
    log_lik[i] = categorical_logit_lpmf(Y[i] | Xbeta[i]');
  }
}

Ordinal regression

Let \(Y_i \in \{1,\ldots,K\}\) be an ordinal outcome with \(K\) categories.

  • The likelihood in ordinal regression is identical to the one from multinomial regression,

\[f(\mathbf{Y} | \boldsymbol{\beta}) = \prod_{i=1}^n \prod_{j=1}^K P(Y_i = j)^{\delta_{ij}}.\]

  • We need to add additional constraints that guarantee ordinality.

Proportional odds assumption

  • \(P(Y_i \leq k)\) is the cumulative probability of \(Y_i\) less than or equal to a specific category \(k=1,\ldots,K-1\).

  • The odds of being less than or equal to a particular category can be defined as, \[\frac{P(Y\leq k)}{P(Y > k)} \text { for } k=1,\ldots,K-1.\]

  • Not defined for \(k = K\), since division by zero is not defined.

Proportional odds regression

The log odds can then be modeled as follows, \[\log \frac{P(Y_i\leq k)}{P(Y_i > k)} = \text{logit}P(Y_i\leq k) = \alpha_k - \mathbf{x}_i \boldsymbol{\beta}\]

  • Why \(-\boldsymbol{\beta}\)?

  • \(\boldsymbol{\beta}\) is a common regression parameter. For a one-unit increase in \(x_{ij}\), \(\beta_j\) is the change in log odds of moving to a more severe level of the outcome \(Y_i\).

  • \(\alpha_k\) for \(k = 1,\ldots,K-1\) are \(k\)-specific intercepts that corresponds to the log odds of moving from level \(k\) to \(k+1\).

Understanding the probabilities

  • One can solve for \(P(Y_i \leq k), \quad k = 1,\ldots,K-1\)),

\[P(Y_i \leq k) = \frac{\exp\{\alpha_k - \mathbf{x}_i\boldsymbol{\beta}\}}{1 + \exp\{\alpha_k - \mathbf{x}_i\boldsymbol{\beta}\}} = \text{expit}(\alpha_k - \mathbf{x}_i\boldsymbol{\beta}).\]

  • \(P(Y_i \leq K) = 1\).

The individual probabilities are then given by,

\[\begin{aligned} P(Y_i = k) &= P(Y_i \leq k) - P(Y_i \leq k-1)\\ &= \text{expit}(\alpha_k - \mathbf{x}_i\boldsymbol{\beta}) - \text{expit}(\alpha_{k-1} - \mathbf{x}_i\boldsymbol{\beta}). \end{aligned}\]

A latent variable representation

  • Define a latent variable,

\[Y_i^* = \mathbf{x}_i \boldsymbol{\beta} + \epsilon_i, \quad \epsilon_i \sim \text{Logistic}(0, 1).\]

  • \(\mathbb{E}[\epsilon_i] = 0\).

  • \(\mathbb{V}(\epsilon_i) = \pi^2/3\).

  • CDF: \(P(\epsilon_i \leq x) = \frac{1}{1 + \exp\{-x\}} = \frac{\exp\{x\}}{1 + \exp\{x\}} = \text{expit}(x).\)

Visualizing the latent process

Adding thresholds (\(c_k\))

Getting category probabilities

A latent variable representation

  • Define a set of \(K-1\) cut-points, \((c_1,\ldots,c_{K-1}) \in \mathbb{R}^{K-1}\). We also define \(c_0 = -\infty, c_K = \infty\).

  • Our ordinal random variable can be generated as,

\[Y_i = \left\{ \begin{matrix*}[l] 1 & c_0 < Y_i^* \leq c_1\\ 2 & c_1 < Y_i^* \leq c_2\\ \vdots & \\ K & c_{K-1} < Y_i^* \leq c_K\\ \end{matrix*} \right.\]

Equivalence of the two specifications

  • Probabilities under the latent specification: \[\begin{aligned} P(Y_i = k) &= P(c_{k-1} < Y_i^* \leq c_k)\\ &= P(c_{k-1} < \mathbf{x}_i \boldsymbol{\beta} + \epsilon_i \leq c_k)\\ &= P(\mathbf{x}_i \boldsymbol{\beta} + \epsilon_i \leq c_k) - P(\mathbf{x}_i \boldsymbol{\beta} + \epsilon_i < c_{k-1})\\ &= P(\epsilon_i \leq c_k - \mathbf{x}_i \boldsymbol{\beta}) - P(\epsilon_i < c_{k-1} - \mathbf{x}_i \boldsymbol{\beta})\\ &= \text{expit}(c_k - \mathbf{x}_i\boldsymbol{\beta}) - \text{expit}(c_{k-1} - \mathbf{x}_i\boldsymbol{\beta}). \end{aligned}\]

  • Equivalency:

    • \(\alpha_k = c_k, \quad k = 1,\ldots, K-1\), assuming that \(\alpha_k < \alpha_{k+1}\).

Ordinal regression using Stan

// ordinal.stan
data {
  int<lower = 2> K;
  int<lower = 0> n;
  int<lower = 1> p;
  int<lower = 1, upper = K> Y[n];
  matrix[n, p] X;
}
parameters {
  vector[p] beta;
  ordered[K - 1] alpha;
}
model {
  target += ordered_logistic_glm_lpmf(Y | X, beta, alpha);
  target += normal_lpdf(beta | 0, 10);
  target += normal_lpdf(alpha | 0, 10);
}
generated quantities {
  vector[p] ors = exp(beta);
  vector[n] Y_pred;
  vector[n] log_lik;
  for (i in 1:n) {
    Y_pred[i] = ordered_logistic_rng(X[i, ] * beta, alpha);
    log_lik[i] = ordered_logistic_glm_lpmf(Y[i] | X[i, ], beta, alpha);
  }
}

ordered_logistic_glm_lpmf

Enforcing order in the cutoffs

  • In Stan, when you define a parameter as ordered[K-1] alpha;, the values of alpha are automatically constrained to be strictly increasing.

  • This transformation ensures that the alpha values follow the required order, i.e., alpha[1] < alpha[2] < ... < alpha[K-1].

  • Stan doesn’t sample alpha directly but instead works with an unconstrained parameter vector, which we will call gamma.

Enforcing order in the cutoffs

parameters {
  vector[K - 1] gamma;
}
transformed parameters {
  vector[K - 1] alpha;
  alpha[1] = gamma[1];
  for (j in 2:K) {
    alpha[j] = alpha[j - 1] + exp(gamma[j]);
  }
}
  • Here, gamma represents a vector of independent, unconstrained variables, and the transformation ensures that alpha is strictly increasing by construction.

  • Luckily we can use ordered, since Stan takes care of this (including the Jacobian) in the background.

Prepare for next class

  • Work on HW 03.

  • Complete reading to prepare for next Tuesday’s lecture

  • Tuesday’s lecture: Missing data