Multiclass Classification

Prof. Sam Berchuck

Feb 20, 2025

Review of last lecture

  • On Tuesday, we learned about classification using logistic regression.

  • Today, we will focus on multiclass classification: multinomial regression, ordinal regression.

Multiclass regression

  • Often times one encounters an outcome variable that is nominal and has more than two categories.

  • If there is no inherent rank or order to the variable, we can use multinomial regression. Examples include:

    • gender (male, female, non-binary),

    • blood type (A, B, AB, O).

  • If there is an order to the variable, we can use ordinal regression. Examples include:

    • stages of cancer (stage I, II, III, IV),

    • pain level (mild, moderate, severe).

Multinomial random variable

  • Assume an outcome Yi∈{1,…,K} for i=1,…,n.

  • The likelihood in multinomial regression can be written as the following categorical likelihood,

f(Y|β)=∏i=1n∏j=1KP(Yi=j)δij,

  • δij=1(Yi=j) is the Kronecker delta.

  • Since Yi is discrete, we only need to specify P(Yi=j) for all i and j.

Log-linear regression

  • One way to motivate multinomial regression is through a log-linear specification:

log⁡P(Yi=j)=xiβj−log⁡Z.

  • βj is a j specific set of regression parameters.

  • P(Yi=j)=exp⁡{xiβj}/Z.

  • Z is a normalizing constant that guarentees that ∑j=1KP(Yi=j)=1.

Finding the normalizing constant

  • We know that,

1=∑j=1KP(Yi=j)=1Z∑j=1Kexp⁡{xiβj}⟹Z=∑j=1Kexp⁡{xiβj}

Multinomial probabilities

  • Thus, we have the following,

P(Yi=k)=exp⁡{xiβk}∑j=1Kexp⁡{xiβj}.

  • This function is called the softmax function.

  • Unfortunately, this specification is not identifiable.

Identifiability issue

  • We can add a vector c to all parameters and get the same result,

exp⁡{xi(βk+c)}∑j=1Kexp⁡{xi(βj+c)}=exp⁡{xiβk}∑j=1Kexp⁡{xiβj}.

  • A common solution is to set: βK=0.

Updating the probabilities

  • Using the identifiability constraint of βK=0, the probabilities become, P(Yi=k)=exp⁡{xiβk}1+∑j=1K−1exp⁡{xiβj},k∈{1,…,K−1},P(Yi=K)=11+∑j=1K−1exp⁡{xiβj}.

  • How to interpret the βk?

Deriving the additive log ratio model

  • Using our specification of the probabilities, it can be seen that,

P(Yi=k)=exp⁡{xiβk}1+∑j=1K−1exp⁡{xiβj}=[11+∑j=1K−1exp⁡{xiβj}]exp⁡{xiβk}=P(Yi=K)exp⁡{xiβk}.

⟹log⁡P(Yi=k)P(Yi=K)=xiβk,k∈{1,…,K−1}

Additive log ratio model

  • If outcome K is chosen as reference, the K−1 regression equations are:

log⁡P(Yi=k)P(Yi=K)=xiβk,k∈{1,…,K−1}

  • This formulation is called additive log ratio.

  • βjk represent the log-odds of being in category k relative to the baseline category K with a one-unit change in Xij.

    • exp⁡(βjk) is an odds ratio.

Getting back to the likelihood

  • The log-likelihood can be written as,

log⁡f(Y|β)=∑i=1n∑j=1Kδijlog⁡P(Yi=j).

  • The P(Yi=j) are given by the additive log ratio model.

  • As Bayesians, we only need to specify priors for βk,k∈{1,…,K−1}.

Multinomial regression in Stan

  • Hard coding the likelihood.
// additive_log_ratio.stan
functions {
  matrix compute_alr_probs(int n, int K, int p, matrix X, matrix beta) {
    matrix[n, K] probs;
    matrix[n, K - 1] expXbeta = exp(X * beta);
    for (i in 1:n) {
      real sum_i = sum(expXbeta[i, ]);
      for (j in 1:K) {
        if (j < K) {
          probs[i, j] = expXbeta[i, j] / (1 + sum_i);
        }
        if (j == K) probs[i, j] = 1 - sum(probs[i, 1:(K - 1)]);
      }
    }
    return probs
  }
}
data {
  int<lower = 1> K;
  int<lower = 1> n;
  int<lower = 1> p;
  array[n] int<lower = 1, upper = K> Y;
  matrix[n, p] X;
  matrix[n, K] delta;
}
parameters {
  matrix[p, K - 1] beta;
}
model {
  matrix[n, K] probs = compute_alr_probs(n, K, p, X, beta);
  for (i in 1:n) {
    for (j in 1:K) {
      target += delta[i, j] * log(probs[i, j]);
    }
  }
  target += normal_lpdf(to_vector(beta) | 0, 10);
}

Multinomial regression in Stan

  • Non-identifiable version.
// multi_logit_bad.stan
data {
  int<lower = 1> K;
  int<lower = 1> n;
  int<lower = 1> p;
  array[n] int<lower = 1, upper = K> Y;
  matrix[n, p] X;
}
parameters {
  matrix[p, K] beta;
}
model {
  matrix[n, K] Xbeta = X * beta;
  for (i in 1:n) {
    target += categorical_logit_lpmf(Y[i] | Xbeta[i]')
  }
  target += normal_lpdf(to_vector(beta) | 0, 10);
}

categorical_logit

Multinomial regression in Stan

  • Zero identifiability constraint.
// multi_logit.stan
data {
  int<lower = 1> K;
  int<lower = 1> n;
  int<lower = 1> p;
  array[n] int<lower = 1, upper = K> Y;
  matrix[n, p] X;
}
transformed data {
  vector[p] zeros = rep_vector(0, p);
}
parameters {
  matrix[p, K - 1] beta_raw;
}
transformed parameters {
  matrix[p, K] beta = append_col(beta_raw, zeros);
}
model {
  matrix[n, K] Xbeta = X * beta;
  for (i in 1:n) {
    target += categorical_logit_lpmf(Y[i] | Xbeta[i]')
  }
  target += normal_lpdf(to_vector(beta) | 0, 10);
}
generated quantities {
  matrix[p, K - 1] ors = exp(beta_raw);
  vector[n] Y_pred;
  vector[n] log_lik;
  for (i in 1:n) {
    Y_pred[i] = categorical_logit_rng(Xbeta[i]');
    log_lik[i] = categorical_logit_lpmf(Y[i] | Xbeta[i]');
  }
}

Ordinal regression

Let Yi∈{1,…,K} be an ordinal outcome with K categories.

  • The likelihood in ordinal regression is identical to the one from multinomial regression,

f(Y|β)=∏i=1n∏j=1KP(Yi=j)δij.

  • We need to add additional constraints that guarantee ordinality.

Proportional odds assumption

  • P(Yi≤k) is the cumulative probability of Yi less than or equal to a specific category k=1,…,K−1.

  • The odds of being less than or equal to a particular category can be defined as, P(Y≤k)P(Y>k) for k=1,…,K−1.

  • Not defined for k=K, since division by zero is not defined.

Proportional odds regression

The log odds can then be modeled as follows, log⁡P(Yi≤k)P(Yi>k)=logitP(Yi≤k)=αk−xiβ

  • Why −β?

  • β is a common regression parameter. For a one-unit increase in xij, βj is the change in log odds of moving to a more severe level of the outcome Yi.

  • αk for k=1,…,K−1 are k-specific intercepts that corresponds to the log odds of moving from level k to k+1.

Understanding the probabilities

  • One can solve for P(Yi≤k),k=1,…,K−1),

P(Yi≤k)=exp⁡{αk−xiβ}1+exp⁡{αk−xiβ}=expit(αk−xiβ).

  • P(Yi≤K)=1.

The individual probabilities are then given by,

P(Yi=k)=P(Yi≤k)−P(Yi≤k−1)=expit(αk−xiβ)−expit(αk−1−xiβ).

A latent variable representation

  • Define a latent variable,

Yi∗=xiβ+ϵi,ϵi∼Logistic(0,1).

  • E[ϵi]=0.

  • V(ϵi)=π2/3.

  • CDF: P(ϵi≤x)=11+exp⁡{−x}=exp⁡{x}1+exp⁡{x}=expit(x).

Visualizing the latent process

Adding thresholds (ck)

Getting category probabilities

A latent variable representation

  • Define a set of K−1 cut-points, (c1,…,cK−1)∈RK−1. We also define c0=−∞,cK=∞.

  • Our ordinal random variable can be generated as,

Yi={1c0<Yi∗≤c12c1<Yi∗≤c2⋮KcK−1<Yi∗≤cK

Equivalence of the two specifications

  • Probabilities under the latent specification: P(Yi=k)=P(ck−1<Yi∗≤ck)=P(ck−1<xiβ+ϵi≤ck)=P(xiβ+ϵi≤ck)−P(xiβ+ϵi<ck−1)=P(ϵi≤ck−xiβ)−P(ϵi<ck−1−xiβ)=expit(ck−xiβ)−expit(ck−1−xiβ).

  • Equivalency:

    • αk=ck,k=1,…,K−1, assuming that αk<αk+1.

Ordinal regression using Stan

// ordinal.stan
data {
  int<lower = 2> K;
  int<lower = 0> n;
  int<lower = 1> p;
  int<lower = 1, upper = K> Y[n];
  matrix[n, p] X;
}
parameters {
  vector[p] beta;
  ordered[K - 1] alpha;
}
model {
  target += ordered_logistic_glm_lpmf(Y | X, beta, alpha);
  target += normal_lpdf(beta | 0, 10);
  target += normal_lpdf(alpha | 0, 10);
}
generated quantities {
  vector[p] ors = exp(beta);
  vector[n] Y_pred;
  vector[n] log_lik;
  for (i in 1:n) {
    Y_pred[i] = ordered_logistic_rng(X[i, ] * beta, alpha);
    log_lik[i] = ordered_logistic_glm_lpmf(Y[i] | X[i, ], beta, alpha);
  }
}

ordered_logistic_glm_lpmf

Enforcing order in the cutoffs

  • In Stan, when you define a parameter as ordered[K-1] alpha;, the values of alpha are automatically constrained to be strictly increasing.

  • This transformation ensures that the alpha values follow the required order, i.e., alpha[1] < alpha[2] < ... < alpha[K-1].

  • Stan doesn’t sample alpha directly but instead works with an unconstrained parameter vector, which we will call gamma.

Enforcing order in the cutoffs

parameters {
  vector[K - 1] gamma;
}
transformed parameters {
  vector[K - 1] alpha;
  alpha[1] = gamma[1];
  for (j in 2:K) {
    alpha[j] = alpha[j - 1] + exp(gamma[j]);
  }
}
  • Here, gamma represents a vector of independent, unconstrained variables, and the transformation ensures that alpha is strictly increasing by construction.

  • Luckily we can use ordered, since Stan takes care of this (including the Jacobian) in the background.

Prepare for next class

  • Work on HW 03.

  • Complete reading to prepare for next Tuesday’s lecture

  • Tuesday’s lecture: Missing data

🔗 BIOSTAT 725 - Spring 2025

1 / 29
Multiclass Classification Prof. Sam Berchuck Feb 20, 2025

  1. Slides

  2. Tools

  3. Close
  • Multiclass Classification
  • Review of last lecture
  • Multiclass regression
  • Multinomial random variable
  • Log-linear regression
  • Finding the normalizing constant
  • Multinomial probabilities
  • Identifiability issue
  • Updating the probabilities
  • Deriving the additive log ratio model
  • Additive log ratio model
  • Getting back to the likelihood
  • Multinomial regression in Stan
  • Multinomial regression in Stan
  • Multinomial regression in Stan
  • Ordinal regression
  • Proportional odds assumption
  • Proportional odds regression
  • Understanding the probabilities
  • A latent variable representation
  • Visualizing the latent process
  • Adding thresholds (ck)
  • Getting category probabilities
  • A latent variable representation
  • Equivalence of the two specifications
  • Ordinal regression using Stan
  • Enforcing order in the cutoffs
  • Enforcing order in the cutoffs
  • Prepare for next class
  • f Fullscreen
  • s Speaker View
  • o Slide Overview
  • e PDF Export Mode
  • b Toggle Chalkboard
  • c Toggle Notes Canvas
  • d Download Drawings
  • ? Keyboard Help