What is a Gaussian mixture model?

Author

Joshua Noble

Data Scientist

Gaussian mixture models, defined

A Gaussian mixture model (GMM) is a probabilistic model that represents data as a combination of several Gaussian distributions, each with its own mean and variance, weighted by a mixing coefficient. GMMs are commonly used for clustering and density estimation, since they can capture complex, multimodal distributions where data points may naturally group around different centers rather than a single mean.

A single Gaussian distribution, also called a “normal distribution,” describes many kinds of natural phenomena. The distribution of students’ heights in a classroom, newborn infants’ weights and mechanical parts’ operational lifespans are often Gaussian distributions.

However, a single Gaussian distribution isn’t suitable for modeling datasets with multiple clusters of data or those with a significant skew or heavy tails. In these cases, a GMM might be more appropriate.

A GMM uses unsupervised learning to generate a probabilistic model that assumes data is generated from a combination of several Gaussian distributions. Instead of assuming all data comes from a single normal distribution (one Gaussian model), a GMM assumes there are multiple normal distributions, each representing a different “cluster” or “subpopulation” in the dataset, and each of which has its own mean and variance.

In the case of students, imagine heights with a bimodal distribution, but the students’ gender identity is unknown. In the case of machine parts, imagine that parts may have come from two different suppliers, one of which makes higher quality parts than the other. In both cases, it could be useful to calculate which sub-population a data point belongs to and the characteristics of that sub-population.

The latest AI trends, brought to you by experts

Get curated insights on the most important—and intriguing—AI news. Subscribe to our weekly Think newsletter. See the IBM Privacy Statement.

Thank you! You are subscribed.

Your subscription will be delivered in English. You will find an unsubscribe link in every newsletter. You can manage your subscriptions or unsubscribe here. Refer to our IBM Privacy Statement for more information.

How Gaussian mixture models work 

GMMs have many real-world applications beyond clustering: segmentation, density estimation, anomaly detection and pattern recognition all can be approximated by a GMM.

Here is a challenging distribution that is clearly not Gaussian:

One could attempt to find the equation of this curve using polynomial fitting or trigonometric approximation but GMMs offer a robust alternative that can be less computationally demanding. This distribution is actually three different Gaussian distributions combined:

A GMM would decompose the above distribution into three different Gaussian distributions and calculate the parameters for each one. The distributions shown above have a single dimension, but a GMM works for higher dimensional distributions as well. A 2D mixture of two Gaussians can be decomposed into two different distributions.

 

When used as a clustering algorithm, each Gaussian in the mixture model has three key parameters:

  • Mean vector (μ): the center of the cluster. In a 1D distribution this will be a single valued vector. In an n-dimensional distribution, this will be a n-valued vector.

  • Covariance matrix (Σ): this is the spread/shape of the Gaussian itself. In a 1D distribution this will be a single value, in an n-dimensional distribution, this will be an n x n matrix.

  • Mixing weight (π): this is the probability that a randomly chosen data point was generated by component. It’s not actually a feature of the Gaussian distribution itself, but instead of the model, as it combines different Gaussian distributions to represent the data it’s fitting.

 

Mixture of Experts | 28 November, episode 83

Decoding AI: Weekly News Roundup

Join our world-class panel of engineers, researchers, product leaders and more as they cut through the AI noise to bring you the latest in AI news and insights.

Fitting a GMM

The goal of a GMM is to estimate both the parameters of each Gaussian distribution in the model and which of those Gaussians each data point belongs to. The latent variable, often referred to as z , is which Gaussian component, out of all components identified in the model, generated a given data point. This variable is “latent” because it’s a hidden (or unobserved) variable that can be learned from the model.

For each point xn , there is a  zn  (where n is the number of components) that is the Gaussian which generated  xi  ( i in this case is the number of data points).  zn  is never observed in the data, only point  xi . Additionally, the Gaussian component which produced  xi , cannot be observed. Instead, the model’s expectation-maximization algorithm infers a distribution of possible z values.

Each Gaussian component is weighted by a mixing coefficient, which represents an estimation of how much each distribution affects the location of that specific datapoint. In a clustering scenario, the mixing weight reflects the relative size of each cluster. The GMM states: to find the probability of x , imagine that first one randomly chooses one Gaussian according to  πk , then draws x from that Gaussian. So  p(x) is a mixture of the component densities. If x is close to multiple means  μk , several Gaussians may assign it a high probability, and their contributions add up. The full model is the weighted sum of these Gaussian probability distributions.

Mathematically, the probability density function of a data point  x  under a GMM with K components is:

 p(x)=k=1KπkN(xμk,Σk)

To break this down:

 πk  : this is the mixing weight for mixture component k, which an estimation of how much Gaussian k contributes to the data point.

 N(xμk,Σk) : is the the Gaussian distribution with:

  •  μk  the mean vector, which can be thought of as the center of Gaussian  k
     
  •  Σk covariance matrix, which represents the “spread and orientation” of Gaussian  k     

The total probability density at  x  is  p(x)  which is a weighted sum of all Gaussians.

Expectation maximization

A GMM is most often fitted using the expectation-maximization (EM) algorithm, which iteratively assigns probabilities belonging to each Gaussian, called the E-step, and updates the parameters of each Gaussian, called the M-step.

EM is a powerful way to estimate parameters when algorithms like maximum likelihood estimation (MLE) are difficult to use, for instance, in the case of a GMM. In GMM, the model is almost always fit using a log-likelihood function. That log-likelihood is non-linear and difficult to maximize analytically, which means that MLE can’t maximize directly. Additionally, a GMM has latent variables (the mixture weights), which are not directly observable in the data and MLE won’t discover these when permuting labels.

Another approach, stochastic gradient descent (SGD), requires that the underlying objective function be differentiable, which may not always be the case. Additionally, unlike EM, SGD can’t be easily parallelized and requires significant compute resources for large data. Parallelizing EM using an approach like map-reduce is a powerful optimization.

EM consists of four steps:

 1. Initialization

The EM algorithm starts with randomized initial parameter values and assumes the observed data comes from a model that can be estimated. Many implementations of GMM will allow practitioners to pick from a variety of initializations like setting initial responsibilities using K-means, random values or sampling from the training data.

2. E-Step (expectation step)

We compute the posterior probability, which is a “soft assignment” of data points to components. This asks, in effect, given the current guesses of parameters, “how much does each Gaussian ‘own’ each data point?”

First, the posterior probability of each latent variable is calculated based on the observed data. The probability that  zi=k , i.e., that  xi  belongs to the k-th component, can be computed using Bayes’ rule:

 P(Zi=kxi;θ)=p(xiZi=k;θ)P(Zi=k;θ)p(xi;θ)

Next, the log-likelihood of the observed data is computed using the current parameter estimates. The expected log-likelihood with respect to the distribution of the latent variables can now be written as follows:

 Q(θ,θold)=i=1nk=1Kγ(zik)log[wkN(xi;μk,Σk)]

The function Q is a weighted sum of the log-likelihoods of all the data points under each Gaussian component, with the weights being the responsibilities. The log likelihood is calculating how likely it is, with the estimated values for of each Gaussian component, that the data point could arise from that distribution. This is different from the likelihood of the observed data under the mixture model as a whole. Instead, this Q-function represents an expected log-likelihood over both the observed data and the estimated latent variable distributions.

3. M-step (maximization step)

The M-Step updates three distinct values for each Gaussian distribution:

  • Mean update typically represented as  μknew 

  • Updating the covariance matrix, typically represented as  Σknew 

  • Updating the mixture weight, typically represented as  wknew

The next step is updating the model parameters by maximizing the log-likelihood of the model producing the data. The better the model, the higher this value will be.

 μknew=i=1nγ(zik)xii=1nγ(zik)

That is, the new mean of the k -th component is a weighted average of all the data points, with the weights being the probabilities that these points belong to component k .

 Σknew=i=1nγ(zik)(xi-μknew)(xi-μknew)i=1nγ(zik)

This represents how the new covariance of component  k  is a weighted average of the squared deviations of each data point from the component’s mean, where the weights are the probabilities of the points assigned to that component.

Finally, the M-Step updates the mixture weights:

 wknew=1ni=1nγ(zik)

The new weight of the k -th component is the total probability of the points belonging to this component, normalized by the number of points n .

4. Convergence

Finally, EM checks that model parameters are stable and converging. If the changes in log-likelihood or parameters are below a set threshold, then the algorithm stops. If not, then EM repeatsiterations of steps 2 and 3 until convergence is reached.

In short, the EM algorithm consists of two steps iteratively repeated. First, the E-step computes the mixture weights of all the Gaussians for each data point. Then the M-step uses those updated mixture weights to re-estimate parameters for each Gaussian. EM then compares the change in log-likelihood and if it’s below a set threshold, assumes convergence and stops iterating.

Comparing GMMs

GMMs are powerful but they rely on Gaussian assumptions. For GMMs to represent data well the clusters need to be elliptical and the densities across the clusters smooth. Clusters with non-elliptical shapes or data with highly dense and sparse sections might not be represented well by a GMM.

When used for clustering, GMMs are similar to k-means clustering but have several key differences. First, unlike k-means, which assigns each point to one cluster, GMMs give probabilities of belonging to each cluster. This is called “soft clustering.” Since clusters can be both elliptical and overlapping, GMMs are often more flexible and allow for more uncertainty in cluster boundaries.

For binary or categorical data, GMMs do not work well, but a similar approach using Bernoulli distributions or multinomial distributions can fit data well. Conversely, those types of models will not fit data that consists of continuous variables where a GMM often will fit the data well.

Since GMMs try to estimate the parameters of Gaussian distributions, some data will be better modeled using a nonparametric method like kernel density estimation (KDE). A KDE doesn’t make any assumptions about the distributions of clusters or sub-populations, instead it estimates density over small, local kernels on each data point. This approach is useful when your data consists of complex distributions without assuming any particular shape.

An extension of GMM is the variational autoencoder (VAE), which is a generative model that learns flexible latent distributions. In a VAE, the overall goal is the same, but a VAE doesn’t use EM. A VAE uses a probabilistic encoder-decoder framework to learn latent representations in the same way that a GMM assigns mixture weights for each data point. The primary difference is that EM requires that the posterior probability can be calculated, whereas in a VAE that is not the case, making it far more flexible. The tradeoff is that a VAE is often more complex and time-consuming to train.
 

GMM use cases

This explainer has focused extensively on clustering since it provides an intuitive introduction to GMMs, but there are other scenarios in which GMMs can be helpful. Feature engineering, anomaly detection and density estimation are all common tasks in which GMMs can be powerful.

Feature engineering: while some machine learning algorithms like XGBoost can enable a model to learn a variety of input feature distributions, others are stricter in their requirements. Linear and logistic regression, linear discriminant analysis (LDA), and multivariate Gaussian distribution typically expect features to be normally distributed and may not function well if the data is multimodal. There are other analytical and visual reasons useful for dealing with multimodality and GMM can help.

Unsupervised classification: GMM works similarly to the K-means algorithm but allows probabilistic determination of class belonging, unlike K-means, where the output is a binary metric. This can be especially beneficial to use cases that require custom thresholds for categorization or that require a probabilistic output.

Anomaly detection: A multivariate Gaussian distribution can be used to identify data points that have low probability of following one or more Gaussian distributions. In this way, a GMM can help in finding two types of anomalous data: anomalies which are outliers of a population (for example, a mistake in data entry) and anomalies which form their own group (such as credit card fraud behavior).

The GMM is a model suitable for a wide variety of tasks that can be trained quickly and optimized easily. Though it has some limitations in terms of the types of data that it is well-suited to deal with, it can be valuable in a wide variety of machine learning and data science tasks.

Implementing GMMs

In Python, one can use the scikit-learn library to quickly create a GMM:

from sklearn.datasets import make_blobs
from sklearn.mixture import GaussianMixture
from sklearn.metrics import accuracy_score

# create some clusters
X, y = make_blobs(n_samples=400, centers=3, cluster_std=0.75, random_state=0)

# fit the GMM
gmm = GaussianMixture(n_components=3).fit(X)

Visualizing the results of GMMs clustering:

# predict the labels themselves
labels = gmm.predict(X)

# print the accuracy
print(f" Accuracy is {accuracy_score(y, labels)}")

# scatterplot the X values
plt.scatter(X[:, 0], X[:, 1], c=labels, s=40, cmap='viridis')

In R a package called mclust, which stands for model-based clustering, can be used to create GMMs.

# Install and load the 'mclust' package
library(mclust)

# create a matrix of data from normal distributions
data <- rbind(matrix(c(rnorm(50, mean = 0, sd = 1), rnorm(50, mean = 1, sd = 1.25)), ncol=2),
              matrix(c(rnorm(50, mean = 4, sd = 1), rnorm(50, mean = 2, sd = 1.25)), ncol = 2),
              matrix(c(rnorm(50, mean = 8, sd = 1.25), rnorm(50, mean = 4, sd = 0.75)), ncol = 2))

# Perform GMM clustering, G represents the number of expected clusters
gmm_model <- Mclust(data, G = 3)  

# Get the cluster assignments
cluster_assignments <- predict(gmm_model)$classification

# Visualize the results
plot(data, col = cluster_assignments, main = "GMM Clustering Results")
points(gmm_model$parameters$mean, col = 1:3)

In both Python and R, the developer needs to set the hyperparameter that specifies the number of clusters as a parameter for the GMM. As with KNN, a commonly used strategy for selecting this number of clusters is to train models for different numbers of clusters and compare each one. The most used metrics to compare models are:

Silhouette coefficient: This is defined for each sample and is composed of two scores: the mean distance between a sample and all other points in the same cluster, and the mean distance between a sample and all other points in the next nearest cluster.

Jensen-Shannon: This measures the divergence between distributions and is most often calculated by first calculating the Kullback–Leibler divergence, the average of the log-likelihood ratios over the samples and then averaging the two resulting KL divergence values. The  concept of distribution similarity is represented by the Jensen-Shannon (JS) metric. The smaller the JS-distance between the two GMMs, the more those GMMs agree on how to fit the data.

Bayesian information criterion (BIC): this gives an estimation of how well the model predicts the data balanced by the number of parameters that the model contains. If K is too small, then the log-likelihood of the model will be low and the BIC value large. If K is too large, then the likelihood may be high but the penalty against larger values (and thus overfitting) will also create a larger BIC value.

Akaike information criterion (AIC)
: this works very similarly to BIC but calculates a smaller penalty for the number of parameters.

Related solutions
IBM® watsonx Orchestrate™ 

Easily design scalable AI assistants and agents, automate repetitive tasks and simplify complex processes with IBM® watsonx Orchestrate™.

Explore watsonx Orchestrate
AI for developers

Move your applications from prototype to production with the help of our AI development solutions.

Explore AI development tools
AI consulting and services

Reinvent critical workflows and operations by adding AI to maximize experiences, real-time decision-making and business value.

Explore AI services
Take the next step

Whether you choose to customize pre-built apps and skills or build and deploy custom agentic services using an AI studio, the IBM watsonx platform has you covered.

Explore watsonx Orchestrate Explore watsonx.ai
Related solutions
IBM® watsonx Orchestrate™ 

Easily design scalable AI assistants and agents, automate repetitive tasks and simplify complex processes with IBM® watsonx Orchestrate™.

Explore watsonx Orchestrate
AI for developers

Move your applications from prototype to production with the help of our AI development solutions.

Explore AI development tools
AI consulting and services

Reinvent critical workflows and operations by adding AI to maximize experiences, real-time decision-making and business value.

Explore AI services
Take the next step

Whether you choose to customize pre-built apps and skills or build and deploy custom agentic services using an AI studio, the IBM watsonx platform has you covered.

Explore watsonx Orchestrate Explore watsonx.ai