The return of the Box-Muller method due to GPUs

Sooner or later in life, you will need to generate a couple normal (or Gaussian) variables. Maybe you’re simulating some thermal noise or the stock market. Perhaps you’re sampling a point process with Gaussian coordinates. Or you’re running a generative artificial intelligence (Gen AI) model, such as a variational autoencoder, which needs to select a Gaussian point on an abstract mathematical space.

Whatever the reason, you can get your independent normal variables by applying the Box-Muller transform to independent uniform random variables. All you need to do is change from Cartesian to polar coordinates. Easy peasy, lemon squeezy.

Box-Muller was not the favourite

I previously wrote about the Box-Muller method, ending on the note that the method, though very cool, isn’t used so much as it relies upon mathematical functions that have been computationally expensive for processors, at least historically.

Box-Muller the favourite on GPUs

But that conventional wisdom has changed in recent years as processors can now (on a hardware level) readily evaluate such functions. More significantly, the fast rise of graphically processing units (GPUs) has also changed the rules of the game.

In a comment on my original Box-Muller post,  it was pointed out that the Box-Muller method is the preferred choice for GPUs, giving a link to the Nvidia website. The reason for using the Box-Muller method is GPUs do not handle well algorithms with loops and branches, so you should use methods that avoid these algorithmic steps. And the Box-Muller method is one that does just that.

The NVDIA website says:

Because GPUs are so sensitive to looping and branching, it turns out that the best choice for the Gaussian transform is actually the venerable Box-Muller transform…

Indeed.

Determinantal learning for selecting subsets in wireless networks

We can’t all talk all the time, while still being heard. That’s true for people and it’s true for wireless networks.

Networks need to schedule when their network nodes can transmit and receive data. In other words, a collection or subset of network nodes is chosen to transmit, while another subset is chosen to receive. (In telecommunication language, this is part of what is called the medium access control or MAC layer.)

The classic scheduler (spatial) Aloha involves each node flipping a biased coin with probability \(p\) of heads occurring. If heads comes up, a node can transmit. If not, it must stand ready to receive data. Simple.

But can we do better?

Determinantal scheduling

A few years ago, a couple collaborators and I become interested in this problem. To develop an intelligent (or, at least, not stupid) scheduler, we adapted a machine (or statistical) learning framework and published the results in a conference paper:

  • 2020 – Błaszczyszyn, Brochard, and Keeler, Coverage probability in wireless networks with determinantal scheduling.

The hint is in the title. Using a determinantal scheduler, we derived mathematical expressions for the coverage probability. I discussed that paper in an earlier post. I produced the numerical results with code located here and here.

From fermions to generative model

We used a learning framework based on determinantal point process, which were originally used to model subatomic particles called fermions such as electrons. A key property is that determinantal points, like fermions, repel each other, so this is a random model for diversity, anti-clustering or negative correlations.

Defined with a kernel matrix and determinants, these random models have convenient mathematical properties. For example, you can both simulate them (exactly) and train (or fit) them with maximum likelihood methods. Using this fermion model, Kulesa and Taskar proposed and pioneered a machine learning framework.

When properly trained (or fitted), determinantal models can act as generative artificial intelligence (or Gen AI) models. In some sense, these point processes are special types of Gibbsian point processes, which are defined with a general energy function called a Hamiltonian. In the field of AI and computer science more broadly, there are many energy-based models, some of which are used for generative models such as Boltzmann machines.

Recent work

A few days ago I was curious what was happening in this research area, and by chance, I noticed this recently uploaded preprint:

  • 2025 – Tu, Saha, and Dhillon –  Determinantal Learning for Subset Selection in Wireless Networks.

I feel more work remains to be done in this area.

Further reading

Wireless networks

Our work was inspired by an earlier paper that we did:

    • 2019 – Błaszczyszyn and Keeler, Determinantal thinning of point processes with network learning applications.

In that paper, which I mentioned in an earlier post, we discussed the potential of using determinantal point processes for scheduling network transmission (or selecting subsets). We pointed out that such a model could be trained on optimized network configurations.

The new work by Tu, Saha, and Dhillon is based on previous work, such as this ambitiously-titled paper:

  • 2019 – Saha and Dhillon – Machine Learning meets Stochastic Geometry: Determinantal Subset Selection for Wireless Networks.

This work by Saha and Dhillon work was independent of our work on scheduling, but came out around the same time.

Machine learning

I should stress that none of the above work was possible without the pioneering efforts by Kulesza and Taskar in a series of papers, much of which appears in their book:

  • 2012 – Kulesza and Taskar – Determinantal point processes for machine learning, Now Publishers.

You can find a version of the book here.

Learning Theory from First Principles

A new book on statistical or machine learning has been published:

  • Bach – Learning Theory from First Principles, MIT Press.

The author, Francis Bach, is a well-known researcher at Inria in Paris. There’s a free copy of his book on his website.

In a typical French way, the book is focused on theory, especially the proofs of essential results in statistical learning. But I stress that I’ve only glanced at the book.

Bach has also uploaded the code used in the book, written in both Python and MATLAB, which is something I often do.

Lecture slides based on Baccelli, Błaszczyszyn, and Karray

For those interested in the theory of point processes, Mohamed Karray uploaded some lecture slides, which are based on the book manuscript:

  • 2024 – Baccelli, Błaszczyszyn, and Karray – Random Measures, Point Processes, and Stochastic Geometry.

I have written about this manuscript in a previous post.

Although both the slides and manuscript lean heavily on the side of mathematical rigour, applications are the driving motivation. For example, the third set of slides cover Palm calculus, which hinges upon the important concept of conditioning on a point (of the random point process) existing at a specific location.  This idea is frequently used in mathematical models of wireless telecommunication networks based on stochastic geometry. (Unfortunately the Wikipedia article on Palm calculus is sorely lacking and only looks at the temporal, not spatial, setting.)

Karray, a researcher at Orange Labs in Paris, also has slides and (mostly forthcoming) manuscripts on machine learning and statistics, which take a rather rigorous route. It’s a French touch, mathematically, as measure theory appears.

The Metropolis-Hastings algorithm in C

There’s an old saying, which I just made up. If you really want to understand something, code it up in C.1Taking it to the next level, I recall an undergraduate subject on programming Motorola chips using assembly. But I wouldn’t recommend any of that.

I wrote some C code that basically does what my code in this previous post does without the pretty pictures. The code performs a Metropolis-Hastings algorithm, simulating random variables (or, more correctly, generating random variates) for a joint probability density describing two random variables.

In previous posts, I have covered the topic of Markov chain Monte Carlo (MCMC) methods, particularly the central workhorse, the Metropolis(-Rosenbluth-Rosenbluth-Teller-Teller)-Hastings algorithm.  These methods are frequently used in Bayesian statistics, high-dimensional integration, and optimization. For more details on how they work, I have written a couple posts, starting with this one and ending with this one, where I detail the mechanics of MCMC methods.

Let’s call this code MCMC C code2 Wise man once wrote: C:/DOS C:/DOS/RUN RUN/DOS/RUN.

Code considerations

The C programming language was not written for playing with random numbers. The standard uniform random number generator in C, simply called rand,  is not good for research level random simulations. But it does the job for illustration purposes.

In lieu of this generator, the Mersenne Twister is a popular algorithm for producing such numbers, which is widely recommended and used. There are implementations of this algorithm in the CUDA and MKL libraries for Nvidia GPUs and Intel CPUs; see this CUDA page and this MKL page for details. Check out this PDF file for further details on the CUDA version.

In addition to that, my C code needs to simulate normal (or Gaussian) random variables. For that I wrote my own simple code using the Box-Muller transform, which I covered in a previous post, so the code would be self-contained and less opaque. But in reality, you should always use functions from a pre-written library for generating variates according to a normal or whichever distribution.

Finally, C was never intended as a scientific language, despite its wide use as one. (Historically, that was Fortran’s job, which is still the workhorse for many serious number-crunching institutes, hence why there’s Fortran -ready version of CUDA.) So when handling sets of numbers, such as vectors and matrices, one has to use pointers and malloc more often that not, which can be a tricky. This is the case here, though the use of pointers in this code is relatively simple.

Where are the pretty pictures?

Unfortunately, when number crunching in C, you don’t immediately have access to plotting libraries that are available in scientific programming languages such as Python and Julia.

But you can simply create .csv (or text) files, and then plot them using whichever library you prefer. And if you have gnuplot installed, you can perform simple one-dimensional histograms using one-line commands such as:

gnuplot -e “plot ‘MCMCData_1D.csv’ using 1 bins=20;” -persist

In the above command, the file MCMCData_1D.csv has the random variates (the simulated random variables) stored in a single column.

Code

I only present code for the more complicated two-dimensional case. The code can be found here.

Note that there’s an option in the code of MCMC_1D.c to plot the results using gnuplot, if it’s installed on your machine. I didn’t include code for plotting the results of the 2-D case as gnuplot doesn’t do a 2-D histogram.

Warning: The code uses the standard pseudo-random number generator in C, which is known for being bad. I only used built-in the C generator to keep my code self-contained. For that reason, I also wrote my code for generating Gaussian (or normal) random variables, by using the Box-Muller transform, but in reality one would never do that for research or industry purposes.

/***********************************************************************
 * Runs a simple Metropolis-Hastings (ie MCMC) algorithm to simulate two
 * jointly distributed random variables with probability density
 * p(x,y)=exp(-(x^4+x*y+y^2)/s^2)/consNorm, where s>0 and consNorm is a
 * normalization constant. The probability density function is defined in
 * the function pdf_single.
 *
 * NOTE: In practice, the value of the normalization constant is not needed, as it cancels out in the algorithm.
 *
 * NOTE: This code will *create* a local file (see variable strFilename) to store results. It will *overwrite* that file if it already exists.
 *
 * WARNING: This code uses the default C random number generator, which is known for failing various tests of randomness.
 * Strongly recommended to use another generator for purposes beyond simple illustration.
 *
 * Author: H. Paul Keeler, 2024.
 * Website: hpaulkeeler.com
 * Repository: github.com/hpaulkeeler/posts
 *
 ***********************************************************/

#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <math.h>
#include <stdbool.h>
#include <string.h>

const long double pi = 3.14159265358979323846; // constant pi for generating polar coordinates

// helper function declarations; see below for definitions
static double *unirand(double *randValues, unsigned numbRand);                           // generate  uniform random variables on (0,1)
static double *normrand(double *randValues, unsigned numbRand, double mu, double sigma); // generate normal random variables
static double pdf_single(double x, double y, double s);                      // define probability density to be simulated
static double mean_var(double *set_sample, unsigned numbSim, double *varX);              // calculate meana and variance

int main(int argc, char *argv[])
{

    if (argc > 1)
    {
        fprintf(stderr, "This program takes no arguments...\n");
        exit(1);
    }
    else
    {

        char strFilename[] = "MCMCData_2D.csv"; // filename for storing simulated random variates

        // intializes (pseudo)-random number generator
        time_t timeCPU; // use CPU time for seed
        srand((unsigned)time(&timeCPU));
        // srand(42); //to reproduce results

        bool booleWriteData = true; // write data to file
        bool booleStats = true;     // perform simple mean/std stats

        // parameters
        unsigned numbSim = 1e4;   // number of random variables simulated
        unsigned numbSteps = 200; // number of steps for the Markov process
        double sigma = 2;         // standard deviation for normal random steps

        // probability density parameters
        double s = .5; // scale parameter for distribution to be simulated

        // Metropolis-Hastings variables
        // proposal for a new position in the random walk
        double zxRandProposal;      
        double zyRandProposal;      
        double pdfProposal; // density for proposed position
        double pdfCurrent;  // density of current position
        double ratioAccept; // ratio of densities (ie acceptance probability)
        double uRand;       // uniform variable for Bernoulli trial (ie a coin flip)
        // random step (normally distributed)
        double *p_numbNormX = (double *)malloc(1 * sizeof(double));
        double *p_numbNormY = (double *)malloc(1 * sizeof(double));
//positions of the random walk (ie the simualted random variables after numbSteps)
        double *p_xRand = (double *)malloc(numbSim * sizeof(double));
        double *p_yRand = (double *)malloc(numbSim * sizeof(double));

        (void)unirand(p_xRand, numbSim); // random initial values
        (void)unirand(p_yRand, numbSim); // random initial values

        unsigned i, j; // loop varibales
        for (i = 0; i < numbSim; i++)
        {
            // loop through each random walk instance (or random variable to be simulated)

            pdfCurrent = pdf_single(*(p_xRand + i), *(p_yRand + i), s); // current probability density

            for (j = 0; j < numbSteps; j++)
            {
                // loop through each step of the random walk
                (void)normrand(p_numbNormX, 1, 0, sigma);
                (void)normrand(p_numbNormY, 1, 0, sigma);
                // take a(normally distributed) random step in x and y
                zxRandProposal = (*(p_xRand + i)) + (*p_numbNormX);
                zyRandProposal = (*(p_yRand + i)) + (*p_numbNormY);

                pdfProposal = pdf_single(zxRandProposal, zyRandProposal, s); // proposed probability density

                // acceptance rejection step
                (void)unirand(&uRand, 1);
                ratioAccept = pdfProposal / pdfCurrent;
                if (uRand < ratioAccept)
                {
                    // update state of random walk / Markov chain
                    *(p_xRand + i) = zxRandProposal;
                    *(p_yRand + i) = zyRandProposal;
                    pdfCurrent = pdfProposal;
                }
            }
        }

        free(p_numbNormX);
        free(p_numbNormY);

        if (booleStats)
        {

            // initialize statistics variables (for testing results)
            char strVariable[] = "XY";
            double *p_AllRand[] = {p_xRand, p_yRand};
            double meanTemp = 0;
            double varTemp = 0;
            double stdTemp = 0;
            char strTemp='X';
            for (i = 0; i < 2; i++)
            {
                meanTemp = mean_var(p_AllRand[i], numbSim, &varTemp);
                stdTemp = sqrt(varTemp);
                strTemp=strVariable[i];
                printf("The average of the %c random variables is %lf.\n", strTemp, meanTemp);
                printf("The standard deviation of the %c random  variables is %lf.\n", strTemp, stdTemp);
            }
        }

        if (booleWriteData)
        {
            // print to file
            FILE *outputFile;
            outputFile = fopen(strFilename, "w");
            for (i = 0; i < numbSim; i++)
            {
                fprintf(outputFile, "%lf,%lf\n", *(p_xRand + i), *(p_yRand + i)); // output to file
            }
            fclose(outputFile);
            printf("Data printed to file.\n");
        }
        free(p_xRand);
        free(p_yRand);

        return (0);
    }
}

static double pdf_single(double x, double y, double s)
{
    // returns the probability density of a single point (x,y) inside a simulation window defined below
    double pdf_output;

    // non-zero density window parameters
    double xMin = -1;
    double xMax = 1;
    double yMin = -1;
    double yMax = 1;

    if1
    {
        pdf_output = exp(-((pow(x, 4) + x * y + pow(y, 2)) / (s * s)));
    }
    else
    {
        pdf_output = 0;
    }
    return pdf_output;
}

static double *normrand(double *randValues, unsigned numbRand, double mu, double sigma)
{
    // simulate pairs of iid normal variables using Box-Muller transform
    // https://en.wikipedia.org/wiki/Box%E2%80%93Muller_transform

    double U1, U2, thetaTemp, rhoTemp, Z1, Z2;
    int i = 0;
    while (i < numbRand)
    {
        // simulate variables in polar coordinates (theta, rho)
        (void)unirand(&U1, 1);
        thetaTemp = 2 * pi * U1; // create uniform theta values
        (void)unirand(&U2, 1);
        rhoTemp = sqrt(-2 * log(U2)); // create Rayleigh rho values

        // change to Cartesian coordinates
        Z1 = rhoTemp * cos(thetaTemp);
        Z1 = sigma * Z1 + mu;
        randValues[i] = Z1; // assign first of random variable pair
        i++;
        if (i < numbRand)
        {
            // if more variables are needed, generate second value of random pair
            Z2 = rhoTemp * sin(thetaTemp);
            Z2 = sigma * Z2 + mu;
            randValues[i] = Z2; // assign second of random variable pair
            i++;
        }
        else
        {
            break;
        }
    }
    return randValues;
}

static double *unirand(double *randValues, unsigned numbRand)
{ // simulate numbRand uniform random variables on the unit interval
  // storing them in randValues which must be allocated by the caller
  // with enough space for numbRand doubles

    for (int i = 0; i < numbRand; i++)
    {
        randValues[i] = (double)rand() / RAND_MAX;
    }
    return randValues;
}

static double mean_var(double *set_sample, unsigned numbSim, double *varX)
{
    // mean and variance of set_sample
    int i;
    // initialize statistics variables (for testing results)
    double meanX = 0;
    double meanXSquared = 0;
    double tempX;
    for (i = 0; i < numbSim; i++)
    {
        tempX = *(set_sample + i);
        meanX += tempX / ((double)numbSim);
        meanXSquared += tempX * tempX / ((double)numbSim);
    }

    *varX = meanXSquared - meanX * meanX;
    return meanX;
}

Further reading

Here’s some information on the rand function and random number generation:

Acknowledg(e)ments

A hat tip to C and CUDA guru Alex Stivala who pointed out a couple of minor issues in my original C code.

  1. x >= xMin) && (x <= xMax) && (y >= yMin) && (y <= yMax []

The Metropolis-Hastings algorithm in Python, Julia, R, and MATLAB

I wrote this code a little while ago, and I thought now would be a good time to present it, as I have covered the topic of Markov chain Monte Carlo (MCMC) methods, particularly the central workhorse the Metropolis(-Rosenbluth-Rosenbluth-Teller-Teller)-Hastings algorithm.  For more details on these methods, I have written about these methods in a couple posts, starting with this one and ending with this particularly relevant one.

Update: I re-wrote this code in C, located here, where I used my own code for generating Gaussian (or normal) variables, to keep the code entirely self-contained, but in practice you should never do that.

What the code does

The code basically does the same thing in four different (scientific) programming languages, namely Python, Julia, C, and MATLAB. It performs a Metropolis-Hastings algorithm, simulating random variables (or more correctly, generating random variates) for two respective probabilities densities in one dimension and two dimensions.

Two examples

The one-dimensional case is particularly simple, as the code only simulates an exponential variable. But for this random variable in practice, you would never ever do with any MCMC method, because simply simulate exponential random variables directly. I discussed in a previous post how this direct approach is used to simulate Poisson variables.

The two-dimensional case is slightly more complicated than the the classic joint Gaussian (or normal) probability density, which you would use other methods to simulate. But the idea can be extended to \(n\) dimensions, which is often the case when dealing with the probability distributions and their corresponding integrals that arise in Bayesian statistics.

Implementation considerations

Variance of the walk

To create the random walk, the code uses normal (or Gaussian) random variables, where the mean is simply the current position of the random walk. This is a standard approach due to the convenient properties of normal distribution.

There’s also the standard deviation \(\sigma>0\) of the normal variables.  In machine learning circles, this is what they call a hyperparameter. The random Metropolis-Hastings algorithm will, in theory, work regardless of the value, but some values result in faster results than others.  This issue is covered briefly in this question on Stats Exchange. And it raises an important question when using MCMC methods in general.

Convergence tests

How many simulations steps are enough to ensure that the random variables being simulated behave closely enough to the desired random variables? In other words, how long does it take for the algorithm to target distribution?

The simulation time elapsed before the algorithm has reached a certain level of sufficient called the burn-in period. Due to its vital importance, it is a central topic in the development and implementation of MCMC methods.

There are tests for assessing the degree of convergence during the simulation run such as the Gelman-Rubin test. This particular test involves taking simple empirical means and variances over the last \(m\) samples, across all simulation runs, and between simulation runs. Then a ratio is calculated of variances is calculated, which should be close to one if the algorithm is sufficiently converged. I haven’t implemented any such tests here, but it’s something that one should do in practice.

Perhaps I’ll cover that topic in a future post.

Testing the results

To test the results, you can empirical calculate the first two moments (that is, the mean and variance) of the simulated random variables. That acts as a very good sanity check.

But if you need more convincing, you can perform a histogram on the results, which effectively a way to empirically estimate the probability density of the simulated random variables. Fortunately, all four programming languages have built in histogram (counting) functions for one and two dimensions.

I already covered this topic in a previous posts on checking Poisson point process simulations.

Code

I’ll only include the code for Python and Julia, and refefr the reader to the rest of the code found here.

Python

I used the Matplotlib library to plot the probability density and its estimate.

For the histogram section, I used the histogram and histogram2d functions respectively to estimate the distribution of the number of points and the intensity function. I used the pdf option.

import numpy as np;  # NumPy package for arrays, random number generation, etc
import matplotlib.pyplot as plt  # for plotting
from matplotlib import cm  # for heatmap plotting
from mpl_toolkits import mplot3d  # for 3-D plots
from scipy import integrate  # for integrating

plt.close("all");  # close all previous plots

# Simulation window parameters
xMin = -1;
xMax = 1;
yMin = -1;
yMax = 1;

numbSim = 10 ** 4;  # number of random variables simulated
numbSteps = 200;  # number of steps for the Markov process
numbBins = 50;  # number of bins for histogram
sigma = 2;  # standard deviation for normal random steps

# probability density parameters
s = .5;  # scale parameter for distribution to be simulated

def fun_lambda(x, y):
    return np.exp(-(x ** 4 + x*y + y ** 2) / s ** 2);

# normalization constant
consNorm = integrate.dblquad(fun_lambda, xMin, xMax, lambda x: yMin, lambda y: yMax)[0];
#un-normalized joint density of variables to be simulated
def fun_p(x, y):
    return (fun_lambda(x, y) ) * (x >= xMin) * (y >= yMin) * (x <= xMax) * (y <= yMax);

xRand = np.random.uniform(xMin, xMax, numbSim);  # random initial values
yRand = np.random.uniform(yMin, yMax, numbSim);  # random initial values

pdfCurrent = fun_p(xRand, yRand);  # current transition (probability) densities

for jj in range(numbSteps):
    zxRand = xRand + sigma * np.random.normal(0, 1, numbSim);  # take a (normally distributed) random step
    zyRand = yRand + sigma * np.random.normal(0, 1, numbSim);  # take a (normally distributed) random step
    # Conditional random step needs to be symmetric in x and y
    # For example: Z|x ~ N(x,1) (or Y=x+N(0,1)) with probability density
    # p(z|x)=e(-(z-x)^2/2)/sqrt(2*pi)
    pdfProposal = fun_p(zxRand, zyRand);  # proposed probability

    # acceptance rejection step
    booleAccept = np.random.uniform(0, 1, numbSim) < pdfProposal / pdfCurrent;
    # update state of random walk/Markov chain
    xRand[booleAccept] = zxRand[booleAccept];
    yRand[booleAccept] = zyRand[booleAccept];
    # update transition (probability) densities
    pdfCurrent[booleAccept] = pdfProposal[booleAccept];

# for histogram, need to reshape as vectors
xRand = np.reshape(xRand, numbSim);
yRand = np.reshape(yRand, numbSim);

p_Estimate, xxEdges, yyEdges = np.histogram2d(xRand, yRand, bins=numbSteps, density=True);
xValues = (xxEdges[1:] + xxEdges[0:xxEdges.size - 1]) / 2;  # mid-points of bins
yValues = (yyEdges[1:] + yyEdges[0:yyEdges.size - 1]) / 2;  # mid-points of bins
X, Y = np.meshgrid(xValues, yValues);  # create x/y matrices for plotting

# analytic solution of (normalized) joint probability density
p_Exact = fun_p(X, Y) / consNorm;

# Plotting
# Plot empirical estimate
fig1 = plt.figure();
ax = plt.axes(projection="3d");
#plt.rc("text", usetex=True);
#plt.rc("font", family="serif");
surf = ax.plot_surface(X, Y, p_Estimate, cmap=plt.cm.plasma);
plt.xlabel("x");
plt.ylabel("y");
plt.title("p(x,y) Estimate");

# Plot exact expression
fig2 = plt.figure();
#plt.rc("text", usetex=True);
#plt.rc("font", family="serif")
ax = plt.axes(projection="3d");
surf = ax.plot_surface(X, Y, p_Exact, cmap=plt.cm.plasma);
plt.xlabel("x");
plt.ylabel("y");
plt.title("p(x,y) Exact Expression");

Julia

using Distributions #for random simulations
using PyPlot #for plotting
using StatsBase #for histograms etc
using Random
using LinearAlgebra
using HCubature #for numerical integration
#using LaTeXStrings #for LateX in labels/titles etc
PyPlot.close("all");  # close all PyPlot figures

#set random seed for reproducibility
#Random.seed!(1234)

# Simulation window parameters
xMin = -1;
xMax = 1;
yMin = -1;
yMax = 1;

numbSim = 10 ^ 5;  # number of random variables simulated
numbSteps = 25;  # number of steps for the Markov process
numbBins = 50;  # number of bins for histogram
sigma = 2;  # standard deviation for normal random steps

# probability density parameters
s = .5;  # scale parameter for distribution to be simulated

function fun_lambda(x,y)
    return (exp.(-(x.^4+x.*y+y.^2)./s^2));
end

#normalization constant -- UNDER CONSTRUCTION
consNorm,errorCub=hcubature(x -> fun_lambda(x[1],x[2]), [xMin, yMin], [xMax, yMax]);
#un-normalized joint density of variables to be simulated
function fun_p(x,y)
    return((fun_lambda(x,y)).*(x.>=xMin).*(y.>=yMin).*(x.<=xMax).*(y.<=yMax));
end
xRand=(xMax-xMin).*rand(numbSim).+xMin; #random initial values
yRand=(yMax-yMin).*rand(numbSim).+yMin; #random initial values

pdfCurrent=fun_p(xRand,yRand); #current transition (probability) densities
for jj=1:numbSteps
    zxRand= xRand.+sigma.*rand(Normal(),numbSim);#take a (normally distributed) random step
    zyRand= yRand.+sigma.*rand(Normal(),numbSim);#take a (normally distributed) random step

    # Conditional random step needs to be symmetric in x and y
    # For example: Z|x ~ N(x,1) (or Y=x+N(0,1)) with probability density
    # p(z|x)=e(-(z-x)^2/2)/sqrt(2*pi)
    pdfProposal = fun_p(zxRand, zyRand);  # proposed probability

    # acceptance rejection step
    booleAccept=rand(numbSim) .< pdfProposal./pdfCurrent;
    # update state of random walk/Markov chain
    xRand[booleAccept] = zxRand[booleAccept];
    yRand[booleAccept] = zyRand[booleAccept];
    # update transition (probability) densities
    pdfCurrent[booleAccept] = pdfProposal[booleAccept];
end

#histogram section: empirical probability density
histXY=fit(Histogram, (xRand,yRand),nbins=numbBins); #find histogram data
histXY=normalize(histXY,mode=:pdf); #normalize histogram
binEdges=histXY.edges; #retrieve bin edges
xValues=(binEdges[1][2:end]+binEdges[1][1:end-1])./2; #mid-points of bins
yValues=(binEdges[2][2:end]+binEdges[2][1:end-1])./2; #mid-points of bins
p_Estimate=(histXY.weights)
#create a meshgrid
X=[xValues[ii] for ii=1:length(xValues), jj=1:length(yValues)];
Y=[yValues[jj] for ii=1:length(xValues), jj=1:length(yValues)];

#analytic solution of (normalized) joint probability density
p_Exact = fun_p(X, Y)./consNorm;

# Plotting
# Plot empirical estimate
fig1 = PyPlot.figure();
PyPlot.rc("text", usetex=true);
PyPlot.rc("font", family="serif");
surf(X, Y, p_Estimate, cmap=PyPlot.cm.plasma);
PyPlot.xlabel("x");
PyPlot.ylabel("y");
PyPlot.title("p(x,y) Estimate");

# Plot exact expression
fig2 = PyPlot.figure();
PyPlot.rc("text", usetex=true);
PyPlot.rc("font", family="serif")
surf(X, Y, p_Exact, cmap=PyPlot.cm.plasma);
PyPlot.xlabel("x");
PyPlot.ylabel("y");
PyPlot.title("p(x,y) Exact Expression");

 

The Metropolis(-Rosenbluth-Rosenbluth-Teller-Teller)-Hastings algorithm

Consider a collection of random variables described by a joint probability distribution. Often, in any field with probability or statistics, one faces the task of simulating these random variables, which typically depend on each other in some fashion.

A now standard way for simulating or sampling such random variables is to use the Metropolis-Hastings algorithm, undoubtedly the cornerstone of Markov chain Monte Carlo methods. This method creates a discrete-time Markov chain that has a stationary or invariant distribution being the aforementioned distribution.

The algorithm was born out of a 1953 paper by Nicholas Metropolis, Arianna W. Rosenbluth, Marshall Rosenbluth, Augusta H. Teller, and Edward Teller (two husband-wife pairs), who looked at a special case, and a 1970 paper by W.K. Hastings, who generalized the method. It is typically called the Metropolis-Hastings or the Metropolis algorithm. And some have called it the M(RT)2 H algorithm.

(The history is a bit complicated, but perhaps we should drop the name Metropolis. The late Arianna (née Wright) Rosenbluth did most of the work. She was also a dab hand at fencing.)

Although the algorithm’s initial adoption and use was slow, taking decades partly due to slower computers, the Metropolis-Hastings algorithm is now a widely used method for simulating collections of random variables. This in turn gives fast ways for exploring, integrating, and optimizing otherwise unwieldy mathematical functions, such as those found in Bayesian statistics, machine learning, statistical physics, and combinatorial optimization. The algorithm serves as the foundation for other random simulation methods, such as the Gibbs sampler, hence it’s been called the workhorse of Marko chain Monte Carlo methods.

There are many books, articles, lecture notes, and websites describing the Metropolis-Hastings algorithm; see the Further reading section below. But I’ll detail the core ideas here. This post is designed to be somewhat self-contained, but it arose from a series of posts, starting with this one and ending with this particularly relevant one.

Constructing a Markov process

Take a collection of \(n\) random variables \(X_1,\dots,X_n\) with a (joint) probability distribution \(\pi(x)=\pi(x_1,\dots,x_n)\). This distribution will either be a (joint) probability mass function or (joint) probability density for discrete or continuous random variables, respectively.

We wish to construct a Markov chain on an abstract mathematical space \(\mathbb{X}\). We assume we can write a point \(x\in \mathbb{X}\) as \(x=(x_1,\dots,x_n)\). More specifically, the space \(\mathbb{X}\) is a Cartesian product of spaces \(\mathbb{X}_1,\dots,\mathbb{X}_n\) on which the variables are defined.

Which mathematical space is \(\mathbb{X}\)? That will, of course, depend on the random variables you’re trying to simulate. But it’s usually the lattice \(\mathbb{Z}^n\), Euclidean space \(\mathbb{R}^n\), or a subset of one of these two spaces.

For this post, we’ll assume that the space \(\mathbb{X}\) is discrete, which makes the things simpler. But the mathematics is similar for continuous spaces, with small changes such as replacing probabilities and sums with probability densities and integrals, respectively. We’ll further assume the space is finite, so we can use matrices to describe the Markov transition kernels. But everything covered here will work on infinite spaces such as \(\mathbb{R}^n\), which is the most common space used in practice.

Again, our overall aim is to construct a Markov chain with a stationary \(\pi\) being the same as the distribution that we want to sample.

Jumper process

There’s a random jumper that wants to jump around the space \(\mathbb{X}\). The jumper randomly jumps from one point in this mathematical space \(x\in \mathbb{X}\) to another point \(y\in \mathbb{X}\) according to the probability \(J(x,y)\). If the the state space \(\mathbb{X}\) is finite, then \(J\) becomes a matrix. The matrix row \(J(x,\cdot)\) is a probability mass function for each \(x\in \mathbb{X}\), so it sums to one. By definition, this random jumping forms a Markov chain.

The only thing we ask is that, for our jumper, every point \(x\) in \(\mathbb{X}\) where \(\pi(x)>0\) is reachable with positive probability in a single step. This implies the easy-to-achieve condition \(J(x,y)>0\) where \(\pi(x)>0\) for all points \(x,y\in\mathbb{X}\).

Now we have a Markovian jumper on the space \(\mathbb{X}\). But this turns out to be too much jumping for our jumper. Furthermore, the jumper is jumping more in certain directions than others. Occasionally the jumper wants to stay put (and have a rest) with the aim of balancing the jump directions.

The jumper still wants to jump sometimes from a point \(x\in \mathbb{X}\) to another point \(y\in \mathbb{X}\) based on \(J(x,y)\). But now at each time step, after choosing the jump direction but before jumping, the jumper flips a biased coin whose success probability \(\alpha(x,y)\) depends on the current position \(x\in \mathbb{X}\) and the (potential) next position \(y\in \mathbb{X}\). For the coin, the acceptance probability, which allows (or not) the jumper to move from \(x\) to \(y\), is given by

$$\alpha(x,y)=\min[1,\frac{\pi(y)}{\pi(x)}\frac{J(y,x)}{J(x,y)} ]\,,\quad x, y\in \mathbb{X}\,.$$

The function \(\alpha(x,y)\) is clearly never negative. The minimum in the above expression ensures that \(\alpha(x,y)\) is a valid probability.

Metropolis-Hastings ratio

The ratio in the expression for \(\alpha(x,y)\) is sometimes called the Metropolis-Hastings ratio, which we’ll soon see is designed specifically to balance the jump directions. The ratio means that a constant factor in the target distribution \(\pi(x)\) will vanish due to cancellation.

More specifically, if we can write the target distribution as \(\pi(x)=f(x)/C\), where \(C>0\) is some constant and \(f(x)\) is a non-negative function, then the ratio term \(\pi(y)/\pi(x)=f(y)/f(x)\). This reasoning also applies to a constant factor in the transition kernel \(M\).

The constant factor being irrelevant in the target distribution \(\pi(x)\) is very useful. It is particularly important for posterior distributions in Bayesian statistics and the Gibbs distributions in statistical physics, as these distributions typically have difficult-to-calculate constants.

Metropolis-Hastings process

The pairing of the original jumper Markov chain with the coin flipping creates a new Markov chain, which we call the Metropolis-Hastings process. What’s remarkable is its stationary distribution will be the target distribution \(\pi(x)\).

Transition kernel (matrix)

For the Metropolis-Hastings process, we can readily reason the structure of the transition kernel (matrix) \(M\) that describes this Markov chain. First we’ll look at the off-diagonal entries of \(M\).

Jumping from \(x\) to \(y\neq x\)

To jump from point \(x\) and to another point \(y\neq x\), the probability is simply the previous probability \(J(x,y)\) multiplied by the probability of that proposed jump being accepted, which is \(\alpha(x,y)\), giving

$$ M(x,y) = \alpha(x,y) J(x,y), \quad x\neq y\,.$$

Now we examine the diagonal entries of \(M\).

Jumping from \(x\) to \(x\)

There are two different ways for the jumper to remain at point \(x\). The first way is that the jumper simply jumps from \(x\) to \(x\), which happens with probability \(J(x,x)\). This proposed jump, so to speak, is accepted with probability one because \(\alpha(x,x)=1\). Consequently, we can write this probability as \(\alpha(x,x)J(x,x)\) or \(J(x,x)\).

The second way consists of all the possible jumps from \(x\) to \(z\), but then for each of those proposed jumps to be rejected, which happens (for each jump) with probability \([1-\alpha(x,z)]\). (I am using here the dummy or bound variable \(z\) instead of \(y\) for clarity.) Adding up the probabilities of these events gives the probability of the second way being the sum \(\sum_{z\in \mathbb{X}}[1-\alpha(x,z)] J(x,z) \,.\)

Consequently, for a single time step, the probability that the jumper starts at point \(x\) and remains at \(x\) is

$$ M(x,x) = \alpha(x,x)J(x,x)+\sum_{z\in \mathbb{X}}[1-\alpha(x,z)] J(x,z) \,.$$

The transition matrix \(M\) should be stochastic, so the rows sum to one, which we see is the case

$$ \sum_{y\in\mathbb{X}}M(x,y)=1\,.$$

Of course, we could have derived the diagonal entry \(M(x,x)\) immediately by starting with the above sum, but that approach lacks intuition into what’s happening.

Expression for the transition kernel \(M\)

$$M(x,y) = \begin{cases}
\alpha(x,y) J(x,y) & \text{if }
\begin{aligned}[t]
x&\neq y
\end{aligned}\\
\alpha(x,x)J(x,x)+\sum_{z\in \mathbb{X}}[1-\alpha(x,z)] J(x,z) & \text{if } x=y
\end{cases}$$

Often the above expression is written as a single line by placing an indicator function or similar in front of the sum for the diagonal entries. (In the continuous case, where the sum is replaced with an integral, a Dirac delta distribution is often used instead.)

Reversibility

A Markov process on \(\mathbb{X}\) with kernel (matrix) \(K\) is (time) reversible with respect to the distribution \(\mu\) if the following holds

$$ \mu(x)K(x,y) = \mu (y) K(y,x)\quad x,y\in\mathbb{X}\,.$$

This reversibility condition is also called the detailed balance equation. If this condition is met, then the Markov process will have a stationary distribution \(\mu\). By summing over \(x\), we can verify this because we obtain

$$ \sum_{x\in\mathbb{X}}\mu(x)K(x,y) =\mu(y)\sum_{x\in\mathbb{X}} K(y,x)=\mu(y)\,.$$

This is just the balance equation, often written as \(\mu=K\mu\), which says that the transition kernel \(K\) has a stationary distribution \(\mu \).

(Strictly speaking, we don’t necessarily need reversibility, as long as the Markov chain has a stationary distribution. Reversibility just makes the algebra easier.)

The Metropolis-Hastings process is reversible

We can show that the Metropolis-Hastings process with the transition kernel \(M\) satisfies the reversibility condition. The proof essentially just requires the swapping of rows and columns. Clearly then we only need to look at the off-diagonal entries, so we assume \(x\neq y\).

We further assume that \(\pi(y)J(y,x)\geq \pi(x) J(x,y)\). Then \(\alpha(x,y)=1\), so \(M(x,y)=J(x,y)\). Now we use this last fact in the last line of algebra that follow:

$$\begin{aligned}\pi(y) M(y,x)&=\pi(y)J(y,x) \alpha(y,x)\\ &= \pi(y)J(y,x) \frac{\pi(x)}{\pi(y)}\frac{J(x,y)}{J(y,x)}\\ &= \pi(x)J(x,y)\\&= \pi(x)M(x,y)\,,\end{aligned}$$

which shows that the reversibility condition holds with the last assumption.

But of course we can reverse that assumption, due to symmetry, so \(\pi(y)J(y,x)\leq \pi(x) J(x,y)\), then \(\alpha(x,y)=[\pi(y)J(y,x)]/[\pi(x) J(x,y)]\) and \(\alpha(y,x)=1\), implying this way also works. Then the reversibility condition holds for all cases.

We could have shown this more quickly by observing that

$$\begin{aligned}\pi(y) M(y,x)&=\pi(y)J(y,x)\alpha(y,x)\\ &= \pi(y)J(y,x) \min[1, \frac{\pi(x)}{\pi(y)}\frac{J(x,y)}{J(y,x)}]\\ &= \min[\pi(y)J(y,x) , \pi(x)J(x,y)]\,,\end{aligned}$$

which is symmetric in \(x\) and \(y\).

In summary, the Metropolis-Hastings process is reversible and has the stationary distribution \(\pi(x)\).

Continuous case

As mentioned earlier, the continuous case is similar to the discrete-case with a few modifications. For example, assuming now \(\mathbb{X}\) is continuous, then the stationary distribution is described by a probability density \(\pi(x)\). The jumper process will be described by transition kernel (function) \(j(x,y)\), where \(j(x,\cdot)\) is now a probability density for each \(x\in \mathbb{X}\).

It follows that the acceptance probability is

$$\alpha(x,y)=\min[1,\frac{\pi(y)}{\pi(x)}\frac{j(y,x)}{j(x,y)} ]\,,\quad x, y\in \mathbb{X}\,.$$

The transition kernel (function) is

$$m(x,y) = \begin{cases}
\alpha(x,y) j(x,y) & \text{if }
\begin{aligned}[t]
x&\neq y
\end{aligned}\\
\alpha(x,x) j(x,x)+\int_{\mathbb{X}}[1-\alpha(x,z)]j(x,z) dz & \text{if } x=y
\end{cases}$$

For more details, there are derivations online such as this one, as well as the sources cited in the Further reading section below.

Libraries

Of course, before writing your own code, I would check out any pre-written functions in your favourite language that already implement the Metropolis-Hastings algorithm.

For example, in MATLAB there’s the function mhsample. Again I would be using this before writing my own code, unless it’s for illustration or educational purposes.

In Python there’s the library pymcmcstat. For those with a machine learning bent, there’s also a TensorFlow function MetropolisHastings.

In R, which is a language designed for statistics, implementations of any Markov chain Monte Carlo methods will be couched in the language and notation of (Bayesian) statistics. For a specific library, I can’t say I know which is the best, but you can check out the MCMCPack library. For R libraries, be wary of using the ones that were published and are (or were?) maintained by a single contributor.

Code

For a couple of simple examples in both one dimension and two dimensions, I’ve implemented the Metropolis-Hastings algorithm in the programming languages R, MATLAB, Python (NumPy) and Julia. The code can be found here.

Further reading

There are good articles on explaining the Metropolis-Hastings approach, as well as its history. On this topic, the articles are probably better resources than the books.

Articles

Historical

The two important papers behind the Metropolis-Hastings algorithm are:

  • 1953 – Metropolis, Rosenbluth, Rosenbluth, Teller, Teller – Equation of state calculations by fast computing machines;
  • 1970 – Hastings – Monte Carlo sampling methods using Markov chains and their applications.
Introductory

There are several tutorial and historical articles covering the Metropolis-Hastings algorithm. An earlier explanatory article is

  • 1995 – Chib and Greenberg – Understanding the Metropolis-Hastings Algorithm.

I also recommend:

  • 1994 – Tierney – Markov chains for exploring posterior distributions;
  • 1998 – Diaconis and Saloff-Coste – What do we know about the Metrolpolis algorithm?;
  • 2001 – Billera and Diaconis – A Geometric Interpretation of the Metropolis-Hastings Algorithm;
  • 2015 – Minh and Minh – Understanding the Hastings Algorithm;
  • 2016 – Robert – The Metropolis-Hastings algorithm.

The above article by Billera and Diaconis shows that the Metropolis-Hastings algorithm is actually the minimization (on the space of possible samplers) using an \(L^1\) norm. (Note that \(K(x,x)\) should be \(K(x,y)\) in equation (1.3), so it agrees with equation (2.1).)

The following article covers the Metropolis-Hastings algorithm in the context of machine learning (particularly Bayesian statistics):

  • 2003 – Andrieu, de Freitas, Doucet, and Jordan – An Introduction to MCMC for Machine Learning.

For a surprising real world application, in the following article, Diaconis briefly recounts a story about a couple of graduate students using the Metropolis-Hastings algorithm to decipher a letter from a prison inmate who had used a simple substitution cipher:

  • 2009 – Diaconis – The Markov chain Monte Carlo revolution.

For Markov chains on general state spaces, the following paper gives a survey of results (often with new proofs):

  • 2004 – Roberts and Rosenthal – General state space Markov chains and MCMC algorithms.
History

The history of this algorithm and related methods are described in the following articles:

  • 2003 – David – A History of the Metropolis Hastings Algorithm;
  • 2005 – Gubematis – Marshall Rosenbluth and the Metropolis algorithm;
  • 2011 – Robert and Casella – A Short History of Markov Chain Monte Carlo: Subjective Recollections from Incomplete Data.

Incidentally, the first author of the third paper, Christian P. Robert, posts regularly on Markov chain Monte Carlo methods, such as the Metropolis-Hastings algorithm, as well as many other topics.

Books

Modern books on stochastic simulations and Monte Carlo methods will often detail this method. For example, see the Handbook of Monte Carlo Methods (Section 6.1) by Kroese, Taimre and Botev. The book Stochastic Simulation: Algorithms and Analysis by Asmussen and Glynn also covers the method in Chapter XIII, Section 3. (For my remark on reversibility, see Remark 3.2 in Asmussen and Glynn.) There is also the book Monte Carlo Strategies in Scientific Computing by Liu; see Chapter 5.

Websites

There are many, many websites covering this topic. Searching around, the following link is probably my favourite, as it gives a detailed explanation on how the Metropolis-Hastings algorithm works and includes with Python code:

This post also covers it with Python code:

This site details the algorithm with R code:

Here’s a quick introduction with a one-dimensional example implemented in R:

Creating a reversible Markov chain using acceptance(-rejection)

The study of Markov chains is generally the study of their long term behaviour, which, under certain conditions, is captured by them having a unique stationary distribution. Stationarity is an important property. It is, in a sense, a local property of a Markov chain.

For a Markov chain, a more global property is something called reversibility. Markov chains with this property must possess a stationary distribution, which, we see below, is an immediate consequence of reversibility. Reversible Markov chains (or processes) with discrete time1Asmussen writes in his book Applied Probability and Queues:

In this book, we use the terminology that a Markov chain has discrete time and a Markov process has continuous time (the state space may be discrete or general). However, one should note that it is equally common to let “chain” refer to a discrete state space and “process” to a general one (time may be discrete or continuous).
are the cornerstone of Markov chain Monte Carlo (MCMC) methods.

In this post we look at how a reversible Markov chain is constructed from a non-reversible (but irreducible) Markov chain by introducing an acceptance-rejection step. This post complements another post I wrote on the Metropolis-Hastings algorithm.

Reversibility

A Markov process on state space \(\mathbb{X}\) with kernel \(K\) is (time) reversible with respect to the distribution \(\mu\) if the following holds

$$ \mu(x)K(x,y) = \mu (y) K(y,x)\quad x,y\in\mathbb{X}\,.$$

This reversibility condition is also called the detailed balance equation. If this condition is met, then the Markov process will have a stationary distribution \(\mu\). By summing over \(x\), we can verify this because we obtain

$$ \sum_{x\in\mathbb{X}}\mu(x)K(x,y) =\mu(y)\sum_{x\in\mathbb{X}} K(y,x)=\mu(y)\,.$$

This is just the balance equation, often written as \(\mu=K\mu\), which says that the transition kernel \(K\) has a stationary distribution \(\mu \).

First Markov chain

We consider a Markov chain with kernel \(J\) defined on a finite state space \(\mathbb{X}\). If the Markov chain is at state \(x\in \mathbb{X}\), it visits another state \(y\in \mathbb{X}\) with the probability \(J(x,y)\). This is a simple time-homogeneous finite Markov chain.

Irreducibility

For our Markov chain, we assume that every state \(x\) in \(\mathbb{X}\) where \(\pi(x)>0\) is reachable with positive probability in a single step. This implies the easy-to-achieve condition \(J(x,y)>0\) where \(\pi(x)>0\) for all points \(x,y \in \mathbb{X}\). This requirement is a stronger form of irreducibility.

Creating a new Markov chain with acceptance

We create a new Markov chain by introducing an acceptance step. For the Markov chain with kernel \(J\), we assume that each time step, after choosing the jump direction but before jumping, a biased coin is flipped . The success probability \(\alpha(x,y)\) depends on the current position \(x\in \mathbb{X}\) and the (potential) next position \(y\in \mathbb{X}\).

Transition kernel

For our new Markov chain, we can quickly reason the transition kernel \(M\). We first look at the off-diagonal elements of the kernel (matrix) \(M\). To go from state \(x\) and to another state \(y\neq x\), the probability is simply

$$ M(x,y) = \alpha(x,y) J(x,y), \quad x\neq y\,.$$

The transition matrix \(M\) needs to be stochastic, so the rows sum to one, so \(\sum_{y\in\mathbb{X}}M(x,y)=1\). That gives us the diagonal elements of \(M\), although their exact form is not needed to show reversibility.

\(\alpha(x,y)\) needs a symmetric function \(s(x,y)\)

For reversibility, we just need to swap rows and columns. Clearly we only need to look at the off-diagonal entries, which implies the requirement

$$ \pi(x)M(x,y) = \pi (y) M(y,x)\quad x\neq y\,.$$

Both sides are symmetric in \(x\) and \(y\), meaning they are equal to some non-negative symmetric \(s(x,y)=s(y,x)\). Looking at the right-hand side, we get

$$\begin{aligned}\pi(y) M(y,x)&=\pi(y)J(y,x) \alpha(y,x)\\ &= s(y,x)\,.\end{aligned}$$

This implies that the function \(\alpha\) is a non-negative function such that \(\alpha\leq 1\), to ensure it’s a probability, with the form

$$ \alpha(x,y)=\frac{s(x,y)}{\pi(x)J(x,y)}\,.$$

The only task remaining now is to choose a reasonable symmetric function \(s\) such that \(\alpha\leq 1\), ensuring \(\alpha\) is a probability. Of course, our choice for the symmetric function \(s\) should also be a function of the stationary distribution \(\pi\) and the underlying kernel \(J\).

Examples

I’ll give two principal examples of the symmetric function \(s(x,y)\). Working in reverse chronological order, I’ll give the simpler of the two examples first.

Barker

A somewhat natural example is

$$s(x,y) = \frac{\pi(x)J(x,y)\pi(y)J(y,x)}{\pi(x)J(x,y)+\pi(y)J(y,x)}\,.$$

This is clearly a symmetric function, which only has the terms \(\pi\) and \(J\). The acceptance probability becomes

$$\alpha(x,y) = \frac{\pi(y)J(y,x)}{\pi(x)J(x,y)+\pi(y)J(y,x)}\,.$$

A.A. Barker proposed this function in a 1965 paper as part of his PhD work in mathematical physics at the University of Adelaide. Barker had been inspired by a previous 1953 paper, which brings us to the next example.

Metropolis(-Rosenbluth-Rosenbluth-Teller-Teller)-Hastings

The now most important example is

$$s(x,y) = \min[\pi(x)J(x,y),\pi(y)J(y,x)]\,.$$

We can see that this is a symmetric function. The acceptance probability becomes

$$\alpha(x,y) = \min[1,\frac{\pi(y)J(y,x)}{\pi(x)J(x,y)}]\,.$$

This example is very famous in the world of Markov chain Monte Carlo methods. It is the main part of the so-called Metropolis-Hastings algorithm, which comes from a 1953 paper by Nicholas Metropolis, Arianna W. Rosenbluth, Marshall Rosenbluth, Augusta H. Teller, and Edward Teller (two husband-wife pairs), who looked at a special case, and a 1970 paper by W.K. Hastings, who generalized the method.

Further reading

Articles

Historical

The two important historical papers that gave the two examples for \(s(x,y)\) are:

  • 1965 – Barker – Monte Carlo calculations of the radial distribution functions for a proton-electron plasma;
  • 1953 – Metropolis, Rosenbluth, Rosenbluth, Teller, Teller – Equation of state calculations by fast computing machines.

This paper is also important:

  • 1970 – Hastings – Monte Carlo sampling methods using Markov chains and their applications.
Introductory

A good explanatory article about the Metropolis-Hastings algorithm is:

  • 1995 – Chib and Greenberg – Understanding the Metropolis-Hastings Algorithm.

I also recommend:

  • 2015 – Minh and Minh – Understanding the Hastings Algorithm;
  • 2016 – Robert – The Metropolis-Hastings algorithm;
  • 2017 – Gonçalves, Łatuszyński, Roberts – Barker’s algorithm for Bayesian inference with intractable likelihoods.
History

The history of this and related methods is described in the following article:

  • 2011 – Robert and Casella – A Short History of Markov Chain Monte Carlo: Subjective Recollections from Incomplete Data.

Incidentally, the first author, Christian P. Robert, writes on his blog about Markov chain Monte Carlo methods, such as the Metropolis-Hastings algorithm, as well as many other topics.

Books

Modern books on stochastic simulations and Monte Carlo methods will often detail this method. For example, see the Handbook of Monte Carlo Methods (Section 6.1) by Kroese, Taimre and Botev. The book Stochastic Simulation: Algorithms and Analysis by Asmussen and Glynn also covers the method in Chapter XIII, Section 3. (For my remark on reversibility, see Remark 3.2 in Asmussen and Glynn.) There is also the book Monte Carlo Strategies in Scientific Computing by Liu; see Chapter 5.

Websites

The web abounds with pages dedicated to explaining Markov chain Monte Carlo methods with the focus typically on the Metropolis-Hastings algorithm. I would try this following one because it gives a detailed explanation on how the Metropolis-Hastings algorithm works:

  • https://gregorygundersen.com/blog/2019/11/02/metropolis-hastings/

Markov goes to Monte Carlo

Numerical and scientific fields such as statistics, computational physics, and optimization methods heavily rely upon Markov chain Monte Carlo methods. These simulation techniques use the power of Markov chains to sample general probability distributions, which in turn give Monte Carlo methods for estimating integrals and optimal solutions.

This is the third part of a multi-part series of posts. The first part covers Markov chains. The second part covers the basics of Monte Carlo methods. This post combines the ideas from the first two parts. Overall, the posts sketch the mechanics of Markov chain Monte Carlo (MCMC) methods.

In a later post I detail the principal MCMC method known as the Metropolis(-Rosenbluth-Rosenbluth-Teller-Teller)-Hastings method, where you can see these mechanics come together. I’ve also given posts presenting code (in Julia, Python, MATLAB, R, and C) for this method here and here.

Here we will only examine Markov chains with countable state spaces. In this setting, we only need standard probability and matrix knowledge. But the results extend to general state spaces, such as Euclidean space.

By the way, I doubt that Andrey A. Markov ever went to Monte Carlo in the small country of Monaco.1But Herbie did.

Monte Carlo conditions

Sum

In the first post we covered Markov chains with countable state spaces, because the mathematics is more transparent. Consequently, we can use such Markov chains to estimate the sum of a function \(f\) over a countable (possibly infinite) set \(\mathbb{S}\). We can write the sum as

$$\begin{align}S(f)&=\sum_{x\in \mathbb{S}} f(x) \\&= \sum_{x\in\mathbb{S}} \frac{f(x)q(x)}{q(x)}\\& = \mathbb{E}_{q}[ \frac{f(Y)}{q(Y)}]\\& = \mathbb{E}_{q}[ g(Y)]\,,\end{align}$$

where \(g=f/q\) and \(Y\) is a suitable random variable that has a probability mass function \(q\) with support \(\mathbb{S}\).

Sampling the random variable (or point) \(Y\), gives the samples \(y_1,\dots,y_n\), which yields an unbiased Monte Carlo estimate of the sum, namely

$$S_n (f)=\frac{1}{n}\sum_{i=1}^n \frac{f(y_i)}{q(y_i)}.$$

The closer the function \(q\) is to \(|f|\), the better the estimate.

Integral

We can write an integral over some region \(\mathbb{S}\) as

$$\begin{align}\int_{\mathbb{S}} f(x) dx &= \int_{\mathbb{S}} \frac{f(x)}{p(x)}p(x)dx\\ & =\mathbb{E}_p[\frac{f(Y)}{p(Y)}]\,,\end{align}$$

where \(p\) is the probability density of a random variable (or point) \(Y\) with support \(A\). Then by sampling \(Y\), the resulting samples \(y_1, \dots, y_n\in [0,1]\) give the unbiased Monte Carlo estimate

$$I_n (f)=\frac{1}{n}\sum_{i=1}^n \frac{f(y_i)}{p(y_i)}\,.$$

Ergodic sequence

To estimate the sum or integral, we’ll need a sequence of ergodic random variables \(\{Y_i\}_{i\in\mathbb{N}}\), which we can achieve with an ergodic Markov chain with a state space \(\mathbb{S}\).

Markov chain conditions

In the context of Markov chains and, more generally, stochastic processes, we can think of ergodicity as convergence and a slightly more general version of the law of large numbers. A Markov chain with a countable state space needs some conditions to ensure ergodicity.

Regularity conditions for countable a Markov chain

  1. A stationary distribution \(\pi\)
  2. Aperiodicity
  3. Irreducibility
  4. Postive recurrence

In general, most of the needed structure for the Markov chain comes from irreducibility and a stationary distribution existing. Sometimes the above conditions are redundant. For example, an aperiodic, irreducible Markov chain with a finite state space is always positive recurrent. We look at ways on how to satisfy these conditions in the Monte Carlo setting when the underlying state space is countable.

Stationary distribution \(\pi\)

A stationary distribution \(\pi\) satisfies equation

$$ \pi=P\pi.$$

We know exactly what our stationary distribution \(\pi\) needs to be. It is the one used in the Monte Carlo method, namely the distribution of \(Y\). The challenge is constructing a Markov chain with such a stationary distribution. This will of course depend on the Markov (transition) kernel \(P\) of the Markov chain.

We’ll look at the other conditions on the Markov kernel \(P\).

Aperiodicity

We recall that the period \(d_x\) of a state \(x\in \mathbb{X}\) is the greatest common divisor of all \(n\) values such that \(P(x,x)^n>0\). We need every state of the countable Markov chain to be aperiodic, meaning \(d_x=1\) for all \(x\in\mathbb{X}\).

One sufficient way to achieve this requirement is to first have the countable Markov chain to be irreducible, which we’ll cover in the next section. Then if there exists at least one state \(x\in \mathbb{x}\) such that \(P(x,x)>0\), then the Markov chain is aperiodic.2Lemma 1.8.2 in Markov Chains by Norris or Remark 6.2.5 in Probability and Stochastic Processes by Brémaud.

Clearly the last condition is not difficult to satisfy, putting the focus now on the irreducibility requirement.

Irreducibility

The irreducibility property says that it is possible for the Marko chain to get from any point to another point in a finite number of steps. More precisely, for an irreducible Markov chain with countable state space, there must exist for all \(x,y\in\mathbb{X}\) a natural number \(s\) (possibly depending \(x\) and \(y\)) such that \(P(x,y)^s>0\).

We can achieve this requirement by introducing a stronger requirement, which is easier to verify. A countable Markov chain with a transition \(P\) will be irreducible if for all \(x,y\in\mathbb{X}\) the condition \(P(x,y)>0\) holds. We have simply required that the above number \(s=1\) for all \(x,y\in\mathbb{X}\).

Furthermore, if \(P(x,x)>0\) for all \(x\in\mathbb{X}\), then our irreducible countable Markov chain is also aperiodic.

Positive recurrence

For a point or state \(x\in\mathbb{X}\), we recall its first return time being defined as

$$ T_x^+=\min\{ t\geq 1: X_t=x\} \,.$$

A state \(x\) is called positive recurrent if the expected value of its first return time is finite, meaning \(\mathbb{E}_x(T_x^+)<\infty\). For a countable Markov chain, if all the states in the state space are positive recurrent, then we say the Markov chain is positive recurrent.

A countable irreducible Markov chain is positive recurrent if (and only if) it has a stationary distribution \(\pi\). Furthermore, that stationary distribution will be unique and nonzero, meaning \(\pi(x)>0\) for all \(x\in\mathbb{X}\).3Theorem 1.7.7 in Markov Chains by Norris or Theorem 6.3.14 in Probability and Stochastic Processes by Brémaud.

For positive recurrence, we need to prove that the Markov chain has a stationary distribution.

Ergodicity

A countable Markov chain that is aperiodic and irreducible is ergodic. In other words, an ergodic Markov process \(X\) with stationary distribution \(\pi\) is a random sequence \(X_0, X_1,\dots\) such that the following holds almost surely

$$ \frac{1}{t}\sum_{i=0}^{t-1} g(X_i) \rightarrow \sum_{x\in\mathbb{X}}g(x)\pi(x)$$

as \(t\rightarrow\infty\) for all bounded, nonnegative functions \(g\). Conveniently, wllle can use such an ergodic Markov chain to make statistical estimates.

Sampling with a Markov chain

For the Monte Carlo sample \(y_1,\dots,y_n\), we simply use the Markov chain samples \(x_0,x_1,\dots,x_{n-1}\), giving the Monte Carlo estimate. In the discrete case, this is simply

$$S_n (f)=\frac{1}{n}\sum_{i=1}^n g(x_{i-1}).$$

which, due to ergodicity, converges to \(S(f)=\mathbb{E}_{Y}[g(Y)]\).

Summary

In summary, a sufficient way to have a countable ergodic Markov chain is for the transition kernel to be such that \(P(x,y)>0\) for all \(x\in\mathbb{X}\) and there exists a (stationary) distribution \(\pi=P\pi\). For the first requirement, one imagines using a non-negative function such as \(e^{-(x-y)^2}/C\), where \(C>0\) is a suitable normalization constant.

The requirements seem possible, but to bring them all together is still a great challenge for the uninitiated. Fortunately, some great minds almost seven decades ago proposed the first Markov chain that meets all the requirements for a specific stationary distribution, which was used for a Monte Carlo estimate.

The first such Markov chain Monte Carlo method appeared in a paper written by five scientists, including two husband-wife pairs. It became known as the Metropolis or Metropolis-Hastings algorithm, and it will be the subject of a future post.

Further reading

Books

Markov chains

Any stochastic process book will include a couple of chapters on Markov chains. For more details, there are many books dedicated entirely to the subject of Markov chains. For example, introductory books include:

  • Brémaud – Markov Chains, Gibbs Fields, Monte Carlo Simulation and Queues;
  • Levin, Peres, and Wilmer – Markov Chains and Mixing Times;
  • Norris – Markov Chains.

Those books cover Markov chains with countable state spaces. If you want to read about discrete-time Markov chains with general state spaces, try the book:

  • Meyn, Tweedie – Markov chains and stochastic stability

All the above books have a section detailing a Markov chain Monte Carlo method, such as the Metropolis-Hastings algorithm and the Gibbs sampler.

Monte Carlo methods

Books dedicated to Monte Carlo approaches include:

For a statistics context, there’s also the good book:

  • Casella and Robert – Monte Carlo Statistical Methods.

The second MC in MCMC methods

In calculus differentiating a function is easy. Integrating one is hard. Much research has gone into finding ways for calculating integrals. But often functions cannot be integrated, at least analytically.

No problem. You do it numerically. You take the function’s curve, split it up into nice rectangles or, even better, trapezoids, and calculate their areas, and add them up. Not accurate enough? Split the curve up even finer until you’re happy. Done.

That works fine in one dimension and even two dimensions. But then, as the number of dimensions increases, the amount of splitting and adding grows exponentially, soon hitting unwieldy numbers. This is a mountain too steep for even the fastest computers. The problem is what applied mathematics guru Richard E. Bellman famously called the curse of dimensionality.

But thankfully randomness offers a path around this mountain. The approach is called Monte Carlo integration for estimating sums and integrals

This is the second part of a series of posts on Mark chain Monte Carlo methods. The first part covers Markov chains. This posts covers the basics of Monte Carlo methods. The third part will combine the ideas from the first two parts. Overall, the three posts will sketch the mechanics of Markov chain Monte Carlo (MCMC) methods.

We’ll focus on estimating sums, but the same logic applies to integrals.

Monte Carlo estimates

In the discrete (or countable) case, consider a function \(f\) that maps from a countable state space \(\mathbb{S}\). This state space maybe, for example, the lattice \(\mathbb{Z}^d\) or some subset of it. We write the sum as

$$S(f)=\sum_{x\in \mathbb{S}} f(x) $$

Such sums arise in many fields, particularly statistical physics, which is where the (Markov chain) Monte Carlo methods were born.1In general, we use integrals for continuous spaces, which are uncountable, whereas we use sums for discrete spaces, which are countable spaces. (The two are generalized by the concept of an integral measure.)

The idea behind a Monte Carlo estimate of a sum is remarkably simple. You randomly sample the function \(f\), find the average, and use it as an estimate of the sum.

Estimating with uniform variables

For a finite set \(\mathbb{S}=\{x_1,\dots,x_m\}\), we write the sum of \(f\) over the set \(\mathbb{S}\) as

$$\begin{align}S(f)&=\sum_{x\in \mathbb{S}}  f(x) \frac{m}{m} \\&= \sum_{x\in\mathbb{S}}  \frac{f(x)p(x)}{p(x)}\\& = \mathbb{E}_{p}[ \frac{f(U)}{p(U)}]\,,\end{align}$$

where \(U\) is a discrete uniform random variables, which has the probability mass function \(p\) with support \(\mathbb{S}\), meaning \(p(x)=1/m\) if \(x\in \mathbb{S}\) and \(p(x)=0\) otherwise. But using a uniform random variable is a naive way, as there are better random variables we can use to sample \(f\).

Estimating with general variables

More generally, we can write the sum of a countable (possibly infinite) set \(\mathbb{S}\) as

$$\begin{align}S(f)&= \sum_{x\in\mathbb{S}}  \frac{f(x)q(x)}{q(x)}\\& = \mathbb{E}_{q}[ \frac{f(Y)}{q(Y)}]\\& = \mathbb{E}_{q}[ g(Y)]\,,\end{align}$$

where \(g=p/q\) and \(Y\) is a suitable random variable that has a probability mass function  \(q\) with support \(\mathbb{S}\).

Sampling the random variable (or point) \(Y\), gives the samples \(y_1,\dots,y_n\), which yields an unbiased Monte Carlo estimate of the sum

$$S_n (f)=\frac{1}{n}\sum_{i=1}^n \frac{f(y_i)}{q(y_i)}.$$

If we can (quickly) sample \(Y\), then we can (quickly) estimate the integral. The closer the (probability mass) function \(q\) is to \(|f|\), the better the estimate.

We can also use the above reasoning with integrals, replacing the sum and probability mass function with an integral and a probability density.

Error rate

If the samples \(y_1, \dots, y_n \) are random and independent, the (strong) law of large numbers guarantees that the unbiased Monte Carlo estimate converges (almost surely) to the correct answer.

Further assuming \(S(f^2)<\infty\), the error of the integral estimate \(|S(f)-S_n(f)|\) is proportional to

$$\frac{\mathbb{V}_q(f)}{n ^{1/2}},
$$
where
$$\mathbb{V}(f)_q=E_q[f^2(Y)]-(E_q[f(Y)])^2,$$
is the variance of the function \(f\). (This is not the variation of a function, which is something different.) The error \(|S(f)-S_n(f)|\) reduces at the rate \(O(n^{-1/2})\). The Monte Carlo approach allows for continual updating until the estimate is sufficiently converged.

The above logic applies to integrals of functions on general spaces.

The dimension \(d\) of the underlying space, as $\mathbb{S}^d$, does not appear in the error. The independence of the dimension is precisely why we tackle high-dimensional sums and integrals with Monte Carlo methods. The general rule of thumb is to use Monte Carlo integration for functions of four or more dimensions.

Losing the independence

Again, a sequence of independent random variables allow our Monte Carlo estimate to converge. We can loosen this requirement and allow some dependence between variables. To obtain a Monte Carlo estimate, we just need to use a sequence of random variables that are ergodic. Then we recover the (strong) law of large numbers that gives our Monte Carlo estimate.

Ergodicity

The word ergodicity is one of those technical terms you’ll see and hear in different fields of mathematics and physics, often when randomness is involved. The concept is also central to deterministic dynamical systems.

There are different versions and strengths of ergodicity, depending on the system and applications. But loosely speaking, an ergodic system is one in which temporal averages and spatial (or ensemble) averages converge to the same value.

That means when you want to find an average, you have two options in an ergodic system. The temporal option: You can measure something over a long time. The spatial (or ensemble) option: You can measure many instances of the same something. The two different averages will eventually converge to the same value.

Ergodic sequences

In the context of Markov processes and, more generally, stochastic processes, we can think of ergodicity as convergence and a slightly more general version of the law of large numbers.

Consider a stochastic process \(X\), which is a sequence of random variables \(X_0,X_1,\dots\) indexed by some parameter \(t\). We assume the probability distributions \(\pi_0,\pi_1,\dots\) of the random variables converge to some final stationary distribution \(\pi\). For a countable state space, the (probability) distribution \(\pi\) corresponds to a probability mass function.

We then say a stochastic process \(X\) state space is ergodic if the following holds almost surely

$$ \frac{1}{t}\sum_{i=0}^t g(X_i)\rightarrow \sum_{x\in \mathbb{Z}}g(x)\pi(x)=\mathbb{E}_{\pi} [g(X)]\,,$$

as \(t\rightarrow\infty\) for all bounded functions, nonnegative functions \(g\). The subscript of the expectation simply denotes that the expectation is taken with respect to the (stationary) distribution \(\pi\).

The equation tells the story. On the left-hand side you have a temporal average. On the right-hand side you have a spatial (or ensemble) average. As time proceeds, the first average will converge to the second average. Ergodicity.

Further reading

There are many books dedicated entirely to the subject of Monte Carlo methods. For example, a good book with detailed section on Markov chains applied to Monte Carlo methods in the context of statistics is:

  • Casella and Robert, Monte Carlo Statistical Methods.

These books also have sections on Monte Carlo integration: