Quantum-enhanced Markov chain Monte Carlo

The not-so-mathematical journal Nature recently published a paper proposing a new Markov chain Monte Carlo method:

  • 2023 – Layden, Mazzola, Mishmash, Motta, Wocjan, Kim, and Sheldon – Quantum-enhanced Markov chain Monte Carlo.

Appearing earlier as this preprint, the paper’s publication in such a journal is a rare event indeed. This post notes this, as well as the fact that we can already simulate perfectly1For small instances of the model, we can do this directly. For large instances, we can use coupling from the past proposed by Propp and Wilson. the paper’s test model, the Ising or Potts model.2Wilhelm Lenz asked his PhD student Ernst Ising to study the one-dimensional version of the model. Renfrey Potts studied the generalization and presented it in his PhD. But this is a quantum algorithm, which is exciting and explains how it can end up in that journal.

The algorithm

The paper’s proposed algorithm adds a quantum mechanical edge or enhancement to the classic Metropolis-Hastings algorithm.3More accurately, it should be called the Metropolis-Rosenbluth-Rosenbluth-Teller-Teller-Hastings algorithm. As I covered in a recent post, the original algorithm uses a Markov chain defined on some mathematical space. Running it on a traditional or classical computer, at each time step, the algorithm consists of proposing a random jump and then accepting the proposed jump or not. Owing to the magic of Markov chains, in the long run, the algorithm simulates a desired probability distribution.

The new quantum version of the algorithm uses a quantum computer to propose the jump, while still using a classical computer to accept the proposal or not.4In my Metropolis-Hastings post, the classical jumper process, a discrete-time Markov chain, is replaced with a quantum mechanical variant. The quantum jump proposals are driven by a time-independent Hamiltonian, which is a central object in quantum and, in fact, all physics. This leads to a Boltzmann (or Gibbs) probability distribution for the jumping process.

Then, running the quantum part on a quantum computer, the algorithm will hopefully outperform its classical counterpart. The paper nurtures this hope by giving empirical evidence of the algorithm’s convergence speed. The researchers performed the numerical experiments on a 27-qubit quantum processor at IBM using the platform Qiskit.

Quantum is so hot right now

In recent years researchers have been focusing on such algorithms that exploit the strangeness and spookiness of quantum mechanics. You will see more and more quantum versions of algorithms that appear in statistics, machine learning, and related fields, as suggested by this survey paper, which also appeared in Nature.

Quantum lite

Sometimes quantum mechanics only loosely inspires algorithms and models. In this setting, some of my machine learning work uses determinantal point processes. This kernel-based random model draws direct inspiration from the wave function, a standard object in quantum mechanics. Under suitable simplifying conditions, the model describes the locations of particles known as fermions such as electrons and protons. Still, it’s fascinating that a quantum physics model inspired an interesting random object that has found applications in spatial statistics and machine learning.

The Metropolis(-Rosenbluth-Rosenbluth-Teller-Teller)-Hastings algorithm

Consider a collection of random variables described by a joint probability distribution. Often, in any field with probability or statistics, one faces the task of simulating these random variables, which typically depend on each other in some fashion.

A now standard way for simulating or sampling such random variables is to use the Metropolis-Hastings algorithm, undoubtedly the cornerstone of Markov chain Monte Carlo methods. This method creates a discrete-time Markov chain that has a stationary or invariant distribution being the aforementioned distribution.

The algorithm was born out of a 1953 paper by Nicholas Metropolis, Arianna W. Rosenbluth, Marshall Rosenbluth, Augusta H. Teller, and Edward Teller (two husband-wife pairs), who looked at a special case, and a 1970 paper by W.K. Hastings, who generalized the method. It is typically called the Metropolis-Hastings or the Metropolis algorithm. And some have called it the M(RT)2 H algorithm.

(The history is a bit complicated, but perhaps we should drop the name Metropolis. The late Arianna (née Wright) Rosenbluth did most of the work. She was also a dab hand at fencing.)

Although the algorithm’s initial adoption and use was slow, taking decades partly due to slower computers, the Metropolis-Hastings algorithm is now a widely used method for simulating collections of random variables. This in turn gives fast ways for exploring, integrating, and optimizing otherwise unwieldy mathematical functions, such as those found in Bayesian statistics, machine learning, statistical physics, and combinatorial optimization. The algorithm serves as the foundation for other random simulation methods, such as the Gibbs sampler, hence it’s been called the workhorse of Marko chain Monte Carlo methods.

There are many books, articles, lecture notes, and websites describing the Metropolis-Hastings algorithm; see the Further reading section below. But I’ll detail the core ideas here. This post is designed to be somewhat self-contained, but it arose from a series of posts, starting with this one and ending with this particularly relevant one.

Constructing a Markov process

Take a collection of \(n\) random variables \(X_1,\dots,X_n\) with a (joint) probability distribution \(\pi(x)=\pi(x_1,\dots,x_n)\). This distribution will either be a (joint) probability mass function or (joint) probability density for discrete or continuous random variables, respectively.

We wish to construct a Markov chain on an abstract mathematical space \(\mathbb{X}\). We assume we can write a point \(x\in \mathbb{X}\) as \(x=(x_1,\dots,x_n)\). More specifically, the space \(\mathbb{X}\) is a Cartesian product of spaces \(\mathbb{X}_1,\dots,\mathbb{X}_n\) on which the variables are defined.

Which mathematical space is \(\mathbb{X}\)? That will, of course, depend on the random variables you’re trying to simulate. But it’s usually the lattice \(\mathbb{Z}^n\), Euclidean space \(\mathbb{R}^n\), or a subset of one of these two spaces.

For this post, we’ll assume that the space \(\mathbb{X}\) is discrete, which makes the things simpler. But the mathematics is similar for continuous spaces, with small changes such as replacing probabilities and sums with probability densities and integrals, respectively. We’ll further assume the space is finite, so we can use matrices to describe the Markov transition kernels. But everything covered here will work on infinite spaces such as \(\mathbb{R}^n\), which is the most common space used in practice.

Again, our overall aim is to construct a Markov chain with a stationary \(\pi\) being the same as the distribution that we want to sample.

Jumper process

There’s a random jumper that wants to jump around the space \(\mathbb{X}\). The jumper randomly jumps from one point in this mathematical space \(x\in \mathbb{X}\) to another point \(y\in \mathbb{X}\) according to the probability \(J(x,y)\). If the the state space \(\mathbb{X}\) is finite, then \(J\) becomes a matrix. The matrix row \(J(x,\cdot)\) is a probability mass function for each \(x\in \mathbb{X}\), so it sums to one. By definition, this random jumping forms a Markov chain.

The only thing we ask is that, for our jumper, every point \(x\) in \(\mathbb{X}\) where \(\pi(x)>0\) is reachable with positive probability in a single step. This implies the easy-to-achieve condition \(J(x,y)>0\) where \(\pi(x)>0\) for all points \(x,y\in\mathbb{X}\).

Now we have a Markovian jumper on the space \(\mathbb{X}\). But this turns out to be too much jumping for our jumper. Furthermore, the jumper is jumping more in certain directions than others. Occasionally the jumper wants to stay put (and have a rest) with the aim of balancing the jump directions.

The jumper still wants to jump sometimes from a point \(x\in \mathbb{X}\) to another point \(y\in \mathbb{X}\) based on \(J(x,y)\). But now at each time step, after choosing the jump direction but before jumping, the jumper flips a biased coin whose success probability \(\alpha(x,y)\) depends on the current position \(x\in \mathbb{X}\) and the (potential) next position \(y\in \mathbb{X}\). For the coin, the acceptance probability, which allows (or not) the jumper to move from \(x\) to \(y\), is given by

$$\alpha(x,y)=\min[1,\frac{\pi(y)}{\pi(x)}\frac{J(y,x)}{J(x,y)} ]\,,\quad x, y\in \mathbb{X}\,.$$

The function \(\alpha(x,y)\) is clearly never negative. The minimum in the above expression ensures that \(\alpha(x,y)\) is a valid probability.

Metropolis-Hastings ratio

The ratio in the expression for \(\alpha(x,y)\) is sometimes called the Metropolis-Hastings ratio, which we’ll soon see is designed specifically to balance the jump directions. The ratio means that a constant factor in the target distribution \(\pi(x)\) will vanish due to cancellation.

More specifically, if we can write the target distribution as \(\pi(x)=f(x)/C\), where \(C>0\) is some constant and \(f(x)\) is a non-negative function, then the ratio term \(\pi(y)/\pi(x)=f(y)/f(x)\). This reasoning also applies to a constant factor in the transition kernel \(M\).

The constant factor being irrelevant in the target distribution \(\pi(x)\) is very useful. It is particularly important for posterior distributions in Bayesian statistics and the Gibbs distributions in statistical physics, as these distributions typically have difficult-to-calculate constants.

Metropolis-Hastings process

The pairing of the original jumper Markov chain with the coin flipping creates a new Markov chain, which we call the Metropolis-Hastings process. What’s remarkable is its stationary distribution will be the target distribution \(\pi(x)\).

Transition kernel (matrix)

For the Metropolis-Hastings process, we can readily reason the structure of the transition kernel (matrix) \(M\) that describes this Markov chain. First we’ll look at the off-diagonal entries of \(M\).

Jumping from \(x\) to \(y\neq x\)

To jump from point \(x\) and to another point \(y\neq x\), the probability is simply the previous probability \(J(x,y)\) multiplied by the probability of that proposed jump being accepted, which is \(\alpha(x,y)\), giving

$$ M(x,y) = \alpha(x,y) J(x,y), \quad x\neq y\,.$$

Now we examine the diagonal entries of \(M\).

Jumping from \(x\) to \(x\)

There are two different ways for the jumper to remain at point \(x\). The first way is that the jumper simply jumps from \(x\) to \(x\), which happens with probability \(J(x,x)\). This proposed jump, so to speak, is accepted with probability one because \(\alpha(x,x)=1\). Consequently, we can write this probability as \(\alpha(x,x)J(x,x)\) or \(J(x,x)\).

The second way consists of all the possible jumps from \(x\) to \(z\), but then for each of those proposed jumps to be rejected, which happens (for each jump) with probability \([1-\alpha(x,z)]\). (I am using here the dummy or bound variable \(z\) instead of \(y\) for clarity.) Adding up the probabilities of these events gives the probability of the second way being the sum \(\sum_{z\in \mathbb{X}}[1-\alpha(x,z)] J(x,z) \,.\)

Consequently, for a single time step, the probability that the jumper starts at point \(x\) and remains at \(x\) is

$$ M(x,x) = \alpha(x,x)J(x,x)+\sum_{z\in \mathbb{X}}[1-\alpha(x,z)] J(x,z) \,.$$

The transition matrix \(M\) should be stochastic, so the rows sum to one, which we see is the case

$$ \sum_{y\in\mathbb{X}}M(x,y)=1\,.$$

Of course, we could have derived the diagonal entry \(M(x,x)\) immediately by starting with the above sum, but that approach lacks intuition into what’s happening.

Expression for the transition kernel \(M\)

$$M(x,y) = \begin{cases}
\alpha(x,y) J(x,y) & \text{if }
\begin{aligned}[t]
x&\neq y
\end{aligned}\\
\alpha(x,x)J(x,x)+\sum_{z\in \mathbb{X}}[1-\alpha(x,z)] J(x,z) & \text{if } x=y
\end{cases}$$

Often the above expression is written as a single line by placing an indicator function or similar in front of the sum for the diagonal entries. (In the continuous case, where the sum is replaced with an integral, a Dirac delta distribution is often used instead.)

Reversibility

A Markov process on \(\mathbb{X}\) with kernel (matrix) \(K\) is (time) reversible with respect to the distribution \(\mu\) if the following holds

$$ \mu(x)K(x,y) = \mu (y) K(y,x)\quad x,y\in\mathbb{X}\,.$$

This reversibility condition is also called the detailed balance equation. If this condition is met, then the Markov process will have a stationary distribution \(\mu\). By summing over \(x\), we can verify this because we obtain

$$ \sum_{x\in\mathbb{X}}\mu(x)K(x,y) =\mu(y)\sum_{x\in\mathbb{X}} K(y,x)=\mu(y)\,.$$

This is just the balance equation, often written as \(\mu=K\mu\), which says that the transition kernel \(K\) has a stationary distribution \(\mu \).

(Strictly speaking, we don’t necessarily need reversibility, as long as the Markov chain has a stationary distribution. Reversibility just makes the algebra easier.)

The Metropolis-Hastings process is reversible

We can show that the Metropolis-Hastings process with the transition kernel \(M\) satisfies the reversibility condition. The proof essentially just requires the swapping of rows and columns. Clearly then we only need to look at the off-diagonal entries, so we assume \(x\neq y\).

We further assume that \(\pi(y)J(y,x)\geq \pi(x) J(x,y)\). Then \(\alpha(x,y)=1\), so \(M(x,y)=J(x,y)\). Now we use this last fact in the last line of algebra that follow:

$$\begin{aligned}\pi(y) M(y,x)&=\pi(y)J(y,x) \alpha(y,x)\\ &= \pi(y)J(y,x) \frac{\pi(x)}{\pi(y)}\frac{J(x,y)}{J(y,x)}\\ &= \pi(x)J(x,y)\\&= \pi(x)M(x,y)\,,\end{aligned}$$

which shows that the reversibility condition holds with the last assumption.

But of course we can reverse that assumption, due to symmetry, so \(\pi(y)J(y,x)\leq \pi(x) J(x,y)\), then \(\alpha(x,y)=[\pi(y)J(y,x)]/[\pi(x) J(x,y)]\) and \(\alpha(y,x)=1\), implying this way also works. Then the reversibility condition holds for all cases.

We could have shown this more quickly by observing that

$$\begin{aligned}\pi(y) M(y,x)&=\pi(y)J(y,x)\alpha(y,x)\\ &= \pi(y)J(y,x) \min[1, \frac{\pi(x)}{\pi(y)}\frac{J(x,y)}{J(y,x)}]\\ &= \min[\pi(y)J(y,x) , \pi(x)J(x,y)]\,,\end{aligned}$$

which is symmetric in \(x\) and \(y\).

In summary, the Metropolis-Hastings process is reversible and has the stationary distribution \(\pi(x)\).

Continuous case

As mentioned earlier, the continuous case is similar to the discrete-case with a few modifications. For example, assuming now \(\mathbb{X}\) is continuous, then the stationary distribution is described by a probability density \(\pi(x)\). The jumper process will be described by transition kernel (function) \(j(x,y)\), where \(j(x,\cdot)\) is now a probability density for each \(x\in \mathbb{X}\).

It follows that the acceptance probability is

$$\alpha(x,y)=\min[1,\frac{\pi(y)}{\pi(x)}\frac{j(y,x)}{j(x,y)} ]\,,\quad x, y\in \mathbb{X}\,.$$

The transition kernel (function) is

$$m(x,y) = \begin{cases}
\alpha(x,y) j(x,y) & \text{if }
\begin{aligned}[t]
x&\neq y
\end{aligned}\\
\alpha(x,x) j(x,x)+\int_{\mathbb{X}}[1-\alpha(x,z)]j(x,z) dz & \text{if } x=y
\end{cases}$$

For more details, there are derivations online such as this one, as well as the sources cited in the Further reading section below.

Libraries

Of course, before writing your own code, I would check out any pre-written functions in your favourite language that already implement the Metropolis-Hastings algorithm.

For example, in MATLAB there’s the function mhsample. Again I would be using this before writing my own code, unless it’s for illustration or educational purposes.

In Python there’s the library pymcmcstat. For those with a machine learning bent, there’s also a TensorFlow function MetropolisHastings.

In R, which is a language designed for statistics, implementations of any Markov chain Monte Carlo methods will be couched in the language and notation of (Bayesian) statistics. For a specific library, I can’t say I know which is the best, but you can check out the MCMCPack library. For R libraries, be wary of using the ones that were published and are (or were?) maintained by a single contributor.

Code

For a couple of simple examples in both one dimension and two dimensions, I’ve implemented the Metropolis-Hastings algorithm in the programming languages R, MATLAB, Python (NumPy) and Julia. The code can be found here.

Further reading

There are good articles on explaining the Metropolis-Hastings approach, as well as its history. On this topic, the articles are probably better resources than the books.

Articles

Historical

The two important papers behind the Metropolis-Hastings algorithm are:

  • 1953 – Metropolis, Rosenbluth, Rosenbluth, Teller, Teller – Equation of state calculations by fast computing machines;
  • 1970 – Hastings – Monte Carlo sampling methods using Markov chains and their applications.
Introductory

There are several tutorial and historical articles covering the Metropolis-Hastings algorithm. An earlier explanatory article is

  • 1995 – Chib and Greenberg – Understanding the Metropolis-Hastings Algorithm.

I also recommend:

  • 1994 – Tierney – Markov chains for exploring posterior distributions;
  • 1998 – Diaconis and Saloff-Coste – What do we know about the Metrolpolis algorithm?;
  • 2001 – Billera and Diaconis – A Geometric Interpretation of the Metropolis-Hastings Algorithm;
  • 2015 – Minh and Minh – Understanding the Hastings Algorithm;
  • 2016 – Robert – The Metropolis-Hastings algorithm.

The above article by Billera and Diaconis shows that the Metropolis-Hastings algorithm is actually the minimization (on the space of possible samplers) using an \(L^1\) norm. (Note that \(K(x,x)\) should be \(K(x,y)\) in equation (1.3), so it agrees with equation (2.1).)

The following article covers the Metropolis-Hastings algorithm in the context of machine learning (particularly Bayesian statistics):

  • 2003 – Andrieu, de Freitas, Doucet, and Jordan – An Introduction to MCMC for Machine Learning.

For a surprising real world application, in the following article, Diaconis briefly recounts a story about a couple of graduate students using the Metropolis-Hastings algorithm to decipher a letter from a prison inmate who had used a simple substitution cipher:

  • 2009 – Diaconis – The Markov chain Monte Carlo revolution.

For Markov chains on general state spaces, the following paper gives a survey of results (often with new proofs):

  • 2004 – Roberts and Rosenthal – General state space Markov chains and MCMC algorithms.
History

The history of this algorithm and related methods are described in the following articles:

  • 2003 – David – A History of the Metropolis Hastings Algorithm;
  • 2005 – Gubematis – Marshall Rosenbluth and the Metropolis algorithm;
  • 2011 – Robert and Casella – A Short History of Markov Chain Monte Carlo: Subjective Recollections from Incomplete Data.

Incidentally, the first author of the third paper, Christian P. Robert, posts regularly on Markov chain Monte Carlo methods, such as the Metropolis-Hastings algorithm, as well as many other topics.

Books

Modern books on stochastic simulations and Monte Carlo methods will often detail this method. For example, see the Handbook of Monte Carlo Methods (Section 6.1) by Kroese, Taimre and Botev. The book Stochastic Simulation: Algorithms and Analysis by Asmussen and Glynn also covers the method in Chapter XIII, Section 3. (For my remark on reversibility, see Remark 3.2 in Asmussen and Glynn.) There is also the book Monte Carlo Strategies in Scientific Computing by Liu; see Chapter 5.

Websites

There are many, many websites covering this topic. Searching around, the following link is probably my favourite, as it gives a detailed explanation on how the Metropolis-Hastings algorithm works and includes with Python code:

This post also covers it with Python code:

This site details the algorithm with R code:

Here’s a quick introduction with a one-dimensional example implemented in R:

Connectivity in device-to-device networks in Poisson-Voronoi cities

Here’s a recently uploaded manuscript:

  • 2023 – Keeler, Błaszczyszyn, Cali – Connectivity and interference in device-to-device networks in Poisson-Voronoi cities.

https://arxiv.org/abs/2309.02137

This work presents numerical results complementing mathematical work carried out by us. The work concerns (continuum) percolation results for a special network model based on Poisson-Voronoi tessellations.

The most relevant work are these two papers (the first being somewhat seminal):

  1. Dousse, Franceschetti, Macris, Meester, Thiran, Percolation in the signal to interference ratio graph, 1996.
  2. Le Gall, Błaszczyszyn, Cali, and En-Najjary, Continuum line-of-sight percolation on Poisson-Voronoi tessellations, 2021

Our work effectively seeks to combine these two papers. We obtain the equivalents results from the first paper by coupling its connectivity model with the connectivity model and network model (based on a Cox point process) presented in the second paper.

If you want a more detailed version, here’s the abstract:

To study the overall connectivity in device-to-device networks in cities, we incorporate a signal-to-interference-plus-noise connectivity model into a Poisson-Voronoi tessellation model representing the streets of a city. Relays are located at crossroads (or street intersections), whereas (user) devices are scattered along streets. Between any two adjacent relays, we assume data can be transmitted either directly between the relays or through users, given they share a common street. Our simulation results reveal that the network connectivity is ensured when the density of users (on the streets) exceeds a certain critical value. But then the network connectivity disappears when the user density exceeds a second critical value. The intuition is that for longer streets, where direct relay-to-relay communication is not possible, users are needed to transmit data between relays, but with too many users the interference becomes too strong, eventually reducing the overall network connectivity. This observation on the user density evokes previous results based on another wireless network model, where transmitter-receivers were scattered across the plane. This effect disappears when interference is removed from the model, giving a variation of the classic Gilbert model and recalling the lesson that neglecting interference in such network models can give overly optimistic results. For physically reasonable model parameters, we show that crowded streets (with more than six users on a typical street) lead to a sudden drop in connectivity. We also give numerical results outlining a relationship between the user density and the strength of any interference reduction techniques.

In future posts I’ll detail the above work as well as our more mathematical work on this type of percolation model.

Creating a reversible Markov chain using acceptance(-rejection)

The study of Markov chains is generally the study of their long term behaviour, which, under certain conditions, is captured by them having a unique stationary distribution. Stationarity is an important property. It is, in a sense, a local property of a Markov chain.

For a Markov chain, a more global property is something called reversibility. Markov chains with this property must possess a stationary distribution, which, we see below, is an immediate consequence of reversibility. Reversible Markov chains (or processes) with discrete time1Asmussen writes in his book Applied Probability and Queues:

In this book, we use the terminology that a Markov chain has discrete time and a Markov process has continuous time (the state space may be discrete or general). However, one should note that it is equally common to let “chain” refer to a discrete state space and “process” to a general one (time may be discrete or continuous).
are the cornerstone of Markov chain Monte Carlo (MCMC) methods.

In this post we look at how a reversible Markov chain is constructed from a non-reversible (but irreducible) Markov chain by introducing an acceptance-rejection step. This post complements another post I wrote on the Metropolis-Hastings algorithm.

Reversibility

A Markov process on state space \(\mathbb{X}\) with kernel \(K\) is (time) reversible with respect to the distribution \(\mu\) if the following holds

$$ \mu(x)K(x,y) = \mu (y) K(y,x)\quad x,y\in\mathbb{X}\,.$$

This reversibility condition is also called the detailed balance equation. If this condition is met, then the Markov process will have a stationary distribution \(\mu\). By summing over \(x\), we can verify this because we obtain

$$ \sum_{x\in\mathbb{X}}\mu(x)K(x,y) =\mu(y)\sum_{x\in\mathbb{X}} K(y,x)=\mu(y)\,.$$

This is just the balance equation, often written as \(\mu=K\mu\), which says that the transition kernel \(K\) has a stationary distribution \(\mu \).

First Markov chain

We consider a Markov chain with kernel \(J\) defined on a finite state space \(\mathbb{X}\). If the Markov chain is at state \(x\in \mathbb{X}\), it visits another state \(y\in \mathbb{X}\) with the probability \(J(x,y)\). This is a simple time-homogeneous finite Markov chain.

Irreducibility

For our Markov chain, we assume that every state \(x\) in \(\mathbb{X}\) where \(\pi(x)>0\) is reachable with positive probability in a single step. This implies the easy-to-achieve condition \(J(x,y)>0\) where \(\pi(x)>0\) for all points \(x,y \in \mathbb{X}\). This requirement is a stronger form of irreducibility.

Creating a new Markov chain with acceptance

We create a new Markov chain by introducing an acceptance step. For the Markov chain with kernel \(J\), we assume that each time step, after choosing the jump direction but before jumping, a biased coin is flipped . The success probability \(\alpha(x,y)\) depends on the current position \(x\in \mathbb{X}\) and the (potential) next position \(y\in \mathbb{X}\).

Transition kernel

For our new Markov chain, we can quickly reason the transition kernel \(M\). We first look at the off-diagonal elements of the kernel (matrix) \(M\). To go from state \(x\) and to another state \(y\neq x\), the probability is simply

$$ M(x,y) = \alpha(x,y) J(x,y), \quad x\neq y\,.$$

The transition matrix \(M\) needs to be stochastic, so the rows sum to one, so \(\sum_{y\in\mathbb{X}}M(x,y)=1\). That gives us the diagonal elements of \(M\), although their exact form is not needed to show reversibility.

\(\alpha(x,y)\) needs a symmetric function \(s(x,y)\)

For reversibility, we just need to swap rows and columns. Clearly we only need to look at the off-diagonal entries, which implies the requirement

$$ \pi(x)M(x,y) = \pi (y) M(y,x)\quad x\neq y\,.$$

Both sides are symmetric in \(x\) and \(y\), meaning they are equal to some non-negative symmetric \(s(x,y)=s(y,x)\). Looking at the right-hand side, we get

$$\begin{aligned}\pi(y) M(y,x)&=\pi(y)J(y,x) \alpha(y,x)\\ &= s(y,x)\,.\end{aligned}$$

This implies that the function \(\alpha\) is a non-negative function such that \(\alpha\leq 1\), to ensure it’s a probability, with the form

$$ \alpha(x,y)=\frac{s(x,y)}{\pi(x)J(x,y)}\,.$$

The only task remaining now is to choose a reasonable symmetric function \(s\) such that \(\alpha\leq 1\), ensuring \(\alpha\) is a probability. Of course, our choice for the symmetric function \(s\) should also be a function of the stationary distribution \(\pi\) and the underlying kernel \(J\).

Examples

I’ll give two principal examples of the symmetric function \(s(x,y)\). Working in reverse chronological order, I’ll give the simpler of the two examples first.

Barker

A somewhat natural example is

$$s(x,y) = \frac{\pi(x)J(x,y)\pi(y)J(y,x)}{\pi(x)J(x,y)+\pi(y)J(y,x)}\,.$$

This is clearly a symmetric function, which only has the terms \(\pi\) and \(J\). The acceptance probability becomes

$$\alpha(x,y) = \frac{\pi(y)J(y,x)}{\pi(x)J(x,y)+\pi(y)J(y,x)}\,.$$

A.A. Barker proposed this function in a 1965 paper as part of his PhD work in mathematical physics at the University of Adelaide. Barker had been inspired by a previous 1953 paper, which brings us to the next example.

Metropolis(-Rosenbluth-Rosenbluth-Teller-Teller)-Hastings

The now most important example is

$$s(x,y) = \min[\pi(x)J(x,y),\pi(y)J(y,x)]\,.$$

We can see that this is a symmetric function. The acceptance probability becomes

$$\alpha(x,y) = \min[1,\frac{\pi(y)J(y,x)}{\pi(x)J(x,y)}]\,.$$

This example is very famous in the world of Markov chain Monte Carlo methods. It is the main part of the so-called Metropolis-Hastings algorithm, which comes from a 1953 paper by Nicholas Metropolis, Arianna W. Rosenbluth, Marshall Rosenbluth, Augusta H. Teller, and Edward Teller (two husband-wife pairs), who looked at a special case, and a 1970 paper by W.K. Hastings, who generalized the method.

Further reading

Articles

Historical

The two important historical papers that gave the two examples for \(s(x,y)\) are:

  • 1965 – Barker – Monte Carlo calculations of the radial distribution functions for a proton-electron plasma;
  • 1953 – Metropolis, Rosenbluth, Rosenbluth, Teller, Teller – Equation of state calculations by fast computing machines.

This paper is also important:

  • 1970 – Hastings – Monte Carlo sampling methods using Markov chains and their applications.
Introductory

A good explanatory article about the Metropolis-Hastings algorithm is:

  • 1995 – Chib and Greenberg – Understanding the Metropolis-Hastings Algorithm.

I also recommend:

  • 2015 – Minh and Minh – Understanding the Hastings Algorithm;
  • 2016 – Robert – The Metropolis-Hastings algorithm;
  • 2017 – Gonçalves, Łatuszyński, Roberts – Barker’s algorithm for Bayesian inference with intractable likelihoods.
History

The history of this and related methods is described in the following article:

  • 2011 – Robert and Casella – A Short History of Markov Chain Monte Carlo: Subjective Recollections from Incomplete Data.

Incidentally, the first author, Christian P. Robert, writes on his blog about Markov chain Monte Carlo methods, such as the Metropolis-Hastings algorithm, as well as many other topics.

Books

Modern books on stochastic simulations and Monte Carlo methods will often detail this method. For example, see the Handbook of Monte Carlo Methods (Section 6.1) by Kroese, Taimre and Botev. The book Stochastic Simulation: Algorithms and Analysis by Asmussen and Glynn also covers the method in Chapter XIII, Section 3. (For my remark on reversibility, see Remark 3.2 in Asmussen and Glynn.) There is also the book Monte Carlo Strategies in Scientific Computing by Liu; see Chapter 5.

Websites

The web abounds with pages dedicated to explaining Markov chain Monte Carlo methods with the focus typically on the Metropolis-Hastings algorithm. I would try this following one because it gives a detailed explanation on how the Metropolis-Hastings algorithm works:

  • https://gregorygundersen.com/blog/2019/11/02/metropolis-hastings/

Markov goes to Monte Carlo

Numerical and scientific fields such as statistics, computational physics, and optimization methods heavily rely upon Markov chain Monte Carlo methods. These simulation techniques use the power of Markov chains to sample general probability distributions, which in turn give Monte Carlo methods for estimating integrals and optimal solutions.

This is the third part of a multi-part series of posts. The first part covers Markov chains. The second part covers the basics of Monte Carlo methods. This post combines the ideas from the first two parts. Overall, the posts sketch the mechanics of Markov chain Monte Carlo (MCMC) methods, whose importance and applications I detailed in a previous post.

We will only examine Markov chains with countable state spaces. In this setting, we only need standard probability and matrix knowledge. But the results extend to general state spaces, such as Euclidean space.

By the way, I doubt that Andrey A. Markov ever went to Monte Carlo in the small country of Monaco.1But Herbie did.

Monte Carlo conditions

Sum

In the first post we covered Markov chains with countable state spaces, because the mathematics is more transparent. Consequently, we can use such Markov chains to estimate the sum of a function \(f\) over a countable (possibly infinite) set \(\mathbb{S}\). We can write the sum as

$$\begin{align}S(f)&=\sum_{x\in \mathbb{S}} f(x) \\&= \sum_{x\in\mathbb{S}} \frac{f(x)q(x)}{q(x)}\\& = \mathbb{E}_{q}[ \frac{f(Y)}{q(Y)}]\\& = \mathbb{E}_{q}[ g(Y)]\,,\end{align}$$

where \(g=f/q\) and \(Y\) is a suitable random variable that has a probability mass function \(q\) with support \(\mathbb{S}\).

Sampling the random variable (or point) \(Y\), gives the samples \(y_1,\dots,y_n\), which yields an unbiased Monte Carlo estimate of the sum, namely

$$S_n (f)=\frac{1}{n}\sum_{i=1}^n \frac{f(y_i)}{q(y_i)}.$$

The closer the function \(q\) is to \(|f|\), the better the estimate.

Integral

We can write an integral over some region \(\mathbb{S}\) as

$$\begin{align}\int_{\mathbb{S}} f(x) dx &= \int_{\mathbb{S}} \frac{f(x)}{p(x)}p(x)dx\\ & =\mathbb{E}_p[\frac{f(Y)}{p(Y)}]\,,\end{align}$$

where \(p\) is the probability density of a random variable (or point) \(Y\) with support \(A\). Then by sampling \(Y\), the resulting samples \(y_1, \dots, y_n\in [0,1]\) give the unbiased Monte Carlo estimate

$$I_n (f)=\frac{1}{n}\sum_{i=1}^n \frac{f(y_i)}{p(y_i)}\,.$$

Ergodic sequence

To estimate the sum or integral, we’ll need a sequence of ergodic random variables \(\{Y_i\}_{i\in\mathbb{N}}\), which we can achieve with an ergodic Markov chain with a state space \(\mathbb{S}\).

Markov chain conditions

In the context of Markov chains and, more generally, stochastic processes, we can think of ergodicity as convergence and a slightly more general version of the law of large numbers. A Markov chain with a countable state space needs some conditions to ensure ergodicity.

Regularity conditions for countable a Markov chain

  1. A stationary distribution \(\pi\)
  2. Aperiodicity
  3. Irreducibility
  4. Postive recurrence

In general, most of the needed structure for the Markov chain comes from irreducibility and a stationary distribution existing. Sometimes the above conditions are redundant. For example, an aperiodic, irreducible Markov chain with a finite state space is always positive recurrent. We look at ways on how to satisfy these conditions in the Monte Carlo setting when the underlying state space is countable.

Stationary distribution \(\pi\)

A stationary distribution \(\pi\) satisfies equation

$$ \pi=P\pi.$$

We know exactly what our stationary distribution \(\pi\) needs to be. It is the one used in the Monte Carlo method, namely the distribution of \(Y\). The challenge is constructing a Markov chain with such a stationary distribution. This will of course depend on the Markov (transition) kernel \(P\) of the Markov chain.

We’ll look at the other conditions on the Markov kernel \(P\).

Aperiodicity

We recall that the period \(d_x\) of a state \(x\in \mathbb{X}\) is the greatest common divisor of all \(n\) values such that \(P(x,x)^n>0\). We need every state of the countable Markov chain to be aperiodic, meaning \(d_x=1\) for all \(x\in\mathbb{X}\).

One sufficient way to achieve this requirement is to first have the countable Markov chain to be irreducible, which we’ll cover in the next section. Then if there exists at least one state \(x\in \mathbb{x}\) such that \(P(x,x)>0\), then the Markov chain is aperiodic.2Lemma 1.8.2 in Markov Chains by Norris or Remark 6.2.5 in Probability and Stochastic Processes by Brémaud.

Clearly the last condition is not difficult to satisfy, putting the focus now on the irreducibility requirement.

Irreducibility

The irreducibility property says that it is possible for the Marko chain to get from any point to another point in a finite number of steps. More precisely, for an irreducible Markov chain with countable state space, there must exist for all \(x,y\in\mathbb{X}\) a natural number \(s\) (possibly depending \(x\) and \(y\)) such that \(P(x,y)^s>0\).

We can achieve this requirement by introducing a stronger requirement, which is easier to verify. A countable Markov chain with a transition \(P\) will be irreducible if for all \(x,y\in\mathbb{X}\) the condition \(P(x,y)>0\) holds. We have simply required that the above number \(s=1\) for all \(x,y\in\mathbb{X}\).

Furthermore, if \(P(x,x)>0\) for all \(x\in\mathbb{X}\), then our irreducible countable Markov chain is also aperiodic.

Positive recurrence

For a point or state \(x\in\mathbb{X}\), we recall its first return time being defined as

$$ T_x^+=\min\{ t\geq 1: X_t=x\} \,.$$

A state \(x\) is called positive recurrent if the expected value of its first return time is finite, meaning \(\mathbb{E}_x(T_x^+)<\infty\). For a countable Markov chain, if all the states in the state space are positive recurrent, then we say the Markov chain is positive recurrent.

A countable irreducible Markov chain is positive recurrent if (and only if) it has a stationary distribution \(\pi\). Furthermore, that stationary distribution will be unique and nonzero, meaning \(\pi(x)>0\) for all \(x\in\mathbb{X}\).3Theorem 1.7.7 in Markov Chains by Norris or Theorem 6.3.14 in Probability and Stochastic Processes by Brémaud.

For positive recurrence, we need to prove that the Markov chain has a stationary distribution.

Ergodicity

A countable Markov chain that is aperiodic and irreducible is ergodic. In other words, an ergodic Markov process \(X\) with stationary distribution \(\pi\) is a random sequence \(X_0, X_1,\dots\) such that the following holds almost surely

$$ \frac{1}{t}\sum_{i=0}^{t-1} g(X_i) \rightarrow \sum_{x\in\mathbb{X}}g(x)\pi(x)$$

as \(t\rightarrow\infty\) for all bounded, nonnegative functions \(g\). Conveniently, wllle can use such an ergodic Markov chain to make statistical estimates.

Sampling with a Markov chain

For the Monte Carlo sample \(y_1,\dots,y_n\), we simply use the Markov chain samples \(x_0,x_1,\dots,x_{n-1}\), giving the Monte Carlo estimate. In the discrete case, this is simply

$$S_n (f)=\frac{1}{n}\sum_{i=1}^n g(x_{i-1}).$$

which, due to ergodicity, converges to \(S(f)=\mathbb{E}_{Y}[g(Y)]\).

Summary

In summary, a sufficient way to have a countable ergodic Markov chain is for the transition kernel to be such that \(P(x,y)>0\) for all \(x\in\mathbb{X}\) and there exists a (stationary) distribution \(\pi=P\pi\). For the first requirement, one imagines using a non-negative function such as \(e^{-(x-y)^2}/C\), where \(C>0\) is a suitable normalization constant.

The requirements seem possible, but to bring them all together is still a great challenge for the uninitiated. Fortunately, some great minds almost seven decades ago proposed the first Markov chain that meets all the requirements for a specific stationary distribution, which was used for a Monte Carlo estimate.

The first such Markov chain Monte Carlo method appeared in a paper written by five scientists, including two husband-wife pairs. It became known as the Metropolis or Metropolis-Hastings algorithm, and it will be the subject of a future post.

Further reading

Books

Markov chains

Any stochastic process book will include a couple of chapters on Markov chains. For more details, there are many books dedicated entirely to the subject of Markov chains. For example, introductory books include:

  • Brémaud – Markov Chains, Gibbs Fields, Monte Carlo Simulation and Queues;
  • Levin, Peres, and Wilmer – Markov Chains and Mixing Times;
  • Norris – Markov Chains.

Those books cover Markov chains with countable state spaces. If you want to read about discrete-time Markov chains with general state spaces, try the book:

  • Meyn, Tweedie – Markov chains and stochastic stability

All the above books have a section detailing a Markov chain Monte Carlo method, such as the Metropolis-Hastings algorithm and the Gibbs sampler.

Monte Carlo methods

Books dedicated to Monte Carlo approaches include:

For a statistics context, there’s also the good book:

  • Casella and Robert – Monte Carlo Statistical Methods.
Waw

The second MC in MCMC methods

In calculus differentiating a function is easy. Integrating one is hard. Much research has gone into finding ways for calculating integrals. But often functions cannot be integrated, at least analytically.

No problem. You do it numerically. You take the function’s curve, split it up into nice rectangles or, even better, trapezoids, and calculate their areas, and add them up. Not accurate enough? Split the curve up even finer until you’re happy. Done.

That works fine in one dimension and even two dimensions. But then, as the number of dimensions increases, the amount of splitting and adding grows exponentially, soon hitting unwieldy numbers. This is a mountain too steep for even the fastest computers. The problem is what applied mathematics guru Richard E. Bellman famously called the curse of dimensionality.

But thankfully randomness offers a path around this mountain. The approach is called Monte Carlo integration for estimating sums and integrals

This is the second part of a series of posts on Mark chain Monte Carlo methods. The first part covers Markov chains. This posts covers the basics of Monte Carlo methods. The third part will combine the ideas from the first two parts. Overall, the three posts will sketch the mechanics of Markov chain Monte Carlo (MCMC) methods, whose importance and applications I detailed in a previous post.

We’ll focus on estimating sums, but the same logic applies to integrals.

Monte Carlo estimates

In the discrete (or countable) case, consider a function \(f\) that maps from a countable state space \(\mathbb{S}\). This state space maybe, for example, the lattice \(\mathbb{Z}^d\) or some subset of it. We write the sum as

$$S(f)=\sum_{x\in \mathbb{S}} f(x) $$

Such sums arise in many fields, particularly statistical physics, which is where the (Markov chain) Monte Carlo methods were born.1In general, we use integrals for continuous spaces, which are uncountable, whereas we use sums for discrete spaces, which are countable spaces. (The two are generalized by the concept of an integral measure.)

The idea behind a Monte Carlo estimate of a sum is remarkably simple. You randomly sample the function \(f\), find the average, and use it as an estimate of the sum.

Estimating with uniform variables

For a finite set \(\mathbb{S}=\{x_1,\dots,x_m\}\), we write the sum of \(f\) over the set \(\mathbb{S}\) as

$$\begin{align}S(f)&=\sum_{x\in \mathbb{S}}  f(x) \frac{m}{m} \\&= \sum_{x\in\mathbb{S}}  \frac{f(x)p(x)}{p(x)}\\& = \mathbb{E}_{p}[ \frac{f(U)}{p(U)}]\,,\end{align}$$

where \(U\) is a discrete uniform random variables, which has the probability mass function \(p\) with support \(\mathbb{S}\), meaning \(p(x)=1/m\) if \(x\in \mathbb{S}\) and \(p(x)=0\) otherwise. But using a uniform random variable is a naive way, as there are better random variables we can use to sample \(f\).

Estimating with general variables

More generally, we can write the sum of a countable (possibly infinite) set \(\mathbb{S}\) as

$$\begin{align}S(f)&= \sum_{x\in\mathbb{S}}  \frac{f(x)q(x)}{q(x)}\\& = \mathbb{E}_{q}[ \frac{f(Y)}{q(Y)}]\\& = \mathbb{E}_{q}[ g(Y)]\,,\end{align}$$

where \(g=p/q\) and \(Y\) is a suitable random variable that has a probability mass function  \(q\) with support \(\mathbb{S}\).

Sampling the random variable (or point) \(Y\), gives the samples \(y_1,\dots,y_n\), which yields an unbiased Monte Carlo estimate of the sum

$$S_n (f)=\frac{1}{n}\sum_{i=1}^n \frac{f(y_i)}{q(y_i)}.$$

If we can (quickly) sample \(Y\), then we can (quickly) estimate the integral. The closer the (probability mass) function \(q\) is to \(|f|\), the better the estimate.

We can also use the above reasoning with integrals, replacing the sum and probability mass function with an integral and a probability density.

Error rate

If the samples \(y_1, \dots, y_n \) are random and independent, the (strong) law of large numbers guarantees that the unbiased Monte Carlo estimate converges (almost surely) to the correct answer.

Further assuming \(S(f^2)<\infty\), the error of the integral estimate \(|S(f)-S_n(f)|\) is proportional to

$$\frac{\mathbb{V}_q(f)}{n ^{1/2}},
$$
where
$$\mathbb{V}(f)_q=E_q[f^2(Y)]-(E_q[f(Y)])^2,$$
is the variance of the function \(f\). (This is not the variation of a function, which is something different.) The error \(|S(f)-S_n(f)|\) reduces at the rate \(O(n^{-1/2})\). The Monte Carlo approach allows for continual updating until the estimate is sufficiently converged.

The above logic applies to integrals of functions on general spaces.

The dimension \(d\) of the underlying space, as $\mathbb{S}^d$, does not appear in the error. The independence of the dimension is precisely why we tackle high-dimensional sums and integrals with Monte Carlo methods. The general rule of thumb is to use Monte Carlo integration for functions of four or more dimensions.

Losing the independence

Again, a sequence of independent random variables allow our Monte Carlo estimate to converge. We can loosen this requirement and allow some dependence between variables. To obtain a Monte Carlo estimate, we just need to use a sequence of random variables that are ergodic. Then we recover the (strong) law of large numbers that gives our Monte Carlo estimate.

Ergodicity

The word ergodicity is one of those technical terms you’ll see and hear in different fields of mathematics and physics, often when randomness is involved. The concept is also central to deterministic dynamical systems.

There are different versions and strengths of ergodicity, depending on the system and applications. But loosely speaking, an ergodic system is one in which temporal averages and spatial (or ensemble) averages converge to the same value.

That means when you want to find an average, you have two options in an ergodic system. The temporal option: You can measure something over a long time. The spatial (or ensemble) option: You can measure many instances of the same something. The two different averages will eventually converge to the same value.

Ergodic sequences

In the context of Markov processes and, more generally, stochastic processes, we can think of ergodicity as convergence and a slightly more general version of the law of large numbers.

Consider a stochastic process \(X\), which is a sequence of random variables \(X_0,X_1,\dots\) indexed by some parameter \(t\). We assume the probability distributions \(\pi_0,\pi_1,\dots\) of the random variables converge to some final stationary distribution \(\pi\). For a countable state space, the (probability) distribution \(\pi\) corresponds to a probability mass function.

We then say a stochastic process \(X\) state space is ergodic if the following holds almost surely

$$ \frac{1}{t}\sum_{i=0}^t g(X_i)\rightarrow \sum_{x\in \mathbb{Z}}g(x)\pi(x)=\mathbb{E}_{\pi} [g(X)]\,,$$

as \(t\rightarrow\infty\) for all bounded functions, nonnegative functions \(g\). The subscript of the expectation simply denotes that the expectation is taken with respect to the (stationary) distribution \(\pi\).

The equation tells the story. On the left-hand side you have a temporal average. On the right-hand side you have a spatial (or ensemble) average. As time proceeds, the first average will converge to the second average. Ergodicity.

Further reading

There are many books dedicated entirely to the subject of Monte Carlo methods. For example, a good book with detailed section on Markov chains applied to Monte Carlo methods in the context of statistics is:

  • Casella and Robert, Monte Carlo Statistical Methods.

These books also have sections on Monte Carlo integration:

The first MC in MCMC methods

Markov chains form a fundamentally important class of stochastic processes. It would be hard to over stress their importance in probability, statistics, and, more broadly, science and technology. They are indispensable in random simulations, particularly those based on the Markov chain Monte Carlo methods. In this post, we’ll have a look some Markov chain basics needed for such simulation methods.

This is the first part of a series of posts on Mark chain Monte Carlo methods. This post covers the basics of Markov chains, which is the more involved part. The second part will cover Monte Carlo methods. The third part will combine the ideas from the first two parts. Overall, the three posts will sketch the mechanics of Markov chain Monte Carlo (MCMC) methods, whose importance and applications I detailed in a previous post.

Markov chains vs Markov processes

All Markov chains are Markov processes. Some people use the term Markov chain to refer to discrete-time Markov processes with general state spaces. Other people prefer the term Markov chain for continuous-time Markov processes with countable state spaces.1In his book Applied Probability and Queues Asmussen writes:

In this book, we use the terminology that a Markov chain has discrete time and a Markov process has continuous time (the state space may be discrete or general). However, one should note that it is equally common to let “chain” refer to a discrete state space and “process” to a general one (time may be discrete or continuous).

Nevertheless, the first MC in the MCMC suggests the Markov chain Monte Carlo crowd prefers the former sense of Markov chain, given the use of discrete-time Markov processes in their simulations.

Markov the frog

Some writers introduce Markov chains with a mental image of a frog jumping around lily pads scattered over a pond. (Presumably the frog never misses a lily pad.) We assume the frog randomly chooses the next lily pad through some random mechanism. Perhaps the distances between lily pads or their sizes influence the chances that frog will jump between them.

We further assume that the frog is a bit particular, preferring to jump in certain directions more than others. More precisely, the probability of our frog jumping from a lily pad labelled \(x\) to another labelled \(y\) is \(P(x,y)\). But jumping in the opposite direction happens with probability \(P(y,x)\), which in general is not equal to \(P(x,y)\).

I typically use the term points, but the Markov literature usually says that the Markov chain visits states.

State space

We can interpret a Markov chain, a type of stochastic process, as a collection or sequence of random variables. 2I briefly detailed stochastic processes in a previous post.The values of the random variables are points in some mathematical space \(\mathbb{X}\). This space can be quite abstract, but in practice it’s usually the lattice \(\mathbb{Z}^n\), Euclidean space \(\mathbb{R}^n\), or a subset of one of these two spaces. For our frog example, all the lily pads in the pond form the state space.

We’ll only consider countable Markov chains where the number of points in the state space \(\mathbb{X}\) is countable. Although the results and theory generally hold for more general state spaces, the accompanying work requires more technical mathematics. For finite and countable state spaces, we can use standard probability and matrix knowledge. But when we use uncountable state spaces such as \(\mathbb{R}^n\), we enter the world of measure theory and functional analysis.

I will often write a point \(x\) in a (state) space \(\mathbb{X}\). But you can say an element \(x\) of a set \(\mathbb{X}\). Many authors refers to the points or elements as states of the Markov chain. In the frog example, each lily pad is a different state.

Markov property

A discrete-time countable Markov chain is a random process that jumps between points of some countable mathematical space \(\mathbb{X}\) such that, when at point \(x \in \mathbb{X}\), the next position is chosen according to a probability distribution \(P(x,·)\) depending only on \(x\).

More specifically, a sequence of random variables \((X_0, X_1, . . .)\) is a discrete-time Markov chain \(X\) with a countable state space
\(\mathbb{X}\) and kernel \(P\) if for all \(x,y \in \mathbb{X}\) and all \(t \geq 1\) satisfying \(\mathbb{P}[X_{t−1}=x_{t-1},\dots,X_0=x_0]>0\), we have

$$ \begin{align}\mathbb{P}[X_{t+1} =y|&X_{t}=x,X_{t−1}=x_{t-1},\dots,X_0=x_0]\\&=\mathbb{P}[X_{t+1} =y|X_t =x]\\&=P(x,y)\,.\end{align}$$

This equation is often called the Markov property.

The Markov property says that the conditional probability of jumping from point \(x\) to \(y\) remains the same, regardless of which points or states \(x_0,x_1,\dots,x_{t-1}\) were previously visited. This is precisely why the kernel \(P\) contains all the information needed to describe the future random evolution of the Markov chain.

We have assumed the probabilities given by \(P\) are fixed, meaning we have described a homogeneous Markov chain.

Markov kernel

The kernel \(P\) is called the Markov (transition) kernel or probability kernel. Assuming a countable state space \(\mathbb{X}\), we can reference any probability value of the kernel \(P\) with two variables \(x,y\in\mathbb{X}\). If we assume a finite state space \(\mathbb{X}\), then the kernel \(P\) becomes a regular matrix taught in linear algebra. An infinite but countable state space gives an infinite matrix \(P\). The rows of the kernel matrix \(P\) must add up to one, because each row is a probability measure.

A more general space, such as Euclidean space \(\mathbb{R}^n\), results in a more general kernel with respect to a suitable measure. In this setting, \(P(x,·)\) is no longer a probability mass function, but a general probability measure.

Initial distribution

At time \(t=0\) we describe the random initial configuration of a Markov process with a probability distribution \(\mu_0\). For a finite or countable Markov chain, this initial distribution \(\mu_0\) corresponds to a probability mass function encoded as a row vector.

Jumping from \(x\) to \(y\)

The probability distribution \(\mu_0\) gives the probability of starting in state (or at point) \(x\in\mathbb{X}\). After one time step, we can write down the probability distribution \(\mu_1\) that gives us the different probabilities of the Markov chain being at different states. At \(n=1\), basic matrix algebra and probability rules give us the matrix equation

$$\mu_1=\mu_0 P$$

By induction, after \(t\) time steps we have the expression

$$\mu_n=\mu_0 P^n\,.$$

where the superscript \(n\) denotes matrix power. We can write the \(n\)-time step kernel as \(P_{(n)}\), which for a finite Markov chain is given by the matrix equation \(P_{(n)}=P^n\).

Seeing how \(P_{(n)}\) behaves as \(n\) approaches infinity forms part of work that studies the convergence and ergodicity properties of Markov chains. I’ll make these concepts clearer below. But first I’ll give some conditions that are typically needed.

Regularity conditions

A Markov chain with a countable state space needs some conditions to ensure convergence and ergodicity.

Regularity conditions

  1. A stationary distribution \(\pi\)
  2. Aperiodicity
  3. Irreducibility
  4. Postive recurrence

The nature of the state space and the kernel will dictate these conditions. These conditions are also not necessarily logically distinct. For example, on a finite state space, you’ll get positive recurrence for free, because an aperiodic, irreducible Markov chain with a finite state space is always positive recurrent.

We now briefly detail these conditions and in another post I’ll give examples how the conditions can be met.

Stationary distribution \(\pi\)

It’s possible to encounter a probability distribution \(\pi\) where applying the kernel \(P\) returns the same distribution \(\pi\), meaning

$$ \pi=\pi P\,.$$

This (fixed-point) equation is called the balance equation.

The distribution \(\pi\) is called the stationary, invariant or steady-state distribution. A Markov chain does not need to have a stationary distribution. And if a Markov chain does have one, it may not be unique. Its existence and uniqueness will depend on the Markov kernel \(P\).

Showing that a unique stationary distribution exists and it is possible to reach it with probability one is the stuff of Markov convergence results. Markov chain Monte Carlo methods hinge upon these results .

Aperiodicity

It is possible for a Markov chain to get trapped in a loop, periodically visiting the same states. The period \(d_x\) of a state \(x\in \mathbb{x}\) is the greatest common divisor of all \(n\) values such that \(P(x,x)^n>0\). If the period of a point is \(d_x=1\), then we say it’s aperiodic. If every state of a Markov chain is aperiodic, we says it’s an aperiodic Markov chain.

Aperiodicity means there are no loops to trap the Markov chain. This property is typically needed for convergence results.

Irreducibility

A Markov chain with a countable state space \(\mathbb{X}\) is irreducible if the Markov chain can go from any point \(x\in\mathbb{X}\) to another other point \(x\in\mathbb{X}\) with a positive probability in a finite number of time steps. In other words, there exists a natural number \(s\) such that \(P(x,y)^s>0\) for all \(x,y\in\mathbb{X}\).

Irreducibility ensures that a Markov chain will visit all the states in its state space. This property is also needed for convergence results.

Recurrence

When studying Markov processes, a quantity of interest is how much time it takes to return to a state or point. For a point \(x\in\mathbb{X}\), we define its first return time as

$$ T_x^+=\min\{ t\geq 1: X_t=x\} \,.$$

As the name suggests, this random variable is the number of time steps for the Markov process return to state \(x\), taking whichever path, conditioned on it starting at \(x\).

We call a state \(x\) recurrent if the probability of its first return time being finite is one, meaning \(\mathbb{P}_x(T_x^+<\infty)=1\). Otherwise the state \(x\) is said to be transient.

Positive recurrence

We can classify different types of recurrence based on the expected value of the first return times. A state \(x\) is called positive recurrent if the expected value of its first return time is finite, meaning \(\mathbb{E}_x(T_x^+)<\infty\). Otherwise state \(x\) is null recurrent.

For a countable Markov chain, if all the states in the state space are (positive) recurrent, so \(\mathbb{E}_x(T_x^+)<\infty\) for all \(x\in\mathbb{X}\), then we say the Markov chain is (positive) recurrent.

Again, the concept of positive recurrence is needed for convergence results.

Ergodicity

We say a countable Markov chain is ergodic if it is irreducible, aperiodic and positive recurrent.3See, for example, Basics of Applied Stochastic Process by Serfozo (page 26) or Probability Theory and Stochastic Processes by Bremaud (page 262). Ergodicity allows one to find averages by employing a more general form of the law of large numbers, which Monte Carlo methods rely upon. We stress that definitions of ergodicity vary somewhat, but in general it means convergence and laws of large numbers exists.

Further reading

Any stochastic process book will include a couple of chapters on Markov chains such as:

    • Brémaud – Probability Theory and Stochastic Processes;
    • Serfozo -Basics of Applied Stochastic Processes.

For more details, there are many books dedicated entirely to the subject of Markov chains. For example, introductory books include:

  • Brémaud – Markov Chains, Gibbs Fields, Monte Carlo Simulation and Queues;
  • Levin, Peres, and Wilmer – Markov Chains and Mixing Times;
  • Norris – Markov Chains.

Those books cover Markov chains with countable state spaces. If you want to read about discrete-time Markov chains with general state spaces, try the book:

  • Meyn, Tweedie – Markov chains and stochastic stability

All the above books have a section on Markov chain Monte Carlo methods, such as the Metropolis-Hastings algorithm or the Gibbs sampler.

The acceptance(-rejection) method for simulating random variables

In a previous post, I covered a simple but much used method for simulating random variables or, rather, generating random variates. To simulate a random variable, the method requires, in an easy fashion, calculating the inverse of its cumulative distribution function. But you cannot always do that.

In lieu of this, the great John von Neumann wrote in a 1951 paper that you can sample a sequence of values from another probability distribution, accepting only the values that meet a certain condition based on this other distribution and the desired distribution, while rejecting all the others. The accepted values will follow the desired probability distribution. This method of simulation or sampling is called the rejection method, the acceptance method, and it has even the double-barrelled name the acceptance-rejection (AR) method.

Details

Let \(X\) be a continuous random variable with a (probability) density \(p(x)\), which is the derivative of its cumulative probability distribution \(P(X\leq x)\). The density \(p(x)\) corresponds to the desired or target distribution from which we want to sample. For whatever reason, we cannot directly simulate the random variable \(X\). (Maybe we cannot use the inverse method because \(P(X\leq x)\) is too complicated.)

The idea that von Newman had was to assume that we can easily simulate another random variable, say, \(Y\) with the (probability) density \(q(x)\). The density \(q(x)\) corresponds to a proposal distribution that we can sample (by using, for example, the inverse method).

Now we further assume that there exists some finite constant \(M>0\) such that we can bound \(p(x)\) by \(Mq(x)\), meaning

$$ p(x) \leq M q(x), \text{ for all } x . $$

Provided this, we can then sample the random variable \(Y\) and accept a value of it (for a value of \(X\)) with probability

$$\alpha = \frac{p(Y)}{Mq(Y)}.$$

If the sampled value of \(Y\) is not accepted (which happens with probability \(1-\alpha\)), then we must repeat this random experiment until a sampled value of \(Y\) is accepted.

Algorithm

We give the pseudo-code for the acceptance-rejection method suggested by von Neumann.

Random variable \(X\) with density \(p(x)\)

  1. Sample a random variable \(Y\) with density \(q(x)\), giving a sample value \(y\).
  2. Calculate the acceptance probability \(\alpha = \frac{p(y)}{Mq(y)}\).
  3. Sample a uniform random variable \(U\sim U(0,1)\), giving a sample value \(u\).
  4. Return the value \(y\) (for the value of \(X\)) if \(u\leq \alpha\), otherwise go to Step 1 and repeat.

As covered in a previous post, Steps 3 and 4 are equivalent to accepting the value \(y\) with probability \(\alpha\).

Point process application

In the context of point processes, this method is akin to thinning point processes independently. This gives a method for positioning points non-uniformly by first placing the points uniformly. The method then thins points based on the desired intensity function. As I covered in a previous post, this is one way to simulate an inhomogeneous (or nonhomogeneous) Poisson point process.

Efficiency

Basic probability theory tells us that the number of experiment runs (Steps 1 to 3) until acceptance is a geometric variable with parameter \(\alpha\). On average the acceptance(-rejection) method will take \(1/\alpha\) number of simulations to sample one value of the random \(X\) of the target distribution. The key then is to make the proposal density \(q(x)\) as small as possible (and adjust \(M\) accordingly), while still keeping the inequality \(p(x) \leq M q(x)\).

Higher dimensions

The difficulty of the acceptance(-rejection) method is finding a good proposal distribution such that the product \(Mq(x)\) is not much larger than the target density \(p(x)\). In one-dimension, this can be often done, but in higher dimensions this becomes increasingly difficult. Consequently, this method is typically not used in higher dimensions.

Another approach with an acceptance step is the Metropolis-Hastings method, which is the quintessential Markov chain Monte Carlo (MCMC) method. This method and its cousins have become exceedingly popular, as they give ways to simulate collections of dependent random variables that have complicated (joint) distributions.

Further reading

The original paper where the acceptance(-rejection) method appears (on page 769 in the right-hand column) is:

  • von Neumann, Various techniques used in connection with random digits, 1951.

The usual books on stochastic simulations and Monte Carlo methods will detail this method. For example, see the book by Devroye (Section II.3) or the more recent Handbook of Monte Carlo Methods (Section 3.1.5) by Kroese, Taimre and Botev. The book Stochastic Simulation: Algorithms and Analysis by Asmussen and Glynn also covers the method in Section 2b.

Other books include those by Fishman (Section 8.5) and Gentle (Section 4.5) respectively.

The Box-Muller method for simulating normal variables

In the previous post, I covered a simple but much used method for simulating random variables or, rather, generating random variates. To simulate a random variable, the method requires writing down, in a tractable manner, the inverse of its cumulative distribution function.

But in the case of the normal (or Gaussian) distribution, there is no closed-form expression for its cumulative distribution function nor its inverse. This means you cannot, in an elegant and fast way at least, generate with the inverse method a single normal random variable using a single uniform random variable.

Interestingly, however, you can generate two (independent) normal variables with two (independent) uniform variables using the Box-Muller method, originally proposed by George Box and Mervin E. Muller. This approach uses the inverse method, but in practice it’s not used much (see below). I detail this method because I find it neat and it highlights the connection between the normal distribution and rotational symmetry, which has been the subject of some recent 3Blue1Brown videos on YouTube.

(This method was also used to simulate the Thomas point process, which I covered in a previous post.)

Incidentally, this connection is also mentioned in a previous post on simulating a Poisson point process on the surface of a sphere.  In that method post, Method 2 uses an observation by the Muller that normal random variables can be used to position points uniformly on spheres.

I imagine this method was first observed by transforming two normal variables, instead of guessing various distribution pairs that would work.  Then I’ll sketch the proof in the opposite direction, though it works in both directions.

Proof outline

The joint probability density of two independent variables is simply the product of the two individual probabilities densities. Then the joint density of two standard normal variables is

$$\begin{align}f_{X,Y}(x,y)&=\left[\frac{1}{\sqrt{2\pi}}e^{-x^2/2}\right]\left[\frac{1}{\sqrt{2\pi}}e^{-y^2/2}\right]\\&=\frac{1}{{2\pi}}e^{-(x^2+y^2)/2}\,.\end{align}$$

Now it requires a change of coordinates in two dimensions (from Cartesian to polar) using a Jacobian determinant, which in this case is \(|J(\theta,r)=r|\).1Alternatively, you can simply recall the so-called area element \(dA=r\,dr\,d\theta\).  giving a new joint probability density

$$f_{\Theta,R}(\theta,r)=\left[\frac{1}{\sqrt{2\pi}}\right]\left[ r\,e^{-r^2/2}\right]\,.$$

Now we just identify the two probability densities. The first probability density corresponds to a uniform variable on \([0, 2\pi]\), whereas the second is that of a Rayleigh variable with parameter \(\sigma=1\). Of course the proof works in the opposite direction because the transformation (between Cartesian and polar coordinates) is a one-to-one function.

Algorithm

Here’s the Box-Muller method for simulating two (independent) standard normal variables with two (independent) uniform random variables.

Two (independent) standard normal random variable \(Z_1\) and \(Z_2\)

  1. Generate two (independent) uniform random variables \(U_1\sim U(0,1)\) and \(U_2\sim U(0,1)\).
  2. Return \(Z_1=\sqrt{-2\ln U_1}\cos(2\pi U_2)\) and \(Z_2=\sqrt{-2\ln U_1}\sin(2\pi U_2)\).

The method effectively samples a uniform angular variable \(\Theta=2\pi U_2\) on the interval \([0,2\pi]\) and a radial variable \(R=\sqrt{-2\ln U_1}\) with a Rayleigh distribution.

The algorithm produces two independent standard normal variables. Of course, as many of us learn in high school, if \(Z\) is a standard normal variable, then the random variable \(X=\sigma Z +\mu\) is a normal variable with mean \(\mu\) and standard deviation \(\sigma>0\) .

The Box-Muller method is rarely used

Sadly this method isn’t typically used, as historically computer processors were slow at doing the calculations, so other methods were employed such as the ziggurat algorithm. Also, although processors can now do such calculations much faster, many languages, not just scientific ones, come with functions for generating normal variables. Consequently, there’s not much need in implementing this method.

Further reading

Websites

Many websites detail this method. Here’s a couple:

Papers

The original paper (which is freely available here) is:

  • 1958 – Box and Muller, A Note on the Generation of Random Normal Deviates.

Another paper by Muller connects normal variables and the (surface of a) sphere:

  • 1959 – Muller, A note on a method for generating points uniformly on n-dimensional spheres.

Books

Many books on stochastic simulation cover the Box-Muller method. The classic book by Devroye with the descriptive title Non-Uniform Random Variate Generation covers this method; see Section 4.1. There’s also the Handbook of Monte Carlo Methods by Kroese, Taimre and Botev; see Section 3.1.2.7. Ripley also covers the method (and he makes a remark with some snark that many people incorrectly spell it the Box-Müller method); see Section 3.1. The book Stochastic Simulation: Algorithms and Analysis by Asmussen and Glynn also mention the method and a variation by Marsaglia; see Examples 2.11 and 2.12.

The inverse method for simulating random variables

We will cover a simple but much used method for simulating random variables or, rather, random variates. Although the material here is found in introductory probability courses, it frequently forms the foundation of more advance stochastic simulation techniques, such as Markov chain Monte Carlo methods.

Details

The basics of probability theory tell us that any random variable can, in theory, be written as a function of a uniform random variable \(U\) distributed on the interval \((0,1)\), which is usually written as \(U\sim U(0,1)\). All one needs is the inverse of the cumulative distribution function of the desired random variable.

More specifically, let \(X\) be a random variable with a cumulative distribution function \(F(x)=\mathbb{P}(X\leq x)\). The function \(F\) is nondecreasing in \(x\), so its inverse can be defined as \(F^{-1}(y)=\inf\{x:F(x)\geq y\}\), which is known as the generalized inverse of \(F(x)\).

Some authors assume the minimum is attained so the infimum is replaced with the minimum, giving \(F^{-1}(y)=\min\{x:F(x)\geq y\}\).

In short, the following result is all that we need.

Transform of a uniform variable \(U\sim U(0,1)\)

For a uniform random variable \(U\sim U(0,1)\), the random variable \(F^{-1}(U)\) has the cumulative distribution function \(\mathbb{P}(F^{-1}(U)\leq x)=P(U\leq F(x))=F(x)\).

Algorithm

The above observation gives a method, which I like to call the direct method, for exactly simulating a random variable \(X\) with the (cumulative) distribution (function) \(F\).

Random variable \(X\) with distribution \(F\)

  1. Sample a uniform random variable \(U\sim U(0,1)\), giving a value \(u\).
  2. Return the value \(x=F^{-1}(u)\) as the sampled value of \(U\).

But this approach only works if we can write down (in a relatively straightforward way) the inverse \(F^{-1}\), which is usually not the case. This means you cannot generate, for example, simulate a normal random variable with a single uniform random variable by using just the inverse method, as we cannot write down the inverse of its cumulative distribution function.

(Interestingly, with two (independent) uniform random variables, we can use the transform method to simulate two (independent) normal (or Gaussian) random variables. This approach is called the Box-Muller transform, which I’ll cover in another post.)

Nevertheless, we can apply the inverse method to some useful distributions.

Examples

Warning: The following examples are only for illustration purposes. Except for the Bernoulli example, you would never use them in standard scientific languages such as MATLAB, Python (with NumPy), R or Julia, because those languages already have much better functions for simulating these and many other random variables (or variates). If you are writing a function in a language that lacks such functions, I would consult one of the references mentioned below. Although the inverse method is usually intuitive and elegant, it is often not the fastest method.

Bernoulli distribution

The simplest random variable is that with the Bernoulli distribution. With probability \(p\), a Bernoulli random variable \(X\) takes the value one. Otherwise, \(X\) takes the value zero (with probability \(1-p\)). This gives the (cumulative) distribution (function):

$$ F_B(x)=\begin{cases}
0 & \text{if } x < 0 \\
1 – p & \text{if } 0 \leq x < 1 \\
1 & \text{if } x \geq 1
\end{cases}$$

This gives a very simple way to simulate (or sample) a Bernoulli variable \(X\) with parameter \(p\).

Bernoulli random variable \(X\) with parameter \(p\)

  1. Sample a uniform random variable \(U\sim U(0,1)\), giving a value \(u\).
  2. If \(u\leq p\), return \(x=1\); otherwise return \(x=0\).
Application: Acceptance simulation methods

In random simulation code, whenever you do something (or not) with some probability \(p\) (or probability \(1-p\)), then the code will perform the above step. Consequently, you see this in the (pseudo-)code of many stochastic simulations with random binary choices, particularly schemes that have an acceptance step such the Metropolis-Hastings method and other Markov chain Monte Carlo (MCMC) methods.

In MCMC schemes, a random (binary) choice is proposed and it is accepted with a certain probability, say, \(\alpha \). This is the equivalent of accepting the proposed choice if some uniform random variable \(U\) meets the condition \(U\leq \alpha\).

This explains why the pseudo-code of the same algorithm can vary. Some pseudo-code will say accept with probability \(\alpha\), while other pseudo-code will say do if \(U\leq \alpha\). It’s two equivalent formulations.

Exponential distribution

The cumulative distribution function of an exponential variable with mean \(1/\lambda\) is \(F_E(x)= 1-e^{-\lambda x}\), which has the inverse \(F^{-1}_E(y)=-(1/\mu)\ln[1-y]\). We can use the fact that on the interval \((0,1)\), a uniform variable \(U\sim U(0,1)\) and \(1-U\) have the same distribution. Consequently, the random variables \(\ln [1-U]\) and \(\ln U\) are equal in distribution.

This gives a method for simulating exponential random variables.

Exponential random variable \(X\) with mean \(1/\lambda\)

  1. Sample a uniform random variable \(U\sim U(0,1)\), giving a value \(u\).
  2. Return \(x=-(1/\lambda)\ln u\).
Application: Poisson simulation method

Of course you can use this method to simulate exponential random variables, but it has another application. In a previous post on simulating Poisson variables, I mentioned that exponential random variables can be used to simulate a Poisson random variable in a direct (or exact) manner. That method is based on the distances between the points of a homogeneous Poisson point process (on the real line) being exponential random variables.

But this method is only suitable for low values of \(\lambda\), less than say fifteen.

Rayleigh distribution

The Rayleigh distribution is \(\mathbb{P}(X\leq x)= (x/\sigma^2)e^{-x^2/(2\sigma^2)}\), where \(\sigma>0\) is its scale parameter. The square root of an exponential variable with mean \(1/\lambda\) has a Rayleigh distribution with scale parameter \(\sigma=1/\sqrt{2\lambda}\).

Consequently, the generation method is similar to the previous example.

Rayleigh random variable \(Y\) with scale parameter \(\sigma>0\)

  1. Sample a uniform random variable \(U\sim U(0,1)\), giving a value \(u\).
  2. Return \(y=\sigma\sqrt{-2\ln u}\).

Other methods

The inverse method is intuitive and often succinct. But most functions for simulating random variables (or, more correctly, generating random variates) do not use these methods, as they are not fast under certain parameter regimes, such as large means. Consequently, other method are used such as approximations (with, say, normal random variables), such as the ones I outlined in this post on simulating Poisson random variables.

More complicated random systems, such as collections of dependent variables, can be simulated using Markov chain Monte Carlo methods, which is the direction we’ll take in a couple posts after this one.

Further reading

The inverse technique is in your favourite introductory book on probability theory. The specific examples here are covered in books on stochastic simulations and Monte Carlo methods. The classic book by Devroye covers these topics; see Section 2.1 and the examples (inverse method) in Chapter 2.

For a modern take, there’s the extensive Handbook of Monte Carlo Methods by Kroese, Taimre and Botev; see Section 3.1.1, Algorithm 4.1 (Bernoulli) and Algorithm 4.29 (exponential), and Algorithm 4.66 (Rayleigh). There’s also the book Stochastic Simulation: Algorithms and Analysis by Asmussen and Gynn; in Chapter 2, see Example 2.1 (Bernoulli) and Exampe 2.3 (exponential).

Other books include those by Fishman (Section 8.1) and Gentle (Section 4.1) respectively. (Warning: the book by Gentle has a mistake on page 105 in algorithm for sampling Bernoulli variables, as noted by the author. It should be \(1-\pi\) and not \(\pi\) when zero is returned for the sampled value of the Bernoulli variable.)