Creating a reversible Markov chain using acceptance(-rejection)

The study of Markov chains is generally the study of their long term behaviour, which, under certain conditions, is captured by them having a unique stationary distribution. Stationarity is an important property. It is, in a sense, a local property of a Markov chain.

For a Markov chain, a more global property is something called reversibility. Markov chains with this property must possess a stationary distribution, which, we see below, is an immediate consequence of reversibility. Reversible Markov chains (or processes) with discrete time1Asmussen writes in his book Applied Probability and Queues:

In this book, we use the terminology that a Markov chain has discrete time and a Markov process has continuous time (the state space may be discrete or general). However, one should note that it is equally common to let “chain” refer to a discrete state space and “process” to a general one (time may be discrete or continuous).
are the cornerstone of Markov chain Monte Carlo (MCMC) methods.

In this post we look at how a reversible Markov chain is constructed from a non-reversible (but irreducible) Markov chain by introducing an acceptance-rejection step. This post complements another post I wrote on the Metropolis-Hastings algorithm.

Reversibility

A Markov process on state space \(\mathbb{X}\) with kernel \(K\) is (time) reversible with respect to the distribution \(\mu\) if the following holds

$$ \mu(x)K(x,y) = \mu (y) K(y,x)\quad x,y\in\mathbb{X}\,.$$

This reversibility condition is also called the detailed balance equation. If this condition is met, then the Markov process will have a stationary distribution \(\mu\). By summing over \(x\), we can verify this because we obtain

$$ \sum_{x\in\mathbb{X}}\mu(x)K(x,y) =\mu(y)\sum_{x\in\mathbb{X}} K(y,x)=\mu(y)\,.$$

This is just the balance equation, often written as \(\mu=K\mu\), which says that the transition kernel \(K\) has a stationary distribution \(\mu \).

First Markov chain

We consider a Markov chain with kernel \(J\) defined on a finite state space \(\mathbb{X}\). If the Markov chain is at state \(x\in \mathbb{X}\), it visits another state \(y\in \mathbb{X}\) with the probability \(J(x,y)\). This is a simple time-homogeneous finite Markov chain.

Irreducibility

For our Markov chain, we assume that every state \(x\) in \(\mathbb{X}\) where \(\pi(x)>0\) is reachable with positive probability in a single step. This implies the easy-to-achieve condition \(J(x,y)>0\) where \(\pi(x)>0\) for all points \(x,y \in \mathbb{X}\). This requirement is a stronger form of irreducibility.

Creating a new Markov chain with acceptance

We create a new Markov chain by introducing an acceptance step. For the Markov chain with kernel \(J\), we assume that each time step, after choosing the jump direction but before jumping, a biased coin is flipped . The success probability \(\alpha(x,y)\) depends on the current position \(x\in \mathbb{X}\) and the (potential) next position \(y\in \mathbb{X}\).

Transition kernel

For our new Markov chain, we can quickly reason the transition kernel \(M\). We first look at the off-diagonal elements of the kernel (matrix) \(M\). To go from state \(x\) and to another state \(y\neq x\), the probability is simply

$$ M(x,y) = \alpha(x,y) J(x,y), \quad x\neq y\,.$$

The transition matrix \(M\) needs to be stochastic, so the rows sum to one, so \(\sum_{y\in\mathbb{X}}M(x,y)=1\). That gives us the diagonal elements of \(M\), although their exact form is not needed to show reversibility.

\(\alpha(x,y)\) needs a symmetric function \(s(x,y)\)

For reversibility, we just need to swap rows and columns. Clearly we only need to look at the off-diagonal entries, which implies the requirement

$$ \pi(x)M(x,y) = \pi (y) M(y,x)\quad x\neq y\,.$$

Both sides are symmetric in \(x\) and \(y\), meaning they are equal to some non-negative symmetric \(s(x,y)=s(y,x)\). Looking at the right-hand side, we get

$$\begin{aligned}\pi(y) M(y,x)&=\pi(y)J(y,x) \alpha(y,x)\\ &= s(y,x)\,.\end{aligned}$$

This implies that the function \(\alpha\) is a non-negative function such that \(\alpha\leq 1\), to ensure it’s a probability, with the form

$$ \alpha(x,y)=\frac{s(x,y)}{\pi(x)J(x,y)}\,.$$

The only task remaining now is to choose a reasonable symmetric function \(s\) such that \(\alpha\leq 1\), ensuring \(\alpha\) is a probability. Of course, our choice for the symmetric function \(s\) should also be a function of the stationary distribution \(\pi\) and the underlying kernel \(J\).

Examples

I’ll give two principal examples of the symmetric function \(s(x,y)\). Working in reverse chronological order, I’ll give the simpler of the two examples first.

Barker

A somewhat natural example is

$$s(x,y) = \frac{\pi(x)J(x,y)\pi(y)J(y,x)}{\pi(x)J(x,y)+\pi(y)J(y,x)}\,.$$

This is clearly a symmetric function, which only has the terms \(\pi\) and \(J\). The acceptance probability becomes

$$\alpha(x,y) = \frac{\pi(y)J(y,x)}{\pi(x)J(x,y)+\pi(y)J(y,x)}\,.$$

A.A. Barker proposed this function in a 1965 paper as part of his PhD work in mathematical physics at the University of Adelaide. Barker had been inspired by a previous 1953 paper, which brings us to the next example.

Metropolis(-Rosenbluth-Rosenbluth-Teller-Teller)-Hastings

The now most important example is

$$s(x,y) = \min[\pi(x)J(x,y),\pi(y)J(y,x)]\,.$$

We can see that this is a symmetric function. The acceptance probability becomes

$$\alpha(x,y) = \min[1,\frac{\pi(y)J(y,x)}{\pi(x)J(x,y)}]\,.$$

This example is very famous in the world of Markov chain Monte Carlo methods. It is the main part of the so-called Metropolis-Hastings algorithm, which comes from a 1953 paper by Nicholas Metropolis, Arianna W. Rosenbluth, Marshall Rosenbluth, Augusta H. Teller, and Edward Teller (two husband-wife pairs), who looked at a special case, and a 1970 paper by W.K. Hastings, who generalized the method.

Further reading

Articles

Historical

The two important historical papers that gave the two examples for \(s(x,y)\) are:

  • 1965 – Barker – Monte Carlo calculations of the radial distribution functions for a proton-electron plasma;
  • 1953 – Metropolis, Rosenbluth, Rosenbluth, Teller, Teller – Equation of state calculations by fast computing machines.

This paper is also important:

  • 1970 – Hastings – Monte Carlo sampling methods using Markov chains and their applications.
Introductory

A good explanatory article about the Metropolis-Hastings algorithm is:

  • 1995 – Chib and Greenberg – Understanding the Metropolis-Hastings Algorithm.

I also recommend:

  • 2015 – Minh and Minh – Understanding the Hastings Algorithm;
  • 2016 – Robert – The Metropolis-Hastings algorithm;
  • 2017 – Gonçalves, Łatuszyński, Roberts – Barker’s algorithm for Bayesian inference with intractable likelihoods.
History

The history of this and related methods is described in the following article:

  • 2011 – Robert and Casella – A Short History of Markov Chain Monte Carlo: Subjective Recollections from Incomplete Data.

Incidentally, the first author, Christian P. Robert, writes on his blog about Markov chain Monte Carlo methods, such as the Metropolis-Hastings algorithm, as well as many other topics.

Books

Modern books on stochastic simulations and Monte Carlo methods will often detail this method. For example, see the Handbook of Monte Carlo Methods (Section 6.1) by Kroese, Taimre and Botev. The book Stochastic Simulation: Algorithms and Analysis by Asmussen and Glynn also covers the method in Chapter XIII, Section 3. (For my remark on reversibility, see Remark 3.2 in Asmussen and Glynn.) There is also the book Monte Carlo Strategies in Scientific Computing by Liu; see Chapter 5.

Websites

The web abounds with pages dedicated to explaining Markov chain Monte Carlo methods with the focus typically on the Metropolis-Hastings algorithm. I would try this following one because it gives a detailed explanation on how the Metropolis-Hastings algorithm works:

  • https://gregorygundersen.com/blog/2019/11/02/metropolis-hastings/