Markov goes to Monte Carlo

Numerical and scientific fields such as statistics, computational physics, and optimization methods heavily rely upon Markov chain Monte Carlo methods. These simulation techniques use the power of Markov chains to sample general probability distributions, which in turn give Monte Carlo methods for estimating integrals and optimal solutions.

This is the third part of a multi-part series of posts. The first part covers Markov chains. The second part covers the basics of Monte Carlo methods. This post combines the ideas from the first two parts. Overall, the posts sketch the mechanics of Markov chain Monte Carlo (MCMC) methods, whose importance and applications I detailed in a previous post.

We will only examine Markov chains with countable state spaces. In this setting, we only need standard probability and matrix knowledge. But the results extend to general state spaces, such as Euclidean space.

By the way, I doubt that Andrey A. Markov ever went to Monte Carlo in the small country of Monaco.1But Herbie did.

Monte Carlo conditions

Sum

In the first post we covered Markov chains with countable state spaces, because the mathematics is more transparent. Consequently, we can use such Markov chains to estimate the sum of a function \(f\) over a countable (possibly infinite) set \(\mathbb{S}\). We can write the sum as

$$\begin{align}S(f)&=\sum_{x\in \mathbb{S}} f(x) \\&= \sum_{x\in\mathbb{S}} \frac{f(x)q(x)}{q(x)}\\& = \mathbb{E}_{q}[ \frac{f(Y)}{q(Y)}]\\& = \mathbb{E}_{q}[ g(Y)]\,,\end{align}$$

where \(g=f/q\) and \(Y\) is a suitable random variable that has a probability mass function \(q\) with support \(\mathbb{S}\).

Sampling the random variable (or point) \(Y\), gives the samples \(y_1,\dots,y_n\), which yields an unbiased Monte Carlo estimate of the sum, namely

$$S_n (f)=\frac{1}{n}\sum_{i=1}^n \frac{f(y_i)}{q(y_i)}.$$

The closer the function \(q\) is to \(|f|\), the better the estimate.

Integral

We can write an integral over some region \(\mathbb{S}\) as

$$\begin{align}\int_{\mathbb{S}} f(x) dx &= \int_{\mathbb{S}} \frac{f(x)}{p(x)}p(x)dx\\ & =\mathbb{E}_p[\frac{f(Y)}{p(Y)}]\,,\end{align}$$

where \(p\) is the probability density of a random variable (or point) \(Y\) with support \(A\). Then by sampling \(Y\), the resulting samples \(y_1, \dots, y_n\in [0,1]\) give the unbiased Monte Carlo estimate

$$I_n (f)=\frac{1}{n}\sum_{i=1}^n \frac{f(y_i)}{p(y_i)}\,.$$

Ergodic sequence

To estimate the sum or integral, we’ll need a sequence of ergodic random variables \(\{Y_i\}_{i\in\mathbb{N}}\), which we can achieve with an ergodic Markov chain with a state space \(\mathbb{S}\).

Markov chain conditions

In the context of Markov chains and, more generally, stochastic processes, we can think of ergodicity as convergence and a slightly more general version of the law of large numbers. A Markov chain with a countable state space needs some conditions to ensure ergodicity.

Regularity conditions for countable a Markov chain

  1. A stationary distribution \(\pi\)
  2. Aperiodicity
  3. Irreducibility
  4. Postive recurrence

In general, most of the needed structure for the Markov chain comes from irreducibility and a stationary distribution existing. Sometimes the above conditions are redundant. For example, an aperiodic, irreducible Markov chain with a finite state space is always positive recurrent. We look at ways on how to satisfy these conditions in the Monte Carlo setting when the underlying state space is countable.

Stationary distribution \(\pi\)

A stationary distribution \(\pi\) satisfies equation

$$ \pi=P\pi.$$

We know exactly what our stationary distribution \(\pi\) needs to be. It is the one used in the Monte Carlo method, namely the distribution of \(Y\). The challenge is constructing a Markov chain with such a stationary distribution. This will of course depend on the Markov (transition) kernel \(P\) of the Markov chain.

We’ll look at the other conditions on the Markov kernel \(P\).

Aperiodicity

We recall that the period \(d_x\) of a state \(x\in \mathbb{X}\) is the greatest common divisor of all \(n\) values such that \(P(x,x)^n>0\). We need every state of the countable Markov chain to be aperiodic, meaning \(d_x=1\) for all \(x\in\mathbb{X}\).

One sufficient way to achieve this requirement is to first have the countable Markov chain to be irreducible, which we’ll cover in the next section. Then if there exists at least one state \(x\in \mathbb{x}\) such that \(P(x,x)>0\), then the Markov chain is aperiodic.2Lemma 1.8.2 in Markov Chains by Norris or Remark 6.2.5 in Probability and Stochastic Processes by Brémaud.

Clearly the last condition is not difficult to satisfy, putting the focus now on the irreducibility requirement.

Irreducibility

The irreducibility property says that it is possible for the Marko chain to get from any point to another point in a finite number of steps. More precisely, for an irreducible Markov chain with countable state space, there must exist for all \(x,y\in\mathbb{X}\) a natural number \(s\) (possibly depending \(x\) and \(y\)) such that \(P(x,y)^s>0\).

We can achieve this requirement by introducing a stronger requirement, which is easier to verify. A countable Markov chain with a transition \(P\) will be irreducible if for all \(x,y\in\mathbb{X}\) the condition \(P(x,y)>0\) holds. We have simply required that the above number \(s=1\) for all \(x,y\in\mathbb{X}\).

Furthermore, clearly \(P(x,x)>0\) for all \(x\in\mathbb{X}\), then our irreducible countable Markov chain is also aperiodic.

Positive recurrence

For a point or state \(x\in\mathbb{X}\), we recall its first return time being defined as

$$ T_x^+=\min\{ t\geq 1: X_t=x\} \,.$$

A state \(x\) is called positive recurrent if the expected value of its first return time is finite, meaning \(\mathbb{E}_x(T_x^+)<\infty\). For a countable Markov chain, if all the states in the state space are positive recurrent, then we say the Markov chain is positive recurrent.

A countable irreducible Markov chain is positive recurrent if (and only if) it has a stationary distribution \(\pi\). Furthermore, that stationary distribution will be unique and nonzero, meaning \(\pi(x)>0\) for all \(x\in\mathbb{X}\).3Theorem 1.7.7 in Markov Chains by Norris or Theorem 6.3.14 in Probability and Stochastic Processes by Brémaud.

In other words, for positive recurrence, we need to prove that the Markov chain has a stationary distribution.

Ergodicity

A countable Markov chain that is aperiodic and irreducible is ergodic. In other words, an ergodic Markov process \(X\) with stationary distribution \(\pi\) is a random sequence \(X_0, X_1,\dots\) such that the following holds almost surely

$$ \frac{1}{t}\sum_{i=0}^{t-1} g(X_i) \rightarrow \sum_{x\in\mathbb{X}}g(x)\pi(x)$$

as \(t\rightarrow\infty\) for all bounded, nonnegative functions \(g\). Consequently, we can use such an ergodic Markov chain to make statistical estimates.

Sampling with a Markov chain

For the Monte Carlo sample \(y_1,\dots,y_n\), we simply use the Markov chain samples \(x_0,x_1,\dots,x_{n-1}\), giving the Monte Carlo estimate. In the discrete case, this is simply

$$S_n (f)=\frac{1}{n}\sum_{i=1}^n g(x_{i-1}).$$

which, due to ergodicity, converges to \(S(f)=\mathbb{E}_{Y}[g(Y)]\).

Summary

In summary, a sufficient way to have a countable ergodic Markov chain is for the transition kernel to be such that \(P(x,y)>0\) for all \(x\in\mathbb{X}\) and there exists a (stationary) distribution \(\pi=P\pi\). For the first requirement, one imagines using a non-negative function such as \(e^{-(x-y)^2}/C\), where \(C>0\) is a suitable normalization constant.

The requirements seem possible, but to bring them all together is still a great challenge for the uninitiated. Fortunately, some great minds almost seven decades ago proposed the first Markov chain that meets all the requirements for a specific stationary distribution, which was used for a Monte Carlo estimate.

The first such Markov chain Monte Carlo method appeared in a paper written by five scientists, including two husband-wife pairs. It became known as the Metropolis or Metropolis-Hastings algorithm, and it will be the subject of a future post.

Further reading

Books

Markov chains

Any stochastic process book will include a couple of chapters on Markov chains. For more details, there are many books dedicated entirely to the subject of Markov chains. For example, introductory books include:

  • Brémaud – Markov Chains, Gibbs Fields, Monte Carlo Simulation and Queues;
  • Levin, Peres, and Wilmer – Markov Chains and Mixing Times;
  • Norris – Markov Chains.

Those books cover Markov chains with countable state spaces. If you want to read about discrete-time Markov chains with general state spaces, try the book:

  • Meyn, Tweedie – Markov chains and stochastic stability

All the above books have a section detailing a Markov chain Monte Carlo method, such as the Metropolis-Hastings algorithm and the Gibbs sampler.

Monte Carlo methods

Books dedicated to Monte Carlo approaches include:

For a statistics context, there’s also the good book:

  • Casella and Robert – Monte Carlo Statistical Methods.

The second MC in MCMC methods

In calculus differentiating a function is easy. Integrating one is hard. Much research has gone into finding ways for calculating integrals. But often functions cannot be integrated, at least analytically.

No problem. You do it numerically. You take the function’s curve, split it up into nice rectangles or, even better, trapezoids, and calculate their areas, and add them up. Not accurate enough? Split the curve up even finer until you’re happy. Done.

That works fine in one dimension and even two dimensions. But then, as the number of dimensions increases, the amount of splitting and adding grows exponentially, soon hitting unwieldy numbers. This is a mountain too steep for even the fastest computers. The problem is what applied mathematics guru Richard E. Bellman famously called the curse of dimensionality.

But thankfully randomness offers a path around this mountain. The approach is called Monte Carlo integration for estimating sums and integrals

This is the second part of a series of posts on Mark chain Monte Carlo methods. The first part covers Markov chains. This posts covers the basics of Monte Carlo methods. The third part will combine the ideas from the first two parts. Overall, the three posts will sketch the mechanics of Markov chain Monte Carlo (MCMC) methods, whose importance and applications I detailed in a previous post.

We’ll focus on estimating sums, but the same logic applies to integrals.

Monte Carlo estimates

In the discrete (or countable) case, consider a function \(f\) that maps from a countable state space \(\mathbb{S}\). This state space maybe, for example, the lattice \(\mathbb{Z}^d\) or some subset of it. We write the sum as

$$S(f)=\sum_{x\in \mathbb{S}} f(x) $$

Such sums arise in many fields, particularly statistical physics, which is where the (Markov chain) Monte Carlo methods were born.1In general, we use integrals for continuous spaces, which are uncountable, whereas we use sums for discrete spaces, which are countable spaces. (The two are generalized by the concept of an integral measure.)

The idea behind a Monte Carlo estimate of a sum is remarkably simple. You randomly sample the function \(f\), find the average, and use it as an estimate of the sum.

Estimating with uniform variables

For a finite set \(\mathbb{S}=\{x_1,\dots,x_m\}\), we write the sum of \(f\) over the set \(\mathbb{S}\) as

$$\begin{align}S(f)&=\sum_{x\in \mathbb{S}}  f(x) \frac{m}{m} \\&= \sum_{x\in\mathbb{S}}  \frac{f(x)p(x)}{p(x)}\\& = \mathbb{E}_{p}[ \frac{f(U)}{p(U)}]\,,\end{align}$$

where \(U\) is a discrete uniform random variables, which has the probability mass function \(p\) with support \(\mathbb{S}\), meaning \(p(x)=1/m\) if \(x\in \mathbb{S}\) and \(p(x)=0\) otherwise. But using a uniform random variable is a naive way, as there are better random variables we can use to sample \(f\).

Estimating with general variables

More generally, we can write the sum of a countable (possibly infinite) set \(\mathbb{S}\) as

$$\begin{align}S(f)&= \sum_{x\in\mathbb{S}}  \frac{f(x)q(x)}{q(x)}\\& = \mathbb{E}_{q}[ \frac{f(Y)}{q(Y)}]\\& = \mathbb{E}_{q}[ g(Y)]\,,\end{align}$$

where \(g=p/q\) and \(Y\) is a suitable random variable that has a probability mass function  \(q\) with support \(\mathbb{S}\).

Sampling the random variable (or point) \(Y\), gives the samples \(y_1,\dots,y_n\), which yields an unbiased Monte Carlo estimate of the sum

$$S_n (f)=\frac{1}{n}\sum_{i=1}^n \frac{f(y_i)}{q(y_i)}.$$

If we can (quickly) sample \(Y\), then we can (quickly) estimate the integral. The closer the (probability mass) function \(q\) is to \(|f|\), the better the estimate.

We can also use the above reasoning with integrals, replacing the sum and probability mass function with an integral and a probability density.

Error rate

If the samples \(y_1, \dots, y_n \) are random and independent, the (strong) law of large numbers guarantees that the unbiased Monte Carlo estimate converges (almost surely) to the correct answer.

Further assuming \(S(f^2)<\infty\), the error of the integral estimate \(|S(f)-S_n(f)|\) is proportional to

$$\frac{\mathbb{V}_q(f)}{n ^{1/2}},
$$
where
$$\mathbb{V}(f)_q=E_q[f^2(Y)]-(E_q[f(Y)])^2,$$
is the variance of the function \(f\). (This is not the variation of a function, which is something different.) The error \(|S(f)-S_n(f)|\) reduces at the rate \(O(n^{-1/2})\). The Monte Carlo approach allows for continual updating until the estimate is sufficiently converged.

The above logic applies to integrals of functions on general spaces.

The dimension \(d\) of the underlying space, as $\mathbb{S}^d$, does not appear in the error. The independence of the dimension is precisely why we tackle high-dimensional sums and integrals with Monte Carlo methods. The general rule of thumb is to use Monte Carlo integration for functions of four or more dimensions.

Losing the independence

Again, a sequence of independent random variables allow our Monte Carlo estimate to converge. We can loosen this requirement and allow some dependence between variables. To obtain a Monte Carlo estimate, we just need to use a sequence of random variables that are ergodic. Then we recover the (strong) law of large numbers that gives our Monte Carlo estimate.

Ergodicity

The word ergodicity is one of those technical terms you’ll see and hear in different fields of mathematics and physics, often when randomness is involved. The concept is also central to deterministic dynamical systems.

There are different versions and strengths of ergodicity, depending on the system and applications. But loosely speaking, an ergodic system is one in which temporal averages and spatial (or ensemble) averages converge to the same value.

That means when you want to find an average, you have two options in an ergodic system. The temporal option: You can measure something over a long time. The spatial (or ensemble) option: You can measure many instances of the same something. The two different averages will eventually converge to the same value.

Ergodic sequences

In the context of Markov processes and, more generally, stochastic processes, we can think of ergodicity as convergence and a slightly more general version of the law of large numbers.

Consider a stochastic process \(X\), which is a sequence of random variables \(X_0,X_1,\dots\) indexed by some parameter \(t\). We assume the probability distributions \(\pi_0,\pi_1,\dots\) of the random variables converge to some final stationary distribution \(\pi\). For a countable state space, the (probability) distribution \(\pi\) corresponds to a probability mass function.

We then say a stochastic process \(X\) state space is ergodic if the following holds almost surely

$$ \frac{1}{t}\sum_{i=0}^t g(X_i)\rightarrow \sum_{x\in \mathbb{Z}}g(x)\pi(x)=\mathbb{E}_{\pi} [g(X)]\,,$$

as \(t\rightarrow\infty\) for all bounded functions, nonnegative functions \(g\). The subscript of the expectation simply denotes that the expectation is taken with respect to the (stationary) distribution \(\pi\).

The equation tells the story. On the left-hand side you have a temporal average. On the right-hand side you have a spatial (or ensemble) average. As time proceeds, the first average will converge to the second average. Ergodicity.

Further reading

There are many books dedicated entirely to the subject of Monte Carlo methods. For example, a good book with detailed section on Markov chains applied to Monte Carlo methods in the context of statistics is:

  • Casella and Robert, Monte Carlo Statistical Methods.

These books also have sections on Monte Carlo integration:

The inverse method for simulating random variables

We will cover a simple but much used method for simulating random variables or, rather, random variates. Although the material here is found in introductory probability courses, it frequently forms the foundation of more advance stochastic simulation techniques, such as Markov chain Monte Carlo methods.

Details

The basics of probability theory tell us that any random variable can, in theory, be written as a function of a uniform random variable \(U\) distributed on the interval \((0,1)\), which is usually written as \(U\sim U(0,1)\). All one needs is the inverse of the cumulative distribution function of the desired random variable.

More specifically, let \(X\) be a random variable with a cumulative distribution function \(F(x)=\mathbb{P}(X\leq x)\). The function \(F\) is nondecreasing in \(x\), so its inverse can be defined as \(F^{-1}(y)=\inf\{x:F(x)\geq y\}\), which is known as the generalized inverse of \(F(x)\).

Some authors assume the minimum is attained so the infimum is replaced with the minimum, giving \(F^{-1}(y)=\min\{x:F(x)\geq y\}\).

In short, the following result is all that we need.

Transform of a uniform variable \(U\sim U(0,1)\)

For a uniform random variable \(U\sim U(0,1)\), the random variable \(F^{-1}(U)\) has the cumulative distribution function \(\mathbb{P}(F^{-1}(U)\leq x)=P(U\leq F(x))=F(x)\).

Algorithm

The above observation gives a method, which I like to call the direct method, for exactly simulating a random variable \(X\) with the (cumulative) distribution (function) \(F\).

Random variable \(X\) with distribution \(F\)

  1. Sample a uniform random variable \(U\sim U(0,1)\), giving a value \(u\).
  2. Return the value \(x=F^{-1}(u)\) as the sampled value of \(U\).

But this approach only works if we can write down (in a relatively straightforward way) the inverse \(F^{-1}\), which is usually not the case. This means you cannot generate, for example, simulate a normal random variable with a single uniform random variable by using just the inverse method, as we cannot write down the inverse of its cumulative distribution function.

(Interestingly, with two (independent) uniform random variables, we can use the transform method to simulate two (independent) normal (or Gaussian) random variables. This approach is called the Box-Muller transform, which I’ll cover in another post.)

Nevertheless, we can apply the inverse method to some useful distributions.

Examples

Warning: The following examples are only for illustration purposes. Except for the Bernoulli example, you would never use them in standard scientific languages such as MATLAB, Python (with NumPy), R or Julia, because those languages already have much better functions for simulating these and many other random variables (or variates). If you are writing a function in a language that lacks such functions, I would consult one of the references mentioned below. Although the inverse method is usually intuitive and elegant, it is often not the fastest method.

Bernoulli distribution

The simplest random variable is that with the Bernoulli distribution. With probability \(p\), a Bernoulli random variable \(X\) takes the value one. Otherwise, \(X\) takes the value zero (with probability \(1-p\)). This gives the (cumulative) distribution (function):

$$ F_B(x)=\begin{cases}
0 & \text{if } x < 0 \\
1 – p & \text{if } 0 \leq x < 1 \\
1 & \text{if } x \geq 1
\end{cases}$$

This gives a very simple way to simulate (or sample) a Bernoulli variable \(X\) with parameter \(p\).

Bernoulli random variable \(X\) with parameter \(p\)

  1. Sample a uniform random variable \(U\sim U(0,1)\), giving a value \(u\).
  2. If \(u\leq p\), return \(x=1\); otherwise return \(x=0\).
Application: Acceptance simulation methods

In random simulation code, whenever you do something (or not) with some probability \(p\) (or probability \(1-p\)), then the code will perform the above step. Consequently, you see this in the (pseudo-)code of many stochastic simulations with random binary choices, particularly schemes that have an acceptance step such the Metropolis-Hastings method and other Markov chain Monte Carlo (MCMC) methods.

In MCMC schemes, a random (binary) choice is proposed and it is accepted with a certain probability, say, \(\alpha \). This is the equivalent of accepting the proposed choice if some uniform random variable \(U\) meets the condition \(U\leq \alpha\).

This explains why the pseudo-code of the same algorithm can vary. Some pseudo-code will say accept with probability \(\alpha\), while other pseudo-code will say do if \(U\leq \alpha\). It’s two equivalent formulations.

Exponential distribution

The cumulative distribution function of an exponential variable with mean \(1/\lambda\) is \(F_E(x)= 1-e^{-\lambda x}\), which has the inverse \(F^{-1}_E(y)=-(1/\mu)\ln[1-y]\). We can use the fact that on the interval \((0,1)\), a uniform variable \(U\sim U(0,1)\) and \(1-U\) have the same distribution. Consequently, the random variables \(\ln [1-U]\) and \(\ln U\) are equal in distribution.

This gives a method for simulating exponential random variables.

Exponential random variable \(X\) with mean \(1/\lambda\)

  1. Sample a uniform random variable \(U\sim U(0,1)\), giving a value \(u\).
  2. Return \(x=-(1/\lambda)\ln u\).
Application: Poisson simulation method

Of course you can use this method to simulate exponential random variables, but it has another application. In a previous post on simulating Poisson variables, I mentioned that exponential random variables can be used to simulate a Poisson random variable in a direct (or exact) manner. That method is based on the distances between the points of a homogeneous Poisson point process (on the real line) being exponential random variables.

But this method is only suitable for low values of \(\lambda\), less than say fifteen.

Rayleigh distribution

The Rayleigh distribution is \(\mathbb{P}(X\leq x)= (x/\sigma^2)e^{-x^2/(2\sigma^2)}\), where \(\sigma>0\) is its scale parameter. The square root of an exponential variable with mean \(1/\lambda\) has a Rayleigh distribution with scale parameter \(\sigma=1/\sqrt{2\lambda}\).

Consequently, the generation method is similar to the previous example.

Rayleigh random variable \(Y\) with scale parameter \(\sigma>0\)

  1. Sample a uniform random variable \(U\sim U(0,1)\), giving a value \(u\).
  2. Return \(y=\sigma\sqrt{-2\ln u}\).

Other methods

The inverse method is intuitive and often succinct. But most functions for simulating random variables (or, more correctly, generating random variates) do not use these methods, as they are not fast under certain parameter regimes, such as large means. Consequently, other method are used such as approximations (with, say, normal random variables), such as the ones I outlined in this post on simulating Poisson random variables.

More complicated random systems, such as collections of dependent variables, can be simulated using Markov chain Monte Carlo methods, which is the direction we’ll take in a couple posts after this one.

Further reading

The inverse technique is in your favourite introductory book on probability theory. The specific examples here are covered in books on stochastic simulations and Monte Carlo methods. The classic book by Devroye covers these topics; see Section 2.1 and the examples (inverse method) in Chapter 2.

For a modern take, there’s the extensive Handbook of Monte Carlo Methods by Kroese, Taimre and Botev; see Section 3.1.1, Algorithm 4.1 (Bernoulli) and Algorithm 4.29 (exponential), and Algorithm 4.66 (Rayleigh). There’s also the book Stochastic Simulation: Algorithms and Analysis by Asmussen and Gynn; in Chapter 2, see Example 2.1 (Bernoulli) and Exampe 2.3 (exponential).

Other books include those by Fishman (Section 8.1) and Gentle (Section 4.1) respectively. (Warning: the book by Gentle has a mistake on page 105 in algorithm for sampling Bernoulli variables, as noted by the author. It should be \(1-\pi\) and not \(\pi\) when zero is returned for the sampled value of the Bernoulli variable.)

Who was S.R. Broadbent?

I found myself recently wondering who the first author was of a seminal paper that created a mathematical field. I did some online digging and I noted my findings here for others who may ask the same question.

Percolation theory

The 1957 paper by S.R. Broadbent and J.M. Hammersley gave birth to percolation theory. I noticed the author affiliations:

S.R. Broadbent, United Glass Bottle Manufacturers
J.M. Hammersley, United Kingdom Atomic Energy Research Establishment Harwell

I knew John Hammersley’s name already, as he had co-written a classic book on Monte Carlo methods. Indeed, if you look at the end of the paper, you see the paper’s idea was born at an early conference on Monte Carlo methods.

But who was S.R. Broadbent? And what was he doing at a bottle factory?

Probability work

This Broadbent character was a bit mysterious. At first, it seemed as though the writer always published as S.R. Broadbent — so much so, I thought he might be a she. Such things have happened before, as the story behind the influential 1960 paper by A.H. Land and A.G. Doig demonstrated.1Alisa H. Land only died in 2021. Alison G. Doig is still alive but she has been known as Alison Harcourt and has worked as a tutor at the University of Melbourne for many, many years.

From the start I could see S.R. Broadbent was strong in probability. In the 1950s Broadbent wrote probability papers, including a 1953 paper co-written by the British father of stochastic geometry David G. Kendall. 2For those across the La Manche, the French father of stochastic geometry is Georges Matheron. The Broadbent and Kendall paper has the improbable title The Random Walk of Trichostrongylus retortaeformis. (It’s a type of larva.)

And then I was thrown by the title of a paper co-written by Broadbent in the 1960s: A computer assessment of media schedules.

Simon says ley lines are just spatial coincidences

Then I lost track of S.R. Broadbent in terms of academic papers. I had almost given up, assuming Broadbent had left academia. But then I saw Broadbent’s name mentioned in the acknowledgements in a famous spatial statistics paper written by David Kendall in 1989. In the paper Kendalls cites a 1980 paper written by S.R. Broadbent. The subject was the statistics of ley lines — it’s a thing — showing they can be explained by mere chance. And in the acknowledgements Kendall thanks a certain Simon Broadbent — presumably he’s S.R. Broadbent whose paper Kendall cited.

Had I found the S.R. Broadbent? I was a bit skeptical, but then he’s mentioned as Simon Broadbent in this obituary of Hammersley:

Despite the title of the paper, ‘Poor man’s Monte Carlo’, the lasting contributions are the clear statement of the problem of counting self-avoiding walks, the use of subadditivity to prove the existence of the connective constant, and the discussion of random media that culminated in Simon Broadbent’s formulation of the percolation model.

That reassured me.

And then I found two 1950s papers (1954 and 1958) by a Simon Broadbent, not S.R. Broadbent, published in the Journal of the Royal Statistical Society Series C: Applied Statistics. I had actually stumbled upon the 1958 paper earlier in my searches. Initially I didn’t think it was S.R. Broadbent because the journal referred to the author as Dr Simon Broadbent, and I thought only medical doctors do that to each other. But it’s a statistics journal. So it’s probably S.R. Broadbent. Also, for this 1954 paper in the same journal they call him Mr Simon Broadbent.

And then I looked at the paper by Broadbent that David Kendall cited, and in that paper Broadbent acknowledges Kendall. And then I noticed that this paper was actually written by Simon Broadbent, but Kendall had cited the paper and written its author as S.R. Broadbent. The final link. S.R. Broadbent was Simon Broadbent — whoever that was.

I thought I had come to the end. But I hadn’t.

Advertising guru

A non-scientific article had floated up in my search results talking about the death of some big name in advertising. Seeing the title and reading the start, I thought it couldn’t be the same Simon Broadbent.

Simon asked simple questions: How does advertising work? How much advertising is enough? How can we tell whether our ad campaign has succeeded? Is adspend profitable?
When he entered advertising in the 60s, none of these questions had satisfactory answers. Many doubted they could ever be answered. Simon did more than anyone to show that advertising can contribute to profit, and that it is not a cost that can be cut with no effect on sales.

But after reading more, I realized this had to be the very same Broadbent.

He answered advertising questions with the rigour of a mathematician and the clarity of a poet. He read engineering at Cambridge, took an applauded first in mathematics and a diploma in statistics at Magdalen, Oxford, and completed his doctorate in statistics at Imperial, London. His poetry was published by Blackwell’s while he was still an undergraduate.

His lifelong interest was applying statistics to problems that nobody had thought amenable to statistical analysis. His paper to the Royal Statistical Society, In Search of the Ley Hunter, debunked claims for the existence of megalithic ley lines, pointing out that there were fewer alleged lines than would be expected from a random distribution of points between which lines could be drawn. Preparing the paper also allowed him to crawl about in Greenwich Park, testing the likely accuracy of stone-age surveying devices using instruments he had built himself.

Simon R. Broadbent made quite the impression in advertising, even being called a global legend in media research. Some of his contributions to this field are mentioned in this published tribute. It seems this field remembered him better than his original field. I couldn’t find any tributes focusing on his mathematical contributions.

Mystery solved, except for what the R stands for.

Back to percolation

Now knowing the S stood for Simon, I noticed that Simon Broadbent is mentioned in the Wikipedia article on percolation theory. The article cites a tutorial article by John Hammersley and Dominic Welsh, which gives the original motivation of percolation theory:

The study of percolation originated from a question posed to one of us by S. R. Broadbent (see Hammersley and Morton 1954). Broadbent was at that time working at the British Coal Utilization Research Association on the design of gas masks for use in coal mines. These masks contained porous carbon granules (the medium) into which gas (the fluid) could penetrate. The pores in a granule constituted a random network of tiny interconnecting tunnels, into which the gas could move by surface adsorption; and the question was what properties of such a random network would assist or inhibit the uptake of gas. If the pores were large enough and richly enough connected, the gas could permeate the interior of a granule; but if the pores were too small or inadequately connected, the gas could not get beyond the outer surface of a granule. Thus there was a critical point, above which the mask worked well and below which it was ineffective. Critical phenomena of this sort are typical of percolation processes, and do not, normally arise in ordinary diffusion.

The original Broadbent and Hammersley paper did mention that Broadbent had previously worked at British Coal Utilisation Research Association (BCURA).

(Another aside, the same Wikipedia article says Rosalind Franklin, whose X-ray work helped reveal the structure of DNA, also worked on porous material at the British Coal Utilisation Research Association, before leaving in 1946, which I guess was before Broadbent started working there. You can read about her career, for example, here.)

Summary

Simon R. Broadbent was a trained engineer and mathematician who co-founded the thriving mathematical discipline of percolation theory, before leaving that world to go revolutionize the field of advertising by using quantitative methods and writing influential books.