Markov goes to Monte Carlo

Numerical and scientific fields such as statistics, computational physics, and optimization methods heavily rely upon Markov chain Monte Carlo methods. These simulation techniques use the power of Markov chains to sample general probability distributions, which in turn give Monte Carlo methods for estimating integrals and optimal solutions.

This is the third part of a multi-part series of posts. The first part covers Markov chains. The second part covers the basics of Monte Carlo methods. This post combines the ideas from the first two parts. Overall, the posts sketch the mechanics of Markov chain Monte Carlo (MCMC) methods, whose importance and applications I detailed in a previous post.

We will only examine Markov chains with countable state spaces. In this setting, we only need standard probability and matrix knowledge. But the results extend to general state spaces, such as Euclidean space.

By the way, I doubt that Andrey A. Markov ever went to Monte Carlo in the small country of Monaco.1But Herbie did.

Monte Carlo conditions

Sum

In the first post we covered Markov chains with countable state spaces, because the mathematics is more transparent. Consequently, we can use such Markov chains to estimate the sum of a function \(f\) over a countable (possibly infinite) set \(\mathbb{S}\). We can write the sum as

$$\begin{align}S(f)&=\sum_{x\in \mathbb{S}} f(x) \\&= \sum_{x\in\mathbb{S}} \frac{f(x)q(x)}{q(x)}\\& = \mathbb{E}_{q}[ \frac{f(Y)}{q(Y)}]\\& = \mathbb{E}_{q}[ g(Y)]\,,\end{align}$$

where \(g=f/q\) and \(Y\) is a suitable random variable that has a probability mass function \(q\) with support \(\mathbb{S}\).

Sampling the random variable (or point) \(Y\), gives the samples \(y_1,\dots,y_n\), which yields an unbiased Monte Carlo estimate of the sum, namely

$$S_n (f)=\frac{1}{n}\sum_{i=1}^n \frac{f(y_i)}{q(y_i)}.$$

The closer the function \(q\) is to \(|f|\), the better the estimate.

Integral

We can write an integral over some region \(\mathbb{S}\) as

$$\begin{align}\int_{\mathbb{S}} f(x) dx &= \int_{\mathbb{S}} \frac{f(x)}{p(x)}p(x)dx\\ & =\mathbb{E}_p[\frac{f(Y)}{p(Y)}]\,,\end{align}$$

where \(p\) is the probability density of a random variable (or point) \(Y\) with support \(A\). Then by sampling \(Y\), the resulting samples \(y_1, \dots, y_n\in [0,1]\) give the unbiased Monte Carlo estimate

$$I_n (f)=\frac{1}{n}\sum_{i=1}^n \frac{f(y_i)}{p(y_i)}\,.$$

Ergodic sequence

To estimate the sum or integral, we’ll need a sequence of ergodic random variables \(\{Y_i\}_{i\in\mathbb{N}}\), which we can achieve with an ergodic Markov chain with a state space \(\mathbb{S}\).

Markov chain conditions

In the context of Markov chains and, more generally, stochastic processes, we can think of ergodicity as convergence and a slightly more general version of the law of large numbers. A Markov chain with a countable state space needs some conditions to ensure ergodicity.

Regularity conditions for countable a Markov chain

  1. A stationary distribution \(\pi\)
  2. Aperiodicity
  3. Irreducibility
  4. Postive recurrence

In general, most of the needed structure for the Markov chain comes from irreducibility and a stationary distribution existing. Sometimes the above conditions are redundant. For example, an aperiodic, irreducible Markov chain with a finite state space is always positive recurrent. We look at ways on how to satisfy these conditions in the Monte Carlo setting when the underlying state space is countable.

Stationary distribution \(\pi\)

A stationary distribution \(\pi\) satisfies equation

$$ \pi=P\pi.$$

We know exactly what our stationary distribution \(\pi\) needs to be. It is the one used in the Monte Carlo method, namely the distribution of \(Y\). The challenge is constructing a Markov chain with such a stationary distribution. This will of course depend on the Markov (transition) kernel \(P\) of the Markov chain.

We’ll look at the other conditions on the Markov kernel \(P\).

Aperiodicity

We recall that the period \(d_x\) of a state \(x\in \mathbb{X}\) is the greatest common divisor of all \(n\) values such that \(P(x,x)^n>0\). We need every state of the countable Markov chain to be aperiodic, meaning \(d_x=1\) for all \(x\in\mathbb{X}\).

One sufficient way to achieve this requirement is to first have the countable Markov chain to be irreducible, which we’ll cover in the next section. Then if there exists at least one state \(x\in \mathbb{x}\) such that \(P(x,x)>0\), then the Markov chain is aperiodic.2Lemma 1.8.2 in Markov Chains by Norris or Remark 6.2.5 in Probability and Stochastic Processes by Brémaud.

Clearly the last condition is not difficult to satisfy, putting the focus now on the irreducibility requirement.

Irreducibility

The irreducibility property says that it is possible for the Marko chain to get from any point to another point in a finite number of steps. More precisely, for an irreducible Markov chain with countable state space, there must exist for all \(x,y\in\mathbb{X}\) a natural number \(s\) (possibly depending \(x\) and \(y\)) such that \(P(x,y)^s>0\).

We can achieve this requirement by introducing a stronger requirement, which is easier to verify. A countable Markov chain with a transition \(P\) will be irreducible if for all \(x,y\in\mathbb{X}\) the condition \(P(x,y)>0\) holds. We have simply required that the above number \(s=1\) for all \(x,y\in\mathbb{X}\).

Furthermore, clearly \(P(x,x)>0\) for all \(x\in\mathbb{X}\), then our irreducible countable Markov chain is also aperiodic.

Positive recurrence

For a point or state \(x\in\mathbb{X}\), we recall its first return time being defined as

$$ T_x^+=\min\{ t\geq 1: X_t=x\} \,.$$

A state \(x\) is called positive recurrent if the expected value of its first return time is finite, meaning \(\mathbb{E}_x(T_x^+)<\infty\). For a countable Markov chain, if all the states in the state space are positive recurrent, then we say the Markov chain is positive recurrent.

A countable irreducible Markov chain is positive recurrent if (and only if) it has a stationary distribution \(\pi\). Furthermore, that stationary distribution will be unique and nonzero, meaning \(\pi(x)>0\) for all \(x\in\mathbb{X}\).3Theorem 1.7.7 in Markov Chains by Norris or Theorem 6.3.14 in Probability and Stochastic Processes by Brémaud.

In other words, for positive recurrence, we need to prove that the Markov chain has a stationary distribution.

Ergodicity

A countable Markov chain that is aperiodic and irreducible is ergodic. In other words, an ergodic Markov process \(X\) with stationary distribution \(\pi\) is a random sequence \(X_0, X_1,\dots\) such that the following holds almost surely

$$ \frac{1}{t}\sum_{i=0}^{t-1} g(X_i) \rightarrow \sum_{x\in\mathbb{X}}g(x)\pi(x)$$

as \(t\rightarrow\infty\) for all bounded, nonnegative functions \(g\). Consequently, we can use such an ergodic Markov chain to make statistical estimates.

Sampling with a Markov chain

For the Monte Carlo sample \(y_1,\dots,y_n\), we simply use the Markov chain samples \(x_0,x_1,\dots,x_{n-1}\), giving the Monte Carlo estimate. In the discrete case, this is simply

$$S_n (f)=\frac{1}{n}\sum_{i=1}^n g(x_{i-1}).$$

which, due to ergodicity, converges to \(S(f)=\mathbb{E}_{Y}[g(Y)]\).

Summary

In summary, a sufficient way to have a countable ergodic Markov chain is for the transition kernel to be such that \(P(x,y)>0\) for all \(x\in\mathbb{X}\) and there exists a (stationary) distribution \(\pi=P\pi\). For the first requirement, one imagines using a non-negative function such as \(e^{-(x-y)^2}/C\), where \(C>0\) is a suitable normalization constant.

The requirements seem possible, but to bring them all together is still a great challenge for the uninitiated. Fortunately, some great minds almost seven decades ago proposed the first Markov chain that meets all the requirements for a specific stationary distribution, which was used for a Monte Carlo estimate.

The first such Markov chain Monte Carlo method appeared in a paper written by five scientists, including two husband-wife pairs. It became known as the Metropolis or Metropolis-Hastings algorithm, and it will be the subject of a future post.

Further reading

Books

Markov chains

Any stochastic process book will include a couple of chapters on Markov chains. For more details, there are many books dedicated entirely to the subject of Markov chains. For example, introductory books include:

  • Brémaud – Markov Chains, Gibbs Fields, Monte Carlo Simulation and Queues;
  • Levin, Peres, and Wilmer – Markov Chains and Mixing Times;
  • Norris – Markov Chains.

Those books cover Markov chains with countable state spaces. If you want to read about discrete-time Markov chains with general state spaces, try the book:

  • Meyn, Tweedie – Markov chains and stochastic stability

All the above books have a section detailing a Markov chain Monte Carlo method, such as the Metropolis-Hastings algorithm and the Gibbs sampler.

Monte Carlo methods

Books dedicated to Monte Carlo approaches include:

For a statistics context, there’s also the good book:

  • Casella and Robert – Monte Carlo Statistical Methods.

The second MC in MCMC methods

In calculus differentiating a function is easy. Integrating one is hard. Much research has gone into finding ways for calculating integrals. But often functions cannot be integrated, at least analytically.

No problem. You do it numerically. You take the function’s curve, split it up into nice rectangles or, even better, trapezoids, and calculate their areas, and add them up. Not accurate enough? Split the curve up even finer until you’re happy. Done.

That works fine in one dimension and even two dimensions. But then, as the number of dimensions increases, the amount of splitting and adding grows exponentially, soon hitting unwieldy numbers. This is a mountain too steep for even the fastest computers. The problem is what applied mathematics guru Richard E. Bellman famously called the curse of dimensionality.

But thankfully randomness offers a path around this mountain. The approach is called Monte Carlo integration for estimating sums and integrals

This is the second part of a series of posts on Mark chain Monte Carlo methods. The first part covers Markov chains. This posts covers the basics of Monte Carlo methods. The third part will combine the ideas from the first two parts. Overall, the three posts will sketch the mechanics of Markov chain Monte Carlo (MCMC) methods, whose importance and applications I detailed in a previous post.

We’ll focus on estimating sums, but the same logic applies to integrals.

Monte Carlo estimates

In the discrete (or countable) case, consider a function \(f\) that maps from a countable state space \(\mathbb{S}\). This state space maybe, for example, the lattice \(\mathbb{Z}^d\) or some subset of it. We write the sum as

$$S(f)=\sum_{x\in \mathbb{S}} f(x) $$

Such sums arise in many fields, particularly statistical physics, which is where the (Markov chain) Monte Carlo methods were born.1In general, we use integrals for continuous spaces, which are uncountable, whereas we use sums for discrete spaces, which are countable spaces. (The two are generalized by the concept of an integral measure.)

The idea behind a Monte Carlo estimate of a sum is remarkably simple. You randomly sample the function \(f\), find the average, and use it as an estimate of the sum.

Estimating with uniform variables

For a finite set \(\mathbb{S}=\{x_1,\dots,x_m\}\), we write the sum of \(f\) over the set \(\mathbb{S}\) as

$$\begin{align}S(f)&=\sum_{x\in \mathbb{S}}  f(x) \frac{m}{m} \\&= \sum_{x\in\mathbb{S}}  \frac{f(x)p(x)}{p(x)}\\& = \mathbb{E}_{p}[ \frac{f(U)}{p(U)}]\,,\end{align}$$

where \(U\) is a discrete uniform random variables, which has the probability mass function \(p\) with support \(\mathbb{S}\), meaning \(p(x)=1/m\) if \(x\in \mathbb{S}\) and \(p(x)=0\) otherwise. But using a uniform random variable is a naive way, as there are better random variables we can use to sample \(f\).

Estimating with general variables

More generally, we can write the sum of a countable (possibly infinite) set \(\mathbb{S}\) as

$$\begin{align}S(f)&= \sum_{x\in\mathbb{S}}  \frac{f(x)q(x)}{q(x)}\\& = \mathbb{E}_{q}[ \frac{f(Y)}{q(Y)}]\\& = \mathbb{E}_{q}[ g(Y)]\,,\end{align}$$

where \(g=p/q\) and \(Y\) is a suitable random variable that has a probability mass function  \(q\) with support \(\mathbb{S}\).

Sampling the random variable (or point) \(Y\), gives the samples \(y_1,\dots,y_n\), which yields an unbiased Monte Carlo estimate of the sum

$$S_n (f)=\frac{1}{n}\sum_{i=1}^n \frac{f(y_i)}{q(y_i)}.$$

If we can (quickly) sample \(Y\), then we can (quickly) estimate the integral. The closer the (probability mass) function \(q\) is to \(|f|\), the better the estimate.

We can also use the above reasoning with integrals, replacing the sum and probability mass function with an integral and a probability density.

Error rate

If the samples \(y_1, \dots, y_n \) are random and independent, the (strong) law of large numbers guarantees that the unbiased Monte Carlo estimate converges (almost surely) to the correct answer.

Further assuming \(S(f^2)<\infty\), the error of the integral estimate \(|S(f)-S_n(f)|\) is proportional to

$$\frac{\mathbb{V}_q(f)}{n ^{1/2}},
$$
where
$$\mathbb{V}(f)_q=E_q[f^2(Y)]-(E_q[f(Y)])^2,$$
is the variance of the function \(f\). (This is not the variation of a function, which is something different.) The error \(|S(f)-S_n(f)|\) reduces at the rate \(O(n^{-1/2})\). The Monte Carlo approach allows for continual updating until the estimate is sufficiently converged.

The above logic applies to integrals of functions on general spaces.

The dimension \(d\) of the underlying space, as $\mathbb{S}^d$, does not appear in the error. The independence of the dimension is precisely why we tackle high-dimensional sums and integrals with Monte Carlo methods. The general rule of thumb is to use Monte Carlo integration for functions of four or more dimensions.

Losing the independence

Again, a sequence of independent random variables allow our Monte Carlo estimate to converge. We can loosen this requirement and allow some dependence between variables. To obtain a Monte Carlo estimate, we just need to use a sequence of random variables that are ergodic. Then we recover the (strong) law of large numbers that gives our Monte Carlo estimate.

Ergodicity

The word ergodicity is one of those technical terms you’ll see and hear in different fields of mathematics and physics, often when randomness is involved. The concept is also central to deterministic dynamical systems.

There are different versions and strengths of ergodicity, depending on the system and applications. But loosely speaking, an ergodic system is one in which temporal averages and spatial (or ensemble) averages converge to the same value.

That means when you want to find an average, you have two options in an ergodic system. The temporal option: You can measure something over a long time. The spatial (or ensemble) option: You can measure many instances of the same something. The two different averages will eventually converge to the same value.

Ergodic sequences

In the context of Markov processes and, more generally, stochastic processes, we can think of ergodicity as convergence and a slightly more general version of the law of large numbers.

Consider a stochastic process \(X\), which is a sequence of random variables \(X_0,X_1,\dots\) indexed by some parameter \(t\). We assume the probability distributions \(\pi_0,\pi_1,\dots\) of the random variables converge to some final stationary distribution \(\pi\). For a countable state space, the (probability) distribution \(\pi\) corresponds to a probability mass function.

We then say a stochastic process \(X\) state space is ergodic if the following holds almost surely

$$ \frac{1}{t}\sum_{i=0}^t g(X_i)\rightarrow \sum_{x\in \mathbb{Z}}g(x)\pi(x)=\mathbb{E}_{\pi} [g(X)]\,,$$

as \(t\rightarrow\infty\) for all bounded functions, nonnegative functions \(g\). The subscript of the expectation simply denotes that the expectation is taken with respect to the (stationary) distribution \(\pi\).

The equation tells the story. On the left-hand side you have a temporal average. On the right-hand side you have a spatial (or ensemble) average. As time proceeds, the first average will converge to the second average. Ergodicity.

Further reading

There are many books dedicated entirely to the subject of Monte Carlo methods. For example, a good book with detailed section on Markov chains applied to Monte Carlo methods in the context of statistics is:

  • Casella and Robert, Monte Carlo Statistical Methods.

These books also have sections on Monte Carlo integration:

Campbell’s theorem (formula)

In a previous post, I wrote about the concept of shot noise of a point process. In the simplest terms, shot noise is just the sum of some function over all the points of a point process. The name stems from the original mathematical models of the noise in old electronic devices, which was compared to shot (used in guns) hitting a surface.

In this post I will present a result known as Campbell’s theorem or  Campbell’s formula, which gives the expectation of shot noise as as simple integral expression. This both a general holding for all point processes. It is also useful, as shot noise naturally arises in mathematical models. One application is wireless network models, where the interference term is shot noise.

But to present the main result, I first need to give some basics of point processes, most of which I already covered in this post.

Point process basics

We consider a point processes \(\Phi\) defined on some underlying mathematical space \(\mathbb{S}\), which is often \(\mathbb{R}^n\).  Researchers typically interpret a point process as a random counting measure, resulting in the use of integral and measure theory notation. For example, \(\Phi(B)\) denotes the number of points located in some (Borel measurable) set \(B\), which is a subset of \(\mathbb{S}\).

For point processes, researchers often use a dual notation such that \(\Phi\) denotes both a random set or a random measure.  Then we can write, for example, \(\Phi=\{X_i\}_i\) to stress that \(\Phi\) is a random sets of points.  (Strictly speaking, you can only use the set notation if the point process is simple, meaning that no two points coincide with probability one.)

The first moment measure of a point process, also called the mean measure or intensity measure, is defined as

$$\Lambda(B)= \mathbb{E} [\Phi(B)]. $$

In other words, the first moment measure can be interpreted as the expected number of points of \(\Phi\) falling inside the set \(B \subseteq \mathbb{S}\).

Shot noise definition

We assume a point process \(\Phi=\{X_i\}_i\) is defined on some space \(\mathbb{S}\). We consider a non-negative function \(f\) with the domain \(\mathbb{S}\), so \(f:\mathbb{S} \rightarrow [0,\infty)\).  If the point process \(\Phi\) is a simple, we can use set notation and define the shot noise as

$$
S= \sum_{X_i\in \Phi} f(X_i)\,.
$$

More generally, the shot noise is defined as

$$
S= \int_{ \mathbb{S}} f(x) \Phi(dx)\,.
$$

(We recall that an integral is simply a more general type of sum, which is why the integral sign comes from the letter S.)

Campbell’s theorem

We now state Campbell’s theorem.

Campbell’s theorem says that for any point process \(\Phi\) defined on a space \(\mathbb{S}\) the following formula holds

$$
\mathbb{E}[ S] = \int_{ \mathbb{S}} f(x) \Lambda(dx)\,,
$$

where \(\Lambda= \mathbb{E} [\Phi(B)]\) is the intensity measure of the point process \(\Phi\).

Interpretation

The integral formula is just an application of Fubini’s theorem, as we have simply changed the order of integration.  The formula holds for general processes because it is simply a result on first moments, so it is leveraging the linearity of sums and integrals, including the expectation operator. Put more simply, the sum of parts does equal the whole.

Some history

At the beginning of the 20th century, Norman R. Campbell studied shot noise in Britain and wrote two key papers. In one of these papers appears a version of the result we now called Campbell’s theorem or Campbell’s formula. Interestingly, Campbell was a physicist who credited his mathematical result to renown pure mathematician G. H. Hardy.  Hardy claimed years later that, given he only researched pure mathematics, none of his work would lead to applications. But Hardy’s claim is simply not true due to this result, as well as his results in number theory, which are famously used in modern day cryptography.

Further reading

For some basics on point processes, I suggest the classic text Stochastic Geometry and its Applications by Chiu, Stoyan, Kendall and Mecke, which covers point processes and the varying notation in Chapters 2 and 4. Haenggi also wrote a very readable introductory book called Stochastic Geometry for Wireless networks, where he gives the basics of point process theory.

These moment formula are also presented with proofs in Applied Probability by Lange; see Section 6.9.