The second MC in MCMC methods

In calculus differentiating a function is easy. Integrating one is hard. Much research has gone into finding ways for calculating integrals. But often functions cannot be integrated, at least analytically.

No problem. You do it numerically. You take the function’s curve, split it up into nice rectangles or, even better, trapezoids, and calculate their areas, and add them up. Not accurate enough? Split the curve up even finer until you’re happy. Done.

That works fine in one dimension and even two dimensions. But then, as the number of dimensions increases, the amount of splitting and adding grows exponentially, soon hitting unwieldy numbers. This is a mountain too steep for even the fastest computers. The problem is what applied mathematics guru Richard E. Bellman famously called the curse of dimensionality.

But thankfully randomness offers a path around this mountain. The approach is called Monte Carlo integration for estimating sums and integrals

This is the second part of a series of posts on Mark chain Monte Carlo methods. The first part covers Markov chains. This posts covers the basics of Monte Carlo methods. The third part will combine the ideas from the first two parts. Overall, the three posts will sketch the mechanics of Markov chain Monte Carlo (MCMC) methods.

We’ll focus on estimating sums, but the same logic applies to integrals.

Monte Carlo estimates

In the discrete (or countable) case, consider a function $f$ that maps from a countable state space $\mathbb{S}$. This state space maybe, for example, the lattice $\mathbb{Z}^d$ or some subset of it. We write the sum as

$$S(f)=\sum_{x\in \mathbb{S}} f(x) $$

Such sums arise in many fields, particularly statistical physics, which is where the (Markov chain) Monte Carlo methods were born.¹

The idea behind a Monte Carlo estimate of a sum is remarkably simple. You randomly sample the function $f$, find the average, and use it as an estimate of the sum.

Estimating with uniform variables

For a finite set $\mathbb{S}=\{x_1,\dots,x_m\}$, we write the sum of $f$ over the set $\mathbb{S}$ as

$$\begin{align}S(f)&=\sum_{x\in \mathbb{S}} f(x) \frac{m}{m} \\&= \sum_{x\in\mathbb{S}} \frac{f(x)p(x)}{p(x)}\\& = \mathbb{E}_{p}[ \frac{f(U)}{p(U)}]\,,\end{align}$$

where $U$ is a discrete uniform random variables, which has the probability mass function $p$ with support $\mathbb{S}$, meaning $p(x)=1/m$ if $x\in \mathbb{S}$ and $p(x)=0$ otherwise. But using a uniform random variable is a naive way, as there are better random variables we can use to sample $f$.

Estimating with general variables

More generally, we can write the sum of a countable (possibly infinite) set $\mathbb{S}$ as

$$\begin{align}S(f)&= \sum_{x\in\mathbb{S}} \frac{f(x)q(x)}{q(x)}\\& = \mathbb{E}_{q}[ \frac{f(Y)}{q(Y)}]\\& = \mathbb{E}_{q}[ g(Y)]\,,\end{align}$$

where $g=p/q$ and $Y$ is a suitable random variable that has a probability mass function $q$ with support $\mathbb{S}$.

Sampling the random variable (or point) $Y$, gives the samples $y_1,\dots,y_n$, which yields an unbiased Monte Carlo estimate of the sum

$$S_n (f)=\frac{1}{n}\sum_{i=1}^n \frac{f(y_i)}{q(y_i)}.$$

If we can (quickly) sample $Y$, then we can (quickly) estimate the integral. The closer the (probability mass) function $q$ is to $|f|$, the better the estimate.

We can also use the above reasoning with integrals, replacing the sum and probability mass function with an integral and a probability density.

Error rate

If the samples $y_1, \dots, y_n $ are random and independent, the (strong) law of large numbers guarantees that the unbiased Monte Carlo estimate converges (almost surely) to the correct answer.

Further assuming $S(f^2)<\infty$, the error of the integral estimate $|S(f)-S_n(f)|$ is proportional to

$$\frac{\mathbb{V}_q(f)}{n ^{1/2}},
$$
where
$$\mathbb{V}(f)_q=E_q[f^2(Y)]-(E_q[f(Y)])^2,$$
is the variance of the function $f$. (This is not the variation of a function, which is something different.) The error $|S(f)-S_n(f)|$ reduces at the rate $O(n^{-1/2})$. The Monte Carlo approach allows for continual updating until the estimate is sufficiently converged.

The above logic applies to integrals of functions on general spaces.

The dimension $d$ of the underlying space, as $\mathbb{S}^d$, does not appear in the error. The independence of the dimension is precisely why we tackle high-dimensional sums and integrals with Monte Carlo methods. The general rule of thumb is to use Monte Carlo integration for functions of four or more dimensions.

Losing the independence

Again, a sequence of independent random variables allow our Monte Carlo estimate to converge. We can loosen this requirement and allow some dependence between variables. To obtain a Monte Carlo estimate, we just need to use a sequence of random variables that are ergodic. Then we recover the (strong) law of large numbers that gives our Monte Carlo estimate.

Ergodicity

The word ergodicity is one of those technical terms you’ll see and hear in different fields of mathematics and physics, often when randomness is involved. The concept is also central to deterministic dynamical systems.

There are different versions and strengths of ergodicity, depending on the system and applications. But loosely speaking, an ergodic system is one in which temporal averages and spatial (or ensemble) averages converge to the same value.

That means when you want to find an average, you have two options in an ergodic system. The temporal option: You can measure something over a long time. The spatial (or ensemble) option: You can measure many instances of the same something. The two different averages will eventually converge to the same value.

Ergodic sequences

In the context of Markov processes and, more generally, stochastic processes, we can think of ergodicity as convergence and a slightly more general version of the law of large numbers.

Consider a stochastic process $X$, which is a sequence of random variables $X_0,X_1,\dots$ indexed by some parameter $t$. We assume the probability distributions $\pi_0,\pi_1,\dots$ of the random variables converge to some final stationary distribution $\pi$. For a countable state space, the (probability) distribution $\pi$ corresponds to a probability mass function.

We then say a stochastic process $X$ state space is ergodic if the following holds almost surely

$$ \frac{1}{t}\sum_{i=0}^t g(X_i)\rightarrow \sum_{x\in \mathbb{Z}}g(x)\pi(x)=\mathbb{E}_{\pi} [g(X)]\,,$$

as $t\rightarrow\infty$ for all bounded functions, nonnegative functions $g$. The subscript of the expectation simply denotes that the expectation is taken with respect to the (stationary) distribution $\pi$.

The equation tells the story. On the left-hand side you have a temporal average. On the right-hand side you have a spatial (or ensemble) average. As time proceeds, the first average will converge to the second average. Ergodicity.