Markov goes to Monte Carlo

Numerical and scientific fields such as statistics, computational physics, and optimization methods heavily rely upon Markov chain Monte Carlo methods. These simulation techniques use the power of Markov chains to sample general probability distributions, which in turn give Monte Carlo methods for estimating integrals and optimal solutions.

This is the third part of a multi-part series of posts. The first part covers Markov chains. The second part covers the basics of Monte Carlo methods. This post combines the ideas from the first two parts. Overall, the posts sketch the mechanics of Markov chain Monte Carlo (MCMC) methods, whose importance and applications I detailed in a previous post.

We will only examine Markov chains with countable state spaces. In this setting, we only need standard probability and matrix knowledge. But the results extend to general state spaces, such as Euclidean space.

By the way, I doubt that Andrey A. Markov ever went to Monte Carlo in the small country of Monaco.1But Herbie did.

Monte Carlo conditions

Sum

In the first post we covered Markov chains with countable state spaces, because the mathematics is more transparent. Consequently, we can use such Markov chains to estimate the sum of a function \(f\) over a countable (possibly infinite) set \(\mathbb{S}\). We can write the sum as

$$\begin{align}S(f)&=\sum_{x\in \mathbb{S}} f(x) \\&= \sum_{x\in\mathbb{S}} \frac{f(x)q(x)}{q(x)}\\& = \mathbb{E}_{q}[ \frac{f(Y)}{q(Y)}]\\& = \mathbb{E}_{q}[ g(Y)]\,,\end{align}$$

where \(g=f/q\) and \(Y\) is a suitable random variable that has a probability mass function \(q\) with support \(\mathbb{S}\).

Sampling the random variable (or point) \(Y\), gives the samples \(y_1,\dots,y_n\), which yields an unbiased Monte Carlo estimate of the sum, namely

$$S_n (f)=\frac{1}{n}\sum_{i=1}^n \frac{f(y_i)}{q(y_i)}.$$

The closer the function \(q\) is to \(|f|\), the better the estimate.

Integral

We can write an integral over some region \(\mathbb{S}\) as

$$\begin{align}\int_{\mathbb{S}} f(x) dx &= \int_{\mathbb{S}} \frac{f(x)}{p(x)}p(x)dx\\ & =\mathbb{E}_p[\frac{f(Y)}{p(Y)}]\,,\end{align}$$

where \(p\) is the probability density of a random variable (or point) \(Y\) with support \(A\). Then by sampling \(Y\), the resulting samples \(y_1, \dots, y_n\in [0,1]\) give the unbiased Monte Carlo estimate

$$I_n (f)=\frac{1}{n}\sum_{i=1}^n \frac{f(y_i)}{p(y_i)}\,.$$

Ergodic sequence

To estimate the sum or integral, we’ll need a sequence of ergodic random variables \(\{Y_i\}_{i\in\mathbb{N}}\), which we can achieve with an ergodic Markov chain with a state space \(\mathbb{S}\).

Markov chain conditions

In the context of Markov chains and, more generally, stochastic processes, we can think of ergodicity as convergence and a slightly more general version of the law of large numbers. A Markov chain with a countable state space needs some conditions to ensure ergodicity.

Regularity conditions for countable a Markov chain

  1. A stationary distribution \(\pi\)
  2. Aperiodicity
  3. Irreducibility
  4. Postive recurrence

In general, most of the needed structure for the Markov chain comes from irreducibility and a stationary distribution existing. Sometimes the above conditions are redundant. For example, an aperiodic, irreducible Markov chain with a finite state space is always positive recurrent. We look at ways on how to satisfy these conditions in the Monte Carlo setting when the underlying state space is countable.

Stationary distribution \(\pi\)

A stationary distribution \(\pi\) satisfies equation

$$ \pi=P\pi.$$

We know exactly what our stationary distribution \(\pi\) needs to be. It is the one used in the Monte Carlo method, namely the distribution of \(Y\). The challenge is constructing a Markov chain with such a stationary distribution. This will of course depend on the Markov (transition) kernel \(P\) of the Markov chain.

We’ll look at the other conditions on the Markov kernel \(P\).

Aperiodicity

We recall that the period \(d_x\) of a state \(x\in \mathbb{X}\) is the greatest common divisor of all \(n\) values such that \(P(x,x)^n>0\). We need every state of the countable Markov chain to be aperiodic, meaning \(d_x=1\) for all \(x\in\mathbb{X}\).

One sufficient way to achieve this requirement is to first have the countable Markov chain to be irreducible, which we’ll cover in the next section. Then if there exists at least one state \(x\in \mathbb{x}\) such that \(P(x,x)>0\), then the Markov chain is aperiodic.2Lemma 1.8.2 in Markov Chains by Norris or Remark 6.2.5 in Probability and Stochastic Processes by Brémaud.

Clearly the last condition is not difficult to satisfy, putting the focus now on the irreducibility requirement.

Irreducibility

The irreducibility property says that it is possible for the Marko chain to get from any point to another point in a finite number of steps. More precisely, for an irreducible Markov chain with countable state space, there must exist for all \(x,y\in\mathbb{X}\) a natural number \(s\) (possibly depending \(x\) and \(y\)) such that \(P(x,y)^s>0\).

We can achieve this requirement by introducing a stronger requirement, which is easier to verify. A countable Markov chain with a transition \(P\) will be irreducible if for all \(x,y\in\mathbb{X}\) the condition \(P(x,y)>0\) holds. We have simply required that the above number \(s=1\) for all \(x,y\in\mathbb{X}\).

Furthermore, if \(P(x,x)>0\) for all \(x\in\mathbb{X}\), then our irreducible countable Markov chain is also aperiodic.

Positive recurrence

For a point or state \(x\in\mathbb{X}\), we recall its first return time being defined as

$$ T_x^+=\min\{ t\geq 1: X_t=x\} \,.$$

A state \(x\) is called positive recurrent if the expected value of its first return time is finite, meaning \(\mathbb{E}_x(T_x^+)<\infty\). For a countable Markov chain, if all the states in the state space are positive recurrent, then we say the Markov chain is positive recurrent.

A countable irreducible Markov chain is positive recurrent if (and only if) it has a stationary distribution \(\pi\). Furthermore, that stationary distribution will be unique and nonzero, meaning \(\pi(x)>0\) for all \(x\in\mathbb{X}\).3Theorem 1.7.7 in Markov Chains by Norris or Theorem 6.3.14 in Probability and Stochastic Processes by Brémaud.

For positive recurrence, we need to prove that the Markov chain has a stationary distribution.

Ergodicity

A countable Markov chain that is aperiodic and irreducible is ergodic. In other words, an ergodic Markov process \(X\) with stationary distribution \(\pi\) is a random sequence \(X_0, X_1,\dots\) such that the following holds almost surely

$$ \frac{1}{t}\sum_{i=0}^{t-1} g(X_i) \rightarrow \sum_{x\in\mathbb{X}}g(x)\pi(x)$$

as \(t\rightarrow\infty\) for all bounded, nonnegative functions \(g\). Conveniently, wllle can use such an ergodic Markov chain to make statistical estimates.

Sampling with a Markov chain

For the Monte Carlo sample \(y_1,\dots,y_n\), we simply use the Markov chain samples \(x_0,x_1,\dots,x_{n-1}\), giving the Monte Carlo estimate. In the discrete case, this is simply

$$S_n (f)=\frac{1}{n}\sum_{i=1}^n g(x_{i-1}).$$

which, due to ergodicity, converges to \(S(f)=\mathbb{E}_{Y}[g(Y)]\).

Summary

In summary, a sufficient way to have a countable ergodic Markov chain is for the transition kernel to be such that \(P(x,y)>0\) for all \(x\in\mathbb{X}\) and there exists a (stationary) distribution \(\pi=P\pi\). For the first requirement, one imagines using a non-negative function such as \(e^{-(x-y)^2}/C\), where \(C>0\) is a suitable normalization constant.

The requirements seem possible, but to bring them all together is still a great challenge for the uninitiated. Fortunately, some great minds almost seven decades ago proposed the first Markov chain that meets all the requirements for a specific stationary distribution, which was used for a Monte Carlo estimate.

The first such Markov chain Monte Carlo method appeared in a paper written by five scientists, including two husband-wife pairs. It became known as the Metropolis or Metropolis-Hastings algorithm, and it will be the subject of a future post.

Further reading

Books

Markov chains

Any stochastic process book will include a couple of chapters on Markov chains. For more details, there are many books dedicated entirely to the subject of Markov chains. For example, introductory books include:

  • Brémaud – Markov Chains, Gibbs Fields, Monte Carlo Simulation and Queues;
  • Levin, Peres, and Wilmer – Markov Chains and Mixing Times;
  • Norris – Markov Chains.

Those books cover Markov chains with countable state spaces. If you want to read about discrete-time Markov chains with general state spaces, try the book:

  • Meyn, Tweedie – Markov chains and stochastic stability

All the above books have a section detailing a Markov chain Monte Carlo method, such as the Metropolis-Hastings algorithm and the Gibbs sampler.

Monte Carlo methods

Books dedicated to Monte Carlo approaches include:

For a statistics context, there’s also the good book:

  • Casella and Robert – Monte Carlo Statistical Methods.
Waw

The first MC in MCMC methods

Markov chains form a fundamentally important class of stochastic processes. It would be hard to over stress their importance in probability, statistics, and, more broadly, science and technology. They are indispensable in random simulations, particularly those based on the Markov chain Monte Carlo methods. In this post, we’ll have a look some Markov chain basics needed for such simulation methods.

This is the first part of a series of posts on Mark chain Monte Carlo methods. This post covers the basics of Markov chains, which is the more involved part. The second part will cover Monte Carlo methods. The third part will combine the ideas from the first two parts. Overall, the three posts will sketch the mechanics of Markov chain Monte Carlo (MCMC) methods, whose importance and applications I detailed in a previous post.

Markov chains vs Markov processes

All Markov chains are Markov processes. Some people use the term Markov chain to refer to discrete-time Markov processes with general state spaces. Other people prefer the term Markov chain for continuous-time Markov processes with countable state spaces.1In his book Applied Probability and Queues Asmussen writes:

In this book, we use the terminology that a Markov chain has discrete time and a Markov process has continuous time (the state space may be discrete or general). However, one should note that it is equally common to let “chain” refer to a discrete state space and “process” to a general one (time may be discrete or continuous).

Nevertheless, the first MC in the MCMC suggests the Markov chain Monte Carlo crowd prefers the former sense of Markov chain, given the use of discrete-time Markov processes in their simulations.

Markov the frog

Some writers introduce Markov chains with a mental image of a frog jumping around lily pads scattered over a pond. (Presumably the frog never misses a lily pad.) We assume the frog randomly chooses the next lily pad through some random mechanism. Perhaps the distances between lily pads or their sizes influence the chances that frog will jump between them.

We further assume that the frog is a bit particular, preferring to jump in certain directions more than others. More precisely, the probability of our frog jumping from a lily pad labelled \(x\) to another labelled \(y\) is \(P(x,y)\). But jumping in the opposite direction happens with probability \(P(y,x)\), which in general is not equal to \(P(x,y)\).

I typically use the term points, but the Markov literature usually says that the Markov chain visits states.

State space

We can interpret a Markov chain, a type of stochastic process, as a collection or sequence of random variables. 2I briefly detailed stochastic processes in a previous post.The values of the random variables are points in some mathematical space \(\mathbb{X}\). This space can be quite abstract, but in practice it’s usually the lattice \(\mathbb{Z}^n\), Euclidean space \(\mathbb{R}^n\), or a subset of one of these two spaces. For our frog example, all the lily pads in the pond form the state space.

We’ll only consider countable Markov chains where the number of points in the state space \(\mathbb{X}\) is countable. Although the results and theory generally hold for more general state spaces, the accompanying work requires more technical mathematics. For finite and countable state spaces, we can use standard probability and matrix knowledge. But when we use uncountable state spaces such as \(\mathbb{R}^n\), we enter the world of measure theory and functional analysis.

I will often write a point \(x\) in a (state) space \(\mathbb{X}\). But you can say an element \(x\) of a set \(\mathbb{X}\). Many authors refers to the points or elements as states of the Markov chain. In the frog example, each lily pad is a different state.

Markov property

A discrete-time countable Markov chain is a random process that jumps between points of some countable mathematical space \(\mathbb{X}\) such that, when at point \(x \in \mathbb{X}\), the next position is chosen according to a probability distribution \(P(x,·)\) depending only on \(x\).

More specifically, a sequence of random variables \((X_0, X_1, . . .)\) is a discrete-time Markov chain \(X\) with a countable state space
\(\mathbb{X}\) and kernel \(P\) if for all \(x,y \in \mathbb{X}\) and all \(t \geq 1\) satisfying \(\mathbb{P}[X_{t−1}=x_{t-1},\dots,X_0=x_0]>0\), we have

$$ \begin{align}\mathbb{P}[X_{t+1} =y|&X_{t}=x,X_{t−1}=x_{t-1},\dots,X_0=x_0]\\&=\mathbb{P}[X_{t+1} =y|X_t =x]\\&=P(x,y)\,.\end{align}$$

This equation is often called the Markov property.

The Markov property says that the conditional probability of jumping from point \(x\) to \(y\) remains the same, regardless of which points or states \(x_0,x_1,\dots,x_{t-1}\) were previously visited. This is precisely why the kernel \(P\) contains all the information needed to describe the future random evolution of the Markov chain.

We have assumed the probabilities given by \(P\) are fixed, meaning we have described a homogeneous Markov chain.

Markov kernel

The kernel \(P\) is called the Markov (transition) kernel or probability kernel. Assuming a countable state space \(\mathbb{X}\), we can reference any probability value of the kernel \(P\) with two variables \(x,y\in\mathbb{X}\). If we assume a finite state space \(\mathbb{X}\), then the kernel \(P\) becomes a regular matrix taught in linear algebra. An infinite but countable state space gives an infinite matrix \(P\). The rows of the kernel matrix \(P\) must add up to one, because each row is a probability measure.

A more general space, such as Euclidean space \(\mathbb{R}^n\), results in a more general kernel with respect to a suitable measure. In this setting, \(P(x,·)\) is no longer a probability mass function, but a general probability measure.

Initial distribution

At time \(t=0\) we describe the random initial configuration of a Markov process with a probability distribution \(\mu_0\). For a finite or countable Markov chain, this initial distribution \(\mu_0\) corresponds to a probability mass function encoded as a row vector.

Jumping from \(x\) to \(y\)

The probability distribution \(\mu_0\) gives the probability of starting in state (or at point) \(x\in\mathbb{X}\). After one time step, we can write down the probability distribution \(\mu_1\) that gives us the different probabilities of the Markov chain being at different states. At \(n=1\), basic matrix algebra and probability rules give us the matrix equation

$$\mu_1=\mu_0 P$$

By induction, after \(t\) time steps we have the expression

$$\mu_n=\mu_0 P^n\,.$$

where the superscript \(n\) denotes matrix power. We can write the \(n\)-time step kernel as \(P_{(n)}\), which for a finite Markov chain is given by the matrix equation \(P_{(n)}=P^n\).

Seeing how \(P_{(n)}\) behaves as \(n\) approaches infinity forms part of work that studies the convergence and ergodicity properties of Markov chains. I’ll make these concepts clearer below. But first I’ll give some conditions that are typically needed.

Regularity conditions

A Markov chain with a countable state space needs some conditions to ensure convergence and ergodicity.

Regularity conditions

  1. A stationary distribution \(\pi\)
  2. Aperiodicity
  3. Irreducibility
  4. Postive recurrence

The nature of the state space and the kernel will dictate these conditions. These conditions are also not necessarily logically distinct. For example, on a finite state space, you’ll get positive recurrence for free, because an aperiodic, irreducible Markov chain with a finite state space is always positive recurrent.

We now briefly detail these conditions and in another post I’ll give examples how the conditions can be met.

Stationary distribution \(\pi\)

It’s possible to encounter a probability distribution \(\pi\) where applying the kernel \(P\) returns the same distribution \(\pi\), meaning

$$ \pi=\pi P\,.$$

This (fixed-point) equation is called the balance equation.

The distribution \(\pi\) is called the stationary, invariant or steady-state distribution. A Markov chain does not need to have a stationary distribution. And if a Markov chain does have one, it may not be unique. Its existence and uniqueness will depend on the Markov kernel \(P\).

Showing that a unique stationary distribution exists and it is possible to reach it with probability one is the stuff of Markov convergence results. Markov chain Monte Carlo methods hinge upon these results .

Aperiodicity

It is possible for a Markov chain to get trapped in a loop, periodically visiting the same states. The period \(d_x\) of a state \(x\in \mathbb{x}\) is the greatest common divisor of all \(n\) values such that \(P(x,x)^n>0\). If the period of a point is \(d_x=1\), then we say it’s aperiodic. If every state of a Markov chain is aperiodic, we says it’s an aperiodic Markov chain.

Aperiodicity means there are no loops to trap the Markov chain. This property is typically needed for convergence results.

Irreducibility

A Markov chain with a countable state space \(\mathbb{X}\) is irreducible if the Markov chain can go from any point \(x\in\mathbb{X}\) to another other point \(x\in\mathbb{X}\) with a positive probability in a finite number of time steps. In other words, there exists a natural number \(s\) such that \(P(x,y)^s>0\) for all \(x,y\in\mathbb{X}\).

Irreducibility ensures that a Markov chain will visit all the states in its state space. This property is also needed for convergence results.

Recurrence

When studying Markov processes, a quantity of interest is how much time it takes to return to a state or point. For a point \(x\in\mathbb{X}\), we define its first return time as

$$ T_x^+=\min\{ t\geq 1: X_t=x\} \,.$$

As the name suggests, this random variable is the number of time steps for the Markov process return to state \(x\), taking whichever path, conditioned on it starting at \(x\).

We call a state \(x\) recurrent if the probability of its first return time being finite is one, meaning \(\mathbb{P}_x(T_x^+<\infty)=1\). Otherwise the state \(x\) is said to be transient.

Positive recurrence

We can classify different types of recurrence based on the expected value of the first return times. A state \(x\) is called positive recurrent if the expected value of its first return time is finite, meaning \(\mathbb{E}_x(T_x^+)<\infty\). Otherwise state \(x\) is null recurrent.

For a countable Markov chain, if all the states in the state space are (positive) recurrent, so \(\mathbb{E}_x(T_x^+)<\infty\) for all \(x\in\mathbb{X}\), then we say the Markov chain is (positive) recurrent.

Again, the concept of positive recurrence is needed for convergence results.

Ergodicity

We say a countable Markov chain is ergodic if it is irreducible, aperiodic and positive recurrent.3See, for example, Basics of Applied Stochastic Process by Serfozo (page 26) or Probability Theory and Stochastic Processes by Bremaud (page 262). Ergodicity allows one to find averages by employing a more general form of the law of large numbers, which Monte Carlo methods rely upon. We stress that definitions of ergodicity vary somewhat, but in general it means convergence and laws of large numbers exists.

Further reading

Any stochastic process book will include a couple of chapters on Markov chains such as:

    • Brémaud – Probability Theory and Stochastic Processes;
    • Serfozo -Basics of Applied Stochastic Processes.

For more details, there are many books dedicated entirely to the subject of Markov chains. For example, introductory books include:

  • Brémaud – Markov Chains, Gibbs Fields, Monte Carlo Simulation and Queues;
  • Levin, Peres, and Wilmer – Markov Chains and Mixing Times;
  • Norris – Markov Chains.

Those books cover Markov chains with countable state spaces. If you want to read about discrete-time Markov chains with general state spaces, try the book:

  • Meyn, Tweedie – Markov chains and stochastic stability

All the above books have a section on Markov chain Monte Carlo methods, such as the Metropolis-Hastings algorithm or the Gibbs sampler.