The Metropolis(-Rosenbluth-Rosenbluth-Teller-Teller)-Hastings algorithm

Consider a collection of random variables described by a joint probability distribution. Often, in any field with probability or statistics, one faces the task of simulating these random variables, which typically depend on each other in some fashion.

A now standard way for simulating or sampling such random variables is to use the Metropolis-Hastings algorithm, undoubtedly the cornerstone of Markov chain Monte Carlo methods. This method creates a discrete-time Markov chain that has a stationary or invariant distribution being the aforementioned distribution.

The algorithm was born out of a 1953 paper by Nicholas Metropolis, Arianna W. Rosenbluth, Marshall Rosenbluth, Augusta H. Teller, and Edward Teller (two husband-wife pairs), who looked at a special case, and a 1970 paper by W.K. Hastings, who generalized the method. It is typically called the Metropolis-Hastings or the Metropolis algorithm. And some have called it the M(RT)2 H algorthm.

(The history is a bit complicated, but perhaps we should drop the name Metropolis. The late Arianna (née Wright) Rosenbluth did most of the work. She was also a dab hand at fencing.)

Although the algorithm’s initial adoption and use was slow, taking decades partly due to slower computers, the Metropolis-Hastings algorithm is now a widely used method for simulating collections of random variables. This in turn gives fast ways for exploring, integrating, and optimizing otherwise unwieldy mathematical functions, such as those found in Bayesian statistics, machine learning, statistical physics, and combinatorial optimization. The algorithm serves as the foundation for other random simulation methods, such as the Gibbs sampler, hence it’s been called the workhorse of Marko chain Monte Carlo methods.

There are many books, articles, lecture notes, and websites describing the Metropolis-Hastings algorithm; see the Further reading section below. But I’ll detail the core ideas here. This post is designed to be somewhat self-contained, but it arose from a series of posts, starting with this one and ending with this particularly relevant one.

Constructing a Markov process

Take a collection of \(n\) random variables \(X_1,\dots,X_n\) with a (joint) probability distribution \(\pi(x)=\pi(x_1,\dots,x_n)\). This distribution will either be a (joint) probability mass function or (joint) probability density for discrete or continuous random variables, respectively.

We wish to construct a Markov chain on an abstract mathematical space \(\mathbb{X}\). We assume we can write a point \(x\in \mathbb{X}\) as \(x=(x_1,\dots,x_n)\). More specifically, the space \(\mathbb{X}\) is a Cartesian product of spaces \(\mathbb{X}_1,\dots,\mathbb{X}_n\) on which the variables are defined.

Which mathematical space is \(\mathbb{X}\)? That will, of course, depend on the random variables you’re trying to simulate. But it’s usually the lattice \(\mathbb{Z}^n\), Euclidean space \(\mathbb{R}^n\), or a subset of one of these two spaces.

For this post, we’ll assume that the space \(\mathbb{X}\) is discrete, which makes the things simpler. But the mathematics is similar for continuous spaces, with small changes such as replacing probabilities and sums with probability densities and integrals, respectively. We’ll further assume the space is finite, so we can use matrices to describe the Markov transition kernels. But everything covered here will work on infinite spaces such as \(\mathbb{R}^n\), which is the most common space used in practice.

Again, our overall aim is to construct a Markov chain with a stationary \(\pi\) being the same as the distribution that we want to sample.

Jumper process

There’s a random jumper that wants to jump around the space \(\mathbb{X}\). The jumper randomly jumps from one point in this mathematical space \(x\in \mathbb{X}\) to another point \(y\in \mathbb{X}\) according to the probability \(J(x,y)\). If the the state space \(\mathbb{X}\) is finite, then \(J\) becomes a matrix. The matrix row \(J(x,\cdot)\) is a probability mass function for each \(x\in \mathbb{X}\), so it sums to one. By definition, this random jumping forms a Markov chain.

The only thing we ask is that, for our jumper, every point \(x\) in \(\mathbb{X}\) where \(\pi(x)>0\) is reachable with positive probability in a single step. This implies the easy-to-achieve condition \(J(x,y)>0\) where \(\pi(x)>0\) for all points \(x,y\in\mathbb{X}\).

Now we have a Markovian jumper on the space \(\mathbb{X}\). But this turns out to be too much jumping for our jumper. Furthermore, the jumper is jumping more in certain directions than others. Occasionally the jumper wants to stay put (and have a rest) with the aim of balancing the jump directions.

The jumper still wants to jump sometimes from a point \(x\in \mathbb{X}\) to another point \(y\in \mathbb{X}\) based on \(J(x,y)\). But now at each time step, after choosing the jump direction but before jumping, the jumper flips a biased coin whose success probability \(\alpha(x,y)\) depends on the current position \(x\in \mathbb{X}\) and the (potential) next position \(y\in \mathbb{X}\). For the coin, the acceptance probability, which allows (or not) the jumper to move from \(x\) to \(y\), is given by

$$\alpha(x,y)=\min[1,\frac{\pi(y)}{\pi(x)}\frac{J(y,x)}{J(x,y)} ]\,,\quad x, y\in \mathbb{X}\,.$$

The function \(\alpha(x,y)\) is clearly never negative. The minimum in the above expression ensures that \(\alpha(x,y)\) is a valid probability.

Metropolis-Hastings ratio

The ratio in the expression for \(\alpha(x,y)\) is sometimes called the Metropolis-Hastings ratio, which we’ll soon see is designed specifically to balance the jump directions. The ratio means that a constant factor in the target distribution \(\pi(x)\) will vanish due to cancellation.

More specifically, if we can write the target distribution as \(\pi(x)=f(x)/C\), where \(C>0\) is some constant and \(f(x)\) is a non-negative function, then the ratio term \(\pi(y)/\pi(x)=f(y)/f(x)\). This reasoning also applies to a constant factor in the transition kernel \(M\).

The constant factor being irrelevant in the target distribution \(\pi(x)\) is very useful. It is particularly important for posterior distributions in Bayesian statistics and the Gibbs distributions in statistical physics, as these distributions typically have difficult-to-calculate constants.

Metropolis-Hastings process

The pairing of the original jumper Markov chain with the coin flipping creates a new Markov chain, which we call the Metropolis-Hastings process. What’s remarkable is its stationary distribution will be the target distribution \(\pi(x)\).

Transition kernel (matrix)

For the Metropolis-Hastings process, we can readily reason the structure of the transition kernel (matrix) \(M\) that describes this Markov chain. First we’ll look at the off-diagonal entries of \(M\).

Jumping from \(x\) to \(y\neq x\)

To jump from point \(x\) and to another point \(y\neq x\), the probability is simply the previous probability \(J(x,y)\) multiplied by the probability of that proposed jump being accepted, which is \(\alpha(x,y)\), giving

$$ M(x,y) = \alpha(x,y) J(x,y), \quad x\neq y\,.$$

Now we examine the diagonal entries of \(M\).

Jumping from \(x\) to \(x\)

There are two different ways for the jumper to remain at point \(x\). The first way is that the jumper simply jumps from \(x\) to \(x\), which happens with probability \(J(x,x)\). This proposed jump, so to speak, is accepted with probability one because \(\alpha(x,x)=1\). Consequently, we can write this probability as \(\alpha(x,x)J(x,x)\) or \(J(x,x)\).

The second way consists of all the possible jumps from \(x\) to \(z\), but then for each of those proposed jumps to be rejected, which happens (for each jump) with probability \([1-\alpha(x,z)]\). (I am using here the dummy or bound variable \(z\) instead of \(y\) for clarity.) Adding up the probabilities of these events gives the probability of the second way being the sum \(\sum_{z\in \mathbb{X}}[1-\alpha(x,z)] J(x,z) \,.\)

Consequently, for a single time step, the probability that the jumper starts at point \(x\) and remains at \(x\) is

$$ M(x,x) = \alpha(x,x)J(x,x)+\sum_{z\in \mathbb{X}}[1-\alpha(x,z)] J(x,z) \,.$$

The transition matrix \(M\) should be stochastic, so the rows sum to one, which we see is the case

$$ \sum_{y\in\mathbb{X}}M(x,y)=1\,.$$

Of course, we could have derived the diagonal entry \(M(x,x)\) immediately by starting with the above sum, but that approach lacks intuition into what’s happening.

Expression for the transition kernel \(M\)

$$M(x,y) = \begin{cases}
\alpha(x,y) J(x,y) & \text{if }
\begin{aligned}[t]
x&\neq y
\end{aligned}\\
\alpha(x,x)J(x,x)+\sum_{z\in \mathbb{X}}[1-\alpha(x,z)] J(x,z) & \text{if } x=y
\end{cases}$$

Often the above expression is written as a single line by placing an indicator function or similar in front of the sum for the diagonal entries. (In the continuous case, where the sum is replaced with an integral, a Dirac delta distribution is often used instead.)

Reversibility

A Markov process on \(\mathbb{X}\) with kernel (matrix) \(K\) is (time) reversible with respect to the distribution \(\mu\) if the following holds

$$ \mu(x)K(x,y) = \mu (y) K(y,x)\quad x,y\in\mathbb{X}\,.$$

This reversibility condition is also called the detailed balance equation. If this condition is met, then the Markov process will have a stationary distribution \(\mu\). By summing over \(x\), we can verify this because we obtain

$$ \sum_{x\in\mathbb{X}}\mu(x)K(x,y) =\mu(y)\sum_{x\in\mathbb{X}} K(y,x)=\mu(y)\,.$$

This is just the balance equation, often written as \(\mu=K\mu\), which says that the transition kernel \(K\) has a stationary distribution \(\mu \).

(Strictly speaking, we don’t necessarily need reversibility, as long as the Markov chain has a stationary distribution. Reversibility just makes the algebra easier.)

The Metropolis-Hastings process is reversible

We can show that the Metropolis-Hastings process with the transition kernel \(M\) satisfies the reversibility condition. The proof essentially just requires the swapping of rows and columns. Clearly then we only need to look at the off-diagonal entries, so we assume \(x\neq y\).

We further assume that \(\pi(y)J(y,x)\geq \pi(x) J(x,y)\). Then \(\alpha(x,y)=1\), so \(M(x,y)=J(x,y)\). Now we use this last fact in the last line of algebra that follow:

$$\begin{aligned}\pi(y) M(y,x)&=\pi(y)J(y,x) \alpha(y,x)\\ &= \pi(y)J(y,x) \frac{\pi(x)}{\pi(y)}\frac{J(x,y)}{J(y,x)}\\ &= \pi(x)J(x,y)\\&= \pi(x)M(x,y)\,,\end{aligned}$$

which shows that the reversibility condition holds with the last assumption.

But of course we can reverse that assumption, due to symmetry, so \(\pi(y)J(y,x)\leq \pi(x) J(x,y)\), then \(\alpha(x,y)=[\pi(y)J(y,x)]/[\pi(x) J(x,y)]\) and \(\alpha(y,x)=1\), implying this way also works. Then the reversibility condition holds for all cases.

We could have shown this more quickly by observing that

$$\begin{aligned}\pi(y) M(y,x)&=\pi(y)J(y,x)\alpha(y,x)\\ &= \pi(y)J(y,x) \min[1, \frac{\pi(x)}{\pi(y)}\frac{J(x,y)}{J(y,x)}]\\ &= \min[\pi(y)J(y,x) , \pi(x)J(x,y)]\,,\end{aligned}$$

which is symmetric in \(x\) and \(y\).

In summary, the Metropolis-Hastings process is reversible and has the stationary distribution \(\pi(x)\).

Continuous case

As mentioned earlier, the continuous case is similar to the discrete-case with a few modifications. For example, assuming now \(\mathbb{X}\) is continuous, then the stationary distribution is described by a probability density \(\pi(x)\). The jumper process will be described by transition kernel (function) \(j(x,y)\), where \(j(x,\cdot)\) is now a probability density for each \(x\in \mathbb{X}\).

It follows that the acceptance probability is

$$\alpha(x,y)=\min[1,\frac{\pi(y)}{\pi(x)}\frac{j(y,x)}{j(x,y)} ]\,,\quad x, y\in \mathbb{X}\,.$$

The transition kernel (function) is

$$m(x,y) = \begin{cases}
\alpha(x,y) j(x,y) & \text{if }
\begin{aligned}[t]
x&\neq y
\end{aligned}\\
\alpha(x,x) j(x,x)+\int_{\mathbb{X}}[1-\alpha(x,z)]j(x,z) dz & \text{if } x=y
\end{cases}$$

For more details, there are derivations online such as this one, as well as the sources cited in the Further reading section below.

Libraries

Of course, before writing your own code, I would check out any pre-written functions in your favourite language that already implement the Metropolis-Hastings algorithm.

For example, in MATLAB there’s the function mhsample. Again I would be using this before writing my own code, unless it’s for illustration or educational purposes.

In Python there’s the library pymcmcstat. For those with a machine learning bent, there’s also a TensorFlow function MetropolisHastings.

In R, which is a language designed for statistics, implementations of any Markov chain Monte Carlo methods will be couched in the language and notation of (Bayesian) statistics. For a specific library, I can’t say I know which is the best, but you can check out the MCMCPack library. For R libraries, be wary of using the ones that were published and are (or were?) maintained by a single contributor.

Code

For a couple of simple examples in both one dimension and two dimensions, I’ve implemented the Metropolis-Hastings algorithm in the programming languages R, MATLAB, Python (NumPy) and Julia. The code can be found here.

Further reading

There are good articles on explaining the Metropolis-Hastings approach, as well as its history. On this topic, the articles are probably better resources than the books.

Articles

Historical

The two important papers behind the Metropolis-Hastings algorithm are:

  • 1953 – Metropolis, Rosenbluth, Rosenbluth, Teller, Teller – Equation of state calculations by fast computing machines;
  • 1970 – Hastings – Monte Carlo sampling methods using Markov chains and their applications.
Introductory

There are several tutorial and historical articles covering the Metropolis-Hastings algorithm. An earlier explanatory article is

  • 1995 – Chib and Greenberg – Understanding the Metropolis-Hastings Algorithm.

I also recommend:

  • 1994 – Tierney – Markov chains for exploring posterior distributions;
  • 1998 – Diaconis and Saloff-Coste – What do we know about the Metrolpolis algorithm?;
  • 2001 – Billera and Diaconis – A Geometric Interpretation of the Metropolis-Hastings Algorithm;
  • 2015 – Minh and Minh – Understanding the Hastings Algorithm;
  • 2016 – Robert – The Metropolis-Hastings algorithm.

The above article by Billera and Diaconis shows that the Metropolis-Hastings algorithm is actually the minimization (on the space of possible samplers) using an \(L^1\) norm. (Note that \(K(x,x)\) should be \(K(x,y)\) in equation (1.3), so it agrees with equation (2.1).)

The following article covers the Metropolis-Hastings algorithm in the context of machine learning (particularly Bayesian statistics):

  • 2003 – Andrieu, de Freitas, Doucet, and Jordan – An Introduction to MCMC for Machine Learning.

For a surprising real world application, in the following article, Diaconis briefly recounts a story about a couple of graduate students using the Metropolis-Hastings algorithm to decipher a letter from a prison inmate who had used a simple substitution cipher:

  • 2009 – Diaconis – The Markov chain Monte Carlo revolution.

For Markov chains on general state spaces, the following paper gives a survey of results (often with new proofs):

  • 2004 – Roberts and Rosenthal – General state space Markov chains and MCMC algorithms.
History

The history of this algorithm and related methods are described in the following articles:

  • 2003 – David – A History of the Metropolis Hastings Algorithm;
  • 2005 – Gubematis – Marshall Rosenbluth and the Metropolis algorithm;
  • 2011 – Robert and Casella – A Short History of Markov Chain Monte Carlo: Subjective Recollections from Incomplete Data.

Incidentally, the first author of the third paper, Christian P. Robert, posts regularly on Markov chain Monte Carlo methods, such as the Metropolis-Hastings algorithm, as well as many other topics.

Books

Modern books on stochastic simulations and Monte Carlo methods will often detail this method. For example, see the Handbook of Monte Carlo Methods (Section 6.1) by Kroese, Taimre and Botev. The book Stochastic Simulation: Algorithms and Analysis by Asmussen and Glynn also covers the method in Chapter XIII, Section 3. (For my remark on reversibility, see Remark 3.2 in Asmussen and Glynn.) There is also the book Monte Carlo Strategies in Scientific Computing by Liu; see Chapter 5.

Websites

There are many, many websites covering this topic. Searching around, the following link is probably my favourite, as it gives a detailed explanation on how the Metropolis-Hastings algorithm works and includes with Python code:

This post also covers it with Python code:

This site details the algorithm with R code:

Here’s a quick introduction with a one-dimensional example implemented in R:

Markov goes to Monte Carlo

Numerical and scientific fields such as statistics, computational physics, and optimization methods heavily rely upon Markov chain Monte Carlo methods. These simulation techniques use the power of Markov chains to sample general probability distributions, which in turn give Monte Carlo methods for estimating integrals and optimal solutions.

This is the third part of a multi-part series of posts. The first part covers Markov chains. The second part covers the basics of Monte Carlo methods. This post combines the ideas from the first two parts. Overall, the posts sketch the mechanics of Markov chain Monte Carlo (MCMC) methods, whose importance and applications I detailed in a previous post.

We will only examine Markov chains with countable state spaces. In this setting, we only need standard probability and matrix knowledge. But the results extend to general state spaces, such as Euclidean space.

By the way, I doubt that Andrey A. Markov ever went to Monte Carlo in the small country of Monaco.1But Herbie did.

Monte Carlo conditions

Sum

In the first post we covered Markov chains with countable state spaces, because the mathematics is more transparent. Consequently, we can use such Markov chains to estimate the sum of a function \(f\) over a countable (possibly infinite) set \(\mathbb{S}\). We can write the sum as

$$\begin{align}S(f)&=\sum_{x\in \mathbb{S}} f(x) \\&= \sum_{x\in\mathbb{S}} \frac{f(x)q(x)}{q(x)}\\& = \mathbb{E}_{q}[ \frac{f(Y)}{q(Y)}]\\& = \mathbb{E}_{q}[ g(Y)]\,,\end{align}$$

where \(g=f/q\) and \(Y\) is a suitable random variable that has a probability mass function \(q\) with support \(\mathbb{S}\).

Sampling the random variable (or point) \(Y\), gives the samples \(y_1,\dots,y_n\), which yields an unbiased Monte Carlo estimate of the sum, namely

$$S_n (f)=\frac{1}{n}\sum_{i=1}^n \frac{f(y_i)}{q(y_i)}.$$

The closer the function \(q\) is to \(|f|\), the better the estimate.

Integral

We can write an integral over some region \(\mathbb{S}\) as

$$\begin{align}\int_{\mathbb{S}} f(x) dx &= \int_{\mathbb{S}} \frac{f(x)}{p(x)}p(x)dx\\ & =\mathbb{E}_p[\frac{f(Y)}{p(Y)}]\,,\end{align}$$

where \(p\) is the probability density of a random variable (or point) \(Y\) with support \(A\). Then by sampling \(Y\), the resulting samples \(y_1, \dots, y_n\in [0,1]\) give the unbiased Monte Carlo estimate

$$I_n (f)=\frac{1}{n}\sum_{i=1}^n \frac{f(y_i)}{p(y_i)}\,.$$

Ergodic sequence

To estimate the sum or integral, we’ll need a sequence of ergodic random variables \(\{Y_i\}_{i\in\mathbb{N}}\), which we can achieve with an ergodic Markov chain with a state space \(\mathbb{S}\).

Markov chain conditions

In the context of Markov chains and, more generally, stochastic processes, we can think of ergodicity as convergence and a slightly more general version of the law of large numbers. A Markov chain with a countable state space needs some conditions to ensure ergodicity.

Regularity conditions for countable a Markov chain

  1. A stationary distribution \(\pi\)
  2. Aperiodicity
  3. Irreducibility
  4. Postive recurrence

In general, most of the needed structure for the Markov chain comes from irreducibility and a stationary distribution existing. Sometimes the above conditions are redundant. For example, an aperiodic, irreducible Markov chain with a finite state space is always positive recurrent. We look at ways on how to satisfy these conditions in the Monte Carlo setting when the underlying state space is countable.

Stationary distribution \(\pi\)

A stationary distribution \(\pi\) satisfies equation

$$ \pi=P\pi.$$

We know exactly what our stationary distribution \(\pi\) needs to be. It is the one used in the Monte Carlo method, namely the distribution of \(Y\). The challenge is constructing a Markov chain with such a stationary distribution. This will of course depend on the Markov (transition) kernel \(P\) of the Markov chain.

We’ll look at the other conditions on the Markov kernel \(P\).

Aperiodicity

We recall that the period \(d_x\) of a state \(x\in \mathbb{X}\) is the greatest common divisor of all \(n\) values such that \(P(x,x)^n>0\). We need every state of the countable Markov chain to be aperiodic, meaning \(d_x=1\) for all \(x\in\mathbb{X}\).

One sufficient way to achieve this requirement is to first have the countable Markov chain to be irreducible, which we’ll cover in the next section. Then if there exists at least one state \(x\in \mathbb{x}\) such that \(P(x,x)>0\), then the Markov chain is aperiodic.2Lemma 1.8.2 in Markov Chains by Norris or Remark 6.2.5 in Probability and Stochastic Processes by Brémaud.

Clearly the last condition is not difficult to satisfy, putting the focus now on the irreducibility requirement.

Irreducibility

The irreducibility property says that it is possible for the Marko chain to get from any point to another point in a finite number of steps. More precisely, for an irreducible Markov chain with countable state space, there must exist for all \(x,y\in\mathbb{X}\) a natural number \(s\) (possibly depending \(x\) and \(y\)) such that \(P(x,y)^s>0\).

We can achieve this requirement by introducing a stronger requirement, which is easier to verify. A countable Markov chain with a transition \(P\) will be irreducible if for all \(x,y\in\mathbb{X}\) the condition \(P(x,y)>0\) holds. We have simply required that the above number \(s=1\) for all \(x,y\in\mathbb{X}\).

Furthermore, clearly \(P(x,x)>0\) for all \(x\in\mathbb{X}\), then our irreducible countable Markov chain is also aperiodic.

Positive recurrence

For a point or state \(x\in\mathbb{X}\), we recall its first return time being defined as

$$ T_x^+=\min\{ t\geq 1: X_t=x\} \,.$$

A state \(x\) is called positive recurrent if the expected value of its first return time is finite, meaning \(\mathbb{E}_x(T_x^+)<\infty\). For a countable Markov chain, if all the states in the state space are positive recurrent, then we say the Markov chain is positive recurrent.

A countable irreducible Markov chain is positive recurrent if (and only if) it has a stationary distribution \(\pi\). Furthermore, that stationary distribution will be unique and nonzero, meaning \(\pi(x)>0\) for all \(x\in\mathbb{X}\).3Theorem 1.7.7 in Markov Chains by Norris or Theorem 6.3.14 in Probability and Stochastic Processes by Brémaud.

In other words, for positive recurrence, we need to prove that the Markov chain has a stationary distribution.

Ergodicity

A countable Markov chain that is aperiodic and irreducible is ergodic. In other words, an ergodic Markov process \(X\) with stationary distribution \(\pi\) is a random sequence \(X_0, X_1,\dots\) such that the following holds almost surely

$$ \frac{1}{t}\sum_{i=0}^{t-1} g(X_i) \rightarrow \sum_{x\in\mathbb{X}}g(x)\pi(x)$$

as \(t\rightarrow\infty\) for all bounded, nonnegative functions \(g\). Consequently, we can use such an ergodic Markov chain to make statistical estimates.

Sampling with a Markov chain

For the Monte Carlo sample \(y_1,\dots,y_n\), we simply use the Markov chain samples \(x_0,x_1,\dots,x_{n-1}\), giving the Monte Carlo estimate. In the discrete case, this is simply

$$S_n (f)=\frac{1}{n}\sum_{i=1}^n g(x_{i-1}).$$

which, due to ergodicity, converges to \(S(f)=\mathbb{E}_{Y}[g(Y)]\).

Summary

In summary, a sufficient way to have a countable ergodic Markov chain is for the transition kernel to be such that \(P(x,y)>0\) for all \(x\in\mathbb{X}\) and there exists a (stationary) distribution \(\pi=P\pi\). For the first requirement, one imagines using a non-negative function such as \(e^{-(x-y)^2}/C\), where \(C>0\) is a suitable normalization constant.

The requirements seem possible, but to bring them all together is still a great challenge for the uninitiated. Fortunately, some great minds almost seven decades ago proposed the first Markov chain that meets all the requirements for a specific stationary distribution, which was used for a Monte Carlo estimate.

The first such Markov chain Monte Carlo method appeared in a paper written by five scientists, including two husband-wife pairs. It became known as the Metropolis or Metropolis-Hastings algorithm, and it will be the subject of a future post.

Further reading

Books

Markov chains

Any stochastic process book will include a couple of chapters on Markov chains. For more details, there are many books dedicated entirely to the subject of Markov chains. For example, introductory books include:

  • Brémaud – Markov Chains, Gibbs Fields, Monte Carlo Simulation and Queues;
  • Levin, Peres, and Wilmer – Markov Chains and Mixing Times;
  • Norris – Markov Chains.

Those books cover Markov chains with countable state spaces. If you want to read about discrete-time Markov chains with general state spaces, try the book:

  • Meyn, Tweedie – Markov chains and stochastic stability

All the above books have a section detailing a Markov chain Monte Carlo method, such as the Metropolis-Hastings algorithm and the Gibbs sampler.

Monte Carlo methods

Books dedicated to Monte Carlo approaches include:

For a statistics context, there’s also the good book:

  • Casella and Robert – Monte Carlo Statistical Methods.

The first MC in MCMC methods

Markov chains form a fundamentally important class of stochastic processes. It would be hard to over stress their importance in probability, statistics, and, more broadly, science and technology. They are indispensable in random simulations, particularly those based on the Markov chain Monte Carlo methods. In this post, we’ll have a look some Markov chain basics needed for such simulation methods.

This is the first part of a series of posts on Mark chain Monte Carlo methods. This post covers the basics of Markov chains, which is the more involved part. The second part will cover Monte Carlo methods. The third part will combine the ideas from the first two parts. Overall, the three posts will sketch the mechanics of Markov chain Monte Carlo (MCMC) methods, whose importance and applications I detailed in a previous post.

Markov chains vs Markov processes

All Markov chains are Markov processes. Some people use the term Markov chain to refer to discrete-time Markov processes with general state spaces. Other people prefer the term Markov chain for continuous-time Markov processes with countable state spaces.1In his book Applied Probability and Queues Asmussen writes:

In this book, we use the terminology that a Markov chain has discrete time and a Markov process has continuous time (the state space may be discrete or general). However, one should note that it is equally common to let “chain” refer to a discrete state space and “process” to a general one (time may be discrete or continuous).

Nevertheless, the first MC in the MCMC suggests the Markov chain Monte Carlo crowd prefers the former sense of Markov chain, given the use of discrete-time Markov processes in their simulations.

Markov the frog

Some writers introduce Markov chains with a mental image of a frog jumping around lily pads scattered over a pond. (Presumably the frog never misses a lily pad.) We assume the frog randomly chooses the next lily pad through some random mechanism. Perhaps the distances between lily pads or their sizes influence the chances that frog will jump between them.

We further assume that the frog is a bit particular, preferring to jump in certain directions more than others. More precisely, the probability of our frog jumping from a lily pad labelled \(x\) to another labelled \(y\) is \(P(x,y)\). But jumping in the opposite direction happens with probability \(P(y,x)\), which in general is not equal to \(P(x,y)\).

I typically use the term points, but the Markov literature usually says that the Markov chain visits states.

State space

We can interpret a Markov chain, a type of stochastic process, as a collection or sequence of random variables. 2I briefly detailed stochastic processes in a previous post.The values of the random variables are points in some mathematical space \(\mathbb{X}\). This space can be quite abstract, but in practice it’s usually the lattice \(\mathbb{Z}^n\), Euclidean space \(\mathbb{R}^n\), or a subset of one of these two spaces. For our frog example, all the lily pads in the pond form the state space.

We’ll only consider countable Markov chains where the number of points in the state space \(\mathbb{X}\) is countable. Although the results and theory generally hold for more general state spaces, the accompanying work requires more technical mathematics. For finite and countable state spaces, we can use standard probability and matrix knowledge. But when we use uncountable state spaces such as \(\mathbb{R}^n\), we enter the world of measure theory and functional analysis.

I will often write a point \(x\) in a (state) space \(\mathbb{X}\). But you can say an element \(x\) of a set \(\mathbb{X}\). Many authors refers to the points or elements as states of the Markov chain. In the frog example, each lily pad is a different state.

Markov property

A discrete-time countable Markov chain is a random process that jumps between points of some countable mathematical space \(\mathbb{X}\) such that, when at point \(x \in \mathbb{X}\), the next position is chosen according to a probability distribution \(P(x,·)\) depending only on \(x\).

More specifically, a sequence of random variables \((X_0, X_1, . . .)\) is a discrete-time Markov chain \(X\) with a countable state space
\(\mathbb{X}\) and kernel \(P\) if for all \(x,y \in \mathbb{X}\) and all \(t \geq 1\) satisfying \(\mathbb{P}[X_{t−1}=x_{t-1},\dots,X_0=x_0]>0\), we have

$$ \begin{align}\mathbb{P}[X_{t+1} =y|&X_{t}=x,X_{t−1}=x_{t-1},\dots,X_0=x_0]\\&=\mathbb{P}[X_{t+1} =y|X_t =x]\\&=P(x,y)\,.\end{align}$$

This equation is often called the Markov property.

The Markov property says that the conditional probability of jumping from point \(x\) to \(y\) remains the same, regardless of which points or states \(x_0,x_1,\dots,x_{t-1}\) were previously visited. This is precisely why the kernel \(P\) contains all the information needed to describe the future random evolution of the Markov chain.

We have assumed the probabilities given by \(P\) are fixed, meaning we have described a homogeneous Markov chain.

Markov kernel

The kernel \(P\) is called the Markov (transition) kernel or probability kernel. Assuming a countable state space \(\mathbb{X}\), we can reference any probability value of the kernel \(P\) with two variables \(x,y\in\mathbb{X}\). If we assume a finite state space \(\mathbb{X}\), then the kernel \(P\) becomes a regular matrix taught in linear algebra. An infinite but countable state space gives an infinite matrix \(P\). The rows of the kernel matrix \(P\) must add up to one, because each row is a probability measure.

A more general space, such as Euclidean space \(\mathbb{R}^n\), results in a more general kernel with respect to a suitable measure. In this setting, \(P(x,·)\) is no longer a probability mass function, but a general probability measure.

Initial distribution

At time \(t=0\) we describe the random initial configuration of a Markov process with a probability distribution \(\mu_0\). For a finite or countable Markov chain, this initial distribution \(\mu_0\) corresponds to a probability mass function encoded as a row vector.

Jumping from \(x\) to \(y\)

The probability distribution \(\mu_0\) gives the probability of starting in state (or at point) \(x\in\mathbb{X}\). After one time step, we can write down the probability distribution \(\mu_1\) that gives us the different probabilities of the Markov chain being at different states. At \(n=1\), basic matrix algebra and probability rules give us the matrix equation

$$\mu_1=\mu_0 P$$

By induction, after \(t\) time steps we have the expression

$$\mu_n=\mu_0 P^n\,.$$

where the superscript \(n\) denotes matrix power. We can write the \(n\)-time step kernel as \(P_{(n)}\), which for a finite Markov chain is given by the matrix equation \(P_{(n)}=P^n\).

Seeing how \(P_{(n)}\) behaves as \(n\) approaches infinity forms part of work that studies the convergence and ergodicity properties of Markov chains. I’ll make these concepts clearer below. But first I’ll give some conditions that are typically needed.

Regularity conditions

A Markov chain with a countable state space needs some conditions to ensure convergence and ergodicity.

Regularity conditions

  1. A stationary distribution \(\pi\)
  2. Aperiodicity
  3. Irreducibility
  4. Postive recurrence

The nature of the state space and the kernel will dictate these conditions. These conditions are also not necessarily logically distinct. For example, on a finite state space, you’ll get positive recurrence for free, because an aperiodic, irreducible Markov chain with a finite state space is always positive recurrent.

We now briefly detail these conditions and in another post I’ll give examples how the conditions can be met.

Stationary distribution \(\pi\)

It’s possible to encounter a probability distribution \(\pi\) where applying the kernel \(P\) returns the same distribution \(\pi\), meaning

$$ \pi=\pi P\,.$$

This (fixed-point) equation is called the balance equation.

The distribution \(\pi\) is called the stationary, invariant or steady-state distribution. A Markov chain does not need to have a stationary distribution. And if a Markov chain does have one, it may not be unique. Its existence and uniqueness will depend on the Markov kernel \(P\).

Showing that a unique stationary distribution exists and it is possible to reach it with probability one is the stuff of Markov convergence results. Markov chain Monte Carlo methods hinge upon these results .

Aperiodicity

It is possible for a Markov chain to get trapped in a loop, periodically visiting the same states. The period \(d_x\) of a state \(x\in \mathbb{x}\) is the greatest common divisor of all \(n\) values such that \(P(x,x)^n>0\). If the period of a point is \(d_x=1\), then we say it’s aperiodic. If every state of a Markov chain is aperiodic, we says it’s an aperiodic Markov chain.

Aperiodicity means there are no loops to trap the Markov chain. This property is typically needed for convergence results.

Irreducibility

A Markov chain with a countable state space \(\mathbb{X}\) is irreducible if the Markov chain can go from any point \(x\in\mathbb{X}\) to another other point \(x\in\mathbb{X}\) with a positive probability in a finite number of time steps. In other words, there exists a natural number \(s\) such that \(P(x,y)^s>0\) for all \(x,y\in\mathbb{X}\).

Irreducibility ensures that a Markov chain will visit all the states in its state space. This property is also needed for convergence results.

Recurrence

When studying Markov processes, a quantity of interest is how much time it takes to return to a state or point. For a point \(x\in\mathbb{X}\), we define its first return time as

$$ T_x^+=\min\{ t\geq 1: X_t=x\} \,.$$

As the name suggests, this random variable is the number of time steps for the Markov process return to state \(x\), taking whichever path, conditioned on it starting at \(x\).

We call a state \(x\) recurrent if the probability of its first return time being finite is one, meaning \(\mathbb{P}_x(T_x^+<\infty)=1\). Otherwise the state \(x\) is said to be transient.

Positive recurrence

We can classify different types of recurrence based on the expected value of the first return times. A state \(x\) is called positive recurrent if the expected value of its first return time is finite, meaning \(\mathbb{E}_x(T_x^+)<\infty\). Otherwise state \(x\) is null recurrent.

For a countable Markov chain, if all the states in the state space are (positive) recurrent, so \(\mathbb{E}_x(T_x^+)<\infty\) for all \(x\in\mathbb{X}\), then we say the Markov chain is (positive) recurrent.

Again, the concept of positive recurrence is needed for convergence results.

Ergodicity

We say a countable Markov chain is ergodic if it is irreducible, aperiodic and positive recurrent.3See, for example, Basics of Applied Stochastic Process by Serfozo (page 26) or Probability Theory and Stochastic Processes by Bremaud (page 262). Ergodicity allows one to find averages by employing a more general form of the law of large numbers, which Monte Carlo methods rely upon. We stress that definitions of ergodicity vary somewhat, but in general it means convergence and laws of large numbers exists.

Further reading

Any stochastic process book will include a couple of chapters on Markov chains such as:

    • Brémaud – Probability Theory and Stochastic Processes;
    • Serfozo -Basics of Applied Stochastic Processes.

For more details, there are many books dedicated entirely to the subject of Markov chains. For example, introductory books include:

  • Brémaud – Markov Chains, Gibbs Fields, Monte Carlo Simulation and Queues;
  • Levin, Peres, and Wilmer – Markov Chains and Mixing Times;
  • Norris – Markov Chains.

Those books cover Markov chains with countable state spaces. If you want to read about discrete-time Markov chains with general state spaces, try the book:

  • Meyn, Tweedie – Markov chains and stochastic stability

All the above books have a section on Markov chain Monte Carlo methods, such as the Metropolis-Hastings algorithm or the Gibbs sampler.

Poisson (stochastic) process

One of the most important stochastic processes is Poisson stochastic process, often called simply the Poisson process. In a previous post I gave the definition of a stochastic process (also called a random process) alongside some examples of this important random object, including counting processes.  The Poisson (stochastic) process is a counting process. This continuous-time  stochastic process is a highly studied and used object. It plays a key role in different probability fields, particularly those focused on stochastic processes such as stochastic calculus (with jumps) and the theories of Markov processes, queueingpoint processes (on the real line), and Levy processes.

The points in time when a Poisson stochastic process increases form a Poisson point process on the real line. In this setting the stochastic process and the point process can be considered two interpretations of the same random object.  The Poisson point process is often just called the Poisson process, but a Poisson point process can be defined on more generals spaces. In some literature, such as the theory of Lévy processes, a Poisson point process is called a Poisson random measure, differentiating the Poisson point process from the Poisson stochastic process. Due to the connection with the Poisson distribution, the two mathematical objects are named after Simeon Poisson, but he never studied these random objects.

The other important stochastic process is the Wiener process or Brownian (motion process), which I cover in another post. The Wiener process is arguably the most important stochastic process. I have written that post and the current one with the same structure and style, reflecting and emphasizing the similarities between these two fundamental stochastic process.

In this post I will give a definition of the homogenous Poisson process. I will also describe some of its key properties and importance. In future posts I will cover the history and generalizations of this stochastic process.

Definition

In the stochastic processes literature there are different definitions of the Poisson process. These depend on the settings such as the level of mathematical rigour. I give a mathematical definition which captures the main characteristics of this stochastic process.

Definition: Homogeneous Poisson (stochastic) process

An integer-valued stochastic process \(\{N_t:t\geq 0 \}\) defined on a probability space \((\Omega,\mathcal{A},\mathbb{P})\) is a homogeneous Poisson (stochastic) process if it has the following properties:

  1. The initial value of the stochastic process \(\{N_t:t\geq 0 \}\) is zero with probability one, meaning \(P(N_0=0)=1\).
  2. The increment \(N_t-N_s\) is independent of the past, that is, \(N_u\), where \(0\leq u\leq s\).
  3. The increment \(N_t-N_s\) is a Poisson variable with mean \(\lambda (t-s)\).

In some literature, the initial value of the stochastic process may not be given. Alternatively, it is simply stated as \(N_0=0\) instead of the more precise (probabilistic) statement given above.

Also, some definitions of this stochastic process include an extra property or two. For example, from the above definition, we can infer that increments of the homogeneous Poisson process are stationary due to the properties of the Poisson distribution. But a definition may include something like the following property, which explicitly states that this stochastic process is stationary.

  1. For \(0\leq u\leq s\), the increment \(N_t-N_s\) is equal in distribution to \(N_{t-s}\).

The definitions may also describe the continuity of the realizations of the stochastic process, known as sample paths, which we will cover in the next section.

It’s interesting to compare these defining properties with the corresponding ones of the standard Wiener stochastic process. Both stochastic processes build upon divisible probability distributions. Using this property, Lévy processes generalize these two stochastic processes.

Properties

The definition of the Poisson (stochastic) process means that it has stationary and independent increments. These are arguably the most important properties as they lead to the great tractability of this stochastic process. The increments are Poisson random variables, implying they can have only positive (integer) values.

The Poisson (stochastic) process exhibits closure properties, meaning you apply certain operations, you get another Poisson (stochastic) process. For example, if we sum two independent Poisson processes \(X= \{X_t:t\geq 0 \}\) and \(Y= \{Y_t:t\geq 0 \}\), then the resulting stochastic process \(Z=Z+Y = \{N_t:t\geq 0 \}\) is also a Poisson (stochastic) process. Such properties are useful for proving mathematical results.

A single realization of a (homogeneous) Poisson stochastic process, where the blue marks show where the process jumps to the next value. In any finite time interval, there are a finite number of jumps.

Properties such as independence and stationarity of the increments are so-called distributional properties. But the sample paths of this stochastic process are also interesting. A sample path of a Poisson stochastic process is  almost surely non-decreasing, being constant except for jumps of size one. (The term almost surely comes from measure theory, but it means with probability one.) There are only finitely number of jumps in each finite time interval.

The homogeneous Poisson (stochastic) process has the Markov property, making it an example of a Markov process.  The homogenous Poisson process \(N=\{ N_t\}_{t\geq 0}\)s not a martingale. But interestingly, the stochastic process is \(\{ W_t – \lambda t\}_{t\geq 0}\) is a martingale. (Such relations have been used to study such stochastic processes with tools from martingale theory.)

Stochastic or point process?

The Poisson (stochastic) process is a discrete-valued stochastic process in continuous time. The relation these types of stochastic processes and point process is a subtle one. For example, David Cox and Valerie Isham write on page 3 of their monograph:

 The borderline between point processes and a  number of other kinds of stochastic process is not sharply defined. In particular, any stochastic process in continuous time in which the sample paths are step functions, and therefore any any process with a discrete state space, is associated with a point process, where a point is a time of transition or, more generally, a time of entry into a pre-assigned state or set of states. Whether it is useful to look at a particular process in this way depends on the purpose of the analysis.

For the Poisson case, this association is presented in the diagram below. We can see the Poisson point process (in red) associated with the Poisson (stochastic) process (in blue) by simply looking at the time points where jumps occur.

A single realization of a (homogeneous) Poisson stochastic process (in blue). The jumps of the process form a (homogeneous) Poisson point process (in red) on the real line representing time.

Importance

Playing a prominent role in the theory of probability, the Poisson (stochastic) process is a highly important and studied stochastic process. It has connections to other stochastic processes and is central in queueing theory and random measures.

The Poisson process is a building block for more complex continuous-time Markov processes with discrete state spaces, which are used as mathematical models.  It is also essential in the study of jump processes and subordinators.

The Poisson (stochastic) process is a member of some important families of stochastic processes, including Markov processes, Lévy processes, and birth-death processes. This stochastic process also has many applications. For example, it plays a central role in quantitative finance. It is also used in the physical sciences as well as some branches of social sciences, as a mathematical model for various random phenomena.

Generalizations and modifications

For the Poisson (stochastic) process, the index set and state space are respectively the non-negative numbers and counting numbers, that is \(T=[0,\infty)\) and \(S=0, 1, \dots\), so it has a continuous index set but a discrete state space. Consequently, changing the state space, index set, or both offers an ways for generalizing and modifying the Poisson (stochastic) process.

Simulation

The defining properties of the Poisson stochastic process, namely independence and stationarity of increments, results in it being easy to simulate. The Poisson  stochastic process can be simulated provided random variables can be simulated or sampled according to a Poisson distributions, which I have covered in this and this post.

Simulating a Poisson stochastic process is similar to simulating a Poisson point process. (Basically, it is the same method in a one-dimensional setting.) But I will leave the details of sampling this stochastic process for another post.

Further reading

Here are some related links:

A very quick history of Wiener process and the Poisson (point and stochastic) process is covered in this talk by me.

In terms of books, the Poisson process has not received as much attention as the Wiener process, which is typically just called the Brownian (motion) process.  That said, any book covering queueing theory will cover the Poisson (stochastic) process.

More advanced readers can read about the Poisson (stochastic) process, the Wiener (or Brownian (motion)) process, and other Lévy processes:

On this topic, I recommend the introductory article:

  • 2004, Applebaum, Lévy Processes – From Probability to Finance and Quantum Groups.

This stochastic process is of course also covered in general books on stochastics process such as:

 

Wiener or Brownian (motion) process

One of the most important stochastic processes is the Wiener process or Brownian (motion) process. In a previous post I gave the definition of a stochastic process (also called a random process) with some examples of this important random object, including random walks. The Wiener process can be considered a continuous version of the simple random walk. This continuous-time stochastic process is a highly studied and used object. It plays a key role different probability fields, particularly those focused on stochastic processes such as stochastic calculus and the theories of Markov processes, martingales, Gaussian processes, and Levy processes.

The Wiener process is named after Norbert Wiener, but it is called the Brownian motion process or often just Brownian motion due to its historical connection as a model for Brownian movement in liquids, a physical phenomenon observed by Robert Brown. But the physical process is not true a Wiener process, which can be treated as an idealized model. I will use the terms Wiener process or Brownian (motion) process to differentiate the stochastic process from the physical phenomenon known as Brownian movement or Brownian process.

The Wiener process is arguably the most important stochastic process. The other important stochastic process is the Poisson (stochastic) process, which I cover in another post. I have written that and the current post with the same structure and style, reflecting and emphasizing the similarities between these two fundamental stochastic process.

In this post I will give a definition of the standard Wiener process. I will also describe some of its key properties and importance. In future posts I will cover the history and generalizations of this stochastic process.

Definition

In the stochastic processes literature there are different definitions of the Wiener process. These depend on the settings such as the level of mathematical rigour. I give a mathematical definition which captures the main characteristics of this stochastic process.

Definition: Standard Wiener or Brownian (motion) process

A real-valued stochastic process \(\{W_t:t\geq 0 \}\) defined on a probability space \((\Omega,\mathcal{A},\mathbb{P})\) is a standard Wiener (or Brownian motion) process if it has the following properties:

  1. The initial value of the stochastic process \(\{W_t:t\geq 0 \}\) is zero with probability one, meaning \(P(W_0=0)=1\).
  2. The increment \(W_t-W_s\) is independent of the past, that is, \(W_u\), where \(0\leq u\leq s\).
  3. The increment \(W_t-W_s\) is a normal variable with mean \(o\) and variance \(t-s\).

In some literature, the initial value of the stochastic process may not be given. Alternatively, it is simply stated as \(W_0=0\) instead of the more precise (probabilistic) statement given above.

Also, some definitions of this stochastic process include an extra property or two. For example, from the above definition, we can infer that increments of the standard Wiener process are stationary due to the properties of the normal distribution. But a definition may include something like the following property, which explicitly states that this stochastic process is stationary.

  1. For \(0\leq u\leq s\), the increment \(W_t-W_s\) is equal in distribution to \(W_{t-s}\).

The definitions may also describe the continuity of the realizations of the stochastic process, known as sample paths, which we will cover in the next section.

It’s interesting to compare these defining properties with the corresponding ones of the homogeneous Poisson stochastic process. Both stochastic processes build upon divisible probability distributions. Using this property, Lévy processes generalize these two stochastic processes.

Properties

The definition of the Wiener process means that it has stationary and independent increments. These are arguably the most important properties as they lead to the great tractability of this stochastic process. The increments are normal random variables, implying they can have both positive and negative (real) values.

The Wiener process exhibits closure properties, meaning you apply certain operations, you get another Wiener process. For example, if \(W= \{W_t:t\geq 0 \}\) is a Wiener process, then for a scaling constant \(c>0\), the resulting stochastic process \(\{W_{ct}/\sqrt{c}:t \geq 0 \}\)is also a Wiener process. Such properties are useful for proving mathematical results.

Two realizations of a Wiener (or Brownian motion) process. The sample paths are continuous (but non-differentiable) almost everywhere.

Properties such as independence and stationarity of the increments are so-called distributional properties. But the sample paths of this stochastic process are also interesting. A sample path of a Wiener process is continuous almost everywhere. (The term almost everywhere comes from measure theory, but it simply means that the only region where the property does not hold is mathematically negligible.) Despite the continuity of the sample paths, they are nowhere differentiable. (Historically, it was a challenge to find such a function, but a classic example is the Weierstrass function.)

The standard Wiener process has the Markov property, making it an example of a Markov process. The standard Wiener process \(W=\{ W_t\}_{t\geq 0}\) is a martingale. Interestingly, the stochastic process \(W=\{ W_t^2-t\}_{t\geq 0}\) is also a martingale. The Wiener process is a fundamental object in martingale theory.

There are many other properties of the Brownian motion process; see the Further reading section for, well, further reading.

Importance

Playing a main role in the theory of probability, the Wiener process is considered the most important and studied stochastic process. It has connections to other stochastic processes and is central in stochastic calculus and martingales. Its discovery led to the development to a family of Markov processes known as diffusion processes.

The Wiener process also arises as the mathematical limit of other stochastic processes such as random walks, which is the subject of Donsker’s theorem or invariance principle, also known as the functional central limit theorem.

The Wiener process is a member of some important families of stochastic processes, including Markov processes, Lévy processes, and Gaussian processes. This stochastic process also has many applications. For example, it plays a central role in quantitative finance. It is also used in the physical sciences as well as some branches of social sciences, as a mathematical model for various random phenomena.

Generalizations and modifications

For the Brownian motion process, the index set and state space are respectively the non-negative numbers and real numbers, that is \(T=[0,\infty)\) and \(S=[0,\infty)\), so it has both continuous index set and state space. Consequently, changing the state space, index set, or both offers an ways for generalizing or modifying the Wiener (stochastic) process.

A single realization of a two-dimensional Wiener (or Brownian motion) process. Each vector component is an independent standard Wiener process.

Simulating

The defining properties of the Wiener process, namely independence and stationarity of increments, results in it being easy to simulate. The Wiener can be simulated provided random variables can be simulated or sampled according to a normal distribution. The main challenge is that the Wiener process is a continuous-time stochastic process, but computer simulations run in a discrete universe.

I will leave the details of sampling this stochastic process for another post.

Further reading

A very quick history of Wiener process and the Poisson (point) process is covered in this talk by me.

There are books almost entirely dedicated to the subject of the Wiener or Brownian (motion) process, including:

Of course the stochastic process is also covered in any book on stochastic calculus:

More advanced readers can read about the Wiener process, its descrete-valued cousin, the Poisson (stochastic) process, as well as other Lévy processes:

On this topic, I recommend the introductory article:

  • 2004, Applebaum, Lévy Processes – From Probability to Finance and Quantum Groups.

The Wiener process is of course also covered in general books on stochastic process such as:

Stochastic processes

I have written a few posts about point processes, which are mathematical objects that seek to represent points randomly scattered over some space. Arguably a much more popular random object is something called a stochastic process. This type of mathematical object, also frequently called a random process, is studied in mathematics. But the origins of stochastic processes stem from various phenomena in the real world.

Stochastic processes find applications representing some type of seemingly random change of a system (usually with respect to time). Examples include the growth of some population, the emission of radioactive particles, or the movements of financial markets. There are many types of stochastic processes with applications in various fields outside of mathematics, including the physical sciences, social sciences, finance, and engineering.

In this post I will cover the standard definition of a stochastic process. But first a quick reminder of some probability basics.

Probability basics

Random experiment

The mathematical field of probability arose from trying to understand games of chance. In these games, some random experiment is performed. A coin is flipped. A die is cast. A card is drawn. These random experiments give the initial intuition behind probability. Such experiments can be considered in more general or abstract terms.

A random experiment has the properties:

  1. Sample space: A sample space, denoted here by \(\Omega\), is the set of all (conceptually) possible outcomes of the random experiment;
  2. Outcomes: An outcome, denoted here by \(\omega\), is an element of the sample space \(\Omega\), meaning \(\omega \in \Omega\), and it is called a sample point or realization.
  3. Events: An event is a subset of the sample space \(\Omega\) for which probability is defined.
Examples
One die

Consider the rolling a traditional six-sided die with the sides numbered from \(1\) to \(6\). Its sample space is \(\Omega=\{1, 2, 3,4,5,6\}\). A possible event is an even number, corresponding to the outcomes \(\{2\}\), \(\{4\}\), and \(\{6\}\).

Two coins

Consider the flipping two identical coins, where each coin has a head appearing on one side and a tail on the other. We denote the head and tail respectively by \(H\) and \(T\). Then the sample space \(\Omega\) is all the possible outcomes, meaning \(\Omega=\{HH, TT, HT, TH\}\). A possible event is at least one head appearing, which corresponds to the outcomes \(\{HH\}\), \(\{HT\}\), and \(\{TH\}\).

Conversely, three heads \(\{HHH\}\), the number \(5\), or the queen of diamonds appearing are clearly not possible outcomes of flipping two coins, which means they are not elements of the sample space.

Modern probability approach

For a random experiment, we formalize what events are possible (or not) with a mathematical object called a \(\sigma\)-algebra. (It is also called \(\sigma\)-field.) This object is a mathematical set with certain properties with respect to set operations. It is a fundamental concept in measure theory, which is the standard approach for the theory of integrals. Measure theory serves as the foundation of modern probability theory.

In modern probability theory, if we want to define a random mathematical object, such as a random variable, we start with a random experiment in the context of a probability space or probability triple \((\Omega,\mathcal{A},\mathbb{P})\), where:

  1. \(\Omega\) is a sample space, which is the set of all (conceptually) possible outcomes;
  2. \(\mathcal{A}\) is a \(\sigma\)-algebra or \(\sigma\)-field, which is a family of events (subsets of \(\Omega\));
  3. \(\mathbb{P}\) is a probability measure, which assigns probability to each event in \(\mathcal{A}\).

To give some intuition behind this approach, David Williams says to imagine that Tyche, Goddess of Chance, chooses a point \(\omega\in\Omega\) at random according to the law \(\mathbb{P}\) such that an event \(A\in \mathcal{A}\) has a probability given by \(\mathbb{P}(A)\), where we understand probability with our own intuition. We can also choose \(\omega\in\Omega\) by using some physical experiment, as long as it is random.

With this formalism, mathematicians define random objects by using a certain measurable function or mapping that maps to a suitable space of mathematical objects. For example, a real-valued random variable is a measurable function from \(\Omega\) to the real line. To consider other random mathematical objects, we just need to define a measurable mapping from \(\Omega\) to a suitable mathematical space.

Definition

Stochastic process

Mathematically, a stochastic process is usually defined as a collection of random variables indexed by some set, often representing time. (Other interpretations exists such as a stochastic process being a random function.)

More formally, a stochastic process is defined as a collection of random variables defined on a common probability space \((\Omega,{\cal A}, \mathbb{P} )\), where \(\Omega\) is a sample space, \({\cal A}\) is a \(\sigma\)-algebra, and \(\mathbb{P}\) is a probability measure, and the random variables, indexed by some set \(T\), all take values in the same mathematical space \(S\), which must be measurable with respect to some \(\sigma\)-algebra \(\Sigma\).

Put another way, for a given probability space \(( \mathbb{P}, {\cal A}, \Omega)\) and a measurable space \((S, \Sigma)\), a stochastic process is a collection of \(S\)-valued random variables, which we can write as:

$$\{X(t):t\in T \}.$$

For each \(t\in T\), \(X(t)\) is a random variable. Historically, a point \(t\in T\) was interpreted as time, so \(X(t)\) is random variable representing a value observed at time \(t\).

Often the collection of random variables \(\{X(t):t\in T \}\) is denoted by simply a single letter such as \(X\).  There are different notations for stochastic processes. For example, a stochastic process can also be written as \(\{X(t,\omega):t\in T \}\), reflecting that is function of the two variables, \(t\in T\) and \(\omega\in \Omega\).

Index set

The set \(T\) is called the index set or parameter set of the stochastic process. Typically this set is some subset of the real line, such as the natural numbers or an interval. If the set is countable, such as the natural numbers, then it is a discrete-time stochastic process. Conversely, an interval for the index set gives a continuous-time stochastic process.

(If the index set is some two or higher dimensional Euclidean space or manifold, then typically the resulting stochastic or random process is called a random field.)

State space

The mathematical space \(S\) is called the state space of the stochastic process. The precise mathematical space can be any one of many different mathematical sets such as the integers, the real line, \(n\)-dimensional Euclidean space, the complex plane, or more abstract mathematical spaces. The different spaces reflects the different values that the stochastic process can take.

Sample function

A single outcome of a stochastic process is called a sample function, a sample path, or, a realization. It is formed by taking a single value of each random variable of the stochastic process. More precisely, if \(\{X(t,\omega):t\in T \}\) is a stochastic process, then for any point \(\omega\in\Omega\), the mapping
\[
X(\cdot,\omega): T \rightarrow S,
\]
is a sample function of the stochastic process \(\{X(t,\omega):t\in T \}\). Other names exist such as trajectory, and path function.

Examples

The range of stochastic processes is limitless, as stochastic processes can be used to construct new ones. Broadly speaking, stochastic processes can be classified by their index set and their state space. For example, we can consider a discrete-time and continuous-time stochastic processes.

There are some commonly used stochastic processes. I’ll give the details of a couple of very simple ones.

Bernoulli process

A very simple stochastic process is the Bernoulli process, which is a sequence of independent and identically distributed (iid) random variables. The value of each random variable can be one of two values, typically \(0\) and \(1\), but they could be also \(-1\) and \(+1\) or \(H\) and \(T\). To generate this stochastic process, each random variable takes one value,  say, \(1\) with probability \(p\) or the other value, say, \(0\) with probability \(1-p\).

We can can liken this stochastic process to flipping a coin, where the probability of a head is \(p\) and its value is \(1\), while the value of a tail is \(0\). In other words, a Bernoulli process is a sequence of iid Bernoulli random variables. The Bernoulli process has the counting numbers (that is, the positive integers) as its index set, meaning \(T=1,\dots\), while in this example the state space is simply \(S=\{0,1\}\).

A single realization of a Bernoulli process, one of the simplest stochastic processes. This discrete-time stochastic process only takes two values such as 0 and 1.

(We can easily generalize the Bernoulli process by having a sequence of iid variables with the same probability space.)

Random walks

A random walk is a type of stochastic process that is usually defined as sum of a sequence of iid random variables or random vectors in Euclidean space. Given random walks are formed from a sum, they are stochastic processes that evolve in discrete time. (But some also use the term to refer to stochastic processes that change in continuous time.)

A classic example of this stochastic process is the simple random walk, which is based on a Bernoulli process, where each iid Bernoulli variable takes either the value positive one or negative one. More specifically, the simple random walk increases by one with probability, say, \(p\), or decreases by one with probability \(1-p\). The index set of this stochastic process is the natural numbers, while its state space is the integers.

A single realization of a simple (symmetric) random walk. This discrete-time stochastic process is a fundamental object in the history of probability theory.

Random walks can be defined in more general settings such as \(n\)- dimensional Euclidean space. There are other types of random walks, defined on different mathematical objects, such as lattices and groups, and in general they are highly studied and have many applications in different disciplines.

Markov processes

One important way for classifying stochastic processes is the stochastic dependence between random variables. For the Bernoulli process, there was no dependence between any random variable, giving a very simple stochastic process. But this is not a very interesting stochastic process.

A more interesting (and typically useful) stochastic process is one in which the random variables depend on each other in some way. For example, the next position of a random walk depends on the current position, which in turn depends on the previous position.

A large family of stochastic processes in which the next value depends on the current value are called Markov processes or Markov chains. (Both names are used. The term Markov chain is largely used when either the state space or index is discrete, but there does not seem to be an agreed upon convention. When I think Markov chain, I think discrete time.) The definition of a Markov process has a property that constrains the dependence between the random variables, as the next random variable only depends on the current random variable, and not all the previous random variables. This constraint on the dependence typically renders Markov processes more tractable than general stochastic processes.

It would be difficult to overstate the importance of Markov processes. Their study and application appear throughout probability, science, and technology.

Counting processes

A counting process is a stochastic process that takes the values of non-negative integers, meaning its state space is the counting numbers, and is non-decreasing. A simple example of a counting process is an asymmetric random walk, which increases by one with some probability \(p\) or remains the same value with probability \(1-p\). In other words, the accumulative sum of a Bernoulli process.  This is an example of a discrete-time counting process, but continuous-time ones also exist.

A counting process can be also interpreted as a counting as a random counting measure on the index set.

Two important stochastic processes

The most two important stochastic processes are the Poisson process and the Wiener process (often called Brownian motion process or just Brownian motion). They are important for both applications and theoretical reasons, playing fundamental roles in the theory of stochastic processes. In future posts I’ll cover both the Wiener process and the Poisson process.

Code

The code used to create the plots in this post is found here on my code repository. The code exists in both MATLAB and Python.

Further reading

There are many, many books covering the fundamentals of modern probability theory, including those (in roughly increasing order of difficulty) by Grimmett and Stirzaker, Karr, Rosenthal, Shiryaev, Durrett, and Billingsley. A very quick introduction is given in this web article.

The development of stochastic processes is one of the great achievements in modern mathematics. Researchers and practitioners have both studied them in great depth and found many applications for them. Consequently, there is no shortage of literature on stochastic processes. For example:

Finally, one of the main pioneers of stochastic processes was Joseph Doob. His seminal book was simply called Stochastic Processes.