Python

Learning Theory from First Principles

A new book on statistical or machine learning has been published:

Bach – Learning Theory from First Principles, MIT Press.

The author, Francis Bach, is a well-known researcher at Inria in Paris. There’s a free copy of his book on his website.

In a typical French way, the book is focused on theory, especially the proofs of essential results in statistical learning. But I stress that I’ve only glanced at the book.

Bach has also uploaded the code used in the book, written in both Python and MATLAB, which is something I often do.

The Metropolis-Hastings algorithm in Python, Julia, R, and MATLAB

I wrote this code a little while ago, and I thought now would be a good time to present it, as I have covered the topic of Markov chain Monte Carlo (MCMC) methods, particularly the central workhorse the Metropolis(-Rosenbluth-Rosenbluth-Teller-Teller)-Hastings algorithm. For more details on these methods, I have written about these methods in a couple posts, starting with this one and ending with this particularly relevant one.

Update: I re-wrote this code in C, located here, where I used my own code for generating Gaussian (or normal) variables, to keep the code entirely self-contained, but in practice you should never do that.

What the code does

The code basically does the same thing in four different (scientific) programming languages, namely Python, Julia, C, and MATLAB. It performs a Metropolis-Hastings algorithm, simulating random variables (or more correctly, generating random variates) for two respective probabilities densities in one dimension and two dimensions.

Two examples

The one-dimensional case is particularly simple, as the code only simulates an exponential variable. But for this random variable in practice, you would never ever do with any MCMC method, because simply simulate exponential random variables directly. I discussed in a previous post how this direct approach is used to simulate Poisson variables.

The two-dimensional case is slightly more complicated than the the classic joint Gaussian (or normal) probability density, which you would use other methods to simulate. But the idea can be extended to $n$ dimensions, which is often the case when dealing with the probability distributions and their corresponding integrals that arise in Bayesian statistics.

Implementation considerations

Variance of the walk

To create the random walk, the code uses normal (or Gaussian) random variables, where the mean is simply the current position of the random walk. This is a standard approach due to the convenient properties of normal distribution.

There’s also the standard deviation $\sigma>0$ of the normal variables. In machine learning circles, this is what they call a hyperparameter. The random Metropolis-Hastings algorithm will, in theory, work regardless of the value, but some values result in faster results than others. This issue is covered briefly in this question on Stats Exchange. And it raises an important question when using MCMC methods in general.

Convergence tests

How many simulations steps are enough to ensure that the random variables being simulated behave closely enough to the desired random variables? In other words, how long does it take for the algorithm to target distribution?

The simulation time elapsed before the algorithm has reached a certain level of sufficient called the burn-in period. Due to its vital importance, it is a central topic in the development and implementation of MCMC methods.

There are tests for assessing the degree of convergence during the simulation run such as the Gelman-Rubin test. This particular test involves taking simple empirical means and variances over the last $m$ samples, across all simulation runs, and between simulation runs. Then a ratio is calculated of variances is calculated, which should be close to one if the algorithm is sufficiently converged. I haven’t implemented any such tests here, but it’s something that one should do in practice.

Perhaps I’ll cover that topic in a future post.

Testing the results

To test the results, you can empirical calculate the first two moments (that is, the mean and variance) of the simulated random variables. That acts as a very good sanity check.

But if you need more convincing, you can perform a histogram on the results, which effectively a way to empirically estimate the probability density of the simulated random variables. Fortunately, all four programming languages have built in histogram (counting) functions for one and two dimensions.

I already covered this topic in a previous posts on checking Poisson point process simulations.

Code

I’ll only include the code for Python and Julia, and refefr the reader to the rest of the code found here.

I used the Matplotlib library to plot the probability density and its estimate.

For the histogram section, I used the histogram and histogram2d functions respectively to estimate the distribution of the number of points and the intensity function. I used the pdf option.

import numpy as np;  # NumPy package for arrays, random number generation, etc
import matplotlib.pyplot as plt  # for plotting
from matplotlib import cm  # for heatmap plotting
from mpl_toolkits import mplot3d  # for 3-D plots
from scipy import integrate  # for integrating

plt.close("all");  # close all previous plots

# Simulation window parameters
xMin = -1;
xMax = 1;
yMin = -1;
yMax = 1;

numbSim = 10 ** 4;  # number of random variables simulated
numbSteps = 200;  # number of steps for the Markov process
numbBins = 50;  # number of bins for histogram
sigma = 2;  # standard deviation for normal random steps

# probability density parameters
s = .5;  # scale parameter for distribution to be simulated

def fun_lambda(x, y):
    return np.exp(-(x ** 4 + x*y + y ** 2) / s ** 2);

# normalization constant
consNorm = integrate.dblquad(fun_lambda, xMin, xMax, lambda x: yMin, lambda y: yMax)[0];
#un-normalized joint density of variables to be simulated
def fun_p(x, y):
    return (fun_lambda(x, y) ) * (x >= xMin) * (y >= yMin) * (x <= xMax) * (y <= yMax);

xRand = np.random.uniform(xMin, xMax, numbSim);  # random initial values
yRand = np.random.uniform(yMin, yMax, numbSim);  # random initial values

pdfCurrent = fun_p(xRand, yRand);  # current transition (probability) densities

for jj in range(numbSteps):
    zxRand = xRand + sigma * np.random.normal(0, 1, numbSim);  # take a (normally distributed) random step
    zyRand = yRand + sigma * np.random.normal(0, 1, numbSim);  # take a (normally distributed) random step
    # Conditional random step needs to be symmetric in x and y
    # For example: Z|x ~ N(x,1) (or Y=x+N(0,1)) with probability density
    # p(z|x)=e(-(z-x)^2/2)/sqrt(2*pi)
    pdfProposal = fun_p(zxRand, zyRand);  # proposed probability

    # acceptance rejection step
    booleAccept = np.random.uniform(0, 1, numbSim) < pdfProposal / pdfCurrent;
    # update state of random walk/Markov chain
    xRand[booleAccept] = zxRand[booleAccept];
    yRand[booleAccept] = zyRand[booleAccept];
    # update transition (probability) densities
    pdfCurrent[booleAccept] = pdfProposal[booleAccept];

# for histogram, need to reshape as vectors
xRand = np.reshape(xRand, numbSim);
yRand = np.reshape(yRand, numbSim);

p_Estimate, xxEdges, yyEdges = np.histogram2d(xRand, yRand, bins=numbSteps, density=True);
xValues = (xxEdges[1:] + xxEdges[0:xxEdges.size - 1]) / 2;  # mid-points of bins
yValues = (yyEdges[1:] + yyEdges[0:yyEdges.size - 1]) / 2;  # mid-points of bins
X, Y = np.meshgrid(xValues, yValues);  # create x/y matrices for plotting

# analytic solution of (normalized) joint probability density
p_Exact = fun_p(X, Y) / consNorm;

# Plotting
# Plot empirical estimate
fig1 = plt.figure();
ax = plt.axes(projection="3d");
#plt.rc("text", usetex=True);
#plt.rc("font", family="serif");
surf = ax.plot_surface(X, Y, p_Estimate, cmap=plt.cm.plasma);
plt.xlabel("x");
plt.ylabel("y");
plt.title("p(x,y) Estimate");

# Plot exact expression
fig2 = plt.figure();
#plt.rc("text", usetex=True);
#plt.rc("font", family="serif")
ax = plt.axes(projection="3d");
surf = ax.plot_surface(X, Y, p_Exact, cmap=plt.cm.plasma);
plt.xlabel("x");
plt.ylabel("y");
plt.title("p(x,y) Exact Expression");

Julia

using Distributions #for random simulations
using PyPlot #for plotting
using StatsBase #for histograms etc
using Random
using LinearAlgebra
using HCubature #for numerical integration
#using LaTeXStrings #for LateX in labels/titles etc
PyPlot.close("all");  # close all PyPlot figures

#set random seed for reproducibility
#Random.seed!(1234)

# Simulation window parameters
xMin = -1;
xMax = 1;
yMin = -1;
yMax = 1;

numbSim = 10 ^ 5;  # number of random variables simulated
numbSteps = 25;  # number of steps for the Markov process
numbBins = 50;  # number of bins for histogram
sigma = 2;  # standard deviation for normal random steps

# probability density parameters
s = .5;  # scale parameter for distribution to be simulated

function fun_lambda(x,y)
    return (exp.(-(x.^4+x.*y+y.^2)./s^2));
end

#normalization constant -- UNDER CONSTRUCTION
consNorm,errorCub=hcubature(x -> fun_lambda(x[1],x[2]), [xMin, yMin], [xMax, yMax]);
#un-normalized joint density of variables to be simulated
function fun_p(x,y)
    return((fun_lambda(x,y)).*(x.>=xMin).*(y.>=yMin).*(x.<=xMax).*(y.<=yMax));
end
xRand=(xMax-xMin).*rand(numbSim).+xMin; #random initial values
yRand=(yMax-yMin).*rand(numbSim).+yMin; #random initial values

pdfCurrent=fun_p(xRand,yRand); #current transition (probability) densities
for jj=1:numbSteps
    zxRand= xRand.+sigma.*rand(Normal(),numbSim);#take a (normally distributed) random step
    zyRand= yRand.+sigma.*rand(Normal(),numbSim);#take a (normally distributed) random step

    # Conditional random step needs to be symmetric in x and y
    # For example: Z|x ~ N(x,1) (or Y=x+N(0,1)) with probability density
    # p(z|x)=e(-(z-x)^2/2)/sqrt(2*pi)
    pdfProposal = fun_p(zxRand, zyRand);  # proposed probability

    # acceptance rejection step
    booleAccept=rand(numbSim) .< pdfProposal./pdfCurrent;
    # update state of random walk/Markov chain
    xRand[booleAccept] = zxRand[booleAccept];
    yRand[booleAccept] = zyRand[booleAccept];
    # update transition (probability) densities
    pdfCurrent[booleAccept] = pdfProposal[booleAccept];
end

#histogram section: empirical probability density
histXY=fit(Histogram, (xRand,yRand),nbins=numbBins); #find histogram data
histXY=normalize(histXY,mode=:pdf); #normalize histogram
binEdges=histXY.edges; #retrieve bin edges
xValues=(binEdges[1][2:end]+binEdges[1][1:end-1])./2; #mid-points of bins
yValues=(binEdges[2][2:end]+binEdges[2][1:end-1])./2; #mid-points of bins
p_Estimate=(histXY.weights)
#create a meshgrid
X=[xValues[ii] for ii=1:length(xValues), jj=1:length(yValues)];
Y=[yValues[jj] for ii=1:length(xValues), jj=1:length(yValues)];

#analytic solution of (normalized) joint probability density
p_Exact = fun_p(X, Y)./consNorm;

# Plotting
# Plot empirical estimate
fig1 = PyPlot.figure();
PyPlot.rc("text", usetex=true);
PyPlot.rc("font", family="serif");
surf(X, Y, p_Estimate, cmap=PyPlot.cm.plasma);
PyPlot.xlabel("x");
PyPlot.ylabel("y");
PyPlot.title("p(x,y) Estimate");

# Plot exact expression
fig2 = PyPlot.figure();
PyPlot.rc("text", usetex=true);
PyPlot.rc("font", family="serif")
surf(X, Y, p_Exact, cmap=PyPlot.cm.plasma);
PyPlot.xlabel("x");
PyPlot.ylabel("y");
PyPlot.title("p(x,y) Exact Expression");

The Metropolis(-Rosenbluth-Rosenbluth-Teller-Teller)-Hastings algorithm

Consider a collection of random variables described by a joint probability distribution. Often, in any field with probability or statistics, one faces the task of simulating these random variables, which typically depend on each other in some fashion.

A now standard way for simulating or sampling such random variables is to use the Metropolis-Hastings algorithm, undoubtedly the cornerstone of Markov chain Monte Carlo methods. This method creates a discrete-time Markov chain that has a stationary or invariant distribution being the aforementioned distribution.

The algorithm was born out of a 1953 paper by Nicholas Metropolis, Arianna W. Rosenbluth, Marshall Rosenbluth, Augusta H. Teller, and Edward Teller (two husband-wife pairs), who looked at a special case, and a 1970 paper by W.K. Hastings, who generalized the method. It is typically called the Metropolis-Hastings or the Metropolis algorithm. And some have called it the M(RT)²H algorithm.

(The history is a bit complicated, but perhaps we should drop the name Metropolis. The late Arianna (née Wright) Rosenbluth did most of the work. She was also a dab hand at fencing.)

Although the algorithm’s initial adoption and use was slow, taking decades partly due to slower computers, the Metropolis-Hastings algorithm is now a widely used method for simulating collections of random variables. This in turn gives fast ways for exploring, integrating, and optimizing otherwise unwieldy mathematical functions, such as those found in Bayesian statistics, machine learning, statistical physics, and combinatorial optimization. The algorithm serves as the foundation for other random simulation methods, such as the Gibbs sampler, hence it’s been called the workhorse of Marko chain Monte Carlo methods.

There are many books, articles, lecture notes, and websites describing the Metropolis-Hastings algorithm; see the Further reading section below. But I’ll detail the core ideas here. This post is designed to be somewhat self-contained, but it arose from a series of posts, starting with this one and ending with this particularly relevant one.

Constructing a Markov process

Take a collection of $n$ random variables $X_1,\dots,X_n$ with a (joint) probability distribution $\pi(x)=\pi(x_1,\dots,x_n)$. This distribution will either be a (joint) probability mass function or (joint) probability density for discrete or continuous random variables, respectively.

We wish to construct a Markov chain on an abstract mathematical space $\mathbb{X}$. We assume we can write a point $x\in \mathbb{X}$ as $x=(x_1,\dots,x_n)$. More specifically, the space $\mathbb{X}$ is a Cartesian product of spaces $\mathbb{X}_1,\dots,\mathbb{X}_n$ on which the variables are defined.

Which mathematical space is $\mathbb{X}$? That will, of course, depend on the random variables you’re trying to simulate. But it’s usually the lattice $\mathbb{Z}^n$, Euclidean space $\mathbb{R}^n$, or a subset of one of these two spaces.

For this post, we’ll assume that the space $\mathbb{X}$ is discrete, which makes the things simpler. But the mathematics is similar for continuous spaces, with small changes such as replacing probabilities and sums with probability densities and integrals, respectively. We’ll further assume the space is finite, so we can use matrices to describe the Markov transition kernels. But everything covered here will work on infinite spaces such as $\mathbb{R}^n$, which is the most common space used in practice.

Again, our overall aim is to construct a Markov chain with a stationary $\pi$ being the same as the distribution that we want to sample.

Jumper process

There’s a random jumper that wants to jump around the space $\mathbb{X}$. The jumper randomly jumps from one point in this mathematical space $x\in \mathbb{X}$ to another point $y\in \mathbb{X}$ according to the probability $J(x,y)$. If the the state space $\mathbb{X}$ is finite, then $J$ becomes a matrix. The matrix row $J(x,\cdot)$ is a probability mass function for each $x\in \mathbb{X}$, so it sums to one. By definition, this random jumping forms a Markov chain.

The only thing we ask is that, for our jumper, every point $x$ in $\mathbb{X}$ where $\pi(x)>0$ is reachable with positive probability in a single step. This implies the easy-to-achieve condition $J(x,y)>0$ where $\pi(x)>0$ for all points $x,y\in\mathbb{X}$.

Now we have a Markovian jumper on the space $\mathbb{X}$. But this turns out to be too much jumping for our jumper. Furthermore, the jumper is jumping more in certain directions than others. Occasionally the jumper wants to stay put (and have a rest) with the aim of balancing the jump directions.

The jumper still wants to jump sometimes from a point $x\in \mathbb{X}$ to another point $y\in \mathbb{X}$ based on $J(x,y)$. But now at each time step, after choosing the jump direction but before jumping, the jumper flips a biased coin whose success probability $\alpha(x,y)$ depends on the current position $x\in \mathbb{X}$ and the (potential) next position $y\in \mathbb{X}$. For the coin, the acceptance probability, which allows (or not) the jumper to move from $x$ to $y$, is given by

$$\alpha(x,y)=\min[1,\frac{\pi(y)}{\pi(x)}\frac{J(y,x)}{J(x,y)} ]\,,\quad x, y\in \mathbb{X}\,.$$

The function $\alpha(x,y)$ is clearly never negative. The minimum in the above expression ensures that $\alpha(x,y)$ is a valid probability.

Metropolis-Hastings ratio

The ratio in the expression for $\alpha(x,y)$ is sometimes called the Metropolis-Hastings ratio, which we’ll soon see is designed specifically to balance the jump directions. The ratio means that a constant factor in the target distribution $\pi(x)$ will vanish due to cancellation.

More specifically, if we can write the target distribution as $\pi(x)=f(x)/C$, where $C>0$ is some constant and $f(x)$ is a non-negative function, then the ratio term $\pi(y)/\pi(x)=f(y)/f(x)$. This reasoning also applies to a constant factor in the transition kernel $M$.

The constant factor being irrelevant in the target distribution $\pi(x)$ is very useful. It is particularly important for posterior distributions in Bayesian statistics and the Gibbs distributions in statistical physics, as these distributions typically have difficult-to-calculate constants.

Metropolis-Hastings process

The pairing of the original jumper Markov chain with the coin flipping creates a new Markov chain, which we call the Metropolis-Hastings process. What’s remarkable is its stationary distribution will be the target distribution $\pi(x)$.

Transition kernel (matrix)

For the Metropolis-Hastings process, we can readily reason the structure of the transition kernel (matrix) $M$ that describes this Markov chain. First we’ll look at the off-diagonal entries of $M$.

Jumping from $x$ to $y\neq x$

To jump from point $x$ and to another point $y\neq x$, the probability is simply the previous probability $J(x,y)$ multiplied by the probability of that proposed jump being accepted, which is $\alpha(x,y)$, giving

$$ M(x,y) = \alpha(x,y) J(x,y), \quad x\neq y\,.$$

Now we examine the diagonal entries of $M$.

Jumping from $x$ to $x$

There are two different ways for the jumper to remain at point $x$. The first way is that the jumper simply jumps from $x$ to $x$, which happens with probability $J(x,x)$. This proposed jump, so to speak, is accepted with probability one because $\alpha(x,x)=1$. Consequently, we can write this probability as $\alpha(x,x)J(x,x)$ or $J(x,x)$.

The second way consists of all the possible jumps from $x$ to $z$, but then for each of those proposed jumps to be rejected, which happens (for each jump) with probability $[1-\alpha(x,z)]$. (I am using here the dummy or bound variable $z$ instead of $y$ for clarity.) Adding up the probabilities of these events gives the probability of the second way being the sum $\sum_{z\in \mathbb{X}}[1-\alpha(x,z)] J(x,z) \,.$

Consequently, for a single time step, the probability that the jumper starts at point $x$ and remains at $x$ is

$$ M(x,x) = \alpha(x,x)J(x,x)+\sum_{z\in \mathbb{X}}[1-\alpha(x,z)] J(x,z) \,.$$

The transition matrix $M$ should be stochastic, so the rows sum to one, which we see is the case

$$ \sum_{y\in\mathbb{X}}M(x,y)=1\,.$$

Of course, we could have derived the diagonal entry $M(x,x)$ immediately by starting with the above sum, but that approach lacks intuition into what’s happening.

Expression for the transition kernel $M$

$$M(x,y) = \begin{cases}
\alpha(x,y) J(x,y) & \text{if }
\begin{aligned}[t]
x&\neq y
\end{aligned}\\
\alpha(x,x)J(x,x)+\sum_{z\in \mathbb{X}}[1-\alpha(x,z)] J(x,z) & \text{if } x=y
\end{cases}$$

Often the above expression is written as a single line by placing an indicator function or similar in front of the sum for the diagonal entries. (In the continuous case, where the sum is replaced with an integral, a Dirac delta distribution is often used instead.)

Reversibility

A Markov process on $\mathbb{X}$ with kernel (matrix) $K$ is (time) reversible with respect to the distribution $\mu$ if the following holds

$$ \mu(x)K(x,y) = \mu (y) K(y,x)\quad x,y\in\mathbb{X}\,.$$

This reversibility condition is also called the detailed balance equation. If this condition is met, then the Markov process will have a stationary distribution $\mu$. By summing over $x$, we can verify this because we obtain

$$ \sum_{x\in\mathbb{X}}\mu(x)K(x,y) =\mu(y)\sum_{x\in\mathbb{X}} K(y,x)=\mu(y)\,.$$

This is just the balance equation, often written as $\mu=K\mu$, which says that the transition kernel $K$ has a stationary distribution $\mu $.

(Strictly speaking, we don’t necessarily need reversibility, as long as the Markov chain has a stationary distribution. Reversibility just makes the algebra easier.)

The Metropolis-Hastings process is reversible

We can show that the Metropolis-Hastings process with the transition kernel $M$ satisfies the reversibility condition. The proof essentially just requires the swapping of rows and columns. Clearly then we only need to look at the off-diagonal entries, so we assume $x\neq y$.

We further assume that $\pi(y)J(y,x)\geq \pi(x) J(x,y)$. Then $\alpha(x,y)=1$, so $M(x,y)=J(x,y)$. Now we use this last fact in the last line of algebra that follow:

$$\begin{aligned}\pi(y) M(y,x)&=\pi(y)J(y,x) \alpha(y,x)\\ &= \pi(y)J(y,x) \frac{\pi(x)}{\pi(y)}\frac{J(x,y)}{J(y,x)}\\ &= \pi(x)J(x,y)\\&= \pi(x)M(x,y)\,,\end{aligned}$$

which shows that the reversibility condition holds with the last assumption.

But of course we can reverse that assumption, due to symmetry, so $\pi(y)J(y,x)\leq \pi(x) J(x,y)$, then $\alpha(x,y)=[\pi(y)J(y,x)]/[\pi(x) J(x,y)]$ and $\alpha(y,x)=1$, implying this way also works. Then the reversibility condition holds for all cases.

We could have shown this more quickly by observing that

$$\begin{aligned}\pi(y) M(y,x)&=\pi(y)J(y,x)\alpha(y,x)\\ &= \pi(y)J(y,x) \min[1, \frac{\pi(x)}{\pi(y)}\frac{J(x,y)}{J(y,x)}]\\ &= \min[\pi(y)J(y,x) , \pi(x)J(x,y)]\,,\end{aligned}$$

which is symmetric in $x$ and $y$.

In summary, the Metropolis-Hastings process is reversible and has the stationary distribution $\pi(x)$.

Continuous case

As mentioned earlier, the continuous case is similar to the discrete-case with a few modifications. For example, assuming now $\mathbb{X}$ is continuous, then the stationary distribution is described by a probability density $\pi(x)$. The jumper process will be described by transition kernel (function) $j(x,y)$, where $j(x,\cdot)$ is now a probability density for each $x\in \mathbb{X}$.

It follows that the acceptance probability is

$$\alpha(x,y)=\min[1,\frac{\pi(y)}{\pi(x)}\frac{j(y,x)}{j(x,y)} ]\,,\quad x, y\in \mathbb{X}\,.$$

The transition kernel (function) is

$$m(x,y) = \begin{cases}
\alpha(x,y) j(x,y) & \text{if }
\begin{aligned}[t]
x&\neq y
\end{aligned}\\
\alpha(x,x) j(x,x)+\int_{\mathbb{X}}[1-\alpha(x,z)]j(x,z) dz & \text{if } x=y
\end{cases}$$

For more details, there are derivations online such as this one, as well as the sources cited in the Further reading section below.

Libraries

Of course, before writing your own code, I would check out any pre-written functions in your favourite language that already implement the Metropolis-Hastings algorithm.

For example, in MATLAB there’s the function mhsample. Again I would be using this before writing my own code, unless it’s for illustration or educational purposes.

In Python there’s the library pymcmcstat. For those with a machine learning bent, there’s also a TensorFlow function MetropolisHastings.

In R, which is a language designed for statistics, implementations of any Markov chain Monte Carlo methods will be couched in the language and notation of (Bayesian) statistics. For a specific library, I can’t say I know which is the best, but you can check out the MCMCPack library. For R libraries, be wary of using the ones that were published and are (or were?) maintained by a single contributor.

Code

For a couple of simple examples in both one dimension and two dimensions, I’ve implemented the Metropolis-Hastings algorithm in the programming languages R, MATLAB, Python (NumPy) and Julia. The code can be found here.

The Newman-Ziff algorithm for simulating percolation models

Imagine you’re trying to estimate some statistics about a certain percolation system. For example, estimating the probability of a given bond being open, which means connected, when a giant component forms. The system or model is too complicated for analytic results, which is true for all percolation models with the exception of a handful.

So you code up the percolation model in your favourite programming language, run a large number of stochastic simulations, collect the statistics, and then look at the results. Percolation systems are, by definition, large, but with fast computers, the simulation method works. You get your statistics.

Easy.

But that’s just for one parameter. What if you want to see what happens with a range of parameter values? Well, you do the same thing, but now just run the large number of simulations for a range of different parameter values, right?

No.

You could do that, and it would work. But it would be slow. How to make it faster?

Newman and Ziff proposed and implemented a fast percolation algorithm based on the simple ideas of keeping old simulations and using conditional probabilities.

Algorithm

The algorithm was presented in this paper:

2000 – Newman, Ziff – Efficient Monte Carlo Algorithm and High-Precision Results for Percolation

There exists a preprint with a slightly different title.

2001 – Newman, Ziff – A fast Monte Carlo algorithm for site or bond percolation.

Overview

The simulation method consists of three components.

Components of the Newman-Ziff algorithm

Re-use parts of previous simulation outcomes by shuffling.

Use fast union-find algorithms.

Find a smooth curve of results by using conditional probabilities.

None of them are entirely revolutionary, but put together they form a fast (and popular) simulation method for studying (monotonic) percolation models.

1. Randomly sampling sequences

The main insight was to sample in an incremental fashion, adding an open bond to the percolation model, such as a lattice, during each simulation run, without throwing away the previous simulation. More specifically, let’s say they sample a bond model with $k$ open bonds during one simulation run. Then in the next simulation, use the same configuration, and simply add one bond.

This is a nice trick, but there’s no free lunch here. On one hand, they don’t waste results. But on the other hand, they do throw away independence (and ergodicity) by doing this. But perhaps the numerical results don’t care.

To do this step, Newman and Ziff use a random shuffling algorithm called the Fisher-Yates shuffle. But it’s not clear if Newmand and Ziff thought of this shuffling algorithm independently or not. They never call it by its name nor do they cite any work on this well-known shuffling method. Both MATLAB and Python (SciPy) have functions for doing random shuffles.

2. Using union-find

Imagine you have a collection of things. Some of these things are connected to other things in your collection. You want to understand how these clusters of connected things behave.

To find the clusters, looking for ultimately the biggest one, Newman and Ziff use a union-find method. There are different types of these alogrithms, often hinging upon the use of recursion. They were developed mostly in the 1980s, particularly in work by Robert Tarjan.

These methods are very efficient at finding clusters. The algorithm speed or, rather, complexity is given in relation to the inverse of an Ackermann function, which is a famously fast growing function. These methods rely upon the monotonicity of growing unions.

(If you have a non-monotonic percolating system, then unfortunately you can’t use these algorithms, which is the case for percolation based on signal-to-interference-plus-noise ratio (SINR). )

3. Filling in the parameter gaps

For any random sample of a percolation model, the number of open bonds or sites is a natural number. But the parameter of the percolation model, such as the probability of a site or bond being open, is a real number.

Newman and Ziff use a simple conditioning argument to fill the gaps.

For any statistic $S$ of of a given percolation model, the Newman-Ziff algorithm finds the expected statistic conditioned on $n$ number of open bonds (or equivalent objects in other models). In other words, the algorithm finds the expected conditional statistics $E[S_n|N=n]$. Then we just need the probability of $N$ being $n$ as a function of the parameter we’re trying to vary, such as the bond probability $p$. With this probability $P(N=n)$, we immediately arrive at the expression for the statistic $S$ with introductory probability

$$ E(S_n)=E_N [E(S_n|N=n)].$$

For a nice discrete model with $m$ total, we get the expression

$$ E(S_n)=\sum _{n=1}^{m}[P(N=n)S_n].$$

But we know the probability (distribution) of $N$, as it’s simply the binomial variable. Assuming there are $m$ total bonds (or equivalent objects), then we arrive at

$$P(N=n)={m\choose n} p^n(1-p)^{m-n} .$$

Then the final statistic or result is simply

$$E(S_n)=\sum _{n=1}^{m} {m\choose n} p^n(1-p)^{m-n} S_n.$$

Code

I wrote the algorithm in MATLAB and Python. The code can be found here. I’ve included a script for plotting the results (for small square lattices).

MATLAB

It turns out that MATLAB doesn’t like recursion too much. Well, at least that’s my explanation for the slow results when I use the union-find method with recursion. So I used the union-find method without recursion.

Python

My Python code is mostly for illustration purposes. Check out this Python library. That website warns of the traps that multithreading in NumPy can pose.

C

In the other original paper, Newman and Ziff included C code of their algorithm. You can find it on Newman’s website, but I have also a copy of it (with added comments from me) located here.

Future work

I plan to implement this algorithm using the NVIDIA CUDA library. This is exactly the type of algorithm that we can implement and run using graphical processing units (GPUs), which is the entire point of the CUDA library.

At first glance, the spatial dependence of the problem suggests using texture memory to speed up the memory accessing on the GPUs. But I will explain why that’s probably not a good idea for this specific algorithm

News: Accelerated PyTorch learning on Mac

The PyTorch website reveals that I can now do accelerated model training with PyTorch on my Mac (which has one of the new Silicon M series chips):

In collaboration with the Metal engineering team at Apple, we are excited to announce support for GPU-accelerated PyTorch training on Mac. Until now, PyTorch training on Mac only leveraged the CPU, but with the upcoming PyTorch v1.12 release, developers and researchers can take advantage of Apple silicon GPUs for significantly faster model training. This unlocks the ability to perform machine learning workflows like prototyping and fine-tuning locally, right on Mac.

As the website says, this new PyTorch acceleration capability is due to Metal, which is Apple’s graphics processing (or GPU) technology.

This means I no longer need to use remote machines when training models. I can do it all on my trusty home computer. Let’s check.

The easiest way to check is, if you haven’t already, first install PyTorch. I would use Anaconda:

conda install pytorch torchvision -c pytorch

With PyTorch installed, run Python and then run the commands:

import torch
print(torch.backends.mps.is_available()) 
print(torch.backends.mps.is_built())

If the last commands works without a hitch, you have PyTorch installed. The last two commands should return True, meaning that PyTorch can use your graphics card for (accelerated) calculations.

New link: Dataflowr – Deep Learning DIY

Want to learn some deep learning? I recommend the Dataflowr website:

https://dataflowr.github.io/website/

It’s a good resource for learning the basics of neural networks using the PyTorch library in Python. The focus is on writing and running code. You can even play around with GPUs (graphical processing units) by running them on Google’s Colab, though that’s usually not needed.

It’s mostly run by a researcher who is a former colleague of mine and, while I was at Inria, was indirectly the reason I started using PyTorch for my machine learning work.

Summary: Poisson simulations

Here’s a lazy summary post where I list all the posts on various Poisson simulations. I’ve also linked to code, which is found on this online repository. The code is typically written in MATLAB and Python.

Posts

Poisson point processes

Simulating a homogeneous Poisson point process on a rectangle (See code.)
Simulating a Poisson point process on a disk (See code.)
Simulating a Poisson point process on a triangle (See code.)
Simulating an inhomogeneous Poisson point process (See code.)
Simulating a Poisson point process on a circle (See code.)
Simulating a Poisson point process on a sphere (See code.)
Simulating a Poisson point process on a n-dimensional sphere (See code.)

Some simulations of Poisson point processes are also covered in this post on the Julia programming language:

Testing the Julia language with point process simulations (See code.)

Frozen code

My simulation code has been frozen and buried in Norway. Well, some of my code that I keep on a GitHub repository has become part of a code preservation project. Consequently, beneath my profile it reads:

Arctic Code Vault Contributor

This is part of what is called the GitHub Archive Program. The people behind it aim to preserve large amounts of (open source) code for future generations in thousands and thousands of years time. But how do they do that?

Well, basically, the good people at GitHub chose and converted many, many, many lines of code into certain error-resistant formats, such as QR code. They then printed it all out and buried it deep in an abandoned mine shaft in frozen Norway. (The frozen and stable Norway is also home to a famous seed bank.)

My code in this project includes most of the code that has appeared in these posts. Of course my contribution is just a drop in the vast code ocean of this project. In fact, at least two or three of my colleagues have also had their code put into deep freeze.

Still, it’s a nice thought to know that stuff I wrote, including code for these very posts, will be potentially around for a very long time.

Simulating a Poisson point process on a n-dimensional sphere

In the previous post I outlined how to simulate or sample a homogeneous Poisson point process on the surface of a sphere. Now I will consider a homogeneous Poisson point process on the $(n-1)-$ sphere, which is the surface of the Euclidean ball in $n$ dimensions.

This is a short post because it immediately builds off the previous post. For positioning the points uniformly, I will use Method 2 from that post, which uses normal random variables, as it immediately gives a fast method in $n$ dimensions.

I wrote this post and the code more for curiosity than any immediate application. But simulating a Poisson point process in this setting requires placing points uniformly on a sphere. And there are applications in that, such as Monte Carlo integration methods, as mentioned in this post, which nicely details different sampling methods.

Steps

As is the case for other shapes, simulating a Poisson point process requires two steps.

Number of points

The number of points of a Poisson point process on the surface of a sphere of radius $r>0$ is a Poisson random variable. The mean of this random variable is $\lambda S_{n-1}$, where $S_{n-1}$ is the surface area of the sphere. For a ball embedded in $n$ dimension, the area of the corresponding sphere is given by

$$S_{n-1} = \frac{2 \pi ^{n/2} }{\Gamma(n/2)} r^{n-1},$$

where $\Gamma$ is the gamma function, which is a natural generalization of the factorial. In MATLAB, we can simply use the function gamma. In Python, we need to use the SciPy function scipy.special. gamma.

Locations of points

For each point on the sphere, we generate $n$ standard normal or Gaussian random variables, say, $W_1, \dots, W_n$, which are independent of each other. These random variables are the Cartesian components of the random point. We rescale the components by the Euclidean norm, then multiply by the radius $r$.

For $i=1,\dots, n$, we obtain

$$X_i=\frac{rW_i}{(W_1^2+\cdots+W_n^2)^{1/2}}.$$

These are the Cartesian coordinates of a point uniformly scattered on a sphere with radius $r$ and a centre at the origin.

How does it work?

In the post on the circle setting, I gave a more detailed outline of the proof, where I said the method is like the Box-Muller transform in reverse. The joint density of the normal random variables is from a multivariate normal distribution with zero correlation. This joint density a function of the Cartesian equation for a sphere. This means the density is constant on the sphere, implying that the angle of the point $(W_1,\dots, W_n)$ will be uniformly distributed.

The vector formed from the normal variables $(W_1,\dots,W_n)$ is a random variable with a chi distribution. But the final vector, which stretches from the origin to the point $(X_1,\dots,X_n)$, has length one, because we rescaled it with the Euclidean norm.

Code

The code for all my posts is located online here. For this post, the code in MATLAB and Python is here.

Simulating a Poisson point process on a sphere

In this post I’ll describe how to simulate or sample a homogeneous Poisson point process on the surface of a sphere. I have already simulated this point process on a rectangle, triangle disk, and circle.

Of course, by sphere, I mean the everyday object that is the surface of a three-dimensional ball, where this two-dimensional object is often denoted by $S^2$. Mathematically, this is a generalization from a Poisson point process on a circle, which is slightly simpler than randomly positioning points on a disk. I recommend reading those two posts first, as a lot of the material presented here builds off them.

I have not needed such a simulation in my own work, but I imagine there are many reasons why you would want to simulate a Poisson point process on a sphere. As some motivation, we can imagine these points on a sphere representing, say, meteorites or lightning hitting the Earth.

The generating the number of points is not difficult. The trick is positioning the points on the sphere in a uniform way. As is often the case, there are various ways to do this, and I recommend this post, which lists the main ones. I will use two methods that I consider the most natural and intuitive ones, namely using spherical coordinates and normal random variables, which is what I did in the post on the circle.

Incidentally, a simple modification allows you to scatter the points uniformly inside the sphere, but you would typically say ball in mathematics, giving a Poisson point process inside a ball; see below for details.

Steps

As always, simulating a Poisson point process requires two steps.

Number of points

The number of points of a Poisson point process on the surface of a sphere of radius $r>0$ is a Poisson random variable with mean $\lambda S_2$, where $S_2=4\pi r^2$ is the surface area of the sphere. (In this post I give some details for simulating or sampling Poisson random variables or, more accurately, variates.)

Locations of points

For any homogeneous Poisson point process, we need to position the points uniformly on the underlying space, which is in this case is the sphere. I now outline two different methods for doing this.

Method 1: Spherical coordinates

The first method is based on spherical coordinates $(\rho, \theta,\phi)$, where the radial coordinate $\rho\geq 0$, and the angular coordinates $0 \leq \theta\leq 2\pi$ and $0\leq \phi \leq \pi$. The change of coordinates gives $x=\rho\sin(\theta)\cos(\phi)$, $y=\rho\sin(\theta)\sin(\phi)$, and $z=\rho\cos(\phi)$.

Now we use Proposition 1.1 in this paper by Pitman and Ross. For each point, we generate two uniform variables $V$ and $\Theta$ on the respective intervals $(-1,1)$ and $(0,2\pi)$. Then we place the point with the Cartesian coordinates

$$X = r \sqrt{1-V^2} \cos\Theta, $$

$$Y = r \sqrt{1-V^2}\sin\Theta, $$

$$ Z=r V. $$

This method places a uniform point on a sphere with a radius $r$.

How does it work?

I’ll skip the precise details, but give some interpretation of this method. The random variable $\Phi := \arccos V$ is the $\phi$-coordinate of the uniform point, which implies $\sin \Phi=\sqrt{1-V^2}$, due to basic trigonometric identities. The area element in polar coordinates is $dA = \rho^2 \sin\phi d\phi d\theta $, which is constant with respect to $\theta$. After integrating with respect to $\phi$, we see that the random variable $V=\cos\Phi$ needs to be uniform (instead of $\Phi$) to ensure the point is uniformly located on the surface.

Method 2: Normal random variables

For each point, we generate three standard normal or Gaussian random variables, say, $W_x$, $W_y$, and $W_z$, which are independent of each other. (The term standard here means the normal random variables have mean $\mu =0$ and standard deviation $\sigma=1$.) The three random variables are the Cartesian components of the random point. We rescale the components by the Euclidean norm, then multiply by the radius $r$, giving

$$X=\frac{rW_x}{(W_x^2+W_y^2+W_z^2)^{1/2}},$$

$$Y=\frac{rW_y}{(W_x^2+W_y^2+W_z^2)^{1/2}},$$

$$Z=\frac{rW_z}{(W_x^2+W_y^2+W_z^2)^{1/2}}.$$

These are the Cartesian coordinates of a point uniformly scattered on a sphere with radius $r$ and a centre at the origin.

How does it work?

The procedure is somewhat like the Box-Muller transform in reverse. In the post on the circle setting, I gave an outline of the proof, which I recommend reading. The joint density of the normal random variables is from a multivariate normal distribution with zero correlation. This joint density is constant on the sphere, implying that the angle of the point $(W_x, W_y, W_z)$ will be uniformly distributed.

The vector formed from the normal variables $(W_x, W_y,W_z)$ is a random variable with a chi distribution. We can see that the vector from the origin to the point $(X,Y,Z)$ has length one, because we rescaled it with the Euclidean norm.

Plotting

Depending on your plotting software, the points may more resemble points on an ellipsoid than a sphere due to the different scaling of the x, y and z axes. To fix this in MATLAB, run the command: axis square. In Python, it’s not straightforward to do this, as it seems to lack an automatic function, so I have skipped it.

Results

Here are some results produced by code written in MATLAB and Python. The blue points are the Poisson points on the sphere. I have used a surface plot (with clear faces) to illustrate the underling sphere in black.

MATLAB

Python

Note: The aspect ratio in 3-D Python plots tends to squash the sphere slightly, but it is a sphere.

Code

The code for all my posts is located online here. For this post, the code in MATLAB and Python is here. In Python I used the library mpl_toolkits for doing 3-D plots.

Poisson point process inside the sphere

Perhaps you want to simulate a Poisson point process inside the ball. There are different ways we can do this, but I will describe just one way, which builds off Method 1 for positioning the points uniformly. (In a later post, I will modify Method 2, giving a way to uniformly position points inside the ball.)

For this simulation method, you need to make two simple modifications to the simulation procedure.

Number of points

The number of points of a Poisson point process inside a sphere of radius $r>0$ is a Poisson random variable with mean $\lambda V_3$, where $V_3=4\pi r^3$ is the volume of the sphere.

Locations of points

We will modify Method 1 as outlined above. To sample the points uniformly in the sphere, you need to generate uniform variables on the unit interval $(0,1)$, take their cubic roots, and then, multiply them by the radius $r$. (This is akin to the step of taking the square root in the disk setting.) The random variables for the angular coordinates are generated as before.

What the code does

Two examples

Implementation considerations

Variance of the walk

Convergence tests

Testing the results

Code

Python

Julia

Constructing a Markov process

Jumper process

Metropolis-Hastings ratio

Metropolis-Hastings process

Transition kernel (matrix)

Jumping from \(x\) to \(y\neq x\)

Jumping from \(x\) to \(x\)

Expression for the transition kernel \(M\)

Reversibility

The Metropolis-Hastings process is reversible

Continuous case

Libraries

Code

Further reading

Articles

Historical

Introductory

History

Books

Websites

Algorithm

Overview

1. Randomly sampling sequences

2. Using union-find

3. Filling in the parameter gaps

Code

MATLAB

Python

C

Future work

Further reading

Posts

Poisson point processes

Checking simulations

Poisson line process

Poisson random variables

Steps

Number of points

Locations of points

How does it work?

Code

Further reading

Steps

Number of points

Locations of points

Method 1: Spherical coordinates

How does it work?

Method 2: Normal random variables

How does it work?

Plotting

Results

MATLAB

Python

Code

Poisson point process inside the sphere

Number of points

Locations of points

Further reading