H. Paul Keeler

The first MC in MCMC methods

Markov chains form a fundamentally important class of stochastic processes. It would be hard to over stress their importance in probability, statistics, and, more broadly, science and technology. They are indispensable in random simulations, particularly those based on the Markov chain Monte Carlo methods. In this post, we’ll have a look some Markov chain basics needed for such simulation methods.

This is the first part of a series of posts on Mark chain Monte Carlo methods. This post covers the basics of Markov chains, which is the more involved part. The second part will cover Monte Carlo methods. The third part will combine the ideas from the first two parts. Overall, the three posts will sketch the mechanics of Markov chain Monte Carlo (MCMC) methods.

Markov chains vs Markov processes

All Markov chains are Markov processes. Some people use the term Markov chain to refer to discrete-time Markov processes with general state spaces. Other people prefer the term Markov chain for continuous-time Markov processes with countable state spaces.¹

Nevertheless, the first MC in the MCMC suggests the Markov chain Monte Carlo crowd prefers the former sense of Markov chain, given the use of discrete-time Markov processes in their simulations.

Markov the frog

Some writers introduce Markov chains with a mental image of a frog jumping around lily pads scattered over a pond. (Presumably the frog never misses a lily pad.) We assume the frog randomly chooses the next lily pad through some random mechanism. Perhaps the distances between lily pads or their sizes influence the chances that frog will jump between them.

We further assume that the frog is a bit particular, preferring to jump in certain directions more than others. More precisely, the probability of our frog jumping from a lily pad labelled $x$ to another labelled $y$ is $P(x,y)$. But jumping in the opposite direction happens with probability $P(y,x)$, which in general is not equal to $P(x,y)$.

I typically use the term points, but the Markov literature usually says that the Markov chain visits states.

State space

We can interpret a Markov chain, a type of stochastic process, as a collection or sequence of random variables. ²The values of the random variables are points in some mathematical space $\mathbb{X}$. This space can be quite abstract, but in practice it’s usually the lattice $\mathbb{Z}^n$, Euclidean space $\mathbb{R}^n$, or a subset of one of these two spaces. For our frog example, all the lily pads in the pond form the state space.

We’ll only consider countable Markov chains where the number of points in the state space $\mathbb{X}$ is countable. Although the results and theory generally hold for more general state spaces, the accompanying work requires more technical mathematics. For finite and countable state spaces, we can use standard probability and matrix knowledge. But when we use uncountable state spaces such as $\mathbb{R}^n$, we enter the world of measure theory and functional analysis.

I will often write a point $x$ in a (state) space $\mathbb{X}$. But you can say an element $x$ of a set $\mathbb{X}$. Many authors refers to the points or elements as states of the Markov chain. In the frog example, each lily pad is a different state.

Markov property

A discrete-time countable Markov chain is a random process that jumps between points of some countable mathematical space $\mathbb{X}$ such that, when at point $x \in \mathbb{X}$, the next position is chosen according to a probability distribution $P(x,·)$ depending only on $x$.

More specifically, a sequence of random variables $(X_0, X_1, . . .)$ is a discrete-time Markov chain $X$ with a countable state space
$\mathbb{X}$ and kernel $P$ if for all $x,y \in \mathbb{X}$ and all $t \geq 1$ satisfying $\mathbb{P}[X_{t−1}=x_{t-1},\dots,X_0=x_0]>0$, we have

$$ \begin{align}\mathbb{P}[X_{t+1} =y|&X_{t}=x,X_{t−1}=x_{t-1},\dots,X_0=x_0]\\&=\mathbb{P}[X_{t+1} =y|X_t =x]\\&=P(x,y)\,.\end{align}$$

This equation is often called the Markov property.

The Markov property says that the conditional probability of jumping from point $x$ to $y$ remains the same, regardless of which points or states $x_0,x_1,\dots,x_{t-1}$ were previously visited. This is precisely why the kernel $P$ contains all the information needed to describe the future random evolution of the Markov chain.

We have assumed the probabilities given by $P$ are fixed, meaning we have described a homogeneous Markov chain.

Markov kernel

The kernel $P$ is called the Markov (transition) kernel or probability kernel. Assuming a countable state space $\mathbb{X}$, we can reference any probability value of the kernel $P$ with two variables $x,y\in\mathbb{X}$. If we assume a finite state space $\mathbb{X}$, then the kernel $P$ becomes a regular matrix taught in linear algebra. An infinite but countable state space gives an infinite matrix $P$. The rows of the kernel matrix $P$ must add up to one, because each row is a probability measure.

A more general space, such as Euclidean space $\mathbb{R}^n$, results in a more general kernel with respect to a suitable measure. In this setting, $P(x,·)$ is no longer a probability mass function, but a general probability measure.

Initial distribution

At time $t=0$ we describe the random initial configuration of a Markov process with a probability distribution $\mu_0$. For a finite or countable Markov chain, this initial distribution $\mu_0$ corresponds to a probability mass function encoded as a row vector.

Jumping from $x$ to $y$

The probability distribution $\mu_0$ gives the probability of starting in state (or at point) $x\in\mathbb{X}$. After one time step, we can write down the probability distribution $\mu_1$ that gives us the different probabilities of the Markov chain being at different states. At $n=1$, basic matrix algebra and probability rules give us the matrix equation

$$\mu_1=\mu_0 P$$

By induction, after $t$ time steps we have the expression

$$\mu_n=\mu_0 P^n\,.$$

where the superscript $n$ denotes matrix power. We can write the $n$-time step kernel as $P_{(n)}$, which for a finite Markov chain is given by the matrix equation $P_{(n)}=P^n$.

Seeing how $P_{(n)}$ behaves as $n$ approaches infinity forms part of work that studies the convergence and ergodicity properties of Markov chains. I’ll make these concepts clearer below. But first I’ll give some conditions that are typically needed.

Regularity conditions

A Markov chain with a countable state space needs some conditions to ensure convergence and ergodicity.

Regularity conditions

A stationary distribution $\pi$

Aperiodicity

Irreducibility

Postive recurrence

The nature of the state space and the kernel will dictate these conditions. These conditions are also not necessarily logically distinct. For example, on a finite state space, you’ll get positive recurrence for free, because an aperiodic, irreducible Markov chain with a finite state space is always positive recurrent.

We now briefly detail these conditions and in another post I’ll give examples how the conditions can be met.

Stationary distribution $\pi$

It’s possible to encounter a probability distribution $\pi$ where applying the kernel $P$ returns the same distribution $\pi$, meaning

$$ \pi=\pi P\,.$$

This (fixed-point) equation is called the balance equation.

The distribution $\pi$ is called the stationary, invariant or steady-state distribution. A Markov chain does not need to have a stationary distribution. And if a Markov chain does have one, it may not be unique. Its existence and uniqueness will depend on the Markov kernel $P$.

Showing that a unique stationary distribution exists and it is possible to reach it with probability one is the stuff of Markov convergence results. Markov chain Monte Carlo methods hinge upon these results .

Aperiodicity

It is possible for a Markov chain to get trapped in a loop, periodically visiting the same states. The period $d_x$ of a state $x\in \mathbb{x}$ is the greatest common divisor of all $n$ values such that $P(x,x)^n>0$. If the period of a point is $d_x=1$, then we say it’s aperiodic. If every state of a Markov chain is aperiodic, we says it’s an aperiodic Markov chain.

Aperiodicity means there are no loops to trap the Markov chain. This property is typically needed for convergence results.

Irreducibility

A Markov chain with a countable state space $\mathbb{X}$ is irreducible if the Markov chain can go from any point $x\in\mathbb{X}$ to another other point $x\in\mathbb{X}$ with a positive probability in a finite number of time steps. In other words, there exists a natural number $s$ such that $P(x,y)^s>0$ for all $x,y\in\mathbb{X}$.

Irreducibility ensures that a Markov chain will visit all the states in its state space. This property is also needed for convergence results.

Recurrence

When studying Markov processes, a quantity of interest is how much time it takes to return to a state or point. For a point $x\in\mathbb{X}$, we define its first return time as

$$ T_x^+=\min\{ t\geq 1: X_t=x\} \,.$$

As the name suggests, this random variable is the number of time steps for the Markov process return to state $x$, taking whichever path, conditioned on it starting at $x$.

We call a state $x$ recurrent if the probability of its first return time being finite is one, meaning $\mathbb{P}_x(T_x^+<\infty)=1$. Otherwise the state $x$ is said to be transient.

Positive recurrence

We can classify different types of recurrence based on the expected value of the first return times. A state $x$ is called positive recurrent if the expected value of its first return time is finite, meaning $\mathbb{E}_x(T_x^+)<\infty$. Otherwise state $x$ is null recurrent.

For a countable Markov chain, if all the states in the state space are (positive) recurrent, so $\mathbb{E}_x(T_x^+)<\infty$ for all $x\in\mathbb{X}$, then we say the Markov chain is (positive) recurrent.

Again, the concept of positive recurrence is needed for convergence results.

Ergodicity

We say a countable Markov chain is ergodic if it is irreducible, aperiodic and positive recurrent.³ Ergodicity allows one to find averages by employing a more general form of the law of large numbers, which Monte Carlo methods rely upon. We stress that definitions of ergodicity vary somewhat, but in general it means convergence and laws of large numbers exists.

The Metropolis-Hastings algorithm in C with multi-variable densities

Here’s a C implementation of the Metropolis(-Rosenbluth-Rosenbluth-Teller-Teller)-Hastings algorithm that can handle joint multi-variable probability densities, thus simulating a finite number of random variables. The Metropolis-Hastings algorithm is the central piece in Markov chain Monte Carlo (MCMC) methods. They have become essential in Bayesian statistics, where they are used to tackle high-dimensional integration. I have written a couple posts about these methods, starting with this one and ending with one.

My previous C code only works with densities of a couple variables. That code is covered in this post, which in turn is based on a previous post, in which I presented the algorithm implemented in the scientific programming languages Python, Julia, MATLAB and R. Those examples were for mostly illustration purposes, as there are already good pre-written libraries in those languages.

Code considerations

I’ll describe some considerations for implementing this algorithm in C. Some of this overlaps with what I wrote on the previous post. In writing this more general version of the code, I arguably cleaned up the old code, which is often the case when you generalize things in coding and mathematics.

Storing multi-dimensional values in C

In C, when handling sets of numbers, such as vectors and matrices, one has to use pointers and the malloc function more often that not, which can create headaches. For this algorithm, probably the most important data object is the one for storing the (current) positions of the multi-dimensional random walks across all the simulations run. This conceptually results in a rectangular grid or table of numbers. In mathematics, you would just use a $d \times n$ matrix, where is $d$ is the number of dimensions and $n$ is the number of simulation runs.

In most programming languages, this table of numbers typically corresponds to a $d \times n$ array. More specifically in C, you could do this with a multi-dimensional array or an array of pointers. But in this case, you don’t need to use the inherent structure of a matrix. Besides, any multi-dimensional array in C will be stored in memory as a one-dimensional array, so you can store the numbers in a simple one-dimensional array or vector with $d \cdot n$ elements in total.¹ The code does this using a single pointer.

Then using this vector, you just need to index (or map) between the vector and the matrix appropriately, which the code does with the (integer) variable indexSimDim. There are simple one-to-one mappings between elements in the vector and the elements in $d \times n$ matrix. For example, matrix row $i$ and column $k$ corresponds to the element number $m=i\cdot d+k$ in the vector; see the code below.

unsigned indexSimDim; // index for keeping track of two-dimensional data as a one-dimensional array // indexSimDim = i * numbDim + k, where i is simulation number and k is dimension number (minus one)

This is a standard trick. In fact, you’ll see this type of line of code as necessary boilerplate in CUDA-based (and similar) code in kernels (routines) due to the inherent grid hierarchy of graphical processing units (GPUs). But that discussion deserves a post on its own.

Randomness in C

In previous posts, I’ve remarked that the standard uniform random number generator in C, called rand, is not good enough for research level randomness. That said, it works fine for regular simulations. The Mersenne Twister is a widely recommended and used algorithm for producing such numbers.

To simulate normal (or Gaussian) random variables, I wrote my own simple function using the Box-Muller transform, which I covered in a previous post, so the code would be self-contained. But in practice, you should always use pre-written and tested functions.

Code

The code can be found here, whereas other MCMC code can be found here and here. The joint probability density is defined in the function pdf_single.

/***********************************************************************
 * Runs a simple Metropolis-Hastings (ie MCMC) algorithm to simulate n
 * jointly distributed random variables with probability density p(x,y).
 * For example:
 * p(x,y)=exp(-(x^4+x*y+y^2+y*z+z^4)/s^2)/consNorm, where s>0 and consNorm is a
 * normalization constant. The probability density function is defined in
 * the function pdf_single.
 *
 * NOTE: In practice, the value of the normalization constant is not needed, as it cancels out in the algorithm.
 *
 * NOTE: This code will *create* a local file (see variable strFilename) to store results. It will *overwrite* that file if it already exists.
 *
 * WARNING: This code uses the default C random number generator, which is known for failing various tests of randomness.
 * Strongly recommended to use another generator for purposes beyond simple illustration.
 *
 * Author: H. Paul Keeler, 2024.
 * Website: hpaulkeeler.com
 * Repository: github.com/hpaulkeeler/posts
 *
 ***********************************************************/

#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <math.h>
#include <stdbool.h>
#include <string.h>

const long double pi = 3.14159265358979323846; // constant pi for generating polar coordinates

// helper function declarations; see below for definitions
static double *unirand(double *randValues, unsigned numbRand);                           // generate  uniform random variables on (0,1)
static double *normrand(double *randValues, unsigned numbRand, double mu, double sigma); // generate normal random variables
static double pdf_single(double *x_input, unsigned numbDim, double *parameters);           // define probability density to be simulated
static double mean_var(double *set_sample, unsigned numbSim, double *varX);              // calculate meana and variance

int main(int argc, char *argv[])
{

    if (argc > 1)
    {
        fprintf(stderr, "This program takes no arguments...\n");
        exit(1);
    }
    else
    {
        char strFilename[] = "MCMCData_ND.csv"; // filename for storing simulated random variates

        // intializes (pseudo)-random number generator
        time_t timeCPU; // use CPU time for seed
        srand((unsigned)time(&timeCPU));
        // srand(42); //to reproduce results

        bool booleWriteData = true; // write data to file
        bool booleStats = true;     // perform simple mean/std stats
        unsigned numbDimMax = 3; //upper bound on number of dimensions for which stats are calculated and printed out

        // simulation parameters
        unsigned numbSim = 1e4;   // number of random variables simulated
        unsigned numbSteps = 200; // number of steps for the Markov process
        double sigma = 2;         // standard deviation for normal random steps

        // probability density parameters
        double s = .5; // scale parameter for distribution to be simulated
        unsigned numbDim = 3;

        // Metropolis-Hastings variables
        // proposal for a new position in the random walk
        double *tRandProposal = (double *)malloc(numbDim * sizeof(double));
        double pdfProposal;      // density for proposed position
        double pdfCurrent;       // density of current position
        double ratioAccept;      // ratio of densities (ie acceptance probability)
        double uRand;            // uniform variable for Bernoulli trial (ie a coin flip)
        // random step (normally distributed)
        double *p_numbNormT = (double *)malloc(1 * sizeof(double));
        // positions of the random walk (ie the simualted random variables after numbSteps)
        double *p_tRand = (double *)malloc(numbDim * numbSim * sizeof(double));

        unsigned i, j, k;     // loop varibales
        unsigned indexSimDim; // index for keeping track of two-dimensional data as a one-dimensional array
        // Typically indexSimDim = i * numbDim + k, where i is simulation number and k is dimension number (minus one)

        double *p_tRandCurrent = (double *)malloc(numbDim * sizeof(double));
        (void)unirand(p_tRand, numbDim * numbSim); // random initial values

        for (i = 0; i < numbSim; i++)
        {
            // loop through each random walk instance (or random variable to be simulated)
            for (k = 0; k < numbDim; k++)
            {
                // loop through dimensions
                indexSimDim = i * numbDim + k;
                // update state of random walk / Markov chain
                *(p_tRandCurrent + k) = *(p_tRand + indexSimDim);
            }

            pdfCurrent = pdf_single(p_tRandCurrent, numbDim, & s); // current probability density

            for (j = 0; j < numbSteps; j++)
            {
                // loop through each step of the random walk
                for (k = 0; k < numbDim; k++)
                {
                    // loop through dimensions
                    indexSimDim = i * numbDim + k;
                    (void)normrand(p_numbNormT, 1, 0, sigma);
                    // take a(normally distributed) random step in x, y and y
                    *(tRandProposal+k) = *(p_tRand + indexSimDim) + *(p_numbNormT);
                }
                pdfProposal = pdf_single(tRandProposal, numbDim, & s); // proposed probability density

                // acceptance rejection step
                (void)unirand(&uRand, 1);
                ratioAccept = pdfProposal / pdfCurrent;
                if (uRand < ratioAccept)
                {
                    for (k = 0; k < numbDim; k++)
                    {
                        // loop through dimensions
                        indexSimDim = i * numbDim + k;
                        // update state of random walk / Markov chain
                        *(p_tRand + indexSimDim) = tRandProposal[k];
                    }
                    pdfCurrent = pdfProposal;
                }
            }
        }

        free(p_numbNormT);

        if (booleStats)
        {
            // initialize statistics variables (for testing results)
            double *p_AllRand = (double *)malloc(numbSim * sizeof(double));
            double meanTemp = 0;
            double varTemp = 0;
            double stdTemp = 0;
            unsigned numbDimStats = fmin(numbDimMax, numbDim); //number of dimensions for which stats are calculated and printed out
            for (k = 0; k < numbDimStats; k++)
            {
                // loop through all the dimensions
                for (i = 0; i < numbSim; i++)
                {
                    // collect variables for dimension k+1
                    indexSimDim = i * numbDim + k;
                    *(p_AllRand + i) = *(p_tRand + indexSimDim);
                }
                meanTemp = mean_var(p_AllRand, numbSim, &varTemp);
                stdTemp = sqrt(varTemp);
                printf("The average of dimension %d random variables is %lf.\n", k + 1, meanTemp);
                printf("The standard deviation of dimension %d random  variables is %lf.\n", k + 1, stdTemp);
            }
        }

        if (booleWriteData)
        {
            // print to file
            FILE *outputFile;
            outputFile = fopen(strFilename, "w");

            // create string of spacers (ie commas and newlines)
            char *strSpacer = (char *)malloc((numbDim + 1) * sizeof(char));
            for (k = 0; k < numbDim - 1; k++)
            {
                *(strSpacer + k) = ',';
            }
            strSpacer[numbDim - 1] = '\n';
            strSpacer[numbDim] = '\0';
            for (i = 0; i < numbSim; i++)
            {
                for (k = 0; k < numbDim; k++)
                {
                    indexSimDim = i * numbDim + k;
                    fprintf(outputFile, "%lf%c", *(p_tRand + indexSimDim), strSpacer[k]);
                }
            }

            fclose(outputFile);
            printf("Data printed to file.\n");
        }
        free(p_tRand);

        return (0);
    }
}

static double pdf_single(double *x_input, unsigned numbDim, double *parameters)
{
    // returns the probability density of a single point inside a simulation window defined below
    
    double pdf_output = 0; //probability density at a single point
    
    // non-zero density square window parameters
    double xMin = -1;
    double xMax = 1;

    // retrieve variables
    double x = *(x_input + 0);
    double y = *(x_input + 1);
    double z = *(x_input + 2);
    // retrieve scale parameter
    double s = *(parameters + 0);

    int i;
    //check point is inside simulation window
    bool booleInsideWindow = true;
    for (i = 0; i < numbDim; i++){
        booleInsideWindow = booleInsideWindow &¹;
    }

    // define probability density
    if (booleInsideWindow)
    {
        // evaluate probability density
        pdf_output = exp(-((pow(x, 4) + x * y + pow(y, 2) + y * z + pow(z, 4)) / (s * s)));
    }
    else
    {
        pdf_output = 0;
    }
    return pdf_output;
}

static double *normrand(double *randValues, unsigned numbRand, double mu, double sigma)
{
    // simulate pairs of iid normal variables using Box-Muller transform
    // https://en.wikipedia.org/wiki/Box%E2%80%93Muller_transform

    double U1, U2, thetaTemp, rhoTemp, Z1, Z2;
    int i = 0;
    while (i < numbRand)
    {
        // simulate variables in polar coordinates (theta, rho)
        (void)unirand(&U1, 1);
        thetaTemp = 2 * pi * U1; // create uniform theta values
        (void)unirand(&U2, 1);
        rhoTemp = sqrt(-2 * log(U2)); // create Rayleigh rho values

        // change to Cartesian coordinates (Z1, Z2)
        Z1 = rhoTemp * cos(thetaTemp);
        Z1 = sigma * Z1 + mu;
        randValues[i] = Z1; // assign first of random variable pair
        i++;
        if (i < numbRand)
        {
            // if more variables are needed, generate second value of random pair
            Z2 = rhoTemp * sin(thetaTemp);
            Z2 = sigma * Z2 + mu;
            randValues[i] = Z2; // assign second of random variable pair
            i++;
        }
        else
        {
            break;
        }
    }
    return randValues;
}

static double *unirand(double *randValues, unsigned numbRand)
{ // simulate numbRand uniform random variables on the unit interval
  // storing them in randValues which must be allocated by the caller
  // with enough space for numbRand doubles

    for (int i = 0; i < numbRand; i++)
    {
        randValues[i] = (double)rand() / RAND_MAX;
    }
    return randValues;
}

static double mean_var(double *set_sample, unsigned numbSim, double *varX)
{
    // mean and variance of set_sample
    int i;
    // initialize statistics variables (for testing results)
    double meanX = 0;
    double meanXSquared = 0;
    double tempX;
    for (i = 0; i < numbSim; i++)
    {
        tempX = *(set_sample + i);
        meanX += tempX / ((double)numbSim);
        meanXSquared += tempX * tempX / ((double)numbSim);
    }

    *varX = meanXSquared - meanX * meanX;
    return meanX;
}

x_input[i] >= xMin) & (x_input[i] <= xMax [↩]

Quantum-enhanced Markov chain Monte Carlo

The not-so-mathematical journal Nature recently published a paper proposing a new Markov chain Monte Carlo method:

2023 – Layden, Mazzola, Mishmash, Motta, Wocjan, Kim, and Sheldon – Quantum-enhanced Markov chain Monte Carlo.

Appearing earlier as this preprint, the paper’s publication in such a journal is a rare event indeed. This post notes this, as well as the fact that we can already simulate perfectly¹ the paper’s test model, the Ising or Potts model.² But this is a quantum algorithm, which is exciting and explains how it can end up in that journal.

The algorithm

The paper’s proposed algorithm adds a quantum mechanical edge or enhancement to the classic Metropolis-Hastings algorithm.³ The original algorithm uses a Markov chain defined on some mathematical space. Running it on a traditional or classical computer, at each time step, the algorithm consists of proposing a random jump and then accepting the proposed jump or not. Owing to the magic of Markov chains, in the long run, the algorithm simulates a desired probability distribution; also see my post.

The new quantum version of the algorithm uses a quantum computer to propose the jump, while still using a classical computer to accept the proposal or not.⁴ The quantum jump proposals are driven by a time-independent Hamiltonian, which is a central object in quantum and, in fact, all physics. This leads to a Boltzmann (or Gibbs) probability distribution for the jumping process.

Then, running the quantum part on a quantum computer, the algorithm will hopefully outperform its classical counterpart. The paper nurtures this hope by giving empirical evidence of the algorithm’s convergence speed. The researchers performed the numerical experiments on a 27-qubit quantum processor at IBM using the platform Qiskit.

Quantum is so hot right now

In recent years researchers have been focusing on such algorithms that exploit the strangeness and spookiness of quantum mechanics. You will see more and more quantum versions of algorithms that appear in statistics, machine learning, and related fields, as suggested by this survey paper, which also appeared in Nature.

Quantum lite

Sometimes quantum mechanics only loosely inspires algorithms and models. In this setting, some of my machine learning work uses determinantal point processes. This kernel-based random model draws direct inspiration from the wave function, a standard object in quantum mechanics. Under suitable simplifying conditions, the model describes the locations of particles known as fermions such as electrons and protons. Still, it’s fascinating that a quantum physics model inspired an interesting random object that has found applications in spatial statistics and machine learning.

Connectivity in device-to-device networks in Poisson-Voronoi cities

Here’s a recently uploaded manuscript:

2023 – Keeler, Błaszczyszyn, Cali – Connectivity and interference in device-to-device networks in Poisson-Voronoi cities.

https://arxiv.org/abs/2309.02137

This work presents numerical results complementing mathematical work carried out by us. The work concerns (continuum) percolation results for a special network model based on Poisson-Voronoi tessellations.

The most relevant work are these two papers (the first being somewhat seminal):

Dousse, Franceschetti, Macris, Meester, Thiran, Percolation in the signal to interference ratio graph, 1996.
Le Gall, Błaszczyszyn, Cali, and En-Najjary, Continuum line-of-sight percolation on Poisson-Voronoi tessellations, 2021

Our work effectively seeks to combine these two papers. We obtain the equivalents results from the first paper by coupling its connectivity model with the connectivity model and network model (based on a Cox point process) presented in the second paper.

If you want a more detailed version, here’s the abstract:

To study the overall connectivity in device-to-device networks in cities, we incorporate a signal-to-interference-plus-noise connectivity model into a Poisson-Voronoi tessellation model representing the streets of a city. Relays are located at crossroads (or street intersections), whereas (user) devices are scattered along streets. Between any two adjacent relays, we assume data can be transmitted either directly between the relays or through users, given they share a common street. Our simulation results reveal that the network connectivity is ensured when the density of users (on the streets) exceeds a certain critical value. But then the network connectivity disappears when the user density exceeds a second critical value. The intuition is that for longer streets, where direct relay-to-relay communication is not possible, users are needed to transmit data between relays, but with too many users the interference becomes too strong, eventually reducing the overall network connectivity. This observation on the user density evokes previous results based on another wireless network model, where transmitter-receivers were scattered across the plane. This effect disappears when interference is removed from the model, giving a variation of the classic Gilbert model and recalling the lesson that neglecting interference in such network models can give overly optimistic results. For physically reasonable model parameters, we show that crowded streets (with more than six users on a typical street) lead to a sudden drop in connectivity. We also give numerical results outlining a relationship between the user density and the strength of any interference reduction techniques.

In future posts I’ll detail the above work as well as our more mathematical work on this type of percolation model.

The acceptance(-rejection) method for simulating random variables

In a previous post, I covered a simple but much used method for simulating random variables or, rather, generating random variates. To simulate a random variable, the method requires, in an easy fashion, calculating the inverse of its cumulative distribution function. But you cannot always do that.

In lieu of this, the great John von Neumann wrote in a 1951 paper that you can sample a sequence of values from another probability distribution, accepting only the values that meet a certain condition based on this other distribution and the desired distribution, while rejecting all the others. The accepted values will follow the desired probability distribution. This method of simulation or sampling is called the rejection method, the acceptance method, and it has even the double-barrelled name the acceptance-rejection (AR) method.

Details

Let $X$ be a continuous random variable with a (probability) density $p(x)$, which is the derivative of its cumulative probability distribution $P(X\leq x)$. The density $p(x)$ corresponds to the desired or target distribution from which we want to sample. For whatever reason, we cannot directly simulate the random variable $X$. (Maybe we cannot use the inverse method because $P(X\leq x)$ is too complicated.)

The idea that von Newman had was to assume that we can easily simulate another random variable, say, $Y$ with the (probability) density $q(x)$. The density $q(x)$ corresponds to a proposal distribution that we can sample (by using, for example, the inverse method).

Now we further assume that there exists some finite constant $M>0$ such that we can bound $p(x)$ by $Mq(x)$, meaning

$$ p(x) \leq M q(x), \text{ for all } x . $$

Provided this, we can then sample the random variable $Y$ and accept a value of it (for a value of $X$) with probability

$$\alpha = \frac{p(Y)}{Mq(Y)}.$$

If the sampled value of $Y$ is not accepted (which happens with probability $1-\alpha$), then we must repeat this random experiment until a sampled value of $Y$ is accepted.

Algorithm

We give the pseudo-code for the acceptance-rejection method suggested by von Neumann.

Random variable $X$ with density $p(x)$

Sample a random variable $Y$ with density $q(x)$, giving a sample value $y$.

Calculate the acceptance probability $\alpha = \frac{p(y)}{Mq(y)}$.

Sample a uniform random variable $U\sim U(0,1)$, giving a sample value $u$.

Return the value $y$ (for the value of $X$) if $u\leq \alpha$, otherwise go to Step 1 and repeat.

As covered in a previous post, Steps 3 and 4 are equivalent to accepting the value $y$ with probability $\alpha$.

Point process application

In the context of point processes, this method is akin to thinning point processes independently. This gives a method for positioning points non-uniformly by first placing the points uniformly. The method then thins points based on the desired intensity function. As I covered in a previous post, this is one way to simulate an inhomogeneous (or nonhomogeneous) Poisson point process.

Efficiency

Basic probability theory tells us that the number of experiment runs (Steps 1 to 3) until acceptance is a geometric variable with parameter $\alpha$. On average the acceptance(-rejection) method will take $1/\alpha$ number of simulations to sample one value of the random $X$ of the target distribution. The key then is to make the proposal density $q(x)$ as small as possible (and adjust $M$ accordingly), while still keeping the inequality $p(x) \leq M q(x)$.

Higher dimensions

The difficulty of the acceptance(-rejection) method is finding a good proposal distribution such that the product $Mq(x)$ is not much larger than the target density $p(x)$. In one-dimension, this can be often done, but in higher dimensions this becomes increasingly difficult. Consequently, this method is typically not used in higher dimensions.

Another approach with an acceptance step is the Metropolis-Hastings method, which is the quintessential Markov chain Monte Carlo (MCMC) method. This method and its cousins have become exceedingly popular, as they give ways to simulate collections of dependent random variables that have complicated (joint) distributions.

The Box-Muller method for simulating normal variables

In the previous post, I covered a simple but much used method for simulating random variables or, rather, generating random variates. To simulate a random variable, the method requires writing down, in a tractable manner, the inverse of its cumulative distribution function.

But in the case of the normal (or Gaussian) distribution, there is no closed-form expression for its cumulative distribution function nor its inverse. This means you cannot, in an elegant and fast way at least, generate with the inverse method a single normal random variable using a single uniform random variable.

Interestingly, however, you can generate two (independent) normal variables with two (independent) uniform variables using the Box-Muller method, originally proposed by George Box and Mervin E. Muller. This approach uses the inverse method, but in practice it’s not used much (see below). I detail this method because I find it neat and it highlights the connection between the normal distribution and rotational symmetry, which has been the subject of some recent 3Blue1Brown videos on YouTube.

(This method was also used to simulate the Thomas point process, which I covered in a previous post.)

Incidentally, this connection is also mentioned in a previous post on simulating a Poisson point process on the surface of a sphere. In that method post, Method 2 uses an observation by the Muller that normal random variables can be used to position points uniformly on spheres.

I imagine this method was first observed by transforming two normal variables, instead of guessing various distribution pairs that would work. Then I’ll sketch the proof in the opposite direction, though it works in both directions.

Proof outline

The joint probability density of two independent variables is simply the product of the two individual probabilities densities. Then the joint density of two standard normal variables is

$$\begin{align}f_{X,Y}(x,y)&=\left[\frac{1}{\sqrt{2\pi}}e^{-x^2/2}\right]\left[\frac{1}{\sqrt{2\pi}}e^{-y^2/2}\right]\\&=\frac{1}{{2\pi}}e^{-(x^2+y^2)/2}\,.\end{align}$$

Now it requires a change of coordinates in two dimensions (from Cartesian to polar) using a Jacobian determinant, which in this case is $|J(\theta,r)=r|$.¹ giving a new joint probability density

$$f_{\Theta,R}(\theta,r)=\left[\frac{1}{\sqrt{2\pi}}\right]\left[ r\,e^{-r^2/2}\right]\,.$$

Now we just identify the two probability densities. The first probability density corresponds to a uniform variable on $[0, 2\pi]$, whereas the second is that of a Rayleigh variable with parameter $\sigma=1$. Of course the proof works in the opposite direction because the transformation (between Cartesian and polar coordinates) is a one-to-one function.

Algorithm

Here’s the Box-Muller method for simulating two (independent) standard normal variables with two (independent) uniform random variables.

Two (independent) standard normal random variable $Z_1$ and $Z_2$

Generate two (independent) uniform random variables $U_1\sim U(0,1)$ and $U_2\sim U(0,1)$.

Return $Z_1=\sqrt{-2\ln U_1}\cos(2\pi U_2)$ and $Z_2=\sqrt{-2\ln U_1}\sin(2\pi U_2)$.

The method effectively samples a uniform angular variable $\Theta=2\pi U_2$ on the interval $[0,2\pi]$ and a radial variable $R=\sqrt{-2\ln U_1}$ with a Rayleigh distribution.

The algorithm produces two independent standard normal variables. Of course, as many of us learn in high school, if $Z$ is a standard normal variable, then the random variable $X=\sigma Z +\mu$ is a normal variable with mean $\mu$ and standard deviation $\sigma>0$ .

The fall of the Box-Muller method

Sadly this method was typically not used, as historically computer processors were slow at doing calculations involving the necessary mathematical functions. To avoid these functions researchers developed and employed other methods such as the ziggurat algorithm.

Also, although processors can now do such calculations much faster, many languages, not just scientific ones, come with functions for generating normal variables. Consequently, there had not been much need in implementing this method.

Update: The return of the Box-Muller method

The above conventional wisdom has changed in recent years as processors can now (on a hardware level) readily evaluate such functions. (I had been waiting to see if certain libraries would be re-written by using the Box-Muller method, but why bother if the old ones work so well?) When I used the term “processors”, I had central processor units (CPUs) in mind, but in recent years graphically processor units (GPUs) have become widely popular.

In a comment on this post, it pointed out that the Box-Muller method is the preferred choice for GPUs, as evidenced by its implementation in Nvidia’s CUDA library. The reason is GPUs do not handle well loops and branches in algorithms, so you should use methods that avoid these algorithmic steps. And the Box-Muller method is one that does just that.

The NVDIA website says:

Because GPUs are so sensitive to looping and branching, it turns out that the best choice for the Gaussian transform is actually the venerable Box-Muller transform

The inverse method for simulating random variables

We will cover a simple but much used method for simulating random variables or, rather, random variates. Although the material here is found in introductory probability courses, it frequently forms the foundation of more advance stochastic simulation techniques, such as Markov chain Monte Carlo methods.

Details

The basics of probability theory tell us that any random variable can, in theory, be written as a function of a uniform random variable $U$ distributed on the interval $(0,1)$, which is usually written as $U\sim U(0,1)$. All one needs is the inverse of the cumulative distribution function of the desired random variable.

More specifically, let $X$ be a random variable with a cumulative distribution function $F(x)=\mathbb{P}(X\leq x)$. The function $F$ is nondecreasing in $x$, so its inverse can be defined as $F^{-1}(y)=\inf\{x:F(x)\geq y\}$, which is known as the generalized inverse of $F(x)$.

Some authors assume the minimum is attained so the infimum is replaced with the minimum, giving $F^{-1}(y)=\min\{x:F(x)\geq y\}$.

In short, the following result is all that we need.

Transform of a uniform variable $U\sim U(0,1)$

For a uniform random variable $U\sim U(0,1)$, the random variable $F^{-1}(U)$ has the cumulative distribution function $\mathbb{P}(F^{-1}(U)\leq x)=P(U\leq F(x))=F(x)$.

Algorithm

The above observation gives a method, which I like to call the direct method, for exactly simulating a random variable $X$ with the (cumulative) distribution (function) $F$.

Random variable $X$ with distribution $F$

Sample a uniform random variable $U\sim U(0,1)$, giving a value $u$.

Return the value $x=F^{-1}(u)$ as the sampled value of $U$.

But this approach only works if we can write down (in a relatively straightforward way) the inverse $F^{-1}$, which is usually not the case. This means you cannot generate, for example, simulate a normal random variable with a single uniform random variable by using just the inverse method, as we cannot write down the inverse of its cumulative distribution function.

(Interestingly, with two (independent) uniform random variables, we can use the transform method to simulate two (independent) normal (or Gaussian) random variables. This approach is called the Box-Muller transform, which I’ll cover in another post.)

Nevertheless, we can apply the inverse method to some useful distributions.

Examples

Warning: The following examples are only for illustration purposes. Except for the Bernoulli example, you would never use them in standard scientific languages such as MATLAB, Python (with NumPy), R or Julia, because those languages already have much better functions for simulating these and many other random variables (or variates). If you are writing a function in a language that lacks such functions, I would consult one of the references mentioned below. Although the inverse method is usually intuitive and elegant, it is often not the fastest method.

Bernoulli distribution

The simplest random variable is that with the Bernoulli distribution. With probability $p$, a Bernoulli random variable $X$ takes the value one. Otherwise, $X$ takes the value zero (with probability $1-p$). This gives the (cumulative) distribution (function):

$$ F_B(x)=\begin{cases}
0 & \text{if } x < 0 \\
1 – p & \text{if } 0 \leq x < 1 \\
1 & \text{if } x \geq 1
\end{cases}$$

This gives a very simple way to simulate (or sample) a Bernoulli variable $X$ with parameter $p$.

Bernoulli random variable $X$ with parameter $p$

Sample a uniform random variable $U\sim U(0,1)$, giving a value $u$.

If $u\leq p$, return $x=1$; otherwise return $x=0$.

Application: Acceptance simulation methods

In random simulation code, whenever you do something (or not) with some probability $p$ (or probability $1-p$), then the code will perform the above step. Consequently, you see this in the (pseudo-)code of many stochastic simulations with random binary choices, particularly schemes that have an acceptance step such the Metropolis-Hastings method and other Markov chain Monte Carlo (MCMC) methods.

In MCMC schemes, a random (binary) choice is proposed and it is accepted with a certain probability, say, $\alpha $. This is the equivalent of accepting the proposed choice if some uniform random variable $U$ meets the condition $U\leq \alpha$.

This explains why the pseudo-code of the same algorithm can vary. Some pseudo-code will say accept with probability $\alpha$, while other pseudo-code will say do if $U\leq \alpha$. It’s two equivalent formulations.

Exponential distribution

The cumulative distribution function of an exponential variable with mean $1/\lambda$ is $F_E(x)= 1-e^{-\lambda x}$, which has the inverse $F^{-1}_E(y)=-(1/\mu)\ln[1-y]$. We can use the fact that on the interval $(0,1)$, a uniform variable $U\sim U(0,1)$ and $1-U$ have the same distribution. Consequently, the random variables $\ln [1-U]$ and $\ln U$ are equal in distribution.

This gives a method for simulating exponential random variables.

Exponential random variable $X$ with mean $1/\lambda$

Sample a uniform random variable $U\sim U(0,1)$, giving a value $u$.

Return $x=-(1/\lambda)\ln u$.

Application: Poisson simulation method

Of course you can use this method to simulate exponential random variables, but it has another application. In a previous post on simulating Poisson variables, I mentioned that exponential random variables can be used to simulate a Poisson random variable in a direct (or exact) manner. That method is based on the distances between the points of a homogeneous Poisson point process (on the real line) being exponential random variables.

But this method is only suitable for low values of $\lambda$, less than say fifteen.

Rayleigh distribution

The Rayleigh distribution is $\mathbb{P}(X\leq x)= (x/\sigma^2)e^{-x^2/(2\sigma^2)}$, where $\sigma>0$ is its scale parameter. The square root of an exponential variable with mean $1/\lambda$ has a Rayleigh distribution with scale parameter $\sigma=1/\sqrt{2\lambda}$.

Consequently, the generation method is similar to the previous example.

Rayleigh random variable $Y$ with scale parameter $\sigma>0$

Sample a uniform random variable $U\sim U(0,1)$, giving a value $u$.

Return $y=\sigma\sqrt{-2\ln u}$.

Other methods

The inverse method is intuitive and often succinct. But most functions for simulating random variables (or, more correctly, generating random variates) do not use these methods, as they are not fast under certain parameter regimes, such as large means. Consequently, other method are used such as approximations (with, say, normal random variables), such as the ones I outlined in this post on simulating Poisson random variables.

More complicated random systems, such as collections of dependent variables, can be simulated using Markov chain Monte Carlo methods, which is the direction we’ll take in a couple posts after this one.

New link – math3ma.com

In my internet wanderings, I stumbled upon this blog:

https://almostsuremath.com/

The writer, Tae-Danae Bradley, wrote a PhD applying category theory (a field of mathematics that strives for abstraction) to problems in machine learning.

The Newman-Ziff algorithm for simulating percolation models

Imagine you’re trying to estimate some statistics about a certain percolation system. For example, estimating the probability of a given bond being open, which means connected, when a giant component forms. The system or model is too complicated for analytic results, which is true for all percolation models with the exception of a handful.

So you code up the percolation model in your favourite programming language, run a large number of stochastic simulations, collect the statistics, and then look at the results. Percolation systems are, by definition, large, but with fast computers, the simulation method works. You get your statistics.

Easy.

But that’s just for one parameter. What if you want to see what happens with a range of parameter values? Well, you do the same thing, but now just run the large number of simulations for a range of different parameter values, right?

No.

You could do that, and it would work. But it would be slow. How to make it faster?

Newman and Ziff proposed and implemented a fast percolation algorithm based on the simple ideas of keeping old simulations and using conditional probabilities.

Algorithm

The algorithm was presented in this paper:

2000 – Newman, Ziff – Efficient Monte Carlo Algorithm and High-Precision Results for Percolation

There exists a preprint with a slightly different title.

2001 – Newman, Ziff – A fast Monte Carlo algorithm for site or bond percolation.

Overview

The simulation method consists of three components.

Components of the Newman-Ziff algorithm

Re-use parts of previous simulation outcomes by shuffling.

Use fast union-find algorithms.

Find a smooth curve of results by using conditional probabilities.

None of them are entirely revolutionary, but put together they form a fast (and popular) simulation method for studying (monotonic) percolation models.

1. Randomly sampling sequences

The main insight was to sample in an incremental fashion, adding an open bond to the percolation model, such as a lattice, during each simulation run, without throwing away the previous simulation. More specifically, let’s say they sample a bond model with $k$ open bonds during one simulation run. Then in the next simulation, use the same configuration, and simply add one bond.

This is a nice trick, but there’s no free lunch here. On one hand, they don’t waste results. But on the other hand, they do throw away independence (and ergodicity) by doing this. But perhaps the numerical results don’t care.

To do this step, Newman and Ziff use a random shuffling algorithm called the Fisher-Yates shuffle. But it’s not clear if Newmand and Ziff thought of this shuffling algorithm independently or not. They never call it by its name nor do they cite any work on this well-known shuffling method. Both MATLAB and Python (SciPy) have functions for doing random shuffles.

2. Using union-find

Imagine you have a collection of things. Some of these things are connected to other things in your collection. You want to understand how these clusters of connected things behave.

To find the clusters, looking for ultimately the biggest one, Newman and Ziff use a union-find method. There are different types of these alogrithms, often hinging upon the use of recursion. They were developed mostly in the 1980s, particularly in work by Robert Tarjan.

These methods are very efficient at finding clusters. The algorithm speed or, rather, complexity is given in relation to the inverse of an Ackermann function, which is a famously fast growing function. These methods rely upon the monotonicity of growing unions.

(If you have a non-monotonic percolating system, then unfortunately you can’t use these algorithms, which is the case for percolation based on signal-to-interference-plus-noise ratio (SINR). )

3. Filling in the parameter gaps

For any random sample of a percolation model, the number of open bonds or sites is a natural number. But the parameter of the percolation model, such as the probability of a site or bond being open, is a real number.

Newman and Ziff use a simple conditioning argument to fill the gaps.

For any statistic $S$ of of a given percolation model, the Newman-Ziff algorithm finds the expected statistic conditioned on $n$ number of open bonds (or equivalent objects in other models). In other words, the algorithm finds the expected conditional statistics $E[S_n|N=n]$. Then we just need the probability of $N$ being $n$ as a function of the parameter we’re trying to vary, such as the bond probability $p$. With this probability $P(N=n)$, we immediately arrive at the expression for the statistic $S$ with introductory probability

$$ E(S_n)=E_N [E(S_n|N=n)].$$

For a nice discrete model with $m$ total, we get the expression

$$ E(S_n)=\sum _{n=1}^{m}[P(N=n)S_n].$$

But we know the probability (distribution) of $N$, as it’s simply the binomial variable. Assuming there are $m$ total bonds (or equivalent objects), then we arrive at

$$P(N=n)={m\choose n} p^n(1-p)^{m-n} .$$

Then the final statistic or result is simply

$$E(S_n)=\sum _{n=1}^{m} {m\choose n} p^n(1-p)^{m-n} S_n.$$

Code

I wrote the algorithm in MATLAB and Python. The code can be found here. I’ve included a script for plotting the results (for small square lattices).

MATLAB

It turns out that MATLAB doesn’t like recursion too much. Well, at least that’s my explanation for the slow results when I use the union-find method with recursion. So I used the union-find method without recursion.

Python

My Python code is mostly for illustration purposes. Check out this Python library. That website warns of the traps that multithreading in NumPy can pose.

C

In the other original paper, Newman and Ziff included C code of their algorithm. You can find it on Newman’s website, but I have also a copy of it (with added comments from me) located here.

Future work

I plan to implement this algorithm using the NVIDIA CUDA library. This is exactly the type of algorithm that we can implement and run using graphical processing units (GPUs), which is the entire point of the CUDA library.

At first glance, the spatial dependence of the problem suggests using texture memory to speed up the memory accessing on the GPUs. But I will explain why that’s probably not a good idea for this specific algorithm

New link: gregorygundersen.com

I came across this blog:

http://gregorygundersen.com/blog

The writer now focuses mostly on financial models and techniques, but earlier posts cover topics in probability and statistics.

Markov chains vs Markov processes

Markov the frog

State space

Markov property

Markov kernel

Initial distribution

Jumping from \(x\) to \(y\)

Regularity conditions

Stationary distribution \(\pi\)

Aperiodicity

Irreducibility

Recurrence

Positive recurrence

Ergodicity

Further reading

Code considerations

Storing multi-dimensional values in C

Randomness in C

Code

The algorithm

Quantum is so hot right now

Quantum lite

Details

Algorithm

Point process application

Efficiency

Higher dimensions

Further reading

Proof outline

Algorithm

The fall of the Box-Muller method

Update: The return of the Box-Muller method

Further reading

Websites

Papers

Books

Details

Algorithm

Examples

Bernoulli distribution

Application: Acceptance simulation methods

Exponential distribution

Application: Poisson simulation method

Rayleigh distribution

Other methods

Further reading

Algorithm

Overview

1. Randomly sampling sequences

2. Using union-find

3. Filling in the parameter gaps

Code

MATLAB

Python

C

Future work

Further reading