machine learning

Determinantal learning for selecting subsets in wireless networks

We can’t all talk all the time, while still being heard. That’s true for people and it’s true for wireless networks.

Networks need to schedule when their network nodes can transmit and receive data. In other words, a collection or subset of network nodes is chosen to transmit, while another subset is chosen to receive. (In telecommunication language, this is part of what is called the medium access control or MAC layer.)

The classic scheduler (spatial) Aloha involves each node flipping a biased coin with probability $p$ of heads occurring. If heads comes up, a node can transmit. If not, it must stand ready to receive data. Simple.

But can we do better?

Determinantal scheduling

A few years ago, a couple collaborators and I become interested in this problem. To develop an intelligent (or, at least, not stupid) scheduler, we adapted a machine (or statistical) learning framework and published the results in a conference paper:

2020 – Błaszczyszyn, Brochard, and Keeler, Coverage probability in wireless networks with determinantal scheduling.

The hint is in the title. Using a determinantal scheduler, we derived mathematical expressions for the coverage probability. I discussed that paper in an earlier post. I produced the numerical results with code located here and here.

From fermions to generative model

We used a learning framework based on determinantal point process, which were originally used to model subatomic particles called fermions such as electrons. A key property is that determinantal points, like fermions, repel each other, so this is a random model for diversity, anti-clustering or negative correlations.

Defined with a kernel matrix and determinants, these random models have convenient mathematical properties. For example, you can both simulate them (exactly) and train (or fit) them with maximum likelihood methods. Using this fermion model, Kulesa and Taskar proposed and pioneered a machine learning framework.

When properly trained (or fitted), determinantal models can act as generative artificial intelligence (or Gen AI) models. In some sense, these point processes are special types of Gibbsian point processes, which are defined with a general energy function called a Hamiltonian. In the field of AI and computer science more broadly, there are many energy-based models, some of which are used for generative models such as Boltzmann machines.

Recent work

A few days ago I was curious what was happening in this research area, and by chance, I noticed this recently uploaded preprint:

2025 – Tu, Saha, and Dhillon – Determinantal Learning for Subset Selection in Wireless Networks.

I feel more work remains to be done in this area.

I should stress that none of the above work was possible without the pioneering efforts by Kulesza and Taskar in a series of papers, much of which appears in their book:

2012 – Kulesza and Taskar – Determinantal point processes for machine learning, Now Publishers.

You can find a version of the book here.

Lecture slides based on Baccelli, Błaszczyszyn, and Karray

For those interested in the theory of point processes, Mohamed Karray uploaded some lecture slides, which are based on the book manuscript:

2024 – Baccelli, Błaszczyszyn, and Karray – Random Measures, Point Processes, and Stochastic Geometry.

I have written about this manuscript in a previous post.

Although both the slides and manuscript lean heavily on the side of mathematical rigour, applications are the driving motivation. For example, the third set of slides cover Palm calculus, which hinges upon the important concept of conditioning on a point (of the random point process) existing at a specific location. This idea is frequently used in mathematical models of wireless telecommunication networks based on stochastic geometry. (Unfortunately the Wikipedia article on Palm calculus is sorely lacking and only looks at the temporal, not spatial, setting.)

Karray, a researcher at Orange Labs in Paris, also has slides and (mostly forthcoming) manuscripts on machine learning and statistics, which take a rather rigorous route. It’s a French touch, mathematically, as measure theory appears.

The Metropolis(-Rosenbluth-Rosenbluth-Teller-Teller)-Hastings algorithm

Consider a collection of random variables described by a joint probability distribution. Often, in any field with probability or statistics, one faces the task of simulating these random variables, which typically depend on each other in some fashion.

A now standard way for simulating or sampling such random variables is to use the Metropolis-Hastings algorithm, undoubtedly the cornerstone of Markov chain Monte Carlo methods. This method creates a discrete-time Markov chain that has a stationary or invariant distribution being the aforementioned distribution.

The algorithm was born out of a 1953 paper by Nicholas Metropolis, Arianna W. Rosenbluth, Marshall Rosenbluth, Augusta H. Teller, and Edward Teller (two husband-wife pairs), who looked at a special case, and a 1970 paper by W.K. Hastings, who generalized the method. It is typically called the Metropolis-Hastings or the Metropolis algorithm. And some have called it the M(RT)²H algorithm.

(The history is a bit complicated, but perhaps we should drop the name Metropolis. The late Arianna (née Wright) Rosenbluth did most of the work. She was also a dab hand at fencing.)

Although the algorithm’s initial adoption and use was slow, taking decades partly due to slower computers, the Metropolis-Hastings algorithm is now a widely used method for simulating collections of random variables. This in turn gives fast ways for exploring, integrating, and optimizing otherwise unwieldy mathematical functions, such as those found in Bayesian statistics, machine learning, statistical physics, and combinatorial optimization. The algorithm serves as the foundation for other random simulation methods, such as the Gibbs sampler, hence it’s been called the workhorse of Marko chain Monte Carlo methods.

There are many books, articles, lecture notes, and websites describing the Metropolis-Hastings algorithm; see the Further reading section below. But I’ll detail the core ideas here. This post is designed to be somewhat self-contained, but it arose from a series of posts, starting with this one and ending with this particularly relevant one.

Constructing a Markov process

Take a collection of $n$ random variables $X_1,\dots,X_n$ with a (joint) probability distribution $\pi(x)=\pi(x_1,\dots,x_n)$. This distribution will either be a (joint) probability mass function or (joint) probability density for discrete or continuous random variables, respectively.

We wish to construct a Markov chain on an abstract mathematical space $\mathbb{X}$. We assume we can write a point $x\in \mathbb{X}$ as $x=(x_1,\dots,x_n)$. More specifically, the space $\mathbb{X}$ is a Cartesian product of spaces $\mathbb{X}_1,\dots,\mathbb{X}_n$ on which the variables are defined.

Which mathematical space is $\mathbb{X}$? That will, of course, depend on the random variables you’re trying to simulate. But it’s usually the lattice $\mathbb{Z}^n$, Euclidean space $\mathbb{R}^n$, or a subset of one of these two spaces.

For this post, we’ll assume that the space $\mathbb{X}$ is discrete, which makes the things simpler. But the mathematics is similar for continuous spaces, with small changes such as replacing probabilities and sums with probability densities and integrals, respectively. We’ll further assume the space is finite, so we can use matrices to describe the Markov transition kernels. But everything covered here will work on infinite spaces such as $\mathbb{R}^n$, which is the most common space used in practice.

Again, our overall aim is to construct a Markov chain with a stationary $\pi$ being the same as the distribution that we want to sample.

Jumper process

There’s a random jumper that wants to jump around the space $\mathbb{X}$. The jumper randomly jumps from one point in this mathematical space $x\in \mathbb{X}$ to another point $y\in \mathbb{X}$ according to the probability $J(x,y)$. If the the state space $\mathbb{X}$ is finite, then $J$ becomes a matrix. The matrix row $J(x,\cdot)$ is a probability mass function for each $x\in \mathbb{X}$, so it sums to one. By definition, this random jumping forms a Markov chain.

The only thing we ask is that, for our jumper, every point $x$ in $\mathbb{X}$ where $\pi(x)>0$ is reachable with positive probability in a single step. This implies the easy-to-achieve condition $J(x,y)>0$ where $\pi(x)>0$ for all points $x,y\in\mathbb{X}$.

Now we have a Markovian jumper on the space $\mathbb{X}$. But this turns out to be too much jumping for our jumper. Furthermore, the jumper is jumping more in certain directions than others. Occasionally the jumper wants to stay put (and have a rest) with the aim of balancing the jump directions.

The jumper still wants to jump sometimes from a point $x\in \mathbb{X}$ to another point $y\in \mathbb{X}$ based on $J(x,y)$. But now at each time step, after choosing the jump direction but before jumping, the jumper flips a biased coin whose success probability $\alpha(x,y)$ depends on the current position $x\in \mathbb{X}$ and the (potential) next position $y\in \mathbb{X}$. For the coin, the acceptance probability, which allows (or not) the jumper to move from $x$ to $y$, is given by

$$\alpha(x,y)=\min[1,\frac{\pi(y)}{\pi(x)}\frac{J(y,x)}{J(x,y)} ]\,,\quad x, y\in \mathbb{X}\,.$$

The function $\alpha(x,y)$ is clearly never negative. The minimum in the above expression ensures that $\alpha(x,y)$ is a valid probability.

Metropolis-Hastings ratio

The ratio in the expression for $\alpha(x,y)$ is sometimes called the Metropolis-Hastings ratio, which we’ll soon see is designed specifically to balance the jump directions. The ratio means that a constant factor in the target distribution $\pi(x)$ will vanish due to cancellation.

More specifically, if we can write the target distribution as $\pi(x)=f(x)/C$, where $C>0$ is some constant and $f(x)$ is a non-negative function, then the ratio term $\pi(y)/\pi(x)=f(y)/f(x)$. This reasoning also applies to a constant factor in the transition kernel $M$.

The constant factor being irrelevant in the target distribution $\pi(x)$ is very useful. It is particularly important for posterior distributions in Bayesian statistics and the Gibbs distributions in statistical physics, as these distributions typically have difficult-to-calculate constants.

Metropolis-Hastings process

The pairing of the original jumper Markov chain with the coin flipping creates a new Markov chain, which we call the Metropolis-Hastings process. What’s remarkable is its stationary distribution will be the target distribution $\pi(x)$.

Transition kernel (matrix)

For the Metropolis-Hastings process, we can readily reason the structure of the transition kernel (matrix) $M$ that describes this Markov chain. First we’ll look at the off-diagonal entries of $M$.

Jumping from $x$ to $y\neq x$

To jump from point $x$ and to another point $y\neq x$, the probability is simply the previous probability $J(x,y)$ multiplied by the probability of that proposed jump being accepted, which is $\alpha(x,y)$, giving

$$ M(x,y) = \alpha(x,y) J(x,y), \quad x\neq y\,.$$

Now we examine the diagonal entries of $M$.

Jumping from $x$ to $x$

There are two different ways for the jumper to remain at point $x$. The first way is that the jumper simply jumps from $x$ to $x$, which happens with probability $J(x,x)$. This proposed jump, so to speak, is accepted with probability one because $\alpha(x,x)=1$. Consequently, we can write this probability as $\alpha(x,x)J(x,x)$ or $J(x,x)$.

The second way consists of all the possible jumps from $x$ to $z$, but then for each of those proposed jumps to be rejected, which happens (for each jump) with probability $[1-\alpha(x,z)]$. (I am using here the dummy or bound variable $z$ instead of $y$ for clarity.) Adding up the probabilities of these events gives the probability of the second way being the sum $\sum_{z\in \mathbb{X}}[1-\alpha(x,z)] J(x,z) \,.$

Consequently, for a single time step, the probability that the jumper starts at point $x$ and remains at $x$ is

$$ M(x,x) = \alpha(x,x)J(x,x)+\sum_{z\in \mathbb{X}}[1-\alpha(x,z)] J(x,z) \,.$$

The transition matrix $M$ should be stochastic, so the rows sum to one, which we see is the case

$$ \sum_{y\in\mathbb{X}}M(x,y)=1\,.$$

Of course, we could have derived the diagonal entry $M(x,x)$ immediately by starting with the above sum, but that approach lacks intuition into what’s happening.

Expression for the transition kernel $M$

$$M(x,y) = \begin{cases}
\alpha(x,y) J(x,y) & \text{if }
\begin{aligned}[t]
x&\neq y
\end{aligned}\\
\alpha(x,x)J(x,x)+\sum_{z\in \mathbb{X}}[1-\alpha(x,z)] J(x,z) & \text{if } x=y
\end{cases}$$

Often the above expression is written as a single line by placing an indicator function or similar in front of the sum for the diagonal entries. (In the continuous case, where the sum is replaced with an integral, a Dirac delta distribution is often used instead.)

Reversibility

A Markov process on $\mathbb{X}$ with kernel (matrix) $K$ is (time) reversible with respect to the distribution $\mu$ if the following holds

$$ \mu(x)K(x,y) = \mu (y) K(y,x)\quad x,y\in\mathbb{X}\,.$$

This reversibility condition is also called the detailed balance equation. If this condition is met, then the Markov process will have a stationary distribution $\mu$. By summing over $x$, we can verify this because we obtain

$$ \sum_{x\in\mathbb{X}}\mu(x)K(x,y) =\mu(y)\sum_{x\in\mathbb{X}} K(y,x)=\mu(y)\,.$$

This is just the balance equation, often written as $\mu=K\mu$, which says that the transition kernel $K$ has a stationary distribution $\mu $.

(Strictly speaking, we don’t necessarily need reversibility, as long as the Markov chain has a stationary distribution. Reversibility just makes the algebra easier.)

The Metropolis-Hastings process is reversible

We can show that the Metropolis-Hastings process with the transition kernel $M$ satisfies the reversibility condition. The proof essentially just requires the swapping of rows and columns. Clearly then we only need to look at the off-diagonal entries, so we assume $x\neq y$.

We further assume that $\pi(y)J(y,x)\geq \pi(x) J(x,y)$. Then $\alpha(x,y)=1$, so $M(x,y)=J(x,y)$. Now we use this last fact in the last line of algebra that follow:

$$\begin{aligned}\pi(y) M(y,x)&=\pi(y)J(y,x) \alpha(y,x)\\ &= \pi(y)J(y,x) \frac{\pi(x)}{\pi(y)}\frac{J(x,y)}{J(y,x)}\\ &= \pi(x)J(x,y)\\&= \pi(x)M(x,y)\,,\end{aligned}$$

which shows that the reversibility condition holds with the last assumption.

But of course we can reverse that assumption, due to symmetry, so $\pi(y)J(y,x)\leq \pi(x) J(x,y)$, then $\alpha(x,y)=[\pi(y)J(y,x)]/[\pi(x) J(x,y)]$ and $\alpha(y,x)=1$, implying this way also works. Then the reversibility condition holds for all cases.

We could have shown this more quickly by observing that

$$\begin{aligned}\pi(y) M(y,x)&=\pi(y)J(y,x)\alpha(y,x)\\ &= \pi(y)J(y,x) \min[1, \frac{\pi(x)}{\pi(y)}\frac{J(x,y)}{J(y,x)}]\\ &= \min[\pi(y)J(y,x) , \pi(x)J(x,y)]\,,\end{aligned}$$

which is symmetric in $x$ and $y$.

In summary, the Metropolis-Hastings process is reversible and has the stationary distribution $\pi(x)$.

Continuous case

As mentioned earlier, the continuous case is similar to the discrete-case with a few modifications. For example, assuming now $\mathbb{X}$ is continuous, then the stationary distribution is described by a probability density $\pi(x)$. The jumper process will be described by transition kernel (function) $j(x,y)$, where $j(x,\cdot)$ is now a probability density for each $x\in \mathbb{X}$.

It follows that the acceptance probability is

$$\alpha(x,y)=\min[1,\frac{\pi(y)}{\pi(x)}\frac{j(y,x)}{j(x,y)} ]\,,\quad x, y\in \mathbb{X}\,.$$

The transition kernel (function) is

$$m(x,y) = \begin{cases}
\alpha(x,y) j(x,y) & \text{if }
\begin{aligned}[t]
x&\neq y
\end{aligned}\\
\alpha(x,x) j(x,x)+\int_{\mathbb{X}}[1-\alpha(x,z)]j(x,z) dz & \text{if } x=y
\end{cases}$$

For more details, there are derivations online such as this one, as well as the sources cited in the Further reading section below.

Libraries

Of course, before writing your own code, I would check out any pre-written functions in your favourite language that already implement the Metropolis-Hastings algorithm.

For example, in MATLAB there’s the function mhsample. Again I would be using this before writing my own code, unless it’s for illustration or educational purposes.

In Python there’s the library pymcmcstat. For those with a machine learning bent, there’s also a TensorFlow function MetropolisHastings.

In R, which is a language designed for statistics, implementations of any Markov chain Monte Carlo methods will be couched in the language and notation of (Bayesian) statistics. For a specific library, I can’t say I know which is the best, but you can check out the MCMCPack library. For R libraries, be wary of using the ones that were published and are (or were?) maintained by a single contributor.

Code

For a couple of simple examples in both one dimension and two dimensions, I’ve implemented the Metropolis-Hastings algorithm in the programming languages R, MATLAB, Python (NumPy) and Julia. The code can be found here.

Quantum-enhanced Markov chain Monte Carlo

The not-so-mathematical journal Nature recently published a paper proposing a new Markov chain Monte Carlo method:

2023 – Layden, Mazzola, Mishmash, Motta, Wocjan, Kim, and Sheldon – Quantum-enhanced Markov chain Monte Carlo.

Appearing earlier as this preprint, the paper’s publication in such a journal is a rare event indeed. This post notes this, as well as the fact that we can already simulate perfectly¹ the paper’s test model, the Ising or Potts model.² But this is a quantum algorithm, which is exciting and explains how it can end up in that journal.

The algorithm

The paper’s proposed algorithm adds a quantum mechanical edge or enhancement to the classic Metropolis-Hastings algorithm.³ The original algorithm uses a Markov chain defined on some mathematical space. Running it on a traditional or classical computer, at each time step, the algorithm consists of proposing a random jump and then accepting the proposed jump or not. Owing to the magic of Markov chains, in the long run, the algorithm simulates a desired probability distribution; also see my post.

The new quantum version of the algorithm uses a quantum computer to propose the jump, while still using a classical computer to accept the proposal or not.⁴ The quantum jump proposals are driven by a time-independent Hamiltonian, which is a central object in quantum and, in fact, all physics. This leads to a Boltzmann (or Gibbs) probability distribution for the jumping process.

Then, running the quantum part on a quantum computer, the algorithm will hopefully outperform its classical counterpart. The paper nurtures this hope by giving empirical evidence of the algorithm’s convergence speed. The researchers performed the numerical experiments on a 27-qubit quantum processor at IBM using the platform Qiskit.

Quantum is so hot right now

In recent years researchers have been focusing on such algorithms that exploit the strangeness and spookiness of quantum mechanics. You will see more and more quantum versions of algorithms that appear in statistics, machine learning, and related fields, as suggested by this survey paper, which also appeared in Nature.

Quantum lite

Sometimes quantum mechanics only loosely inspires algorithms and models. In this setting, some of my machine learning work uses determinantal point processes. This kernel-based random model draws direct inspiration from the wave function, a standard object in quantum mechanics. Under suitable simplifying conditions, the model describes the locations of particles known as fermions such as electrons and protons. Still, it’s fascinating that a quantum physics model inspired an interesting random object that has found applications in spatial statistics and machine learning.

New link – math3ma.com

In my internet wanderings, I stumbled upon this blog:

https://almostsuremath.com/

The writer, Tae-Danae Bradley, wrote a PhD applying category theory (a field of mathematics that strives for abstraction) to problems in machine learning.

Book manuscript: Random Measures, Point Processes, and Stochastic Geometry

Interested in the theory of point processes? There is the freely available online book manuscript:

Baccelli, Błaszczyszyn, and Karray – Random Measures, Point Processes, and Stochastic Geometry.

The authors are established names in the field having written many papers on using point processes to model wireless networks.

(Disclaimer: I have worked and published with them. I have even co-authored a book with one of them.)

Of course there are already books on point process, including the two-volume classic by the Daryl Daley and David Vere-Jones. Although that work still serves as a good reference, since its publication researchers have produced many new results. This explains the publication of two more recent books on point processes and their generalization random measures:

Last and Penrose – Lectures on the Poisson process, Cambridge University Press.
Kallenberg – Random Measures, Theory and Applications, Springer.

There’s a free online version of the book by Last and Penrose.

(Disclaimer: Günter Last graciously mentioned my name in the acknowledgements for my very minor contribution to the historical section.)

The manuscript by Baccelli, Błaszczyszyn, and Karray covers newer topics and results by using mathematically rigorous proofs. For example, there are interesting results on determinantal point processes, also called fermion point processes, which started off as a model for subatomic particles in seminal work by Odile Macchi, but they have found applications in wireless network models and machine learning. (I mentioned these later applications in this and this post.)

Despite applications being a strong motivation, the book is not for the faint of hearted in terms of mathematics. In the tradition of French probability, there is plenty of rigour, which means mathematical abstraction, analysis and measure theory.

Update: Mohamed Karray uploaded some lecture slides based on these lectures. Karray is a researcher at Orange Labs in Paris who has research interests in telecommunications (naturally), as well as machine (or statistical) learning.

News: Accelerated PyTorch learning on Mac

The PyTorch website reveals that I can now do accelerated model training with PyTorch on my Mac (which has one of the new Silicon M series chips):

In collaboration with the Metal engineering team at Apple, we are excited to announce support for GPU-accelerated PyTorch training on Mac. Until now, PyTorch training on Mac only leveraged the CPU, but with the upcoming PyTorch v1.12 release, developers and researchers can take advantage of Apple silicon GPUs for significantly faster model training. This unlocks the ability to perform machine learning workflows like prototyping and fine-tuning locally, right on Mac.

As the website says, this new PyTorch acceleration capability is due to Metal, which is Apple’s graphics processing (or GPU) technology.

This means I no longer need to use remote machines when training models. I can do it all on my trusty home computer. Let’s check.

The easiest way to check is, if you haven’t already, first install PyTorch. I would use Anaconda:

conda install pytorch torchvision -c pytorch

With PyTorch installed, run Python and then run the commands:

import torch
print(torch.backends.mps.is_available()) 
print(torch.backends.mps.is_built())

If the last commands works without a hitch, you have PyTorch installed. The last two commands should return True, meaning that PyTorch can use your graphics card for (accelerated) calculations.

Coverage probability in wireless networks with determinantal scheduling

My collaborators and I uploaded a manuscript:

Błaszczyszyn, Brochard, and Keeler, Coverage probability in wireless networks with determinantal scheduling.

https://arxiv.org/abs/2006.05038

Details

The paper builds off some previous work by us that uses a (relatively) new model in machine learning:

Błaszczyszyn and Keeler, Determinantal thinning of point processes with network learning applications.

https://arxiv.org/abs/1810.08672

The new machine learning model is based on a special type of point process called a determinantal point process. It was originally called a Fermion point process. These are useful point processes as the exhibit certain closure properties under certain operations such independent thinning.

Determinantal point processes are random objects and, when they are properly trained or fitted, can serve as generative artificial intelligence (or Gen AI) models. In fact, you can interpret these point processes as a special Gibbsian point processes, which have at their core a Hamiltonian function. A Hamiltonian is generalized type of energy function, which is also the basis for other generative (or stochastic) models such as Boltzmann machines.

Kulesza and Taskar introduced and developed the framework for using determinantal point processes for machine learning models.

Code

The MATLAB code for the producing the results in the paper can be found here:

https://github.com/hpaulkeeler/detcov_matlab

I also re-wrote the MATLAB code into Python:

https://github.com/hpaulkeeler/detcov_python

Deep Learning: An Introduction for Applied Mathematicians

When I studied neural networks during my undergraduate, they didn’t receive the attention and praise that they now enjoy, becoming synonymous with artificial intelligence (AI). Loosely inspired by the workings of the brain, these statistical models were conceived by back in the 1940s for classifying data. But then they were underwent the so-called AI winter, receiving little notice from the broader research community, except for some notable exceptions.

But now neural networks are back with gusto under the term deep learning.

(Strictly speaking, the term deep learning refers to several classes of statistical or machine learning algorithms with multiple layers making them deep, and neural networks is one class of these algorithms.)

To an outsider, which is most of the world, the term deep learning sounds perhaps a bit mysterious. What’s deep about them?

To dispel the mystery and shed light on these now ubiquitous statistical models, Catherine F. Higham and Desmond J. Higham wrote the tutorial:

2019 – Higham and Higham, Deep Learning: An Introduction for Applied Mathematicians.

Here’s my take on the paper.

I recommend this paper if you want understand the basics of deep learning. It looks at a simple feedforward (neural) network, which is also called a multilayer perceptron, but the paper uses neither term. Taking a step beyond simple linear regression, this is the original nonlinear neural network without any complications that they now possess.

The toy example of a neural network in the paper serves as an excellent starting point for, um, learning deep learning.

What the paper covers

When building up a neural network, a so-called activation function is needed. The working of the neural network involves this function being typically applied again and again (and again) to matrices.

The paper goes through the essential (for training neural networks) procedure of backpropagation, showing that it’s a clever, compact way based on the chain rule in calculus for getting the derivatives of this model. These derivatives are then used in the gradient-based optimization method for fitting the model. (Optimizing functions and fitting statistical models often results imply the same thing.)

Obviously written for (applied) mathematicians, the paper attempts to clarify some of the confusing terms, which have arisen because much of AI and machine learning in general have been developed by computer scientists. For example, what is called a cost function or loss function in the machine learning community is often called an objective function.

Other examples spring to mind. What is commonly called the sigmoid function may be better known as the logistic function. And what is linear in machine learning land is often actually affine.

Worked example with code

The paper includes a worked example with code (in MATLAB) of a simple 4-layer feedforward network (or perceptron). This model is then fitted or trained using a simple stochastic gradient descent method, hence the need for derivatives. For training, the so-called cost function or loss function is a root-square function, but now most neural networks use cost functions based on maximum likelihoods. For the activation function, the papers uses the sigmoid function, but often practitioners use the recitified linear unit (ReLU) activation function is often used.

The problem is a simple binary classification problem, identifying in which of the two regions points lie in a two-dimensional square.

In the code, the neural network is hard coded, so if you want to modify the structure of the network, you’ll have to work a bit. Such coding practices are usually frowned upon, but the authors have done so for brevity and clarity purposes.

The rest of the paper

The next section of the paper involves using a pre-written MATLAB library (but not an official one) that applies a convolution neural network. (The word convolution is used, but the neural networks actually hinge upon filters.) These networks are become essential for treating image data, being now found on (smart)phones and computers for identifying contents of photos. There is less to gain here in terms of intuition on how neural networks work.

The last section of the paper details what the authors didn’t cover, which includes regularization methods (to prevent overfitting) and why the steepest descent method works, despite it being advised against outside of this field.

Code one up yourself

If you want to learning how these networks work, I would strongly suggest coding up one yourself, preferably first without looking at the code given by Higham and Higham.

Of course if you want to develop a good, functioning neural network, you wouldn’t use the code found in the tutorial, which is just for educational purposes. There are libraries such as TensorFlow or (Py)Torch.

Determinantal thinning of point processes with network learning applications

My colleague and I uploaded a manuscript:

Błaszczyszyn and Keeler, Determinantal thinning of point processes with network learning applications.

https://arxiv.org/abs/1810.08672

Details

The paper uses a (relatively) new model framework in machine learning. This framework is based on a special type of point process called a determinantal point process, which is also called a fermion point process. (This particle model draw inspiration from the form of the wave function in quantum mechanics.) Kulesza and Taskar introduced and developed the framework for using determinantal point processes for machine learning models.

The paper covers a type of generative machine learning (or Gen AI) model, due to its using randomness to create network outcomes. In fact, you can interpret these point processes as a special Gibbsian point processes, which have at their core a Hamiltonian function. A Hamiltonian is generalized type of energy function, which is also the basis for other generative (or stochastic) models such as Boltzmann machines.

The training method for determinantal point processes is based on the maximum likelihood approach, which gives a convex optimization problem, allowing you to tackle it with your favourite gradient method, such as the BFGS or Adam method.

It turns out that point processes are amenable for looking at the signal-to-inference-plus-noise ratio (SINR) — see this post for details. More specifically, we found these point processes give tractable SINR-based expressions when you use (what I call) the standard (SINR) model, which I covered in a previous post.

Code

The MATLAB code for the producing the results in the paper can be found here:

https://github.com/hpaulkeeler/DetPoisson_MATLAB

I also re-wrote (or translated) the MATLAB code into Python:

https://github.com/hpaulkeeler/DetPoisson_Python

Determinantal scheduling

From fermions to generative model

Recent work

Further reading

Wireless networks