backpropagation Archives – H. Paul Keeler

Improving on artificial neural networks

Artificial neural networks get all the glory. They are now everywhere. You can’t open up a newspaper or your laptop without seeing a reference to or being pestered by some agent of artificial intelligence (AI), which usually implies an artificial neural network is working in the background. Despite this, they are far from ideal.

In in some sense, mainstream artificial networks networks are rather brain-lite, as they only draw inspiration from how brains actually function. These statistical models are mostly linear and continuous, which makes them well-behaved mathematically (or algorithmically) speaking.

But in terms of energy and time required for training and using computational tools, the carbon-based grey matter is winning. Kids don’t need to read every book, newspaper and sonnet ever written to master a language. While understanding these words, our brains aren’t boiling volumes of water in excess heat output.

To make such statistical models more brain-heavy, researchers have proposed neural networks that run on spikes, drawing inspiration from the spiky electrical jolts that run through brain neurons. The proposal is something called a spiking neural network, which is also artificial neural networks, but they run on spikes with the aim of being more efficient at learning and doing tasks.

Spiking neural networks are not smooth

The problem is that these spiky models are hard to train (or fit), as their nice behaving property of smoothness vanishes when there’s no continuity. You cannot simply run the forward pass and then backward pass to find the gradients of your neural network model, as you do with regular neural networks when doing backpropagation.

Surrogate gradients

Despite this obstacle, there have been some proposals for training these statistical models. The training proposals often come down to proposing a continuous function to approximate the actual function being used in the spiking neural networks. The function’s gradients are found using standard methods. We can then use these surrogate gradients to infer in which direction we should move to to better train (or fit) the model.

I know. It sounds like cheating, using one function to guess something about another function. But there has been some success with this approach.

Sparse Spiking Gradients

The training method proposed by Perez-Nieves and Goodman is a type of surrogate method. Using the leaky integrate-and-fire (LIF) model for neuron firing, they develop an approximation for the gradients of their spiking neural network. A key feature of their approach is that, much like our brains, their model is sparse in the sense only a small fraction of neurons are every firing.

Provided the spiking neural network attains a certain degree of sparsity, their sparse spiking gradient descent gives faster, less memory-hungry results.

Perez-Nieves and Goodman support their claims by giving numerical results, which they obtained by running their method on graphical processing units (GPUs). These ferociously fast video game chips have become the standard hardware for doing big number-crunching tasks that are routinely required of working with models in machine learning and artificial intelligence.

When I studied neural networks during my undergraduate, they didn’t receive the attention and praise that they now enjoy, becoming synonymous with artificial intelligence (AI). Loosely inspired by the workings of the brain, these statistical models were conceived by back in the 1940s for classifying data. But then they were underwent the so-called AI winter, receiving little notice from the broader research community, except for some notable exceptions.

But now neural networks are back with gusto under the term deep learning.

(Strictly speaking, the term deep learning refers to several classes of statistical or machine learning algorithms with multiple layers making them deep, and neural networks is one class of these algorithms.)

To an outsider, which is most of the world, the term deep learning sounds perhaps a bit mysterious. What’s deep about them?

To dispel the mystery and shed light on these now ubiquitous statistical models, Catherine F. Higham and Desmond J. Higham wrote the tutorial:

2019 – Higham and Higham, Deep Learning: An Introduction for Applied Mathematicians.

Here’s my take on the paper.

I recommend this paper if you want understand the basics of deep learning. It looks at a simple feedforward (neural) network, which is also called a multilayer perceptron, but the paper uses neither term. Taking a step beyond simple linear regression, this is the original nonlinear neural network without any complications that they now possess.

The toy example of a neural network in the paper serves as an excellent starting point for, um, learning deep learning.

What the paper covers

When building up a neural network, a so-called activation function is needed. The working of the neural network involves this function being typically applied again and again (and again) to matrices.

The paper goes through the essential (for training neural networks) procedure of backpropagation, showing that it’s a clever, compact way based on the chain rule in calculus for getting the derivatives of this model. These derivatives are then used in the gradient-based optimization method for fitting the model. (Optimizing functions and fitting statistical models often results imply the same thing.)

Obviously written for (applied) mathematicians, the paper attempts to clarify some of the confusing terms, which have arisen because much of AI and machine learning in general have been developed by computer scientists. For example, what is called a cost function or loss function in the machine learning community is often called an objective function.

Other examples spring to mind. What is commonly called the sigmoid function may be better known as the logistic function. And what is linear in machine learning land is often actually affine.

Worked example with code

The paper includes a worked example with code (in MATLAB) of a simple 4-layer feedforward network (or perceptron). This model is then fitted or trained using a simple stochastic gradient descent method, hence the need for derivatives. For training, the so-called cost function or loss function is a root-square function, but now most neural networks use cost functions based on maximum likelihoods. For the activation function, the papers uses the sigmoid function, but often practitioners use the recitified linear unit (ReLU) activation function is often used.

The problem is a simple binary classification problem, identifying in which of the two regions points lie in a two-dimensional square.

In the code, the neural network is hard coded, so if you want to modify the structure of the network, you’ll have to work a bit. Such coding practices are usually frowned upon, but the authors have done so for brevity and clarity purposes.

The rest of the paper

The next section of the paper involves using a pre-written MATLAB library (but not an official one) that applies a convolution neural network. (The word convolution is used, but the neural networks actually hinge upon filters.) These networks are become essential for treating image data, being now found on (smart)phones and computers for identifying contents of photos. There is less to gain here in terms of intuition on how neural networks work.

The last section of the paper details what the authors didn’t cover, which includes regularization methods (to prevent overfitting) and why the steepest descent method works, despite it being advised against outside of this field.

Code one up yourself

If you want to learning how these networks work, I would strongly suggest coding up one yourself, preferably first without looking at the code given by Higham and Higham.

Of course if you want to develop a good, functioning neural network, you wouldn’t use the code found in the tutorial, which is just for educational purposes. There are libraries such as TensorFlow or (Py)Torch.

Tag: backpropagation

Sparse Spiking Gradient Descent