When I studied neural networks during my undergraduate, they didn’t receive the attention and praise that they now enjoy, becoming synonymous with artificial intelligence (AI). Loosely inspired by the workings of the brain, these statistical models were conceived by back in the 1940s for classifying data. But then they were underwent the so-called AI winter, receiving little notice from the broader research community, except for some notable exceptions.
But now neural networks are back with gusto under the term deep learning.
(Strictly speaking, the term deep learning refers to several classes of statistical or machine learning algorithms with multiple layers making them deep, and neural networks is one class of these algorithms.)
To an outsider, which is most of the world, the term deep learning sounds perhaps a bit mysterious. What’s deep about them?
To dispel the mystery and shed light on these now ubiquitous statistical models, Catherine F. Higham and Desmond J. Higham wrote the tutorial:
- 2019 – Higham and Higham, Deep Learning: An Introduction for Applied Mathematicians.
Here’s my take on the paper.
Recommended
I recommend this paper if you want understand the basics of deep learning. It looks at a simple feedforward (neural) network, which is also called a multilayer perceptron, but the paper uses neither term. Taking a step beyond simple linear regression, this is the original nonlinear neural network without any complications that they now possess.
The toy example of a neural network in the paper serves as an excellent starting point for, um, learning deep learning.
What the paper covers
When building up a neural network, a so-called activation function is needed. The working of the neural network involves this function being typically applied again and again (and again) to matrices.
The paper goes through the essential (for training neural networks) procedure of backpropagation, showing that it’s a clever, compact way based on the chain rule in calculus for getting the derivatives of this model. These derivatives are then used in the gradient-based optimization method for fitting the model. (Optimizing functions and fitting statistical models often results imply the same thing.)
Obviously written for (applied) mathematicians, the paper attempts to clarify some of the confusing terms, which have arisen because much of AI and machine learning in general have been developed by computer scientists. For example, what is called a cost function or loss function in the machine learning community is often called an objective function.
Other examples spring to mind. What is commonly called the sigmoid function may be better known as the logistic function. And what is linear in machine learning land is often actually affine.
Worked example with code
The paper includes a worked example with code (in MATLAB) of a simple 4-layer feedforward network (or perceptron). This model is then fitted or trained using a simple stochastic gradient descent method, hence the need for derivatives. For training, the so-called cost function or loss function is a root-square function, but now most neural networks use cost functions based on maximum likelihoods. For the activation function, the papers uses the sigmoid function, but often practitioners use the recitified linear unit (ReLU) activation function is often used.
The problem is a simple binary classification problem, identifying in which of the two regions points lie in a two-dimensional square.
In the code, the neural network is hard coded, so if you want to modify the structure of the network, you’ll have to work a bit. Such coding practices are usually frowned upon, but the authors have done so for brevity and clarity purposes.
The rest of the paper
The next section of the paper involves using a pre-written MATLAB library (but not an official one) that applies a convolution neural network. (The word convolution is used, but the neural networks actually hinge upon filters.) These networks are become essential for treating image data, being now found on (smart)phones and computers for identifying contents of photos. There is less to gain here in terms of intuition on how neural networks work.
The last section of the paper details what the authors didn’t cover, which includes regularization methods (to prevent overfitting) and why the steepest descent method works, despite it being advised against outside of this field.
Code one up yourself
If you want to learning how these networks work, I would strongly suggest coding up one yourself, preferably first without looking at the code given by Higham and Higham.
Of course if you want to develop a good, functioning neural network, you wouldn’t use the code found in the tutorial, which is just for educational purposes. There are libraries such as TensorFlow or (Py)Torch.
Further reading
There is just too much literature, especially on the internet, covering this topic. For a good mix of words and equations, I recommend this book:
- Goodfellow, Bengio, Courville – Deep Learning – MIT Press.
As I said, the book is recent, being published in 2016. It’s a fast (too fast?) moving field, but this text gets you up to speed with the main research topics in neural learning — sorry — deep learning.