A guest lecture for BMEN Mathematical Modeling
(the slides move left-right as well as up-down — pressing the space bar moves through the slides in the correct order)
Quick remark before we get to the theoretical part of this presentation...
This lecture contains several live demos. Why?
Main example:
Task: Predict the nerve mask for an input ultrasound image.
5635 training images, 5508 test images.
420×580 resolution.
~47% images don't have a mask.
Data from a $100,000 competition held in 2016: https://www.kaggle.com/c/ultrasound-nerve-segmentation
I will live-execute the following code (more on the hardware & software requirements later):
https://github.com/agisga/ultrasound-nerve-segmentation/blob/master/jupyter/U-net_improved.ipynb
The code borrows from: [1], [2].
Hopefully, by the end of this lecture you will understand all of the components of this deep learning model.
The theoretical part of this presentation (including most figures) is largely based on:
Michael A. Nielsen, "Neural Networks and Deep Learning", Determination Press, 2015 (licensed under a Creative Commons Attribution-NonCommercial 3.0 Unported License)
To follow in spirit, this presentation is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.
Weighing multiple input factors to make a decision according to a decision threshold.
Binary inputs $x_1, x_2, \ldots \in \{0, 1\}$,
and binary output.
$$ \begin{eqnarray} \mathbf{x} \mapsto \left\{ \begin{array}{ll} 0 & \mbox{if } \mathbf{w}^T \mathbf{x} + b \leq 0 \\ 1 & \mbox{if } \mathbf{w}^T \mathbf{x} + b > 0 \end{array} \right. \end{eqnarray} $$
Example: NAND Gate.
$$ \begin{eqnarray} \mbox{output} & = & \left\{ \begin{array}{ll} 0 & \mbox{if } (-2x_1 - 2x_2 + 3) \leq 0 \\ 1 & \mbox{if } (-2x_1 - 2x_2 + 3) > 0 \end{array} \right. \end{eqnarray} $$That is, 00 $\mapsto$ 1, 01 $\mapsto$ 1, 10 $\mapsto$ 1, 11 $\mapsto$ 0.
(NAND gate is universal for computation. Analogy: a network of perceptrons and a computational circuit.)
Hidden layer: Making a decision at a more abstract level by weighing up the results of the first layer.
Array of 400 photocells, connected to the “neurons”. The weights ($w_i$) and biases ($b$) are potentiometers.
In 1958 The New York Times reported the perceptron to be “the embryo of an electronic computer that [the Navy] expects will be able to walk, talk, see, write, reproduce itself and be conscious of its existence.” (Mikel Olazaran (1996) A Sociological Study of the Official History of the Perceptrons Controversy)
Use the sigmoid function as the activation function.
The sigmoid neuron: $$\mathbf{x} \mapsto \frac{1}{1 + \exp(-\mathbf{w}^T\mathbf{x} - b)}$$
Neural networks can be used to approximate any continuous function to any desired precision.
(among other sources, see Chapter 4 in Michael Nielsen's book, or George Cybenko (1989) Approximation by superpositions of a sigmoidal function).
Let's see how well it works in a live demonstration!
State-of-the-art neural network models require a modern GPU (or GPUs).
In this talk I use the following setup on a Google Cloud virtual machine.
Instructions to reproduce the same exact setup can be found at https://github.com/agisga/coding_notes/blob/master/google_cloud.md.
Components:
I will execute the following code:
https://github.com/agisga/coding_notes/blob/master/Keras/MNIST_functional_API.ipynb
How do we find the optimal weights and biases?
Cost function: $$C = \frac{1}{2n} \sum\subscript{i=1}^n \lVert y(\mathbf{x}_i) - a(\mathbf{x}_i) \rVert_2^2,$$ where $a(\mathbf{x}_i)$ is the output of the NN and $y(\mathbf{x}_i)$ is the desired output for the input $\mathbf{x}_i$.
$C = \frac{1}{2n} \sum\subscript{i=1}^n \lVert y(\mathbf{x}_i) - a(\mathbf{x}_i) \rVert_2^2 = \frac{1}{n} \sum\subscript{i=1}^n C_i$,
where $C_i := (y(\mathbf{x}_i) - a(\mathbf{x}_i))^2 / 2$.
Randomly choose a subset $\mathbf{x}\subscript{i_1}, \ldots, \mathbf{x}\subscript{i_m}$ of size $m\ll n$.
Then, $\nabla C \approx \frac{1}{m} \sum\subscript{j=1}^m \nabla C\subscript{i_j}$,
The set $\{\mathbf{x}\subscript{i_1}, \ldots, \mathbf{x}\subscript{i_m}\}$ is called a mini-batch.
Now we only need to figure out how to differentiate $C\subscript{i_j}$ with respect to every weight and every bias in the neural network...
$$\Rightarrow \mathbf{a}^l = \sigma(W^l \mathbf{a}^{l-1} + \mathbf{b}^l) = \sigma(\mathbf{z}^l)$$
Layer L (last layer): $$ \begin{eqnarray} \frac{\partial C}{\partial W\subscript{ij}^L} &=& \nabla\subscript{\mathbf{a}^L} C \cdot \frac{\partial \sigma(\mathbf{z}^L)}{\partial \mathbf{z}^L} \cdot \frac{\partial \mathbf{z}^L}{\partial W\subscript{ij}^L} \nonumber \\ &=:& \delta^L \cdot \frac{\partial \mathbf{z}^L}{\partial W\subscript{ij}^L} = \delta^L \cdot \mathbf{a}_j^{L-1} \nonumber \end{eqnarray} $$
Layer L-1: $$ \begin{eqnarray} \frac{\partial C}{\partial W\subscript{ij}^{L-1}} &=& \nabla\subscript{\mathbf{a}^L} C \cdot \frac{\partial \sigma(\mathbf{z}^L)}{\partial \mathbf{z}^L} \cdot \frac{\partial \mathbf{z}^L}{\partial \mathbf{z}^{L-1}} \cdot \frac{\partial \mathbf{z}^{L-1}}{\partial W\subscript{ij}^{L-1}} \nonumber \\ &=& \delta^L \cdot W^L \cdot \frac{\partial \sigma(\mathbf{z}^{L-1})}{\partial \mathbf{z}^{L-1}} \cdot \frac{\partial \mathbf{z}^{L-1}}{\partial W\subscript{ij}^{L-1}} \nonumber \\ &=:& \delta^{L-1} \cdot \frac{\partial \mathbf{z}^{L-1}}{\partial W\subscript{ij}^{L-1}} \nonumber = \delta^{L-1} \cdot \mathbf{a}_j^{L-2} \nonumber \end{eqnarray} $$
Likewise, any other layer $l$:
$\frac{\partial C}{\partial W\subscript{ij}^{l}} = \delta^{l} \cdot \mathbf{a}_j^{l-1}$,
where $\delta^{l}$ is determined by $\delta^{l+1}$, $W^{l+1}$, and $\partial \sigma(\mathbf{z}^{l}) / \partial \mathbf{z}^{l}$.
Hinton (...) is now “deeply suspicious” of back-propagation (...). “My view is throw it all away and start again,” he said.(I may be wrong but I think this was said mostly in relationship to unsupervised learning)
(...)
“Max Planck said, ‘Science progresses one funeral at a time.’ The future depends on some graduate student who is deeply suspicious of everything I have said.”
There are a lot of techniques, tricks, and best practices (some based on theory, some on empirical trial and error). Here is a small selection.
Least squares loss: Learning slowdown b/c $\nabla C$ depends on $\sigma^\prime(z)$ ($\approx 0$ for $|z| > 5$).
Categorical cross-entropy: $$ C = -\frac{1}{n} \sum_x [y \ln(a) + (1-y) \ln(1-a)]. $$
Reduce over-fitting to the training data.
Let's try these techniques on the handwritten digit recognition example.
(unrelated to Hinton's remarks on a previous slide)
With standard initialization $|w_j|<1$, as so, $|w_j \sigma^\prime(z_j)| < 1/4$. $\Rightarrow$ Vanishing gradient in the earlier layers of a deep model $\Rightarrow$ The earlier layers "learn" much slower than later layers.
Likewise: exploding gradient when all $|w_j \sigma^\prime(z_j)| \gg 1$.
Deep convolutional network is the most widely used type of deep nearal network.
Modern CNN: LeCun, Bottou, Bengio, Haffner (1998).
Each neuron in the hidden layer is connected to only $5 \times 5 = 25$ input activations (pixels).
A complete convolutional layer uses several convolutional filters to produce several different feature maps.
Most common: max-pooling
Other options: L2-pooling, average pooling.
Putting it all together
Vincent Dumoulin, Francesco Visin - A guide to convolution arithmetic for deep learning
Animations source: https://github.com/vdumoulin/conv_arithmetic
Let's try it on the handwritten digit recognition example.
Modular nature.
Many building blocks.
Ronneberger, Fischer, Brox. 2015. "U-Net: Convolutional Networks for Biomedical Image Segmentation." Medical Image Computing and Computer-Assisted Intervention (MICCAI), Springer, LNCS, Vol.9351: 234--241.
Back to the ultrasound image segmentation example.