A guest lecture for BMEN Mathematical Modeling
Main example:
Task: Predict the nerve mask for an input ultrasound image.
5635 training images, 5508 test images.
420×580 resolution.
~47% images don't have a mask.
Data from a $100,000 competition held in 2016: https://www.kaggle.com/c/ultrasound-nerve-segmentation
I will live-execute the following code (more on the hardware & software requirements later):
The code borrows from: [1], [2].
Hopefully, by the end of this lecture you will understand all of the components of this deep learning model.
The theoretical part of this presentation (including most figures) is largely based on:
Michael A. Nielsen, "Neural Networks and Deep Learning", Determination Press, 2015 (licensed under a Creative Commons Attribution-NonCommercial 3.0 Unported License)
To follow in spirit, this presentation is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.
Weighing multiple input factors to make a decision according to a decision threshold.
Binary inputs $x_1, x_2, \ldots \in \{0, 1\}$,
and binary output.
$$ \begin{eqnarray} \mathbf{x} \mapsto \left\{ \begin{array}{ll} 0 & \mbox{if } \mathbf{w}^T \mathbf{x} + b \leq 0 \\ 1 & \mbox{if } \mathbf{w}^T \mathbf{x} + b > 0 \end{array} \right. \end{eqnarray} $$
Example: NAND Gate.
That is, 00 $\mapsto$ 1, 01 $\mapsto$ 1, 10 $\mapsto$ 1, 11 $\mapsto$ 0.
(NAND gate is universal for computation. Analogy: a network of perceptrons and a computational circuit.)
Hidden layer: Making a decision at a more abstract level by weighing up the results of the first layer.
Array of 400 photocells, connected to the “neurons”. The weights ($w_i$) and biases ($b$) are potentiometers.
In 1958 The New York Times reported the perceptron to be “the embryo of an electronic computer that [the Navy] expects will be able to walk, talk, see, write, reproduce itself and be conscious of its existence.” (Mikel Olazaran (1996) A Sociological Study of the Official History of the Perceptrons Controversy)
Use the sigmoid function as the activation function.
The sigmoid neuron: $$\mathbf{x} \mapsto \frac{1}{1 + \exp(-\mathbf{w}^T\mathbf{x} - b)}$$
Neural networks can be used to approximate any continuous function to any desired precision.
(among other sources, see Chapter 4 in Michael Nielsen's book, or George Cybenko (1989) Approximation by superpositions of a sigmoidal function).
Let's see how well it works in a live demonstration!
State-of-the-art neural network models require a modern GPU (or GPUs).
In this talk I use the following setup on a Google Cloud virtual machine.
Instructions to reproduce the same exact setup can be found at https://github.com/agisga/coding_notes/blob/master/google_cloud.md.
I will execute the following code:
How do we find the optimal weights and biases?
Cost function: $$C = \frac{1}{2n} \sum\subscript{i=1}^n \lVert y(\mathbf{x}_i) - a(\mathbf{x}_i) \rVert_2^2,$$ where $a(\mathbf{x}_i)$ is the output of the NN and $y(\mathbf{x}_i)$ is the desired output for the input $\mathbf{x}_i$.
$C = \frac{1}{2n} \sum\subscript{i=1}^n \lVert y(\mathbf{x}_i) - a(\mathbf{x}_i) \rVert_2^2 = \frac{1}{n} \sum\subscript{i=1}^n C_i$,
where $C_i := (y(\mathbf{x}_i) - a(\mathbf{x}_i))^2 / 2$.
Randomly choose a subset $\mathbf{x}\subscript{i_1}, \ldots, \mathbf{x}\subscript{i_m}$ of size $m\ll n$.
Then, $\nabla C \approx \frac{1}{m} \sum\subscript{j=1}^m \nabla C\subscript{i_j}$,
The set $\{\mathbf{x}\subscript{i_1}, \ldots, \mathbf{x}\subscript{i_m}\}$ is called a mini-batch.
Now we only need to figure out how to differentiate $C\subscript{i_j}$ with respect to every weight and every bias in the neural network...
$$\Rightarrow \mathbf{a}^l = \sigma(W^l \mathbf{a}^{l-1} + \mathbf{b}^l) = \sigma(\mathbf{z}^l)$$
Layer L (last layer): $$ \begin{eqnarray} \frac{\partial C}{\partial W\subscript{ij}^L} &=& \nabla\subscript{\mathbf{a}^L} C \cdot \frac{\partial \sigma(\mathbf{z}^L)}{\partial \mathbf{z}^L} \cdot \frac{\partial \mathbf{z}^L}{\partial W\subscript{ij}^L} \nonumber \\ &=:& \delta^L \cdot \frac{\partial \mathbf{z}^L}{\partial W\subscript{ij}^L} = \delta^L \cdot \mathbf{a}_j^{L-1} \nonumber \end{eqnarray} $$
Layer L-1: $$ \begin{eqnarray} \frac{\partial C}{\partial W\subscript{ij}^{L-1}} &=& \nabla\subscript{\mathbf{a}^L} C \cdot \frac{\partial \sigma(\mathbf{z}^L)}{\partial \mathbf{z}^L} \cdot \frac{\partial \mathbf{z}^L}{\partial \mathbf{z}^{L-1}} \cdot \frac{\partial \mathbf{z}^{L-1}}{\partial W\subscript{ij}^{L-1}} \nonumber \\ &=& \delta^L \cdot W^L \cdot \frac{\partial \sigma(\mathbf{z}^{L-1})}{\partial \mathbf{z}^{L-1}} \cdot \frac{\partial \mathbf{z}^{L-1}}{\partial W\subscript{ij}^{L-1}} \nonumber \\ &=:& \delta^{L-1} \cdot \frac{\partial \mathbf{z}^{L-1}}{\partial W\subscript{ij}^{L-1}} \nonumber = \delta^{L-1} \cdot \mathbf{a}_j^{L-2} \nonumber \end{eqnarray} $$
Likewise, any other layer $l$:
$\frac{\partial C}{\partial W\subscript{ij}^{l}} = \delta^{l} \cdot \mathbf{a}_j^{l-1}$,
where $\delta^{l}$ is determined by $\delta^{l+1}$, $W^{l+1}$, and $\partial \sigma(\mathbf{z}^{l}) / \partial \mathbf{z}^{l}$.
Hinton (...) is now “deeply suspicious” of back-propagation (...). “My view is throw it all away and start again,” he said.(I may be wrong but I think this was said mostly in relationship to unsupervised learning)
“Max Planck said, ‘Science progresses one funeral at a time.’ The future depends on some graduate student who is deeply suspicious of everything I have said.”
There are a lot of techniques, tricks, and best practices (some based on theory, some on empirical trial and error). Here is a small selection.
Least squares loss: Learning slowdown b/c $\nabla C$ depends on $\sigma^\prime(z)$ ($\approx 0$ for $|z| > 5$).
Categorical cross-entropy: $$ C = -\frac{1}{n} \sum_x [y \ln(a) + (1-y) \ln(1-a)]. $$
Reduce over-fitting to the training data.
Let's try these techniques on the handwritten digit recognition example.
(unrelated to Hinton's remarks on a previous slide)
With standard initialization $|w_j|<1$, as so, $|w_j \sigma^\prime(z_j)| < 1/4$. $\Rightarrow$ Vanishing gradient in the earlier layers of a deep model $\Rightarrow$ The earlier layers "learn" much slower than later layers.
Likewise: exploding gradient when all $|w_j \sigma^\prime(z_j)| \gg 1$.
Deep convolutional network is the most widely used type of deep nearal network.
Modern CNN: LeCun, Bottou, Bengio, Haffner (1998).
Each neuron in the hidden layer is connected to only $5 \times 5 = 25$ input activations (pixels).
A complete convolutional layer uses several convolutional filters to produce several different feature maps.
Most common: max-pooling
Other options: L2-pooling, average pooling.
Putting it all together
Vincent Dumoulin, Francesco Visin - A guide to convolution arithmetic for deep learning
Animations source: https://github.com/vdumoulin/conv_arithmetic
Let's try it on the handwritten digit recognition example.
Modular nature.
Many building blocks.
Ronneberger, Fischer, Brox. 2015. "U-Net: Convolutional Networks for Biomedical Image Segmentation." Medical Image Computing and Computer-Assisted Intervention (MICCAI), Springer, LNCS, Vol.9351: 234--241.
Back to the ultrasound image segmentation example.