NOTE This is just a summary of deep learning in the area of healthcare. I have jumped much trivial stuff.
Multi-Layer Perceptrons (MLP)
Universal Approximation Theorem
A multilayer network of sigmoid neurons with a single hidden layer can be used to approximate any continuous functin to any desired precision.
Gradient Descent & Backpropagation
Choosing a loss function
Real Value | Probabilities | |
---|---|---|
Output Activation Function | Linear | Softmax |
Loss Function | MSE | Cross-entropy |
Gradient Decent & Gradient Descent Algorithm
…
Optimisation
Hyperparameters
- Loss function
- Optimisation algorithm
- incl. choice of learning rate
- Activation function
- Number of hidden layers
- Number of iterations/epochs
- Batch size
- Train-test data split ratio
Optimisation Algorithms
Pitfalls of Vanilla (standard) gradient descent
- high-demensional loss functions are highly non-convex
- risk of getting trapping in suboptimal local minima
- high sensitivity to the weight initialisation point
- saddle points and plateaus
- Pros:
- faster convergence
- ability to escape local minima and plateaus
- oscillates in and out of local minima because the monmentum is able to propel it out
- Limitations:
- intuition: look before you leap
- two-step computation
- For every parameter update, gradient descent parses the entrie dataset
- Advantages:
- conditions of convergence are well-understood
- several acceleration techniques designed to operate in the batch GD setting
- Disadvantages:
- randomly shuffle the training set, and update parameters after gradients are computed for each training sample
Mini-Batch Stochastic Gradient Descent
- Update parameters after gradients are computed for a randomly drawn mini-batch of training samples
Adaptive Methods
- choose different learning rate for every weight in the network
AdaGrad
- adaptively scales the learning rate for each weight
- but it decays the learnin grate very aggresively
- after a while, the more frequently-updating parameters will reveive very small updates due to the decayed LR
RMSProp
- root-mean-squared propagation
- decay the denominator and prevent its rapid growth
Adam
- adaptive momentum estimation
- does everthing that RMSProp does
- but also uses a cumulative history of the gradients
Which optimiser to use?
- SDG:
- manages to reach a minimum
- may take longer than other methods
- reliant on good initialisation and annealing schedule
- may get stuck in saddle points rather than local minima
- Adaptive methods
- Achieve fast convergence
- Able to train complex models
- Ideal if the input data is sparse
- No need to tune the learning rate itself
Which weight initialisation schema to use?
- Based on activation function to make a choice
- Tanh / Sigmoid
- Xavier initialisation
- ReLU
- He initialisation
Normalisation Layers
Motivation
- Covariate shift
- A change in the data distribution between the training and test scenarios
- Problematic because the model needs to adapt to a new distribution
- Internal covariate shift
- Can also happen during the training process, e.g. from epoch to epoch
Batch Normalisation
- compute batch statistics (batch mean and variance)
- Normalise layer inputs
- Scale and shift
Why is BatchNorm helpful
- BatchNorm is fully differentiable
- Lower covariate shift enables use of larger learning rates
- Lower rist of exploding / vanishing gradients
- Reduces training times
- Reduces sensitivity to weight initialisation
- Can also happen during the training process, e.g. from epoch to epoch
Convolutional neural networks (CNN)
- CNNs preserve the spatial structure
- sliding window-based filters
- invariant to translation, flipping, scaling…
Convolution Operation
- slide the filter (kernel) over the input
- the resulting output is called a feature map
- CONV layer is for feature extraction
Understanding the (Hyper)parameters
Convolution (Hyper) parameters
- input dimensions: W1 x H1 x D1
- spatial extent of each filter: F
- depth of each filter = depth of input
- output dimensions
- W2 x H2 x D2
- stride: S
- Number of filters: K
- The output dimensions are:
- padding allows the output feature map to be tha same size as the input
- Pad the inputs with an appropriately sized border so that the filter kernel can be applied to the corners of the image
- The output dimensions are:
- defines the interval at which the filter is applied
- S denotes the number of pixels by which the window moves after each operation
- The final output dimentions are:
- Filter out details
- introduces invariance t local minor modifications
- Makes the representations smaller
- Operates over each feature map independently
- Makes the features translation and scale-invariant
Dimensionality Reduction
- Using 1 x 1 convolutions reduces the number of computations by controlling the number of filters.
Width vs. Depth
- Wide: more filters per CONV layer
- wide networks memorise
- Deep: more CONV layer
- deep networks generalise
CNN Architectures for Semantic Segmentation
Richer Visual Recognition Tasks
- Classification
- output: one label per image
- Semantic Segmentation
- output: one (category) label per pixel
- grouping together similar pixels
- Instance Segmentation
- Goal: implement a pixel-level classifier
- High-resolution prediction
- Preserve the image dimensions at the output
- Requires an encode-decode network
Working with Medical Data
What makes medical data unique?
Complexity
- Healthcare data resides in multiple places, in different formats
- Interpretation of medical imaging non-trivial
- Requires specialist training
- Difficult to acquire labels
Size
- DL models are data-hungry
- medical datasets are relatively small
- individual medical samples are very large
Intensities
- reflect physical properties of anatomical tissues
Phenotypic Heterogeneity
- There is significant natural variation, even in healthy participants
Site Variability
- Differences in scanning sequences, scanners, point-spread function, noise, manufacturer
- Handling multiple sources of data
- Research vs. Clinical data
Multi-Modal Data
- a single patient may produce diverse data smaples
- missing data
Source of dataset
Ethic
Data Preparation and Pre-processing
Spatial Normalisation
- Using rigid or affine registration to a common space
Intensity Normalisation
Histogram Normalisation
Masking
- to mask
- increased focus on the task
- limited memory input
- no parameter ‘wasted’ on processing the background
- not to mask
- no image details/features are excluded
- no resource spent on the border pixels
- no need mask calculation required
Small Datasets
Transfer Learning
- Learning new tasks relies on previous tasks
- Start with a pretrained model that produces good results
- Initialise a model with the pretrained model weights and fine-tune
Domain Adaptation
- Domain shift
- when the training and test distributions are different
- Goal of domain adaptation
- train a NN on a source dataset, and achieve good accuracy on a target dataset that is significantly different from the source
- adversarial domain adaptation
Transfer Learning Overview
row: source data col: target data
Labelled | Unlabelled | |
---|---|---|
Labelled | Fine-tune; Multi-task learning | Self-taught learning |
Unlabelled | Adversarial domain adaptation; Few-/Low-shot learning | Clustering techniques |
Beyond Batch Norm
Batch Norm
- accelerates training
- improves generalisation
Normalisation without batches
- Layer Norm
- Normalise across over the feature/channels
- Instance Norm
- Normalise across the activations in each channel
- Group Norm
- Computes the mean and standard deviation over groups of channels = does not assume that all channels equal importance
Loss Functions for Medical Tasks
Regression Losses
- L1
- L2
- Huber loss
- Generalised regression loss
Segmentation Losses
- use pixel/voxel-wise cross entropy
- overlap-based losses have been proposed to address class imbalance
- Dice Coefficient
- estimates the regional overlap in a segmentation
Sequence Learning
Vector-to-sequence models
- One to many
Sequence-to-vector models
- Many to one
Sequence-to-sequence models
- Many to many
To model sequences, we need to
- account for relationship between inputs
- handle sequences of variable length
- compute the same function at each time step
- track long-term dependencies
Recurrent Neural Networks (RNNs)
…
Long-Short Term Memory (LSTM)
Forget Gate
- Deciding which information to discard from t he cell state
Input Gate
- Deciding which information to store in the cell state
Combining Forget + Input Gates
- Deciding which information to remore/store in the cell state
Output Gate
- Deciding what information to output
Gated Recurrent Units (GRU)
- Combine forget and input gates into a single update gate
- Merges cell state and hidden state
LSTMs vs GRUs
LSTM | GRU | |
---|---|---|
Number of gates | 3 gates | 2 gates |
Gate composition | Separate input and forget gates | Input and forget gates are coupled into an update gate; reset gate is applied directly to the previous hidden state |
Memory/History | C_t serves as the internal memory of the network | No internal memory that is different from the exposed hidden state. Only a hidden state; No output gate |
Self-Attention and Transformers
…