Home Deep Learning in Healthcare (Summary)
Post
Cancel

Deep Learning in Healthcare (Summary)

NOTE This is just a summary of deep learning in the area of healthcare. I have jumped much trivial stuff.

Multi-Layer Perceptrons (MLP)

Universal Approximation Theorem

A multilayer network of sigmoid neurons with a single hidden layer can be used to approximate any continuous functin to any desired precision.

Gradient Descent & Backpropagation

Choosing a loss function

 Real ValueProbabilities
Output Activation FunctionLinearSoftmax
Loss FunctionMSECross-entropy

Gradient Decent & Gradient Descent Algorithm

Optimisation

Hyperparameters

  • Loss function
  • Optimisation algorithm
    • incl. choice of learning rate
  • Activation function
  • Number of hidden layers
  • Number of iterations/epochs
  • Batch size
  • Train-test data split ratio

Optimisation Algorithms

Pitfalls of Vanilla (standard) gradient descent

  • high-demensional loss functions are highly non-convex
    • risk of getting trapping in suboptimal local minima
  • high sensitivity to the weight initialisation point
  • saddle points and plateaus
    • points on loss landscape where the gradient is highly non-spherical
    • regions of small weight updates
    • risk of divergence

      Pros/Cons of Momentum

  • Pros:
    • faster convergence
    • ability to escape local minima and plateaus
    • oscillates in and out of local minima because the monmentum is able to propel it out
  • Limitations:
    • “ball” rolling down blindly

      Nesterov Accelerated Gradient (NAG)

  • intuition: look before you leap
  • two-step computation
    • instead of computing the gradient at the current point, compute it at what would be the next point

      Batch Gradient Descent

  • For every parameter update, gradient descent parses the entrie dataset
  • Advantages:
    • conditions of convergence are well-understood
    • several acceleration techniques designed to operate in the batch GD setting
  • Disadvantages:
    • Computationally slow

      Stochastic Gradient Descent

  • randomly shuffle the training set, and update parameters after gradients are computed for each training sample

    Mini-Batch Stochastic Gradient Descent

  • Update parameters after gradients are computed for a randomly drawn mini-batch of training samples

Adaptive Methods

  • choose different learning rate for every weight in the network

    AdaGrad

  • adaptively scales the learning rate for each weight
  • but it decays the learnin grate very aggresively
  • after a while, the more frequently-updating parameters will reveive very small updates due to the decayed LR

    RMSProp

  • root-mean-squared propagation
  • decay the denominator and prevent its rapid growth

    Adam

  • adaptive momentum estimation
  • does everthing that RMSProp does
  • but also uses a cumulative history of the gradients

Which optimiser to use?

  • SDG:
    • manages to reach a minimum
    • may take longer than other methods
    • reliant on good initialisation and annealing schedule
    • may get stuck in saddle points rather than local minima
  • Adaptive methods
    • Achieve fast convergence
    • Able to train complex models
    • Ideal if the input data is sparse
    • No need to tune the learning rate itself

Which weight initialisation schema to use?

  • Based on activation function to make a choice
  • Tanh / Sigmoid
    • Xavier initialisation
  • ReLU
    • He initialisation

Normalisation Layers

Motivation

  • Covariate shift
    • A change in the data distribution between the training and test scenarios
    • Problematic because the model needs to adapt to a new distribution
  • Internal covariate shift
    • Can also happen during the training process, e.g. from epoch to epoch

      Batch Normalisation

    • compute batch statistics (batch mean and variance)
    • Normalise layer inputs
    • Scale and shift

      Why is BatchNorm helpful

    • BatchNorm is fully differentiable
    • Lower covariate shift enables use of larger learning rates
      • Lower rist of exploding / vanishing gradients
    • Reduces training times
    • Reduces sensitivity to weight initialisation

Convolutional neural networks (CNN)

  • CNNs preserve the spatial structure
    • sliding window-based filters
    • invariant to translation, flipping, scaling…

Convolution Operation

  • slide the filter (kernel) over the input
  • the resulting output is called a feature map
  • CONV layer is for feature extraction

Understanding the (Hyper)parameters

Convolution (Hyper) parameters

  • input dimensions: W1 x H1 x D1
  • spatial extent of each filter: F
    • depth of each filter = depth of input
  • output dimensions
    • W2 x H2 x D2
  • stride: S
  • Number of filters: K
  • The output dimensions are:
    • W2 = W1 - F + 1
    • H2 = H1 - F + 1

      Padding

  • padding allows the output feature map to be tha same size as the input
  • Pad the inputs with an appropriately sized border so that the filter kernel can be applied to the corners of the image
  • The output dimensions are:
    • W2 = W1 - F + 2P + 1
    • H2 = H1 - F + 2P + 1

      Stride

  • defines the interval at which the filter is applied
  • S denotes the number of pixels by which the window moves after each operation
  • The final output dimentions are:
    • W2 = (W1 - F + 2P) / S + 1
    • H2 = (H1 - F + 2P) / S + 1

      The Pooling Operation

  • Filter out details
    • introduces invariance t local minor modifications
  • Makes the representations smaller
  • Operates over each feature map independently
  • Makes the features translation and scale-invariant

    Dimensionality Reduction

  • Using 1 x 1 convolutions reduces the number of computations by controlling the number of filters.

    Width vs. Depth

  • Wide: more filters per CONV layer
    • wide networks memorise
  • Deep: more CONV layer
    • deep networks generalise

CNN Architectures for Semantic Segmentation

Richer Visual Recognition Tasks

  • Classification
    • output: one label per image
  • Semantic Segmentation
    • output: one (category) label per pixel
    • grouping together similar pixels
  • Instance Segmentation
    • output: category and instance labels for each pixel
    • distinguishes between different instances of an object

      Semantic Segmentation

  • Goal: implement a pixel-level classifier
  • High-resolution prediction
    • Preserve the image dimensions at the output
  • Requires an encode-decode network

Working with Medical Data

What makes medical data unique?

Complexity

  • Healthcare data resides in multiple places, in different formats
  • Interpretation of medical imaging non-trivial
  • Requires specialist training
  • Difficult to acquire labels

    Size

  • DL models are data-hungry
  • medical datasets are relatively small
  • individual medical samples are very large

    Intensities

  • reflect physical properties of anatomical tissues

    Phenotypic Heterogeneity

  • There is significant natural variation, even in healthy participants

    Site Variability

  • Differences in scanning sequences, scanners, point-spread function, noise, manufacturer
  • Handling multiple sources of data
  • Research vs. Clinical data

    Multi-Modal Data

  • a single patient may produce diverse data smaples
  • missing data

    Source of dataset

    Ethic

Data Preparation and Pre-processing

Spatial Normalisation

  • Using rigid or affine registration to a common space

    Intensity Normalisation

    Histogram Normalisation

    Masking

  • to mask
    • increased focus on the task
    • limited memory input
    • no parameter ‘wasted’ on processing the background
  • not to mask
    • no image details/features are excluded
    • no resource spent on the border pixels
    • no need mask calculation required

Small Datasets

Transfer Learning

  • Learning new tasks relies on previous tasks
  • Start with a pretrained model that produces good results
  • Initialise a model with the pretrained model weights and fine-tune

    Domain Adaptation

  • Domain shift
    • when the training and test distributions are different
  • Goal of domain adaptation
    • train a NN on a source dataset, and achieve good accuracy on a target dataset that is significantly different from the source
  • adversarial domain adaptation

    Transfer Learning Overview

    row: source data col: target data

 LabelledUnlabelled
LabelledFine-tune; Multi-task learningSelf-taught learning
UnlabelledAdversarial domain adaptation; Few-/Low-shot learningClustering techniques

Beyond Batch Norm

Batch Norm

  • accelerates training
  • improves generalisation

    Normalisation without batches

  • Layer Norm
    • Normalise across over the feature/channels
  • Instance Norm
    • Normalise across the activations in each channel
  • Group Norm
    • Computes the mean and standard deviation over groups of channels = does not assume that all channels equal importance

Loss Functions for Medical Tasks

Regression Losses

  • L1
  • L2
  • Huber loss
  • Generalised regression loss

    Segmentation Losses

  • use pixel/voxel-wise cross entropy
  • overlap-based losses have been proposed to address class imbalance
  • Dice Coefficient
    • estimates the regional overlap in a segmentation

Sequence Learning

Vector-to-sequence models

  • One to many

    Sequence-to-vector models

  • Many to one

    Sequence-to-sequence models

  • Many to many

    To model sequences, we need to

  • account for relationship between inputs
  • handle sequences of variable length
  • compute the same function at each time step
  • track long-term dependencies

    Recurrent Neural Networks (RNNs)

Long-Short Term Memory (LSTM)

Forget Gate

  • Deciding which information to discard from t he cell state

    Input Gate

  • Deciding which information to store in the cell state

    Combining Forget + Input Gates

  • Deciding which information to remore/store in the cell state

    Output Gate

  • Deciding what information to output

Gated Recurrent Units (GRU)

  • Combine forget and input gates into a single update gate
  • Merges cell state and hidden state

LSTMs vs GRUs

 LSTMGRU
Number of gates3 gates2 gates
Gate compositionSeparate input and forget gatesInput and forget gates are coupled into an update gate; reset gate is applied directly to the previous hidden state
Memory/HistoryC_t serves as the internal memory of the networkNo internal memory that is different from the exposed hidden state. Only a hidden state; No output gate

Self-Attention and Transformers

This post is licensed under CC BY 4.0 by the author.