Home

Awesome

MIT License

This repository contains a PyTorch implementation of the NeurIPS 2020 paper "Disentangling Human Error from the Ground Truth in Segmentation of Medical Images", 2020.

Mou-Cheng Xu is the main developer of the Python code; Le Zhang is the main developer of the data simulation code.

How to use our code for further research

We recommend to try the toy-example in MNIST_example.ipynb to understand the pipeline, this is a simplied main function for MNIST, similar to other main functions in Train_GCM.py, Train_ours.py, Train_puunet.py and Train_unet.py.

  1. If you want to apply our code on your own medical data-sets:

Following MNIST_example.ipynb, you might want to replace the data-loader with your own data-loader for your preferred pre-processing. An example for a data-loader can be found in Utilis.py, namely CustomDataset_punet.

  1. If you want to plug-in the proposed loss function and play around:

The loss function is implemented in Loss.py as noisy_label_loss.

  1. If you want to adapt our Run.py for your application, you need to prepare data stored in a specific way:

data_path='/.../.../all_of_datasets' data_tag='some_data_set'

The full path to the data='/.../.../all_of_datasets/some_data_set'

  1. folder structure of our training data:

An example of BRATS in our experiments.

your path
│
└───data sets
│   │
│   └───brats
│        │
│        └───train
│        │   │
│        │   └───Over # where all over segmentation label are stored in training 
│        │   │
│        │   └───Under # where all under segmentation label are stored in training 
│        │   │
│        │   └───Wrong # where all wrong segmentation label are stored in training 
│        │   │
│        │   └───Good # where all good segmentation label are stored in training 
│        │   │
│        │   └───Image # where all training images are stored
│        │
│        └───validate
│        │   │
│        │   └───Over
│        │   │
│        │   └───Under
│        │   │
│        │   └───Wrong
│        │   │
│        │   └───Good
│        │   │
│        │   └───Image # where all validation images are stored
│        │
│        └───test
│        │   │
│        │   └───Over
│        │   │
│        │   └───Under
│        │   │
│        │   └───Wrong
│        │   │
│        │   └───Good
│        │   │
│        │   └───Image # where all testing images are stored
<!--- <br> <img height="400" src="figures/humanerror_2.png" /> </br> -->

Contents

Introduction

We present a method for jointly learning, from purely noisy observations alone, the reliability of individual annotators and the true segmentation label distributions, using two coupled CNNs. The separation of the two is achieved by encouraging the estimated annotators to be maximally unreliable while achieving high fidelity with the noisy training data.

The architecture of our model is depicted below: <br> <img height="500" src="figures/NIPS.png" /> </br>

Installation

All required libraries can be installed via conda (anaconda). We recommend creating an conda env with all dependencies via environment file e.g.,

  conda env create -f conda_env.yml

Data Simulator (Morphological Transformations)

We generate synthetic annotations from an assumed GT on MNIST, MS lesion and BraTS datasets, to generate efficacy of the approach in an idealised situation where the GT is known. We simulate a group of 5 annotators of disparate characteristics by performing morphological transformations (e.g., thinning, thickening, fractures, etc) on the ground-truth (GT) segmentation labels, using Morpho-MNIST software. In particular, the first annotator provides faithful segmentation (“good-segmentation”) with approximate GT, the second tends over-segment (“over-segmentation”), the third tends to under-segment (“under-segmentation”), the fourth is prone to the combination of small fractures and over-segmentation (“wrong-segmentation”) and the fifth always annotates everything as the background (“blank-segmentation”). To create synthetic noisy labels in multi-class scenario, we first choose a target class and then apply morphological operations on the provided GT mask to create 4 synthetic noisy labels at different patterns, namely, over-segmentation, under-segmentation, wrong segmentation and good segmentation. We create training data by deriving labels from the simulated annotators. Here we provide several example images in data_simulation.

<br> <img height="500" src="figures/Morph.png" /> </br>

Here we also introduce another simple method for annotator data simulation for both binary and multi-class masks. Before running the data simulator, please make sure you have installed FSL in your machine.

For binary mask:

  1. change the folder path to your data folder and run ./data_simulation/over-segmentation.m to generate the simulated over-segmentation mask;
<br> <img height="130" src="figures/over-sample.png" /> </br>
  1. change the folder path to your data folder and run ./data_simulation/under-segmentation.m to generate the simulated under-segmentation mask;
<br> <img height="130" src="figures/under-sample.png" /> </br>
  1. change the folder path to your data folder and run ./data_simulation/artificial_wrong_mask.py to generate the simulated wrong-segmentation mask;
<br> <img height="130" src="figures/wrong-sample.png" /> </br>

For multi-class mask:

change the folder path to your data folder and run ./data_simulation/multiclass_data_simulator.m to generate the over-segmentation, under-segmentation and wrong-segmentation masks simultaneously.

<br> <img height="150" src="figures/Multi-class.png" /> </br>

Download & preprocess the datasets

Download example datasets in following table as used in the paper, and pre-process the dataset using the folowing steps for multiclass segmentation purpose:

  1. Download the training dataset with annotations from the corresponding link (e.g. Brats2019)

  2. Unzip the data and you will have two folders:

  3. Extract the 2D images and annotations from nii.gz files by running

       - cd Brats
       - python ./preprocessing/Prepare_BRATS.py
    
Dataset (with Link)ContentResolution (pixels)Number of Classes
MNISTHandwritten Digits28 x 282
ISBI2015Multiple Sclerosis Lesion181 x 217 x 1812
BraTS2019Multimodal Brain Tumor240 x 240 x 1554
LIDC-IDRILung Image Database Consortium image collection180 x 1802

Training

For BraTS dataset, set the hyper-parameters in run.py

   - input_dim=4,
   - class_no=4,
   - repeat=1,
   - train_batchsize=2,
   - validate_batchsize=1,
   - num_epochs=30,
   - learning_rate=1e-4,
   - alpha=1.5,
   - width=16,
   - depth=4,
   - data_path=your path,
   - dataset_tag='brats',
   - label_mode='multi',
   - save_probability_map=True,
   - low_rank_mode=False

and run:

python run.py 

Testing

To test our model, please run segmentation.py with the following setting:

  1. change the model_path to your pre-trained model;
  2. change the test_path to your testing data.

Performance

  1. Comparison of segmentation accuracy and error of CM estimation for different methods trained withdense labels (mean±standard deviation). The best results are shown in bald. Note that we count out the Oraclefrom the model ranking as it forms a theoretical upper-bound on the performance where true labels are known onthe training data.
ModelsBrats Dice (%)Brats CM estimationLIDC-IDRI Dice (%)LIDC-IDRI CM estimation
Naive CNN on mean labels29.42±0.58n/a56.72±0.61n/a
Naive CNN on mode labels34.12±0.45n/a58.64±0.47n/a
Probabilistic U-net40.53±0.75n/a61.26±0.69n/a
STAPLE46.73±0.170.2147±0.010369.34±0.580.0832±0.0043
Spatial STAPLE47.31±0.210.1871±0.009470.92±0.180.0746±0.0057
Ours without Trace49.03±0.340.1569±0.007271.25±0.120.0482±0.0038
Ours53.47±0.240.1185±0.005674.12±0.190.0451±0.0025
Oracle (Ours but with known CMs)67.13±0.140.0843±0.002979.41±0.170.0381±0.0021
  1. Comparison of segmentation accuracy and error of CM estimation for different methods trained withonly one label available per image (mean±standard deviation). The best results are shown in bald.
ModelsBrats Dice (%)Brats CM estimationLIDC-IDRI Dice (%)LIDC-IDRI CM estimation
Naive CNN on mean & mode labels36.12±0.93n/a48.36±0.79n/a
STAPLE38.74±0.850.2956±0.104757.32±0.870.1715±0.0134
Spatial STAPLE41.59±0.740.2543±0.086762.35±0.640.1419±0.0207
Ours without Trace43.74±0.490.1825±0.072466.95±0.510.0921±0.0167
Ours46.21±0.280.1576±0.048768.12±0.480.0587±0.0098
  1. The final segmentation of our model on BraTSand each annotator network predictions visualization.(Best viewed in colour: the target label is red.)
<br> <img height="500" src="figures/Brats_1.jpg" /> </br>
  1. Visualisation of segmentation labels on BraTS dataset: (a) GT and simulated annotator’ssegmentations (Annotator 1 - 5); (b) the predictions from the supervised models.)
<br> <img height="340" src="figures/brats-compare.jpg" /> </br>
  1. Visualisation of segmentation labels on LIDC-IDRI dataset: (a) GT and simulated annota-tor’s segmentations (Annotator 1 - 5); (b) the predictions from the supervised models.)
<br> <img height="340" src="figures/LIDC-compare.jpg" /> </br>

Citation

If you use this code or the dataset for your research, please cite our paper:

@article{HumanError2020,
  title={Disentangling Human Error from the Ground Truth in Segmentation of Medical Images},
  author={Zhang, Le and Tanno, Ryutaro and Xu, Mou-Cheng and Jacob, Joseph and Ciccarelli, Olga and Barkhof, Frederik and C. Alexander, Daniel},
  journal={NeurIPS},
  year={2020},
}