Home

Awesome

3/20/2018 Update

The Docker image is available below. We are still working to organize and document the data analysis scripts. We apologize for the delay.

Survival Convolutional Networks

This page contains software and data resources related to the paper

Mobadersany, Pooya, et al. "Predicting cancer outcomes from histology and genomics using convolutional networks." Proceedings of the National Academy of Sciences 115.13 (2018): E2970-E2979..

We provide scripts for formatting and analyzing data, a portable Docker container that encapsulates an executable software, documentation, and data used to generate the results shown in the paper.

Docker container

A Docker container that encapsulates executables and data is posted at DockerHub

Brief directions for deploying this Docker are provided below. Consult the Docker Tutorial for additional guidance.

  1. Pull the docker to your system and start the container.

$docker pull cancerdatascience/scnn:1.0

  1. Confirm that the docker image is downloaded. The image is > 10GB due to the inclusion of data and so the download may take some time
$docker images
REPOSITORY                     TAG                            IMAGE ID            CREATED             SIZE
cancerdatascience/scnn         1.0                            858d8c3d6af4        24 hours ago        13.2GB
  1. Switch to the docker container and run the code on CPU or GPU

CPU version

$docker run -it cancerdatascience/scnn:1.0 /bin/bash
root@97d439b58033:/# cd /root/scnn
root@97d439b58033:/# python model_train.py

GPU version (4 GPUs - see note below)

$docker run --device=/dev/nvidiactl --device=/dev/nvidia-uvm --device=/dev/nvidia0:/dev/nvidia0 --device=/dev/nvidia1:/dev/nvidia1 --device=/dev/nvidia2:/dev/nvidia2 --device=/dev/nvidia3:/dev/nvidia3 -i -t cancerdatascience/scnn:1.0 /bin/bash
root@97d439b58033:/# cd /root/scnn
root@97d439b58033:/# python model_train.py

Note: this docker is built on CUDA 8.0 with CUDNN 5.1 and Nvidia driver 367.57. This code was developed on a system with 4 NVIDIA K80 GPUs. The memory limitations of GPU systems vary widely and running this Docker in GPU with inadequate resources may produce memory errors.

Executables for training and testing models

The Docker container provides executables for training SCNN/GSCNN models and for evaluating the accuracy of these models. Both of these executables consume:

Note: Executables may not produce identical models or statistics for different runs. We have taken steps to reduce variability wherever possible, but the issues of non-associativity in distributed and GPU computing produce variations that stack as models are trained. Graph-level seeding and operation-level seeding issues were addressed using TensorFlow functions to eliminate these as a source of randomness.

Training

The command

python model_train.py

will train a model using the default hyperparameters and data locations described below.

The input parameters and their default values can be listed using

python model_train.py -h

that generates output

usage: model_train.py [-h] [-m M] [-f F] [-i I] [-r R] [-t T] [-d D] [--lr LR]
                      [--me ME] [--kp KP] [--bs BS] [--ic IC] [--ngf NGF]
                      [--nm NM]

Arguments for Training the SCNN/GSCNN model

optional arguments:
  -h, --help  show this help message and exit
  -m M        SCNN or GSCNN; SCNN for the case we just use histology images
              and GSCNN for the case we integrate the histology and genomic
              features. Default value = SCNN.
  -f F        Path to the file containing patient IDs, patient indexes, 
              clinical outcomes, and genomic features. The default path and 
              filename is ./inputs/all_dataset.csv
  -i I        Path to the Training ROIs. The default is ../images/train
  -r R        Path to the Training results (training loss and other outputs).
              The default path is ./train_results
  -t T        Path to save the trained models and their
              weights/biases/parameters values. The default is ./checkpoints
  -d D        Path to the temporary binary files for Training. The default path 
              is ./tmp
  --lr LR     Initial learning rate. Default value = 0.001.
  --me ME     Max number of epochs. Default value = 100.
  --kp KP     Keeping probability for training weight dropout. Default value = 0.95.
  --bs BS     Batch size. Default value = 14.
  --ic IC     Column containing patient indices in the -f input (0-indexed). 
              Default value = 1.
  --ngf NGF   Number of genomic features in the -f input. Default value = 2.
  --nm NM     Number models to save for test time model averaging. 
              Default value = 5.

The outputs generetaed by model_train.py are:

Example

python model_train.py -m GSCNN --ngf 80

This command will train the GSCNN model with 80 genomic features.

Testing

The command

python model_test.py

will generate risk values for testing patients using a trained model. The testing input parameters and their default values for the model_test.py can be listed using the command

python model_test.py -h

that generates output

usage: model_test.py [-h] [-m M] [--kp KP] [--bs BS] [-i I] [-r R] [-t T]
                     [-d D] [-f F] [--ic IC] [--ngf NGF] [--nm NM]

Arguments for Testing the SCNN/GSCNN model

optional arguments:
  -h, --help  show this help message and exit
  -m M        SCNN or GSCNN; SCNN for the case we just use the clinical
              outcomes and GSCNN for the case we integrate the clinical
              outcomes with genomic features. Default value = SCNN.
  --kp KP     Keeping probability for test weight dropout. Default value = 1.
  --bs BS     Batch size. Default value = 14.
  -i I        Path to the folder containing ROI .pngs for testing. 
              The default path is ../images/test
  -r R        Path where testing results will be generated (final test C-Index and 
              patient risk values). The default path is ./test_results
  -t T        Path to folder containing the trained models and their
              weights/biases/parameters values. The default path is ./checkpoints
  -d D        Path where the binary files for testing will be generated. 
              The default path is ./tmp
  -f F        Path to the input .csv file. The default path and filename is
              ./inputs/all_dataset.csv
  --ic IC     Column containing patient indices in the -f input (0-indexed). 
              Default value = 1.
  --ngf NGF   Number of genomic features in the -f input file. Default value = 2.
  --nm NM     number of models used for model averaging during testing. 
              Default value = 5.

The outputs generetaed by model_test.py are:

Example

python model_test.py -m GSCNN --ngf 80

This command will test the GSCNN model with 80 genomic features.

Data

The results in this paper were generated using whole-slide .svs images of paraffin embedded sections and clinica outcomes data from The Cancer Genome Atlas. These images are publically available and hosted at the NCI Genomic Data Commons (GDC) Legacy Archive. A full list of the whole-slide image files used in the paper is available in /data/rois.txt.

Downloading the data

GDC does not currently enable direct querying of the TCGA diagnostic images for a specific project. To generate a list of the files to download, you have to first generate a manifest of all whole-slide images in TCGA (both frozen and diagnostic), filter the frozen section images in this list, and then match the identifiers against the sample identifiers (TCGA-##-####) for the project(s) of interest.

The manifest for all TCGA whole-slide images can be generated using the GDC Legacy Archive query.

Rows containing diagnostic image files can be identified using the Linux command line

cut -d$'\t' -f 2 gdc_manifest.txt | grep -E '\.*-DX[^-]\w*.'

After matching the slide filenames against the sample IDs from the clinical data for the project(s) of interest, the relevant filenames can be used with the GDC Data Transfer Tool or the GDC API.

Extracting regions of interest

Regions of interest can be extracted using the python script generate_rois.py. This script consumes a tab-delimited text file describing the whole-slide image files, ROI coordinates, desired size and magnification for extracted ROIs, and then generates a collection of ROI .png images. These images are transformed into a binary for model training and testing by the software described below.

Note: region extraction depends on the OpenSlide library.