Home

Awesome

SAT-DS

Dropbox arXiv Model

This is the official repository to build SAT-DS, a medical data collection of 72 public segmentation datasets, contains over 22K 3D images, 302K segmentation masks and 497 classes from 3 different modalities (MRI, CT, PET) and 8 human body regions. 🚀

Based on this data collection, we build an universal segmentation model for 3D radiology scans driven by text prompts (check this repo and our paper).

The data collection will continuously growing, stay tuned!

Hightlight

🎉 To save your time from downloading and preprocess so many datasets, we offer shortcut download links of 42/72 datasets in SAT-DS, which allow re-attribution with licenses such as CC BY-SA. Find them in dropbox.

All these datasets are preprocessed and packaged by us for your convenience, ready for immediate use upon download and extraction. Download the datasets you need and unzip them in data/nii, these datasets can be used immediately with the paired jsonl files in data/jsonl, check Step 3 below for how to use them. Note that we respect and adhere to the licenses of all the datasets, if we incorrectly reattribute any of them, please contact us.

What we have done in building SAT-DS:

What we offer in this repo:

This repo can be used to:

Check our paper "One Model to Rule them All: Towards Universal Segmentation for Medical Images with Text Prompts" for more details.

ArXiv

Website

Example Figure

Step 1: Download datasets

This is the detailed list of all the datasets and their official download links. Their citation information can be found in citation.bib .

As a shortcut, we preprocess, package and re-attribute some of them for your convenient use. Download them here.

Dataset NameModalityRegionClassesScansDownload link
AbdomenCT1KCTAbdomen4988https://github.com/JunMa11/AbdomenCT-1K
ACDCCTThorax4300https://humanheart-project.creatis.insa-lyon.fr/database/
AMOS CTCTAbdomen16300https://zenodo.org/records/7262581
AMOS MRIMRIThorax1660https://zenodo.org/records/7262581
ATLASR2MRIBrain1654http://fcon_1000.projects.nitrc.org/indi/retro/atlas.html
ATLASMRIAbdomen260https://atlas-challenge.u-bourgogne.fr
autoPETPETWhole Body1501https://wiki.cancerimagingarchive.net/pages/viewpage.action?pageId=93258287
Brain AtlasMRIBrain10830http://brain-development.org/
BrainPTMMRIBrain760https://brainptm-2021.grand-challenge.org/
BraTS2023 GLIMRIBrain45004https://www.synapse.org/#!Synapse:syn51514105
BraTS2023 MENMRIBrain44000https://www.synapse.org/#!Synapse:syn51514106
BraTS2023 METMRIBrain4951https://www.synapse.org/#!Synapse:syn51514107
BraTS2023 PEDMRIBrain4396https://www.synapse.org/#!Synapse:syn51514108
BraTS2023 SSAMRIBrain4240https://www.synapse.org/#!Synapse:syn51514109
BTCV AbdomenCTAbdomen1530https://www.synapse.org/#!Synapse:syn3193805/wiki/217789
BTCV CervixCTAbdomen430https://www.synapse.org/Synapse:syn3378972
CHAOS CTCTAbdomen120https://chaos.grand-challenge.org/
CHAOS MRIMRIAbdomen560https://chaos.grand-challenge.org/
CMRxMotionMRIThorax4138https://www.synapse.org/#!Synapse:syn28503327/files/
CouinaudCTAbdomen10161https://github.com/GLCUnet/dataset
COVID-19 CT SegCTThorax420https://github.com/JunMa11/COVID-19-CT-Seg-Benchmark
CrossMoDA2021MRIHead and Neck2105https://crossmoda.grand-challenge.org/Data/
CT-ORGCTWhole Body6140https://wiki.cancerimagingarchive.net/pages/viewpage.action?pageId=61080890
CTPelvic1KCTLower Limb5117https://zenodo.org/record/4588403#.YEyLq_0zaCo
DAP AtlasCTWhole Body179533https://github.com/alexanderjaus/AtlasDataset
FeTA2022MRIBrain780https://feta.grand-challenge.org/data-download/
FLARE22CTAbdomen1550https://flare22.grand-challenge.org/
FUMPECTThorax135https://www.kaggle.com/datasets/andrewmvd/pulmonary-embolism-in-ct-images
HAN SegCTHead and Neck4141https://zenodo.org/record/
HECKTOR2022PETHead and Neck2524https://hecktor.grand-challenge.org/Data/
INSTANCECTBrain1100https://instance.grand-challenge.org/Dataset/
ISLES2022MRIBrain1500http://www.isles-challenge.org/
KiPA22CTAbdomen470https://kipa22.grand-challenge.org/dataset/
KiTS23CTAbdomen3489https://github.com/neheller/kits23
LAScarQS2022 Task 1MRIThorax260https://zmiclab.github.io/projects/lascarqs22/data.html
LAScarQS2022 Task 2MRIThorax1130https://zmiclab.github.io/projects/lascarqs22/data.html
LNDbCTThorax1236https://zenodo.org/record/7153205#.Yz_oVHbMJPZ
LUNA16CTThorax1888https://luna16.grand-challenge.org/
MM-WHS CTCTThorax940https://mega.nz/folder/UNMF2YYI#1cqJVzo4p_wESv9P_pc8uA
MM-WHS MRMRIThorax940https://mega.nz/folder/UNMF2YYI#1cqJVzo4p_wESv9P_pc8uA
MRSpineSegMRISpine2391https://www.cg.informatik.uni-siegen.de/en/spine-segmentation-and-analysis
MSD CardiacMRIThorax120http://medicaldecathlon.com/
MSD ColonCTAbdomen1126http://medicaldecathlon.com/
MSD HepaticVesselCTAbdomen2303http://medicaldecathlon.com/
MSD HippocampusMRIBrain3260http://medicaldecathlon.com/
MSD LiverCTAbdomen2131http://medicaldecathlon.com/
MSD LungCTThorax163http://medicaldecathlon.com/
MSD PancreasCTAbdomen2281http://medicaldecathlon.com/
MSD ProstateMRIPelvis264http://medicaldecathlon.com/
MSD SpleenCTAbdomen141http://medicaldecathlon.com/
MyoPS2020MRIThorax6135https://mega.nz/folder/BRdnDISQ#FnCg9ykPlTWYe5hrRZxi-w
NSCLCCTThorax285https://wiki.cancerimagingarchive.net/pages/viewpage.action?pageId=68551327
Pancreas CTCTAbdomen180https://wiki.cancerimagingarchive.net/display/public/pancreas-ct
Parse2022CTThorax1100https://parse2022.grand-challenge.org/Dataset/
PDDCACTHead and Neck1248https://www.imagenglab.com/newsite/pddca/
PROMISE12MRIPelvis150https://promise12.grand-challenge.org/Details/
SEGACTWhole Body156https://multicenteraorta.grand-challenge.org/data/
SegRap2023 Task1CTHead and Neck61120https://segrap2023.grand-challenge.org/
SegRap2023 Task2CTThorax2120https://segrap2023.grand-challenge.org/
SegTHORCTThorax440https://competitions.codalab.org/competitions/21145#learn_the_details
SKI10CTUpper Limb499https://ambellan.de/sharing/QjrntLwah
SLIVER07CTAbdomen120https://sliver07.grand-challenge.org/
ToothFairyMRIHead and Neck4153https://ditto.ing.unimore.it/toothfairy/
TotalSegmentator CardiacCTWhole Body171202https://zenodo.org/record/6802614
TotalSegmentator MusclesCTWhole Body311202https://zenodo.org/record/6802614
TotalSegmentator OrgansCTWhole Body241202https://zenodo.org/record/6802614
TotalSegmentator RibsCTWhole Body391202https://zenodo.org/record/6802614
TotalSegmentator VertebraeCTWhole Body291202https://zenodo.org/record/6802614
TotalSegmentator V2CTWhole Body241202https://zenodo.org/record/6802614
VerSeCTWhole Body2996https://github.com/anjany/verse
WMHMRIBrain1170https://wmh.isi.uu.nl/
WORDCTAbdomen18150https://github.com/HiLab-git/WORD

Step 2: Preprocess datasets

For each dataset, we need to find all the image and mask pairs, and another 5 basic information: dataset name, modality, label name, patient ids (to split train-test set) and official split (if provided).
In processor.py, we customize the process procedure for each dataset, to generate a jsonl file including these information for each sample.
Take AbdomenCT1K for instance, you need to run the following command:

python processor.py \
--dataset_name AbdomenCT1K \
--root_path 'SAT-DS/data/nii/AbdomenCT-1K' \
--jsonl_dir 'SAT-DS/data/jsonl'

root_path should be where you download and place the data, jsonl_dir should be where you plan to place the jsonl files.
⚠️ Note the dataset_name and the name in the table might not be exactly the same. For specific details, please refer to each process function in processor.py.
After process, each sample in jsonl files would be like:

{
  'image' :"SAT-DS/data/nii/AbdomenCT-1K/Images/Case_00558_0000.nii.gz",
  'mask': "SAT-DS/data/nii/AbdomenCT-1K/Masks/Case_00558.nii.gz",
  'label': ["liver", "kidney", "spleen", "pancreas"],
  'modality': 'CT',
  'dataset': 'AbdomenCT1K,
  'official_split': 'unknown',
  'patient_id': 'Case_00558_0000.nii.gz',
}

Note that in this step, we may convert the image and mask into new nifiti files for some datasets, such as TotalSegmentator and so on. So it may take some time.

Shortcut to skip Step 1 and 2: Download the preprocessed and packaged data for immediate use

We offer shortcut download links of 42 datasets in dropbox. All these datasets are preprocessed and packaged in advance. Download the datasets you need and unzip them in data/nii, each dataset is paired with a jsonl file in data/jsonl.

Step 3: Load data with unified normalization

With the generated jsonl file, a dataset is now ready to be used.
However, when mixing all the datasets to train a universal segmentation model, we need to apply normalization on the image intensity, orientation, spacing across all the datasets, and adjust labels if necessary.
We realize this by customizing the load script for each dataset in loader.py, this is a simple demo how to use it in your code:

from loader import Loader_Wrapper

loader = Loader_Wrapper()
    
# load samples from jsonl
with open('SAT-DS/data/jsonl', 'r') as f:
    lines = f.readlines()
    data = [json.loads(line) for line in lines]

# load a sample
for sample in data:
    batch = getattr(loader, func_name)(sample)
    img_tensor, mc_mask, text_ls, modality, image_path, mask_path = batch

For each sample, whatever the dataset it comes from, the loader will give output in a normalized format:

img_tensor  # tensor with shape (1, H, W, D)
mc_mask  # binary tensor with shape (N, H, W, D), one channel for each class;
text_ls  # a list of N class name;
modality  # MRI, CT or PET;
image_path  # path to the loaded mask file;
mask_path  # path to the loaded imag file;

⚠️ Note that we may merge and adjust labels here in the loader. Therefore, the output text_ls may be different from the label you see in the input jsonl file. Here is an case where we merge left kidney' and right kidneyfor a new labelkidney` when loading examples from CHAOS_MRI:

kidney = mask[1] + mask[2]
mask = torch.cat((mask, kidney.unsqueeze(0)), dim=0)
labels.append("kidney")

And here is another case where we adjust the annotation of kidney by integrating the annotation of kidney tumor and kidney cyst:

mc_masks[0] += mc_masks[1]
mc_masks[0] += mc_masks[2]

We also offer the shortcut to visualize and check any sample in any dataset after normalization. For example, to visualize the first sample in AbdomenCT1K.jsonl, just run the following command:

python loader.py \
--visualization_dir 'SAT-DS/data/visualization' \
--path2jsonl 'SAT-DS/data/jsonl/AbdomenCT1K.jsonl' \
--i 0

(Optional) Step 4: Convert to npy files

For convenience, before training SAT, we normalize all the data according to step 3, and convert the images and segmentation masks to npy files. If you try to use our training code, run this command for each dataset:

python convert_to_npy.py \
--jsonl2load 'SAT-DS/data/jsonl/AbdomenCT1K.jsonl' \
--jsonl2save 'SAT-DS/data/jsonl/AbdomenCT1K.jsonl'

The converted npy files will be saved in preprocessed_npy/dataset_name, and some new information will be added to the jsonl file for connivence to load the npy files.

(Optional) Step 5: Split train and test set

We offer the train-test split used in our paper for each dataset in json files. To follow our split and benchmark your method, simply run this command:

python train_test_split.py \
--jsonl2split 'SAT-DS/data/jsonl/AbdomenCT1K.jsonl' \
--train_jsonl 'SAT-DS/data/trainset_jsonl/AbdomenCT1K.jsonl' \
--test_jsonl 'SAT-DS/data/testset_jsonl/AbdomenCT1K.jsonl' \
--split_json 'SAT-DS/data/split_json/AbdomenCT1K.json'

This will split the jsonl file into train and test.

Or, if you want to re-split them, just customize your split by identifying the patient_id in the json file (patient_id of each sample can be found in jsonl file of each dataset):

{'train':['train_patient_id1', ...], 'test':['test_patient_id1', ...]}

(Optional) Step 6: DIY your data collection

You may want to customize the dataset collection in training your model, simply merge the train jsonls of the data you want to involve. For example, merge the jsonls for all the 72 datasets into train.jsonl, and you can use them together to train SAT, using our training code in this repo.

Similarly, you can customize a benchmark with arbitrary datasets you want by merging the test jsonls.

Citation

If you use this code for your research or project, please cite:

@arxiv{zhao2023model,
  title={One Model to Rule them All: Towards Universal Segmentation for Medical Images with Text Prompt}, 
  author={Ziheng Zhao and Yao Zhang and Chaoyi Wu and Xiaoman Zhang and Ya Zhang and Yanfeng Wang and Weidi Xie},
  year={2023},
  journal={arXiv preprint arXiv:2312.17183},
}

And if you use any of these datasets in SAT-DS, please cite the corresponding papers. A summerized citation information can be found in citation.bib .