Home

Awesome

<div align="center"> <img src="utils/logo_white_bg.jpg" alt="Touchstone Benchmark" width="300"> </div> <h1 align="center" style="font-size: 60px; margin-bottom: 4px">Touchstone Benchmark</h1> <p align="center"> <a href='https://www.cs.jhu.edu/~zongwei/advert/TutorialBenchmarkV1.pdf'> <img src='https://img.shields.io/badge/Participate%20--%20Touchstone%201.0-%233F51B5?style=for-the-badge' alt='Participate - Touchstone 1.0'> </a> <a href='https://www.cs.jhu.edu/~zongwei/advert/Call4Benchmark.pdf'> <img src='https://img.shields.io/badge/Participate%20--%20Touchstone%202.0-%23F44336?style=for-the-badge' alt='Participate - Touchstone 2.0'> </a> <br/> <a href="https://github.com/MrGiovanni/Touchstone"> <img src="https://img.shields.io/github/stars/MrGiovanni/Touchstone?style=social" alt='GitHub Stars' /> </a> <a href="https://twitter.com/bodymaps317"> <img src="https://img.shields.io/twitter/follow/BodyMaps?style=social" alt="Follow on Twitter" /> </a> </p>

We present Touchstone, a large-scale medical segmentation benchmark based on annotated 5,195 CT volumes from 76 hospitals for training, and 6,933 CT volumes from 8 additional hospitals for testing. We invite AI inventors to train their models on AbdomenAtlas, and we independently evaluate their algorithms. We have already collaborated with 14 influential research teams, and we remain accepting new submissions.

[!NOTE] Training set

Test set

Touchstone 1.0 Leaderboard

<b>Touchstone Benchmark: Are We on the Right Way for Evaluating AI Algorithms for Medical Segmentation?</b> <br/> Pedro R. A. S. Bassi<sup>1</sup>, Wenxuan Li<sup>1</sup>, Yucheng Tang<sup>2</sup>, Fabian Isensee<sup>3</sup>, ..., Alan Yuille<sup>1</sup>, Zongwei Zhou<sup>1</sup> <br/> <sup>1</sup>Johns Hopkins University, <sup>2</sup>NVIDIA, <sup>3</sup>DKFZ <br/> NeurIPS 2024 <br/> <a href='https://www.zongweiz.com/dataset'><img src='https://img.shields.io/badge/Project-Page-Green'></a> <a href='https://www.cs.jhu.edu/~alanlab/Pubs24/bassi2024touchstone.pdf'><img src='https://img.shields.io/badge/Paper-PDF-purple'></a> <a href='document/jhu_seminar_slides.pdf'><img src='https://img.shields.io/badge/Slides-Seminar-orange'></a> <a href='document/rsna2024_abstract.pdf'><img src='https://img.shields.io/badge/Abstract-RSNA-purple'></a> <a href='document/rsna2024_slides.pdf'><img src='https://img.shields.io/badge/Slides-RSNA-orange'></a>

rankmodelorganizationaverage DSCpapergithub
🏆MedNeXtDKFZ89.2arXivGitHub stars
🏆STU-Net-BShanghai AI Lab89.0arXivGitHub stars
🏆MedFormerRutgers89.0arXivGitHub stars
🏆nnU-Net ResEncLDKFZ88.8arXivGitHub stars
🏆UniSegNPU88.8arXivGitHub stars
🏆Diff-UNetHKUST88.5arXivGitHub stars
🏆LHU-NetUR88.0arXivGitHub stars
🏆NexToUHIT87.8arXivGitHub stars
9SegVolBAAI87.1arXivGitHub stars
10U-Net & CLIPCityU87.1arXivGitHub stars
11Swin UNETR & CLIPCityU86.7arXivGitHub stars
12Swin UNETRNVIDIA80.1arXivGitHub stars
13UNesTNVIDIA79.1arXivGitHub stars
14SAM-AdapterDuke73.4arXivGitHub stars
15UNETRNVIDIA64.4arXivGitHub stars
<details> <summary style="margin-left: 25px;">Aorta - NexToU 🏆 </summary> <div style="margin-left: 25px;">
rankmodelorganizationDSCpapergithub
🏆NexToUHIT86.4arXivGitHub stars
2MedNeXtDKFZ83.1arXivGitHub stars
3UniSegNPU82.3arXivGitHub stars
4STU-Net-BShanghai AI Lab82.1arXivGitHub stars
5nnU-Net ResEncLDKFZ81.4arXivGitHub stars
6Diff-UNetHKUST81.2arXivGitHub stars
7Swin UNETRNVIDIA81.1arXivGitHub stars
8SegVolBAAI80.2arXivGitHub stars
9UNesTNVIDIA78.6arXivGitHub stars
10Swin UNETR & CLIPCityU78.1arXivGitHub stars
11U-Net & CLIPCityU77.1arXivGitHub stars
12SAM-AdapterDuke62.8arXivGitHub stars
13UNETRNVIDIA52.1arXivGitHub stars
</div> </details> <details> <summary style="margin-left: 25px;">Gallbladder - STU-Net-B & MedNeXt 🏆 </summary> <div style="margin-left: 25px;">
rankmodelorganizationDSCpapergithub
🏆STU-Net-BShanghai AI Lab85.5arXivGitHub stars
🏆MedNeXtDKFZ85.3arXivGitHub stars
3nnU-Net ResEncLDKFZ84.9arXivGitHub stars
4UniSegNPU84.7arXivGitHub stars
5Diff-UNetHKUST83.8arXivGitHub stars
6NexToUHIT82.3arXivGitHub stars
7U-Net & CLIPCityU82.1arXivGitHub stars
8Swin UNETR & CLIPCityU80.2arXivGitHub stars
9SegVolBAAI79.3arXivGitHub stars
10Swin UNETRNVIDIA69.2arXivGitHub stars
11UNesTNVIDIA62.1arXivGitHub stars
12SAM-AdapterDuke49.4arXivGitHub stars
13UNETRNVIDIA43.8arXivGitHub stars
</div> </details> <details> <summary style="margin-left: 25px;">KidneyL - Diff-UNet 🏆 </summary> <div style="margin-left: 25px;">
rankmodelorganizationDSCpapergithub
🏆Diff-UNetHKUST91.9arXivGitHub stars
2nnU-Net ResEncLDKFZ91.9arXivGitHub stars
3STU-Net-BShanghai AI Lab91.9arXivGitHub stars
4MedNeXtDKFZ91.8arXivGitHub stars
5SegVolBAAI91.8arXivGitHub stars
6UniSegNPU91.5arXivGitHub stars
7U-Net & CLIPCityU91.1arXivGitHub stars
8Swin UNETR & CLIPCityU91.0arXivGitHub stars
9NexToUHIT89.6arXivGitHub stars
10SAM-AdapterDuke87.3arXivGitHub stars
11Swin UNETRNVIDIA85.5arXivGitHub stars
12UNesTNVIDIA85.4arXivGitHub stars
13UNETRNVIDIA63.7arXivGitHub stars
</div> </details> <details> <summary style="margin-left: 25px;">KidneyR - Diff-UNet 🏆 </summary> <div style="margin-left: 25px;">
rankmodelorganizationDSCpapergithub
🏆Diff-UNetHKUST92.8arXivGitHub stars
2MedNeXtDKFZ92.6arXivGitHub stars
3nnU-Net ResEncLDKFZ92.6arXivGitHub stars
4STU-Net-BShanghai AI Lab92.5arXivGitHub stars
5SegVolBAAI92.5arXivGitHub stars
6UniSegNPU92.2arXivGitHub stars
7U-Net & CLIPCityU91.9arXivGitHub stars
8Swin UNETR & CLIPCityU91.7arXivGitHub stars
9SAM-AdapterDuke90.4arXivGitHub stars
10NexToUHIT90.1arXivGitHub stars
11UNesTNVIDIA83.6arXivGitHub stars
12Swin UNETRNVIDIA81.7arXivGitHub stars
13UNETRNVIDIA69.6arXivGitHub stars
</div> </details> <details> <summary style="margin-left: 25px;">Liver - MedNeXt 🏆 </summary> <div style="margin-left: 25px;">
rankmodelorganizationDSCpapergithub
🏆MedNeXtDKFZ96.3arXivGitHub stars
2nnU-Net ResEncLDKFZ96.3arXivGitHub stars
3Diff-UNetHKUST96.2arXivGitHub stars
4STU-Net-BShanghai AI Lab96.2arXivGitHub stars
5UniSegNPU96.1arXivGitHub stars
6U-Net & CLIPCityU96.0arXivGitHub stars
7SegVolBAAI96.0arXivGitHub stars
8Swin UNETR & CLIPCityU95.8arXivGitHub stars
9NexToUHIT95.7arXivGitHub stars
10SAM-AdapterDuke94.1arXivGitHub stars
11UNesTNVIDIA93.6arXivGitHub stars
12Swin UNETRNVIDIA93.5arXivGitHub stars
13UNETRNVIDIA90.5arXivGitHub stars
</div> </details> <details> <summary style="margin-left: 25px;">Pancreas - MedNeXt 🏆 </summary> <div style="margin-left: 25px;">
rankmodelorganizationDSCpapergithub
🏆MedNeXtDKFZ83.3arXivGitHub stars
2STU-Net-BShanghai AI Lab83.2arXivGitHub stars
3nnU-Net ResEncLDKFZ82.9arXivGitHub stars
4UniSegNPU82.7arXivGitHub stars
5Diff-UNetHKUST81.9arXivGitHub stars
6U-Net & CLIPCityU80.8arXivGitHub stars
7Swin UNETR & CLIPCityU80.2arXivGitHub stars
8NexToUHIT80.2arXivGitHub stars
9SegVolBAAI79.1arXivGitHub stars
10Swin UNETRNVIDIA68.5arXivGitHub stars
11UNesTNVIDIA68.3arXivGitHub stars
12UNETRNVIDIA55.1arXivGitHub stars
13SAM-AdapterDuke50.2arXivGitHub stars
</div> </details> <details> <summary style="margin-left: 25px;">Postcava - STU-Net-B & MedNeXt 🏆 </summary> <div style="margin-left: 25px;">
rankmodelorganizationDSCpapergithub
🏆STU-Net-BShanghai AI Lab81.3arXivGitHub stars
🏆MedNeXtDKFZ81.3arXivGitHub stars
3UniSegNPU81.2arXivGitHub stars
4Diff-UNetHKUST80.8arXivGitHub stars
5nnU-Net ResEncLDKFZ80.5arXivGitHub stars
6U-Net & CLIPCityU78.5arXivGitHub stars
7NexToUHIT78.1arXivGitHub stars
8SegVolBAAI77.8arXivGitHub stars
9Swin UNETR & CLIPCityU76.8arXivGitHub stars
10Swin UNETRNVIDIA69.9arXivGitHub stars
11UNesTNVIDIA66.2arXivGitHub stars
12UNETRNVIDIA53.9arXivGitHub stars
13SAM-AdapterDuke48.0arXivGitHub stars
</div> </details> <details> <summary style="margin-left: 25px;">Spleen - nnU-Net ResEncL 🏆 </summary> <div style="margin-left: 25px;">
rankmodelorganizationDSCpapergithub
🏆nnU-Net ResEncLDKFZ95.2arXivGitHub stars
2MedNeXtDKFZ95.2arXivGitHub stars
3STU-Net-BShanghai AI Lab95.1arXivGitHub stars
4Diff-UNetHKUST95.0arXivGitHub stars
5UniSegNPU94.9arXivGitHub stars
6SegVolBAAI94.5arXivGitHub stars
7NexToUHIT94.7arXivGitHub stars
8U-Net & CLIPCityU94.3arXivGitHub stars
9Swin UNETR & CLIPCityU94.1arXivGitHub stars
10SAM-AdapterDuke90.5arXivGitHub stars
11Swin UNETRNVIDIA87.9arXivGitHub stars
12UNesTNVIDIA86.7arXivGitHub stars
13UNETRNVIDIA76.5arXivGitHub stars
</div> </details> <details> <summary style="margin-left: 25px;">Stomach - STU-Net-B & MedNeXt & nnU-Net ResEncL 🏆 </summary> <div style="margin-left: 25px;">
rankmodelorganizationDSCpapergithub
🏆STU-Net-BShanghai AI Lab93.5arXivGitHub stars
🏆MedNeXtDKFZ93.5arXivGitHub stars
🏆nnU-Net ResEncLDKFZ93.4arXivGitHub stars
4UniSegNPU93.3arXivGitHub stars
5Diff-UNetHKUST93.1arXivGitHub stars
6NexToUHIT92.7arXivGitHub stars
7SegVolBAAI92.5arXivGitHub stars
8U-Net & CLIPCityU92.4arXivGitHub stars
9Swin UNETR & CLIPCityU92.2arXivGitHub stars
10SAM-AdapterDuke88.0arXivGitHub stars
11UNesTNVIDIA87.6arXivGitHub stars
12Swin UNETRNVIDIA84.1arXivGitHub stars
13UNETRNVIDIA74.2arXivGitHub stars
</div> </details>

In-depth Result Analysis

<div align="center"> <img src="utils/JHHAnalysiswlCaptions.png" alt="JHH Analysis" width="1200"> </div> <details> <summary style="margin-left: 25px;">* </summary> <div style="margin-left: 25px;">

Each cell in the significance heatmap above indicates a one-sided statistical test. Red indicates that the x-axis AI algorithm is significantly superior to the y-axis algorithm in terms of DSC, for one organ.

</div> </details>

We provide DSC and NSD per CT scan for each checkpoint in test sets #2 and #3, and a code tutorial for easy:

You can easily modify our code to compare your custom model to our checkpoints, or to analyze segmentation performance in custom demographic groups (e.g., hispanic men aged 20-25).

<details> <summary style="margin-left: 25px;">Code tutorial </summary> <div style="margin-left: 25px;">

Per-sample results are in CSV files inside the folders totalsegmentator_results and dapatlas_results.

<details> <summary style="margin-left: 25px;">File structure </summary> <div style="margin-left: 25px;">
totalsegmentator_results
    ├── Diff-UNet
    │   ├── dsc.csv
    │   └── nsd.csv
    ├── LHU-Net
    │   ├── dsc.csv
    │   └── nsd.csv
    ├── MedNeXt
    │   ├── dsc.csv
    │   └── nsd.csv
    ├── ...
dapatlas_results
    ├── Diff-UNet
    │   ├── dsc.csv
    │   └── nsd.csv
    ├── LHU-Net
    │   ├── dsc.csv
    │   └── nsd.csv
    ├── MedNeXt
    │   ├── dsc.csv
    │   └── nsd.csv
    ├── ...
</div> </details>

1. Clone the GitHub repository

git clone https://github.com/MrGiovanni/Touchstone
cd Touchstone

2. Create environments

conda env create -f environment.yml
source activate touchstone
python -m ipykernel install --user --name touchstone --display-name "touchstone"

3. Reproduce analysis figures in our paper

Figure 1 - Dataset statistics:

cd notebooks
jupyter nbconvert --to notebook --execute --ExecutePreprocessor.kernel_name=touchstone TotalSegmentatorMetadata.ipynb
jupyter nbconvert --to notebook --execute --ExecutePreprocessor.kernel_name=touchstone DAPAtlasMetadata.ipynb
#results: plots are saved inside Touchstone/outputs/plotsTotalSegmentator/ and Touchstone/outputs/plotsDAPAtlas/

Figure 2 - Potential confrounders significantly impact AI performance:

cd ../plot
python AggregatedBoxplot.py --stats
#results: Touchstone/outputs/summary_groups.pdf

Appendix D.2.3 - Statistical significance maps:

#statistical significance maps (Appendix D.2.3):
python PlotAllSignificanceMaps.py
python PlotAllSignificanceMaps.py --organs second_half
python PlotAllSignificanceMaps.py --nsd
python PlotAllSignificanceMaps.py --organs second_half --nsd
#results: Touchstone/outputs/heatmaps

Appendix D.4 and D.5 - Box-plots for per-group and per-organ results, with statistical tests:

cd ../notebooks
jupyter nbconvert --to notebook --execute --ExecutePreprocessor.kernel_name=touchstone GroupAnalysis.ipynb
#results: Touchstone/outputs/box_plots

4. Custom Analysis

<details> <summary style="margin-left: 25px;">Define custom demographic groups (e.g., hispanic men aged 20-25) and compare AI performance on them </summary> <div style="margin-left: 25px;">

The csv results files in totalsegmentator_results/ and dapatlas_results/ contain per-sample dsc and nsd scores. Rich meatdata for each one of those samples (sex, age, scanner, diagnosis,...) are available in metaTotalSeg.csv and 'Clinical Metadata FDG PET_CT Lesions.csv', for TotalSegmentator and DAP Atlas, respectively. The code in TotalSegmentatorMetadata.ipynb and DAPAtlasMetadata.ipynb extracts this meatdata into simplfied group lists (e.g., a list of all samples representing male patients), and saves these lists in the folders plotsTotalSegmentator/ and plotsDAPAtlas/. You can modify the code to generate custom sample lists (e.g., all men aged 30-35). To compare a set of groups, the filenames of all lists in the set should begin with the same name. For example, comp1_list_a.pt, comp1_list_b.pt, comp1_list_C.pt can represent a set of 3 groups. Then, PlotGroup.py can draw boxplots and perform statistical tests comparing the AI algorithm's results (dsc and nsd) for the samples inside the different custom lists you created. In our example, you just just need to specify --group_name comp1 when running PlotGroup.py:

python utils/PlotGroup.py --ckpt_root totalsegmentator_results/ --group_root outputs/plotsTotalSegmentator/ --group_name comp1 --organ liver --stats
</div> </details> </div> </details>

Citation

Please cite the following papers if you find our leaderboard or dataset helpful.

@article{bassi2024touchstone,
  title={Touchstone Benchmark: Are We on the Right Way for Evaluating AI Algorithms for Medical Segmentation?},
  author={Bassi, Pedro RAS and Li, Wenxuan and Tang, Yucheng and Isensee, Fabian and Wang, Zifu and Chen, Jieneng and Chou, Yu-Cheng and Kirchhoff, Yannick and Rokuss, Maximilian and Huang, Ziyan and Ye, Jin and He, Junjun and Wald, Tassilo and Ulrich, Constantin and Baumgartner, Michael and Roy, Saikat and Maier-Hein, Klaus H. and Jaeger, Paul and Ye, Yiwen and Xie, Yutong and Zhang, Jianpeng and Chen, Ziyang and Xia, Yong and Xing, Zhaohu and Zhu, Lei and Sadegheih, Yousef and Bozorgpour, Afshin and Kumari, Pratibha and Azad, Reza and Merhof, Dorit and Shi, Pengcheng and Ma, Ting and Du, Yuxin and Bai, Fan and Huang, Tiejun and Zhao, Bo and Wang, Haonan and Li, Xiaomeng and Gu, Hanxue and Dong, Haoyu and Yang, Jichen and Mazurowski, Maciej A. and Gupta, Saumya and Wu, Linshan and Zhuang, Jiaxin and Chen, Hao and Roth, Holger and Xu, Daguang and Blaschko, Matthew B. and Decherchi, Sergio and Cavalli, Andrea and Yuille, Alan L. and Zhou, Zongwei},
  journal={Conference on Neural Information Processing Systems},
  year={2024},
  utl={https://github.com/MrGiovanni/Touchstone}
}

@article{li2024abdomenatlas,
  title={AbdomenAtlas: A large-scale, detailed-annotated, \& multi-center dataset for efficient transfer learning and open algorithmic benchmarking},
  author={Li, Wenxuan and Qu, Chongyu and Chen, Xiaoxi and Bassi, Pedro RAS and Shi, Yijia and Lai, Yuxiang and Yu, Qian and Xue, Huimin and Chen, Yixiong and Lin, Xiaorui and others},
  journal={Medical Image Analysis},
  pages={103285},
  year={2024},
  publisher={Elsevier}
}

@inproceedings{li2024well,
  title={How Well Do Supervised Models Transfer to 3D Image Segmentation?},
  author={Li, Wenxuan and Yuille, Alan and Zhou, Zongwei},
  booktitle={The Twelfth International Conference on Learning Representations},
  year={2024}
}

@article{qu2023abdomenatlas,
  title={Abdomenatlas-8k: Annotating 8,000 CT volumes for multi-organ segmentation in three weeks},
  author={Qu, Chongyu and Zhang, Tiezheng and Qiao, Hualin and Tang, Yucheng and Yuille, Alan L and Zhou, Zongwei and others},
  journal={Advances in Neural Information Processing Systems},
  volume={36},
  year={2023}
}

Acknowledgement

This work was supported by the Lustgarten Foundation for Pancreatic Cancer Research and the McGovern Foundation. Paper content is covered by patents pending.

<div align="center"> <img src="utils/partner_logo.jpg" alt="Touchstone Benchmark" width="1200"> </div>