Home

Awesome

🌐 xT: Nested Tokenization for Larger Context in Large Images

xT

xT: Nested Tokenization for Larger Context in Large Images
Ritwik Gupta*, Shufan Li*, Tyler Zhu*, Jitendra Malik, Trevor Darrell, Karttikeya Mangalam
Paper: https://arxiv.org/abs/2403.01915

arXiv | Project page

About

xT enables you to model large images, end-to-end, on contemporary, memory-limited GPUs. It is a simple framework for vision transformers which effectively aggregates global context with local details.

Installation

The code has been tested on Linux on NVIDIA A100 GPUs with PyTorch 2+. We use custom CUDA kernels as implemented by the Mamba and OpenAI Triton projects. Therefore, modifications may be required to use this repository on other operating systems or GPUs.

Training

Training can be launched through ./run_submit.sh <num GPUs> <port number> config=<path to config>

We also provide SubmitIt scripts in launch_scripts to submit training jobs on Slurm clusters.

Pretrained Models

Weights and configs for our experiments are available on Hugging Face.

NameResolutionTop1-ACCParamsMem (GB)Thrpt (region/s)
Swin-T25653.7631M0.3076.43
Swin-T <xT> Hyper256/25652.9347M0.3147.81
Swin-T <xT> Hyper512/25660.5647M0.2988.28
Swin-T <xT> XL512/25658.9247M0.1780.00
Swin-T <xT> Mamba512/25661.9744M0.2984.77
Swin-S25658.4552M0.4644.44
Swin-S <xT> Hyper256/25657.0469M0.4639.80
Swin-S <xT> Hyper512/25663.6269M0.4641.45
Swin-S <xT> XL512/25662.6869M0.2336.36
Swin-B25658.5792M0.5036.14
Swin-B <xT> Hyper256/25655.52107M0.6129.85
Swin-B <xT> Hyper512/25664.08107M0.7424.00
Swin-B <xT> XL512/25662.09107M0.3941.03
Swin-B <xT> Mamba512/25663.73103M0.5829.09
Swin-L25668.78206M0.8417.02
Swin-L <xT> Hyper256/25667.84215M1.0616.08
Swin-L <xT> Hyper512/25672.42215M1.0316.58
Swin-L <xT> XL512/25673.47215M0.5314.10
Swin-L <xT> Mamba512/25673.36212M1.0315.61

Citation

@article{xTLargeImageModeling,
  title={xT: Nested Tokenization for Larger Context in Large Images},
  author={Gupta, Ritwik and Li, Shufan and Zhu, Tyler and Malik, Jitendra and Darrell, Trevor and Mangalam, Karttikeya},
  journal={arXiv preprint arXiv:2403.01915},
  year={2024}
}