Home

Awesome

The first GPT-style general vision model unifies various vision tasks only with a vanilla ViT. No negative transfer.

<h5 align="center"> <!-- [![hf_space](https://img.shields.io/badge/🤗-Open%20In%20Spaces-blue.svg)](https://huggingface.co/spaces/LanguageBind/GiT) [![Replicate demo and cloud API](https://replicate.com/camenduru/GiT/badge)](https://replicate.com/camenduru/GiT) [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/camenduru/GiT-jupyter/blob/main/MoE_LLaVA_jupyter.ipynb) [![hf_space](https://img.shields.io/badge/🤗-Paper%20In%20HF-red.svg)](https://huggingface.co/papers/2401.15947) --> <!-- [![youtube](https://img.shields.io/badge/-YouTube-000000?logo=youtube&logoColor=FF0000)](https://www.youtube.com/watch?v=uYb38g-weEY) [![jiqizhixin](https://img.shields.io/badge/-WeChat@机器之心-000000?logo=wechat&logoColor=07C160)](https://mp.weixin.qq.com/s/ICylR6n2LhqQRS0CAHFI1A) -->

arXiv License Hits GitHub issues GitHub closed issues
Twitter <br>

</h5>

This repo is the official implementation of ECCV2024 <font color=Red>Oral</font> paper: GiT: Towards Generalist Vision Transformer through Universal Language Interface as well as the follow-ups. We have made every effort to ensure that the codebase is clean, concise, easily readable, state-of-the-art, and relies only on minimal dependencies.

GiT: Towards Generalist Vision Transformer through Universal Language Interface

Haiyang Wang*, Hao Tang*, Li Jiang $^\dagger$, Shaoshuai Shi, Muhammad Ferjad Naeem, Hongsheng Li, Bernt Schiele, Liwei Wang $^\dagger$

<div align="center"> <img src="assets/Figure1.png" width="800"/> </div>

📣 News

💫 What we want to do

The Model Architectures across various AI domains are converging towards <font color=Red>Multi-Layer Plain Transformers</font>.

Reducing Human Bias in Model Architecture Designing

We aim to unify the model architecture of vision and language through a plain transformer, reducing human biases such as modality-specific encoders and task-specific heads. A key advancement in deep learning is the shift from hand-crafted to autonomously learned features, inspiring us to reduce human-designed aspects in architecture. Moreover, benefiting from the flexibility of plain transformers, our framework can extend to more modalities like point clouds and graphs.

🤔 What we achieve

Building a universal computation model across all tasks stands as the cornerstone of artificial intelligence, reducing the need for task-specific designs. In this project, we introduce GiT (Generalist Vision Transformer). GiT has the following characteristics:

Overview

🚀 Main Results

Single-Task Benchmark

ModelParamsMetricPerfomanceckptlogconfig
GiT-B<sub>detection</sub>131MmAP45.1ckptlogconfig
GiT-B<sub>insseg</sub>131MmAP31.4ckptlogconfig
GiT-B<sub>semseg</sub>131MmIoU47.7ckptlogconfig
GiT-B<sub>caption</sub>131MBLEU-433.7ckptlogconfig
GiT-B<sub>grounding</sub>131MAcc@0.583.3ckptlogconfig

Multi-Tasking Benchmark

ModelParamsDetectionIns SegSem SegCaptionGroundingckptlogconfig
GiT-B<sub>multi-task</sub>131M46.731.947.835.385.8ckptlogconfig
GiT-L<sub>multi-task</sub>387M51.335.150.635.788.4ckptlogconfig
GiT-H<sub>multi-task</sub>756M52.935.852.436.289.2ckptlogconfig
<!-- | GiT-B<sub>single-task</sub> | 131M|45.1 | 31.4| 47.7 |33.7|83.3|[ckpt](https://huggingface.co/kanashi6/GiT/blob/main/det_base.pth)|[log](https://huggingface.co/kanashi6/GiT/blob/main/det_base.log)| [config](https://github.com/Haiyang-W/GiT/blob/main/configs/GiT/single_detection_base.py)| -->

Task Synergy in Multi-Tasking Training

ModelParamsDetectionIns SegSem SegCaptionGrounding
GiT-B<sub>single-task</sub>131M45.131.447.733.783.3
Improvement+1.6+0.5+0.1+1.6+2.5
GiT-B<sub>multi-task</sub>131M46.731.947.835.385.8

Zero-shot benchmark

ModelParamsCityscapes<br>(Det)Cityscapes <br>(Ins Seg)Cityscapes <br>(Sem Seg)SUN RGB-Dnocapsckptlogconfig
GiT-B<sub>multi-task</sub>131M21.814.334.430.99.2ckptlogconfig
GiT-B<sub>universal</sub>131M29.117.956.237.510.6ckptlogconfig
GiT-L<sub>universal</sub>387M32.320.358.039.911.6ckptlogconfig
GiT-H<sub>universal</sub>756M34.118.761.842.512.6ckptlogconfig

Few-shot benchmark

ModelParamsDRIVELoveDAPotsdamWIDERFaceDeepFashionconfig
GiT-B<sub>multi-task</sub>131M34.324.919.117.423.0config
GiT-B<sub>universal</sub>131M51.130.830.631.238.3config
GiT-L<sub>universal</sub>387M55.434.137.233.449.3config
GiT-H<sub>universal</sub>756M57.935.143.434.052.2config

🛠️ Quick Start

Installation

conda create -n GiT python=3.8

conda activate GiT

# We only test in 1.9.1, may be other versions are also worked.
pip install torch==1.9.1+cu111 torchvision==0.10.1+cu111 torchaudio==0.9.1 -f https://download.pytorch.org/whl/torch_stable.html

pip install -U openmim
mim install "mmengine==0.8.3"
mim install "mmcv==2.0.1"
pip install "transformers==4.31.0"

git clone git@github.com:Haiyang-W/GiT.git
cd GiT
pip install -v -e .
pip install -r requirements/optional.txt
pip install -r requirements/runtime.txt

# if you face ChildFailedError, please update yapf
pip install yapf==0.40.1
GiT
|──bert_embed.pt
|——bert_embed_large.pt
|——bert_embed_huge.pt
# current path is ./GiT
cd ..
pip install git+https://github.com/lvis-dataset/lvis-api.git

Dataset Preparation

Multi-Tasking Dataset

Multi-tasking benchmark contains coco2017 for object detection and instance segmentation, ade20k for semantic segmentation, coco caption for image caption, and refcoco series for visual grounding.

GiT
|──data
|  |──ade
|  |  |──ADEChallengeData2016
|  |  |  |──annorations
|  |  |  |  |──training & validation
|  |  |  |──images
|  |  |  |  |──training & validation
|  |  |  |──objectInfo150.txt
|  |  |  |──sceneCategories.txt
|  |──coco
|  |  |──annotations
|  |  |  |──*.json
|  |  |──train2017
|  |  |  |──*.jpg
|  |  |──val2017
|  |  |  |──*.jpg
|  |──coco_2014
|  |  |──annotations
|  |  |  |──*.json
|  |  |  |──coco_karpathy_test.json
|  |  |  |──coco_karpathy_train.json
|  |  |  |──coco_karpathy_val_gt.json
|  |  |  |──coco_karpathy_val.json
|  |  |──train2014
|  |  |  |──*.jpg
|  |  |──val2014
|  |  |  |──*.jpg
|  |  |──refcoco
|  |  |  |──*.p

Universal Dataset

We use 27 datasets in universal training. For more details about dataset preparation, please refer to here.

<div align="center"> <img src="assets/universal.png" width="800"/> </div> <br>

🚨 We only list part of the commands (GiT-B) below. For more detailed commands, please refer to here.

Training

Single Task

Detection

bash tools/dist_train.sh configs/GiT/single_detection_base.py  ${GPU_NUM} --work-dir ${work_dir}

Multi Task

GiT-B

bash tools/dist_train.sh configs/GiT/multi_fivetask_base.py  ${GPU_NUM} --work-dir ${work_dir}

Universal Training

GiT-B

bash tools/dist_train.sh configs/GiT/universal_base.py  ${GPU_NUM} --work-dir ${work_dir}

Testing

Single Task

Detection

bash tools/dist_test.sh configs/GiT/single_detection_base.py ${ckpt_file} ${GPU_NUM} --work-dir ${work_dir}

Multi Task

GiT-B

bash tools/dist_test.sh configs/GiT/multi_fivetask_base.py ${ckpt_file} ${GPU_NUM} --work-dir ${work_dir}

Zero-shot and few-shot

Please download universal pretrain weight from huggingface and organize files as follows:

GiT
|──universal_base.pth
|——universal_large.pth
|——universal_huge.pth

Zero-shot

bash tools/dist_test.sh configs/GiT/zero-shot/zero_shot_cityscapes_det_base.py ${ckpt_file} ${GPU_NUM} --work-dir ${work_dir}

Few-shot

bash tools/dist_train.sh configs/GiT/few-shot/few_shot_drive_det_base.py ${GPU_NUM} --work-dir ${work_dir}

Customize Dataset

If you want to use GiT on your own dataset, please refer here for more details.

🚀 Lightweight Version

If your GPU memory is insufficient, you can reduce the resolution like here, where we lower the detection resolution to 672. It requires ~20 hours of training and reaches ~41.5 mAP.

👀 Todo

👍 Acknowledgement

📘 Citation

Please consider citing our work as follows if it is helpful.

@inproceedings{wang2024git,
  title={GiT: Towards Generalist Vision Transformer through Universal Language Interface},
  author={Wang, Haiyang and Tang, Hao and Jiang, Li and Shi, Shaoshuai and Naeem, Muhammad Ferjad and Li, Hongsheng and Schiele, Bernt and Wang, Liwei},
  booktitle={ECCV},
  year={2024}
}

✨ Star History

Star History Chart

<!-- ## 🤝 Contributors <a href="https://github.com/Haiyang-W/GiT/graphs/contributors"> <img src="https://avatars.githubusercontent.com/u/54112784?v=4" /> </a> -->