Home

Awesome

<p align="center" width="100%"> <img src="https://oryx-mllm.github.io/static/images/icon.png" alt="967023137dff29e65b21544e7620e0f7.webp" width=30%> </p> <div>

Oryx MLLM: On-Demand Spatial-Temporal Understanding at Arbitrary Resolution

<p align="left"> <a href='https://github.com/liuzuyan' target='_blank'>Zuyan Liu<sup>*,1,2</sup></a>&emsp; <a href='https://github.com/dongyh20/' target='_blank'>Yuhao Dong<sup>*,2,3</sup></a>&emsp; <a href='https://liuziwei7.github.io/' target='_blank'>Ziwei Liu<sup>3</sup></a>&emsp; Winston Hu<sup>2</sup></a>&emsp; <a href='https://scholar.google.com/citations?user=TN8uDQoAAAAJ' target='_blank'>Jiwen Lu<sup>1,&#x2709</sup></a>&emsp; <a href='https://raoyongming.github.io/' target='_blank'>Yongming Rao<sup>2,1,&#x2709</sup></a>&emsp; </p> <p align="left"><sup>1</sup>Tsinghua University &ensp; <sup>2</sup>Tencent&ensp; <sup>3</sup>S-Lab, NTU&ensp; </p> <p align="left"><sup>*</sup> Equal Contribution<sup>&ensp; &#x2709</sup> Corresponding Author</p>

Project Page: oryx-project-page

arXiv Paper: Static Badge

Demo by Gradio: Static Badge

Model Checkpoints: oryx-checkpoints

Oryx SFT Data: oryx-SFT-Data

๐Ÿ“ข News

๐Ÿ Introducing Oryx

Oryx is a unified multimodal architecture for the spatial-temporal understanding of images, videos, and multi-view 3D scenes. Oryx offers an on-demand solution to seamlessly and efficiently process visual inputs with arbitrary spatial sizes and temporal lengths. Our model achieve strong capabilities in image, video, and 3D multimodal understanding simultaneously.

Main idea of On-Demand Multimodal Understanding

<p align="center" width="100%"> <img src="https://oryx-mllm.github.io/static/images/teaser.png" alt="teaser.png" width=90%> </p> <div>

Overview of Oryx Architecture

<p align="center" width="100%"> <img src="https://oryx-mllm.github.io/static/images/method.png" alt="method.png" width=90%> </p> <div>

โœ… TODO List

๐Ÿ“ƒ Main Results

Results on General Temporal Understanding

<p align="center" width="100%"> <img src="https://oryx-mllm.github.io/static/images/results1.png" alt="results1.png" width=80%> </p> <div>

Results on Long-Form Temporal Understanding

<p align="center" width="100%"> <img src="https://oryx-mllm.github.io/static/images/results2.png" alt="results2.png" width=80%> </p> <div>

Results on Image Understanding

<p align="center" width="100%"> <img src="https://oryx-mllm.github.io/static/images/results3.png" alt="results3.png" width=80%> </p> <div>

Results on 3D Understanding

<p align="center" width="100%"> <img src="https://oryx-mllm.github.io/static/images/results4.png" alt="results4.png" width=80%> </p> <div>

Model Zoo

We provide our checkpoints at Huggingface

ModelLinkSizeVisual EncoderLLM-TypeIntermediate Model
Oryx-7BHuggingface7BOryx-ViTQwen-2-7BOryx-7B-Image
Oryx-34BHuggingface34BOryx-ViTYi-1.5-34BOryx-34B-Image
Oryx-1.5-7BHuggingface7BOryx-ViTQwen-2.5-7BComing Soon
Oryx-1.5-32BHuggingface32BOryx-ViTQwen-2.5-32BComing Soon

Generation Demo

You can try the generation results of our strong Oryx model with the following steps:

1. Download the Oryx model from our huggingface collections.

2. Download the Oryx-ViT vision encoder.

3. Replace the path for "mm_vision_tower" in the config.json with your local path for Oryx-ViT.

4. Modify the model path and run the inference script with your own video to test our model.

python inference.py

Evaluation

You can evaluate our model with the following steps:

1. Download the Oryx model from our huggingface collections.

2. Download the Oryx-ViT vision encoder.

3. Replace the path for "mm_vision_tower" in the config.json with your local path for Oryx-ViT.

4. Install the provided lmms-eval folder.

cd ./lmms-eval
pip install -e .

4. Modify the model path and run the evaluation script to test our model.

bash ./scripts/eval_image.sh
bash ./scripts/eval_video.sh

Training Instructions

Installation

1. Clone this repository:

git clone https://github.com/Oryx-mllm/oryx
cd oryx

2. Install the required package:

conda create -n oryx python=3.10 -y
conda activate oryx
pip install --upgrade pip
pip install -e

Preparation

3. Prepare training data:

Please download training data from our huggingface.

Modify the DATA and FOLDER arguments in the training scripts to your save folder.

DATA="PATH/TO/Oryx-SFT-Data/data.json"
FOLDER="PATH/TO/Oryx-SFT-Data"

If you are interested in our long-form training data, you can download movienet_data.json and movienet_patch and mix appropriate quantity (we recommand 30k) with the main training data.

Training

4. Training your own model:

Modify the following lines in the scripts at your own environments:

export PYTHONPATH=/PATH/TO/oryx:$PYTHONPATH
VISION_TOWER='oryx_vit:PATH/TO/oryx_vit_new.pth'
DATA="PATH/TO/Oryx-SFT-DATA/data.json"
MODEL_NAME_OR_PATH="PATH/TO/7B_MODEL"

Scripts for training Oryx-7B

bash scripts/train_oryx_7b.sh

Scripts for training Oryx-34B

bash scripts/train_oryx_34b.sh

Citation

If you find it useful for your research and applications, please cite our paper using this BibTeX:

@article{liu2024oryx,
title={Oryx MLLM: On-Demand Spatial-Temporal Understanding at Arbitrary Resolution},
author={Liu, Zuyan and Dong, Yuhao and Liu, Ziwei and Hu, Winston and Lu, Jiwen and Rao, Yongming},
journal={arXiv preprint arXiv:2409.12961},
year={2024}
}

Acknowledgement