Awesome
Awesome Unified Multimodal Models <!-- omit in toc -->
This is a repository for organizing papers, codes and other resources related to unified multimodal models.
<p align="center"> <img src="assets/unified_model.webp" alt="TAX" style="display: block; margin: 0 auto;" width="480px" /> </p>:thinking: What are unified multimodal models?
Traditional multimodal models can be broadly categorized into two types: multimodal understanding and multimodal generation. Unified multimodal models aim to integrate these two tasks within a single framework. Such models are also referred to as Any-to-Any generation in the community. These models operate on the principle of multimodal input and multimodal output, enabling them to process and generate content across various modalities seamlessly.
:high_brightness: This project is still on-going, pull requests are welcomed!!
If you have any suggestions (missing papers, new papers, or typos), please feel free to edit and pull a request. Just letting us know the title of papers can also be a great contribution to us. You can do this by open issue or contact us directly via email.
:star: If you find this repo useful, please star it!!!
Table of Contents <!-- omit in toc -->
- Open-source Toolboxes and Foundation Models
- Evaluation Benchmarks and Metrics
- Single Model
- Multi Experts
- Tokenizer
Unified Multimodal Understanding and Generation
-
MotionGPT-2: A General-Purpose Motion-Language Model for Motion Generation and Understanding (Oct. 2024, arXiv)
-
Janus: Decoupling Visual Encoding for Unified Multimodal Understanding and Generation (Oct. 2024, arXiv)
-
MMAR: Towards Lossless Multi-Modal Auto-Regressive Probabilistic Modeling (Oct. 2024, arXiv)
-
Emu3: Next-Token Prediction is All You Need (Sep. 2024, arXiv)
-
MIO: A Foundation Model on Multimodal Tokens (Sep. 2024, arXiv)
-
MonoFormer: One Transformer for Both Diffusion and Autoregression (Sep. 2024, arXiv)
-
VILA-U: a Unified Foundation Model Integrating Visual Understanding and Generation (Sep. 2024, arXiv)
-
Show-o: One Single Transformer to Unify Multimodal Understanding and Generation (Aug. 2024, arXiv)
-
Transfusion: Predict the Next Token and Diffuse Images with One Multi-Modal Model (Aug. 2024, arXiv)
-
ANOLE: An Open, Autoregressive, Native Large Multimodal Models for Interleaved Image-Text Generation (Jul. 2024, arXiv)
-
X-VILA: Cross-Modality Alignment for Large Language Model (May. 2024, arXiv)
-
Chameleon: Mixed-Modal Early-Fusion Foundation Models (May 2024, arXiv)
-
SEED-X: Multimodal Models with Unified Multi-granularity Comprehension and Generation (Apr. 2024, arXiv)
-
Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models (Mar. 2024, arXiv)
-
AnyGPT: Unified Multimodal LLM with Discrete Sequence Modeling (Feb. 2024, arXiv)
-
World Model on Million-Length Video And Language With Blockwise RingAttention (Feb. 2024, arXiv)
-
Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization (Feb. 2024, arXiv)
-
MM-Interleaved: Interleaved Image-Text Generative Modeling via Multi-modal Feature Synchronizer (Jan. 2024, arXiv)
-
Unified-IO 2: Scaling Autoregressive Multimodal Models with Vision, Language, Audio, and Action (Dec. 2023, arXiv)
-
Emu2: Generative Multimodal Models are In-Context Learners (Jul. 2023, CVPR)
-
Gemini: A Family of Highly Capable Multimodal Models (Dec. 2023, arXiv)
-
VL-GPT: A Generative Pre-trained Transformer for Vision and Language Understanding and Generation (Dec. 2023, arXiv)
-
DreamLLM: Synergistic Multimodal Comprehension and Creation (Dec. 2023, ICLR)
-
Making LLaMA SEE and Draw with SEED Tokenizer (Oct. 2023, ICLR)
-
NExT-GPT: Any-to-Any Multimodal LLM (Sep. 2023, ICML)
-
LaVIT: Unified Language-Vision Pretraining in LLM with Dynamic Discrete Visual Tokenization (Sep. 2023, ICLR)
-
Planting a SEED of Vision in Large Language Model (Jul. 2023, arXiv)
-
Emu: Generative Pretraining in Multimodality (Jul. 2023, ICLR)
-
CoDi: Any-to-Any Generation via Composable Diffusion (May. 2023, NeurIPS)
-
Multimodal unified attention networks for vision-and-language interactions (Aug. 2019)
-
UniMuMo: Unified Text, Music, and Motion Generation (Oct. 2024, arXiv)
-
MedViLaM: A multimodal large language model with advanced generalizability and explainability for medical data understanding and generation (Oct. 2024, arXiv)
Multi Experts
- TaxaBind: A Unified Embedding Space for Ecological Applications (Nov. 2024, arXiv)
Tokenizer
- Cosmos Tokenizer: A suite of image and video neural tokenizers (Nov. 2024, arXiv)
Acknowledgements
This template is provided by Awesome-Video-Diffusion and Awesome-MLLM-Hallucination.