Awesome

Awesome-Foundation-Models

A foundation model is a large-scale pretrained model (e.g., BERT, DALL-E, GPT-3) that can be adapted to a wide range of downstream applications. This term was first popularized by the Stanford Institute for Human-Centered Artificial Intelligence. This repository maintains a curated list of foundation models for vision and language tasks. Research papers without code are not included.

Survey

2024

Language Agents (from Princeton Shunyu Yao's PhD thesis. Blog1, Blog2)
A Systematic Survey on Large Language Models for Algorithm Design (from City Univ. of Hong Kong)
Image Segmentation in Foundation Model Era: A Survey (from Beijing Institute of Technology)
Towards Vision-Language Geo-Foundation Model: A Survey (from Nanyang Technological University)
An Introduction to Vision-Language Modeling (from Meta)
The Evolution of Multimodal Model Architectures (from Purdue University)
Efficient Multimodal Large Language Models: A Survey (from Tencent)
Foundation Models for Video Understanding: A Survey (from Aalborg University)
Is Sora a World Simulator? A Comprehensive Survey on General World Models and Beyond (from GigaAI)
Prospective Role of Foundation Models in Advancing Autonomous Vehicles (from Tongji University)
Parameter-Efficient Fine-Tuning for Large Models: A Comprehensive Survey (from Northeastern University)
A Review on Background, Technology, Limitations, and Opportunities of Large Vision Models (from Lehigh)
Large Multimodal Agents: A Survey (from CUHK)
The Uncanny Valley: A Comprehensive Analysis of Diffusion Models (from Mila)
Real-World Robot Applications of Foundation Models: A Review (from University of Tokyo)
From GPT-4 to Gemini and Beyond: Assessing the Landscape of MLLMs on Generalizability, Trustworthiness and Causality through Four Modalities (from Shanghai AI Lab)
Towards the Unification of Generative and Discriminative Visual Foundation Model: A Survey (from JHU)

Before 2024

Foundational Models in Medical Imaging: A Comprehensive Survey and Future Vision (from SDSU)
Multimodal Foundation Models: From Specialists to General-Purpose Assistants (from Microsoft)
Towards Generalist Foundation Model for Radiology (from SJTU)
Foundational Models Defining a New Era in Vision: A Survey and Outlook (from MBZ University of AI)
Towards Generalist Biomedical AI (from Google)
A Systematic Survey of Prompt Engineering on Vision-Language Foundation Models (from Oxford)
Large Multimodal Models: Notes on CVPR 2023 Tutorial (from Chunyuan Li, Microsoft)
A Survey on Multimodal Large Language Models (from USTC and Tencent)
Vision-Language Models for Vision Tasks: A Survey (from Nanyang Technological University)
Foundation Models for Generalist Medical Artificial Intelligence (from Stanford)
A Comprehensive Survey on Pretrained Foundation Models: A History from BERT to ChatGPT
A Comprehensive Survey of AI-Generated Content (AIGC): A History of Generative AI from GAN to ChatGPT
Vision-language pre-training: Basics, recent advances, and future trends
On the Opportunities and Risks of Foundation Models (This survey first popularizes the concept of foundation model; from Standford)

Papers by Date

2024

[12/19] Genesis: A Generative and Universal Physics Engine for Robotics and Beyond (from CMU)
[12/04] Navigation World Models (from Meta)
[12/03] HunyuanVideo: A Systematic Framework For Large Video Generative Models (from Tencent)
[11/21] DINO-X: A Unified Vision Model for Open-World Object Detection and Understanding (from IDEA)
[11/14] Scaling Laws for Precision (from Harvard)
[11/13] NeuralFeels with neural fields: Visuotactile perception for in-hand manipulation (from Meta)
[11/07] DINO-WM: World Models on Pre-trained Visual Features enable Zero-shot Planning (from NYU)
[10/31] Project Sid: Many-agent simulations toward AI civilization (from Altera.AL)
[10/30] TokenFormer: Rethinking Transformer Scaling with Tokenized Model Parameters (from Max Planck Institute for Informatics)
[10/30] Reward Centering (from Richard Sutton, University of Alberta)
[10/21] Long Term Memory : The Foundation of AI Self-Evolution (from Tianqiao and Chrissy Chen Institute)
[10/10] Scaling Up Your Kernels: Large Kernel Design in ConvNets towards Universal Representations (from CUHK)
[10/04] Movie Gen: A Cast of Media Foundation Models (from Meta)
[10/02] Were RNNs All We Needed? (from Mila)
[10/01] nGPT: Normalized Transformer with Representation Learning on the Hypersphere (from Nvidia)
[09/30] MM1.5: Methods, Analysis & Insights from Multimodal LLM Fine-tuning (from Apple)
[09/27] Emu3: Next-Token Prediction is All You Need (from BAAI)
[09/25] Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Multimodal Models (from Allen AI)
[09/18] Qwen2-VL: Enhancing Vision-Language Model’s Perception of the World at Any Resolution (from Alibaba)
[09/18] Moshi: a speech-text foundation model for real-time dialogue (from Kyutai)
[08/27] Diffusion Models Are Real-Time Game Engines (from Google)
[08/22] Sapiens: Foundation for Human Vision Models (from Meta)
[08/14] Imagen 3 (from Google Deepmind)
[07/31] The Llama 3 Herd of Models (from Meta)
[07/29] SAM 2: Segment Anything in Images and Videos (from Meta)
[07/24] PartGLEE: A Foundation Model for Recognizing and Parsing Any Objects (from HUST and ByteDance)
[07/17] EVE: Unveiling Encoder-Free Vision-Language Models (from BAAI)
[07/12] Transformer Layers as Painters (from Sakana AI)
[06/24] Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs (from NYU)
[06/13] 4M-21: An Any-to-Any Vision Model for Tens of Tasks and Modalities (from EPFL and Apple)
[06/10] Merlin: A Vision Language Foundation Model for 3D Computed Tomography (from Stanford. Code will be available.)
[06/06] Vision-LSTM: xLSTM as Generic Vision Backbone (from LSTM authors)
[05/31] MeshXL: Neural Coordinate Field for Generative 3D Foundation Models (from Fudan)
[05/25] MoEUT: Mixture-of-Experts Universal Transformers (from Stanford)
[05/22] Attention as an RNN (from Mila & Borealis AI)
[05/22] GigaPath: A whole-slide foundation model for digital pathology from real-world data (from Nature)
[05/21] BiomedParse: a biomedical foundation model for biomedical image parsing (from Microsoft. Journal version)
[05/20] Octo: An Open-Source Generalist Robot Policy (from UC Berkeley)
[05/17] Observational Scaling Laws and the Predictability of Language Model Performance (fro Standford)
[05/14] Understanding the performance gap between online and offline alignment algorithms (from Google)
[05/09] Lumina-T2X: Transforming Text into Any Modality, Resolution, and Duration via Flow-based Large Diffusion Transformers (from Shanghai AI Lab)
[05/08] You Only Cache Once: Decoder-Decoder Architectures for Language Models
[05/07] xLSTM: Extended Long Short-Term Memory (from Sepp Hochreiter, the author of LSTM.)
[05/06] Advancing Multimodal Medical Capabilities of Gemini (from Google)
[05/04] U-DiTs: Downsample Tokens in U-Shaped Diffusion Transformers (from Peking University)
[05/03] Vibe-Eval: A hard evaluation suite for measuring progress of multimodal language models
[04/30] KAN: Kolmogorov-Arnold Networks (Promising alternatives of MLPs. from MIT)
[04/26] How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites (InternVL 1.5. from Shanghai AI Lab)
[04/14] TransformerFAM: Feedback attention is working memory (from Google. Efficient attention.)
[04/10] Leave No Context Behind: Efficient Infinite Context Transformers with Infini-attention (from Google)
[04/02] Octopus v2: On-device language model for super agent (from Stanford)
[04/02] Mixture-of-Depths: Dynamically allocating compute in transformer-based language models (from Google)
[04/02] Iterated Learning Improves Compositionality in Large Vision-Language Models (from U of Michigan)
[03/22] InternVideo2: Scaling Video Foundation Models for Multimodal Video Understanding (from Shanghai AI Lab)
[03/18] Arc2Face: A Foundation Model of Human Faces (from Imperial College London)
[03/14] MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training (30B parameters. from Apple)
[03/09] uniGradICON: A Foundation Model for Medical Image Registration (from UNC-Chapel Hill)
[03/05] Scaling Rectified Flow Transformers for High-Resolution Image Synthesis (Stable Diffusion 3. from Stability AI)
[03/01] Learning and Leveraging World Models in Visual Representation Learning (from Meta)
[03/01] VisionLLaMA: A Unified LLaMA Interface for Vision Tasks (from Meituan)
[02/28] CLLMs: Consistency Large Language Models (from SJTU)
[02/27] Transparent Image Layer Diffusion using Latent Transparency (from Standford)
[02/22] MobileLLM: Optimizing Sub-billion Parameter Language Models for On-Device Use Cases (from Meta)
[02/21] Beyond A∗: Better Planning with Transformers via Search Dynamics Bootstrapping (from Meta)
[02/20] Neural Network Diffusion (Generating network parameters via diffusion models. from NUS)
[02/20] VideoPrism: A Foundational Visual Encoder for Video Understanding (from Google)
[02/19] FiT: Flexible Vision Transformer for Diffusion Model (from Shanghai AI Lab)
[02/06] MobileVLM V2: Faster and Stronger Baseline for Vision Language Model (from Meituan)
[01/30] YOLO-World: Real-Time Open-Vocabulary Object Detection (from Tencent and HUST)
[01/23] Lumiere: A Space-Time Diffusion Model for Video Generation (from Google)
[01/22] CheXagent: Towards a Foundation Model for Chest X-Ray Interpretation (from Stanford)
[01/19] Depth Anything: Unleashing the Power of Large-Scale Unlabeled Data (from TikTok)
[01/16] SiT: Exploring Flow and Diffusion-based Generative Models with Scalable Interpolant Transformers (from NYU)
[01/15] InstantID: Zero-shot Identity-Preserving Generation in Seconds (from Xiaohongshu)

2023

BioCLIP: A Vision Foundation Model for the Tree of Life (CVPR 2024 best student paper)
Mamba: Linear-Time Sequence Modeling with Selective State Spaces (Mamba appears to outperform similarly-sized Transformers while scaling linearly with sequence length. from CMU)
FoundationPose: Unified 6D Pose Estimation and Tracking of Novel Objects (from NVIDIA)
Tracking Everything Everywhere All at Once (from Cornell, ICCV 2023 best student paper)
Foundation Models for Generalist Geospatial Artificial Intelligence (from IBM and NASA)
LLaMA 2: Open Foundation and Fine-Tuned Chat Models (from Meta)
InternLM-XComposer: A Vision-Language Large Model for Advanced Text-image Comprehension and Composition (from Shanghai AI Lab)
The All-Seeing Project: Towards Panoptic Visual Recognition and Understanding of the Open World (from Shanghai AI Lab)
Meta-Transformer: A Unified Framework for Multimodal Learning (from CUHK and Shanghai AI Lab)
Retentive Network: A Successor to Transformer for Large Language Models (from Microsoft and Tsinghua University)
Neural World Models for Computer Vision (PhD Thesis of Anthony Hu from University of Cambridge)
Recognize Anything: A Strong Image Tagging Model (a strong foundation model for image tagging. from OPPO)
Towards Visual Foundation Models of Physical Scenes (describes a first step towards learning general-purpose visual representations of physical scenes using only image prediction as a training criterion; from AWS)
LIMA: Less Is More for Alignment (65B parameters, from Meta)
PaLM 2 Technical Report (from Google)
IMAGEBIND: One Embedding Space To Bind Them All (from Meta)
Visual Instruction Tuning (LLaVA, from U of Wisconsin-Madison and Microsoft)
SEEM: Segment Everything Everywhere All at Once (from University of Wisconsin-Madison, HKUST, and Microsoft)
SAM: Segment Anything (the first foundation model for image segmentation; from Meta)
SegGPT: Segmenting Everything In Context (from BAAI, ZJU, and PKU)
Images Speak in Images: A Generalist Painter for In-Context Visual Learning (from BAAI, ZJU, and PKU)
UniDector: Detecting Everything in the Open World: Towards Universal Object Detection (CVPR, from Tsinghua and BNRist)
Unmasked Teacher: Towards Training-Efficient Video Foundation Models (from Chinese Academy of Sciences, University of Chinese Academy of Sciences, Shanghai AI Laboratory)
Visual Prompt Multi-Modal Tracking (from Dalian University of Technology and Peng Cheng Laboratory)
Toward Building General Foundation Models for Language, Vision, and Vision-Language Understanding Tasks (from ByteDance)
EVA-CLIP: Improved Training Techniques for CLIP at Scale (from BAAI and HUST)
EVA-02: A Visual Representation for Neon Genesis (from BAAI and HUST)
EVA-01: Exploring the Limits of Masked Visual Representation Learning at Scale (CVPR, from BAAI and HUST)
LLaMA: Open and Efficient Foundation Language Models (A collection of foundation language models ranging from 7B to 65B parameters; from Meta)
The effectiveness of MAE pre-pretraining for billion-scale pretraining (from Meta)
BloombergGPT: A Large Language Model for Finance (50 billion parameters; from Bloomberg)
BLOOM: A 176B-Parameter Open-Access Multilingual Language Model (this work was coordinated by BigScience whose goal is to democratize LLMs.)
FLIP: Scaling Language-Image Pre-training via Masking (from Meta)
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models (from Saleforce Research)
GPT-4 Technical Report (from OpenAI)
Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models (from Microsoft Research Asia)
UNINEXT: Universal Instance Perception as Object Discovery and Retrieval (a unified model for 10 instance perception tasks; CVPR, from ByteDance)
InternVideo: General Video Foundation Models via Generative and Discriminative Learning (from Shanghai AI Lab)
InternImage: Exploring Large-Scale Vision Foundation Models with Deformable Convolutions (CVPR, from Shanghai AI Lab)
BridgeTower: Building Bridges Between Encoders in Vision-Language Representation Learning (from Harbin Institute of Technology and Microsoft Research Asia)

2022