Awesome

Ultimate-Awesome-Transformer-Attention

This repo contains a comprehensive paper list of Vision Transformer & Attention, including papers, codes, and related websites. This list is maintained by Min-Hung Chen. (Actively keep updating)

If you find some ignored papers, feel free to create pull requests, open issues, or email me. Contributions in any form to make this list more comprehensive are welcome.

If you find this repository useful, please consider citing and ★STARing this list. Feel free to share this list with others!

[Update: January, 2024] Added all the related papers from NeurIPS 2023! [Update: December, 2023] Added all the related papers from ICCV 2023! [Update: September, 2023] Split the multi-modal paper list to README_multimodal.md [Update: June, 2023] Added all the related papers from ICML 2023! [Update: June, 2023] Added all the related papers from CVPR 2023! [Update: February, 2023] Added all the related papers from ICLR 2023! [Update: December, 2022] Added attention-free papers from Networks Beyond Attention (GitHub) made by Jianwei Yang [Update: November, 2022] Added all the related papers from NeurIPS 2022! [Update: October, 2022] Split the 2nd half of the paper list to README_2.md [Update: October, 2022] Added all the related papers from ECCV 2022! [Update: September, 2022] Added the Transformer tutorial slides made by Lucas Beyer! [Update: June, 2022] Added all the related papers from CVPR 2022!

------ (The following papers are moved to README_multimodal.md) ------

Multi-Modality

------ (The following papers are moved to README_2.md) ------

Citation

If you find this repository useful, please consider citing this list:

@misc{chen2022transformerpaperlist,
    title = {Ultimate awesome paper list: transformer and attention},
    author = {Chen, Min-Hung},
    journal = {GitHub repository},
    url = {https://github.com/cmhungsteve/Awesome-Transformer-Attention},
    year = {2022},
}

Survey

"A Survey on Multimodal Large Language Models for Autonomous Driving", WACVW, 2024 (Purdue). [Paper][GitHub]
"Efficient Multimodal Large Language Models: A Survey", arXiv, 2024 (Tencent). [Paper][GitHub]
"From Sora What We Can See: A Survey of Text-to-Video Generation", arXiv, 2024 (Newcastle University, UK). [Paper][GitHub]
"When LLMs step into the 3D World: A Survey and Meta-Analysis of 3D Tasks via Multi-modal Large Language Models", arXiv, 2024 (Oxford). [Paper][GitHub]
"Foundation Models for Video Understanding: A Survey", arXiv, 2024 (Aalborg University, Denmark). [Paper][GitHub]
"Vision Mamba: A Comprehensive Survey and Taxonomy", arXiv, 2024 (Chongqing University). [Paper][GitHub]
"Is Sora a World Simulator? A Comprehensive Survey on General World Models and Beyond", arXiv, 2024 (GigaAI, China). [Paper][GitHub]
"Video Diffusion Models: A Survey", arXiv, 2024 (Bielefeld University, Germany). [Paper][GitHub]
"Unleashing the Power of Multi-Task Learning: A Comprehensive Survey Spanning Traditional, Deep, and Pretrained Foundation Model Eras", arXiv, 2024 (Lehigh + UPenn). [Paper]
"Hallucination of Multimodal Large Language Models: A Survey", arXiv, 2024 (NUS). [Paper][GitHub]
"A Survey on Vision Mamba: Models, Applications and Challenges", arXiv, 2024 (HKUST). [Paper][GitHub]
"State Space Model for New-Generation Network Alternative to Transformers: A Survey", arXiv, 2024 (Anhui University). [Paper][GitHub]
"Exploring the Frontier of Vision-Language Models: A Survey of Current Methodologies and Future Directions", arXiv, 2024 (IIT Patna). [Paper]
"From Pixels to Insights: A Survey on Automatic Chart Understanding in the Era of Large Foundation Models", arXiv, 2024 (UIUC). [Paper][GitHub]
"Parameter-Efficient Fine-Tuning for Large Models: A Comprehensive Survey", arXiv, 2024 (Northeastern). [Paper]
"Sora as an AGI World Model? A Complete Survey on Text-to-Video Generation", arXiv, 2024 (Kyung Hee University). [Paper]
"Controllable Generation with Text-to-Image Diffusion Models: A Survey", arXiv, 2024 (Beijing University of Posts and Telecommunications). [Paper][GitHub]
"Sora: A Review on Background, Technology, Limitations, and Opportunities of Large Vision Models", arXiv, 2024 (Lehigh University, Pennsylvania). [Paper][GitHub]
"Large Multimodal Agents: A Survey", arXiv, 2024 (CUHK). [Paper][GitHub]
"Parameter-Efficient Fine-Tuning for Pre-Trained Vision Models: A Survey", arXiv, 2024 (BIGAI). [Paper][GitHub]
"Vision-Language Navigation with Embodied Intelligence: A Survey", arXiv, 2024 (Qufu Normal University, China). [Paper]
"The (R)Evolution of Multimodal Large Language Models: A Survey", arXiv, 2024 (University of Modena and Reggio Emilia (UniMoRE), Italy). [Paper]
"Masked Modeling for Self-supervised Representation Learning on Vision and Beyond", arXiv, 2024 (Westlake University, China). [Paper][GitHub]
"Transformer for Object Re-Identification: A Survey", arXiv, 2024 (Wuhan University). [Paper]
"Forging Vision Foundation Models for Autonomous Driving: Challenges, Methodologies, and Opportunities", arXiv, 2024 (Huawei). [Paper][GtiHub]
"MM-LLMs: Recent Advances in MultiModal Large Language Models", arXiv, 2024 (Tencent). [Paper]
"From GPT-4 to Gemini and Beyond: Assessing the Landscape of MLLMs on Generalizability, Trustworthiness and Causality through Four Modalities", arXiv, 2024 (Shanghai AI Lab). [Paper]
"A Survey on Hallucination in Large Vision-Language Models", arXiv, 2024 (Huawei). [Paper]
"A Survey for Foundation Models in Autonomous Driving", arXiv, 2024 (Motional, Massachusetts). [Paper]
"A Survey on Transformer Compression", arXiv, 2024 (Huawei). [Paper]
"Vision + Language Applications: A Survey", CVPRW, 2023 (Ritsumeikan University, Japan). [Paper][GitHub]
"Multimodal Learning With Transformers: A Survey", TPAMI, 2023 (Tsinghua & Oxford). [Paper]
"A Survey of Visual Transformers", TNNLS, 2023 (CAS). [Paper][GitHub]
"Video Understanding with Large Language Models: A Survey", arXiv, 2023 (University of Rochester). [Paper][GitHub]
"Visual Instruction Tuning towards General-Purpose Multimodal Model: A Survey", arXiv, 2023 (NTU, Singapore). [Paper]
"A Survey of Reasoning with Foundation Models: Concepts, Methodologies, and Outlook", arXiv, 2023 (Huawei). [Paper][GitHub]
"A Challenger to GPT-4V? Early Explorations of Gemini in Visual Expertise", arXiv, 2023 (Tencent). [Paper]GitHub]
"Towards the Unification of Generative and Discriminative Visual Foundation Model: A Survey", arXiv, 2023 (JHU). [Paper]
"Explainability of Vision Transformers: A Comprehensive Review and New Perspectives", arXiv, 2023 (Institute for Research in Fundamental Sciences (IPM), Iran). [Paper]
"Vision-Language Instruction Tuning: A Review and Analysis", arXiv, 2023 (Tencent). [Paper][GitHub (in construction)]
"Understanding Video Transformers for Segmentation: A Survey of Application and Interpretability", arXiv, 2023 (York University). [Paper]
"Unsupervised Object Localization in the Era of Self-Supervised ViTs: A Survey", arXiv, 2023 (valeo.ai, France). [Paper][GitHub]
"A Survey on Video Diffusion Models", arXiv, 2023 (Fudan). [Paper][GitHub]
"The Dawn of LMMs: Preliminary Explorations with GPT-4V(ision)", arXiv, 2023 (Microsoft). [Paper]
"Multimodal Foundation Models: From Specialists to General-Purpose Assistants", arXiv, 2023 (Microsoft). [Paper]
"Transformers in Small Object Detection: A Benchmark and Survey of State-of-the-Art", arXiv, 2023 (University of Western Australia). [Paper]
"RenAIssance: A Survey into AI Text-to-Image Generation in the Era of Large Model", arXiv, 2023 (University of Sydney). [Paper]
"A survey on efficient vision transformers: algorithms, techniques, and performance benchmarking", arXiv, 2023 (The University of Sydney). [Paper]
"From CNN to Transformer: A Review of Medical Image Segmentation Models", arXiv, 2023 (UESTC). [Paper]
"Foundational Models Defining a New Era in Vision: A Survey and Outlook", arXiv, 2023 (MBZUAI). [Paper][GitHub]
"A Systematic Survey of Prompt Engineering on Vision-Language Foundation Models", arXiv, 2023 (Oxford). [Paper]
"Robust Visual Question Answering: Datasets, Methods, and Future Challenges", arXiv, 2023 (Xi'an Jiaotong University). [Paper]
"A Survey on Open-Vocabulary Detection and Segmentation: Past, Present, and Future", arXiv, 2023 (HKUST). [Paper]
"Transformers in Reinforcement Learning: A Survey", arXiv, 2023 (Mila). [Paper]
"Vision Language Transformers: A Survey", arXiv, 2023 (Boise State University, Idaho). [Paper]
"Towards Open Vocabulary Learning: A Survey", arXiv, 2023 (Peking). [Paper][GitHub]
"Large Multimodal Models: Notes on CVPR 2023 Tutorial", arXiv, 2023 (Microsoft). [Paper]
"A Survey on Multimodal Large Language Models", arXiv, 2023 (USTC). [Paper][GitHub]
"2D Object Detection with Transformers: A Review", arXiv, 2023 (German Research Center for Artificial Intelligence, Germany). [Paper]
"Visual Question Answering: A Survey on Techniques and Common Trends in Recent Literature", arXiv, 2023 (Eldorado’s Institute of Technology, Brazil). [Paper]
"Vision-Language Models in Remote Sensing: Current Progress and Future Trends", arXiv, 2023 (NYU). [Paper]
"Visual Tuning", arXiv, 2023 (The Hong Kong Polytechnic University). [Paper]
"Self-supervised Learning for Pre-Training 3D Point Clouds: A Survey", arXiv, 2023 (Fudan University). [Paper]
"Semantic Segmentation using Vision Transformers: A survey", arXiv, 2023 (University of Peradeniya, Sri Lanka). [Paper]
"A Review of Deep Learning for Video Captioning", arXiv, 2023 (Deakin University, Australia). [Paper]
"Transformer-Based Visual Segmentation: A Survey", arXiv, 2023 (NTU, Singapore). [Paper][GitHub]
"Vision-Language Models for Vision Tasks: A Survey", arXiv, 2023 (?). [Paper][GitHub (in construction)]
"Text-to-image Diffusion Model in Generative AI: A Survey", arXiv, 2023 (KAIST). [Paper]
"Foundation Models for Decision Making: Problems, Methods, and Opportunities", arXiv, 2023 (Berkeley + Google). [Paper]
"Advances in Medical Image Analysis with Vision Transformers: A Comprehensive Review", arXiv, 2023 (RWTH Aachen University, Germany). [Paper][GitHub]
"Efficiency 360: Efficient Vision Transformers", arXiv, 2023 (IBM). [Paper][GitHub]
"Transformer-based Generative Adversarial Networks in Computer Vision: A Comprehensive Survey", arXiv, 2023 (Indian Institute of Information Technology). [Paper]
"Large-scale Multi-Modal Pre-trained Models: A Comprehensive Survey", arXiv, 2023 (Pengcheng Laboratory). [Paper][GitHub]
"A Survey on Visual Transformer", TPAMI, 2022 (Huawei). [Paper]
"Attention mechanisms in computer vision: A survey", Computational Visual Media, 2022 (Tsinghua University, China). [Paper][Springer][Github]
"A Comprehensive Study of Vision Transformers on Dense Prediction Tasks", VISAP, 2022 (NavInfo Europe, Netherlands). [Paper]
"Vision-and-Language Pretrained Models: A Survey", IJCAI, 2022 (The University of Sydney). [Paper]
"Vision Transformers in Medical Imaging: A Review", arXiv, 2022 (Covenant University, Nigeria). [Paper]
"A Comprehensive Survey of Transformers for Computer Vision", arXiv, 2022 (Sejong University). [Paper]
"Vision-Language Pre-training: Basics, Recent Advances, and Future Trends", arXiv, 2022 (Microsoft). [Paper]
"Vision+X: A Survey on Multimodal Learning in the Light of Data", arXiv, 2022 (Illinois Institute of Technology, Chicago). [Paper]
"Vision Transformers for Action Recognition: A Survey", arXiv, 2022 (Charles Sturt University, Australia). [Paper]
"VLP: A Survey on Vision-Language Pre-training", arXiv, 2022 (CAS). [Paper]
"Transformers in Remote Sensing: A Survey", arXiv, 2022 (MBZUAI). [Paper][Github]
"Medical image analysis based on transformer: A Review", arXiv, 2022 (NUS, Singapore). [Paper]
"3D Vision with Transformers: A Survey", arXiv, 2022 (MBZUAI). [Paper][GitHub]
"Vision Transformers: State of the Art and Research Challenges", arXiv, 2022 (NYCU). [Paper]
"Transformers in Medical Imaging: A Survey", arXiv, 2022 (MBZUAI). [Paper][GitHub]
"Multimodal Learning with Transformers: A Survey", arXiv, 2022 (Oxford). [Paper]
"Transforming medical imaging with Transformers? A comparative review of key properties, current progresses, and future perspectives", arXiv, 2022 (CAS). [Paper]
"Transformers in 3D Point Clouds: A Survey", arXiv, 2022 (University of Waterloo). [Paper]
"A survey on attention mechanisms for medical applications: are we moving towards better algorithms?", arXiv, 2022 (INESC TEC and University of Porto, Portugal). [Paper]
"Efficient Transformers: A Survey", arXiv, 2022 (Google). [Paper]
"Are we ready for a new paradigm shift? A Survey on Visual Deep MLP", arXiv, 2022 (Tsinghua). [Paper]
"Vision Transformers in Medical Computer Vision - A Contemplative Retrospection", arXiv, 2022 (National University of Sciences and Technology (NUST), Pakistan). [Paper]
"Video Transformers: A Survey", arXiv, 2022 (Universitat de Barcelona, Spain). [Paper]
"Transformers in Medical Image Analysis: A Review", arXiv, 2022 (Nanjing University). [Paper]
"Recent Advances in Vision Transformer: A Survey and Outlook of Recent Work", arXiv, 2022 (?). [Paper]
"Transformers Meet Visual Learning Understanding: A Comprehensive Review", arXiv, 2022 (Xidian University). [Paper]
"Image Captioning In the Transformer Age", arXiv, 2022 (Alibaba). [Paper][GitHub]
"Visual Attention Methods in Deep Learning: An In-Depth Survey", arXiv, 2022 (Fayoum University, Egypt). [Paper]
"Transformers in Vision: A Survey", ACM Computing Surveys, 2021 (MBZUAI). [Paper]
"Survey: Transformer based Video-Language Pre-training", arXiv, 2021 (Renmin University of China). [Paper]
"A Survey of Transformers", arXiv, 2021 (Fudan). [Paper]
"Attention mechanisms and deep learning for machine vision: A survey of the state of the art", arXiv, 2021 (University of Kashmir, India). [Paper]

Awesome

Ultimate-Awesome-Transformer-Attention

Overview

Citation

Survey

Image Classification / Backbone

Replace Conv w/ Attention

Pure Attention

Conv-stem + Attention

Conv + Attention

Vision Transformer

General Vision Transformer

Efficient Vision Transformer

Conv + Transformer

Training + Transformer

Robustness + Transformer

Model Compression + Transformer

Attention-Free

MLP-Series

Other Attention-Free

Analysis for Transformer

Detection

Object Detection

3D Object Detection

Multi-Modal Detection

HOI Detection

Salient Object Detection

Other Detection Tasks

Segmentation

Semantic Segmentation

Depth Estimation

Object Segmentation

Other Segmentation Tasks

Video (High-level)

Action Recognition

Action Detection/Localization

Action Prediction/Anticipation

Video Object Segmentation

Video Instance Segmentation

Other Video Tasks

References