Awesome

Awesome-Multimodal-Chatbot

Awesome Multimodal Assistant is a curated list of multimodal chatbots/conversational assistants that utilize various modes of interaction, such as text, speech, images, and videos, to provide a seamless and versatile user experience. It is designed to assist users in performing various tasks, from simple information retrieval to complex multimedia reasoning.

Multimodal Instruction Tuning

MultiInstruct: Improving Multi-Modal Zero-Shot Learning via Instruction Tuning

arXiv 2022/12 [paper]
GPT-4

arXiv 2023/03 [paper] [blog]
Visual Instruction Tuning

arXiv 2023/04 [paper] [code] [project page] [demo]
MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models

arXiv 2023/04 [paper] [code] [project page] [demo]
mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality

arXiv 2023/04 [paper] [code] [demo]
LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model

arXiv 2023/04 [paper] [code] [demo]
Video-LLaMA: An Instruction-Finetuned Visual Language Model for Video Understanding

[code]
LMEye: An Interactive Perception Network for Large Language Models
arXiv 2023/05 [paper] [code]
MultiModal-GPT: A Vision and Language Model for Dialogue with Humans

arXiv 2023/05 [paper] [code] [demo]
X-LLM: Bootstrapping Advanced Large Language Models by Treating Multi-Modalities as Foreign Languages

arXiv 2023/05 [paper] [code] [project page]
Otter: A Multi-Modal Model with In-Context Instruction Tuning

arXiv 2023/05 [paper] [code] [demo]
InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning

arXiv 2023/05 [paper] [code]
InternGPT: Solving Vision-Centric Tasks by Interacting with ChatGPT Beyond Language

arXiv 2023/05 [paper] [code] [demo]
VisionLLM: Large Language Model is also an Open-Ended Decoder for Vision-Centric Tasks

arXiv 2023/05 [paper] [code]
Cheap and Quick: Efficient Vision-Language Instruction Tuning for Large Language Models
arXiv 2023/05 [paper] [code] [project page]
EmbodiedGPT: Vision-Language Pre-Training via Embodied Chain of Thought

arXiv 2023/05 [paper] [code] [project page]
DetGPT: Detect What You Need via Reasoning

arXiv 2023/05 [paper] [code] [project page]
PathAsst: Redefining Pathology through Generative Foundation AI Assistant for Pathology

arXiv 2023/05 [paper] [code]
ChatBridge: Bridging Modalities with Large Language Model as a Language Catalyst

arXiv 2023/05 [paper] [code] [project page]
Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models

arXiv 2023/06 [paper] [code]
LAMM: Language-Assisted Multi-Modal Instruction-Tuning Dataset, Framework, and Benchmark

arXiv 2023/06 [paper]
Accountable Textual-Visual Chat Learns to Reject Human Instructions in Image Re-creation

arXiv 2023/06 [paper] [project page]
VALLEY: VIDEO ASSISTANT WITH LARGE LANGUAGE MODEL ENHANCED ABILITY

arXiv 2023/06 [paper] [code]

LLM-Based Modularized Frameworks

Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models

arXiv 2023/03 [paper] [code] [demo]
ViperGPT: Visual Inference via Python Execution for Reasoning

arXiv 2023/03 [paper] [code] [project page]
TaskMatrix.AI: Completing Tasks by Connecting Foundation Models with Millions of APIs

arXiv 2023/03 [paper] [code]
Chatgpt asks, blip-2 answers: Automatic questioning towards enriched visual descriptions

arXiv 2023/03 [paper] [code]
MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action

arXiv 2023/03 [paper] [code] [project page] [demo]
Hugginggpt: Solving ai tasks with chatgpt and its friends in huggingface

arXiv 2023/03 [paper] [code] [demo]
VLog: Video as a Long Document

[code] [demo]
Video ChatCaptioner: Towards Enriched Spatiotemporal Descriptions

arXiv 2023/04 [paper] [code]
ChatVideo: A Tracklet-centric Multimodal and Versatile Video Understanding System

arXiv 2023/04 [paper] [project page]
VideoChat: Chat-Centric Video Understanding

arXiv 2023/05 [paper] [code] [demo]