Home

Awesome

VITA: Towards Open-Source Interactive Omni Multimodal LLM

<p align="center"> <img src="./asset/vita_log2.png" width="100%" height="100%"> </p>

<font size=7><div align='center' > [๐ŸŽ Project Page] [๐Ÿ“– arXiv Paper] [๐Ÿค— Hugging Face] [๐Ÿ’ฌ WeChat (ๅพฎไฟก)]</div></font>


<p align="center"> <img src="./asset/vita.png" width="85%" height="85%"> </p>

๐Ÿ”ฅ News

Contents <!-- omit in toc -->

๐Ÿ‘€ VITA Overview

The remarkable multimodal capabilities and interactive experience of GPT-4o underscore their necessity in practical applications, yet open-source models rarely excel in both areas. In this paper, we introduce VITA, the first-ever open-source Multimodal Large Language Model (MLLM) adept at simultaneous processing and analysis of Video, Image, Text, and Audio modalities, and meanwhile has an advanced multimodal interactive experience. Our work distinguishes from existing open-source MLLM through three key features:

<p align="center"> <img src="./asset/VITA_features.png" width="88%" height="88%"> </p>

VITA is capable of processing inputs in the form of pure text/audio, as well as video/image combined with text/audio. Besides, two key techniques are adopted to advance the multimodal interactive experience:

<p align="center"> <img src="./asset/VITA_duplex.png" width="88%" height="88%"> </p>

๐Ÿ“ˆ Experimental Results

<p align="center"> <img src="./asset/language_eval2.png" width="68%" height="50%"> </p> <p align="center"> <img src="./asset/audio_eval.jpg" width="96%" height="96%"> </p> <p align="center"> <img src="./asset/visual_eval.jpg" width="100%" height="100%"> </p>

โญ Training

Requirements and Installation

git clone https://github.com/VITA-MLLM/VITA
cd VITA
conda create -n vita python=3.10 -y
conda activate vita
pip install --upgrade pip
pip install -r requirements.txt
pip install flash-attn --no-build-isolation

Data Preparation

[
    ...
    {
        "set": "sharegpt4",
        "id": "000000000164",
        "conversations": [
            {
                "from": "human",
                "value": "<image>\n<audio>\n"
            },
            {
                "from": "gpt",  // follow the setting of llave, "gpt" is only used to indicate that this is the ground truth of the model output
                "value": "This is a well-organized kitchen with a clean, modern aesthetic. The kitchen features a white countertop against a white wall, creating a bright and airy atmosphere. "
            }
        ],
        "image": "coco/images/train2017/000000000164.jpg",
        "audio": [
            "new_value_dict_0717/output_wavs/f61cf238b7872b4903e1fc15dcb5a50c.wav"
        ]
    },
    ...
]
AudioFolder = ""
FolderDict = {
    #### NaturalCap
    "sharegpt4": "",
}
#### NaturalCap
ShareGPT4V = {"chat_path": ""}
from .dataset_config import *

NaturalCap = [ShareGPT4V]

DataConfig = {
    "Pretrain_video": NaturalCap,
}

Continual Training

    ...
    --model_name_or_path VITA_ckpt \
    ...
    --vision_tower InternViT-300M-448px \
    ...
    --audio_encoder audio-encoder-2wh_zh_en_audioset_Mixtral-8x7B_New-base-tunning \
    ...
export PYTHONPATH=./
export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
OUTPUT_DIR=/mnt/cfs/lhj/videomllm_ckpt/outputs/vita_video_audio
bash script/train/finetuneTask_nodes.sh ${OUTPUT_DIR}

๐Ÿ“ Inference

Quick Start

CUDA_VISIBLE_DEVICES=0,1 python video_audio_demo.py \
    --model_path [vita/path] \
    --image_path asset/vita_log2.png \
    --model_type mixtral-8x7b \
    --conv_mode mixtral_two \
    --question "่ฏทๆ่ฟฐ่ฟ™ๅผ ๅ›พ็‰‡ใ€‚" \
CUDA_VISIBLE_DEVICES=0,1 python video_audio_demo.py \
    --model_path [vita/path] \
    --image_path asset/vita_log2.png \
    --model_type mixtral-8x7b \
    --conv_mode mixtral_two \
    --audio_path asset/q1.wav
CUDA_VISIBLE_DEVICES=0,1 python video_audio_demo.py \
    --model_path [vita/path] \
    --image_path asset/vita_log2.png \
    --model_type mixtral-8x7b \
    --conv_mode mixtral_two \
    --audio_path asset/q2.wav

Demo

We have accelerated the model using vLLM. Since VITA has not yet been integrated into vLLM, you need to make some modifications to the vLLM code to adapt it for VITA.

conda create -n vita_demo python==3.10
conda activate vita_demo
pip install -r web_demo/web_demo_requirements.txt

# Backup a new weight file
cp -r  VITA_ckpt/ demo_VITA_ckpt/

cd ./web_demo/vllm_tools
cp -rf model_weight_file/*  ../../demo_VITA_ckpt/
cp -rf vllm_file/*  your_anaconda/envs/vita_demo/lib/python3.10/site-packages/vllm/model_executor/models/

๐Ÿ“ Basic Demo

https://github.com/user-attachments/assets/bdc7e9d1-a7d3-432e-aae8-5de493a5c042

python -m web_demo.web_ability_demo  demo_VITA_ckpt/

๐Ÿ“ Real-Time Interactive Demo

To have a good interactive experience, please pay attention to the following three points:

https://github.com/user-attachments/assets/5f375464-a77c-4dce-b2b5-7897c230bb9b

To run the real-time interactive demo, you need to make the following preparations:

python -m web_demo.web_interactive_demo

โœ’๏ธ Citation

If you find our work helpful for your research, please consider citing our work.

@article{fu2024vita,
  title={Vita: Towards open-source interactive omni multimodal llm},
  author={Fu, Chaoyou and Lin, Haojia and Long, Zuwei and Shen, Yunhang and Zhao, Meng and Zhang, Yifan and Wang, Xiong and Yin, Di and Ma, Long and Zheng, Xiawu and others},
  journal={arXiv preprint arXiv:2408.05211},
  year={2024}
}

๐Ÿ“ฃ Statement

VITA is trained on large-scale open-source corpus, and its output has randomness. Any content generated by VITA does not represent the views of the model developers. We are not responsible for any problems arising from the use, misuse, and dissemination of VITA, including but not limited to public opinion risks and data security issues.

๐Ÿ“œ Related Works

Explore our related researches:

๐Ÿ‘ Acknowledgement

VITA is built with reference to the following outstanding works: LLaVA-1.5, Bunny, ChatUnivi, InternVL, InternViT, and Mixtral 8*7B. Thanks๏ผ