Awesome

PathWeave

Code for paper "LLMs Can Evolve Continually on Modality for X-Modal Reasoning" NeurIPS2024🎉

🔥 News

[2024.11] 🔥 Release code and checkpoints.

TODO:

Train&Test code.
Data Processing.
Checkpoints.

💻 Table of Contents

Abstract
Approach
Getting Started
Citation
Acknowledgement

📣 Abstract

Multimodal Large Language Models (MLLMs) have gained significant attention due to their impressive capabilities in multimodal understanding. However, existing methods rely heavily on extensive modal-specific pretraining and joint-modal tuning, leading to significant computational burdens when expanding to new modalities. In this paper, we propose PathWeave, a flexible and scalable framework with modalpath switching and expansion abilities that enables MLLMs to continually evolve on modalities for X-modal reasoning. We leverage the concept of Continual Learning and develop an incremental training strategy atop pre-trained MLLMs, enabling their expansion to new modalities using uni-modal data, without executing joint-modal pretraining. In detail, a novel Adapter-in-Adapter (AnA) framework is introduced, in which uni-modal and cross-modal adapters are seamlessly integrated to facilitate efficient modality alignment and collaboration. Additionally, an MoE-based gating module is applied between two types of adapters to further enhance the multimodal interaction. To investigate the proposed method, we establish a challenging benchmark called Continual Learning of Modality (MCL), which consists of high-quality QA data from five distinct modalities: image, video, audio, depth and point cloud. Extensive experiments demonstrate the effectiveness of the proposed AnA framework on learning plasticity and memory stability during continual learning. Furthermore, PathWeave performs comparably to state-of-the-art MLLMs while concurrently reducing parameter training burdens by 98.73%.

🚩 Approach

🏃‍♂️ Getting Started

Data Processing

Our depth data are generated follow the instruction of OneLLM.

Step 1: Download CC3M data based on this json file (the entire data requires too much disk space).
- LLaVA:
  
  training_data: llava_instruct_50k_train_data.json
  
  val_data: llava_instruct_1dot5k_test_data_coco.json
- CC3M:
  
  training_data: CC3M_train_50k_clear_depth_complete.json
  
  val_data: CC3M_val_1_5k_coco.json
- Fusion (CC3M + LlaVA): It is recommended to prioritize the use of this data, as data mixing has already been completed.
  
  traning_data: fusion_train_55w_clear.json
Step 2: Follow the installation guidance of DPT
Step 3: Generate depth data using the following scripts or DIY.

Training_data: Run the script bash run_scripts/data_processing/depth_generation_1.sh

val_data: Run the script bash run_scripts/data_processing/depth_generation_val_cc3m.sh

Model-ckpt

All checkpoints can be found in Google Driver.

Test

Before testing, please change the checkpoint path in the following direction:

lavis/projects/xinstruct_blip/train/vicuna7b

Also change the path in:

lavis/projects/xinstruct_blip/eval/vicuna7b

We marked all the path with: "path_to_your_data".

</details>

Example:

Run the script bash run_scripts/ours/video/test_video_modality.sh

Train

<details> <summary> Tips </summary> Before training, please check the data direction in the following direction:

lavis/configs/datasets/depth

You also need to change the file diresction in:

lavis/datasets/datasets/depth_vqa_dataset.py

lavis/tasks/captioning.py

We marked all the path with: "path_to_your_data".

</details>

Example:

Run the script bash run_scripts/ours/video/train_video_modality.sh

Note: Need to load the parameters of previous modality for current training.

🌟 Citation

@article{yu2024llms,
  title={LLMs Can Evolve Continually on Modality for X-Modal Reasoning},
  author={Yu, Jiazuo and Xiong, Haomiao and Zhang, Lu and Diao, Haiwen and Zhuge, Yunzhi and Hong, Lanqing and Wang, Dong and Lu, Huchuan and He, You and Chen, Long},
  journal={arXiv preprint arXiv:2410.20178},
  year={2024}
}

🤗 Acknowledgement

Our repo is built on X-InstructBLIP and OneLLM. We thank the authors for sharing their codes.