Home

Awesome

:chestnut: SEED Multimodal

Project Homepage arXiv arXiv Static Badge Demo arXiv

Powered by CV Center, Tencent AI Lab, and ARC Lab, Tencent PCG.

image

The repository provides the official implementation of SEED, SEED-LLaMA. For any inquiries, please email seed-x@googlegroups.com.

News

:beers: We are actively looking for self-motivated interns. Please feel free to reach out if you are interested. :beers:

Stay tuned for the updates!

Brief Introduction

It is recommended to check out our papers for technical details.

:speech_balloon: What can SEED-LLaMA do?

image

SEED-LLaMA is capable of both multimodal comprehension and generation, exhibiting compositional emergent abilities such as multi-turn in-context multimodal generation, acting like your AI assistant. [Compare to SOTA] [More examples on X]

<!-- We present **SEED-LLaMA** by large-scale pretraining and instruction tuning on the interleaved textual and visual data, which demonstrates impressive performance on a broad range of multimodal comprehension and generation tasks. More importantly, SEED-LLaMA has exhibited **compositional emergent abilities** such as multi-turn in-context multimodal generation, acting like your **AI assistant**. -->

:bulb: How does SEED-LLaMA achieve it?

image

The core of SEED-LLaMA is the tailored SEED tokenizer, which properly quantized visual signals into discrete visual tokens, capturing necessary semantics while being produced under 1D causal dependence. [SEED-2 vs. SEED-1]

<!-- ### Compositional Emergent Ability **Multi-turn in-context image and text generation.** ![image](paper_images/v2/multi_turn1.jpg) ![image](paper_images/v2/multi_turn2.jpg) **Compositional image generation.** ![image](paper_images/v2/results.jpg) --> <!-- ### SEED Tokenizer v2 In SEED tokenizer v2, the generation embedding is aligned with the **image embedding** (1 token) of [unCLIP SD](https://huggingface.co/stabilityai/stable-diffusion-2-1-unclip), and can be decoded to realistic images with the unCLIP-SD-UNet. In SEED tokenizer v1, we train a visual tokenizer through aligning the **generation embeddings** with the text embeddings (77 tokens) of [SD](https://github.com/CompVis/stable-diffusion), and the generation embeddings can be decoded to images with the SD-UNet. The below figure shows the visual comparison of the reconstructed images between SEED tokenizer v2 (the third row) and SEED tokenizer v1 (the second row). We can observe that the images reconstructed by SEED tokenizer v2 can better preserve the visual information of the original images. The semantic representations of texts can not fully preserve the rich visual information of images. ![image](paper_images/v2/seed_comparison.jpg) --> <!-- ### Pretraining We perform multimodal autoregressive pretraining on interleaved visual and textual data for SEED-LLaMA. Visual inputs are pre-processed into discrete tokens to conserve computational resources. Given the multimodal discrete sequence, a unified next-word-prediction objective is employed. During inference, visual codes are decoded into a realistic image by SEED De-Tokenization. ![image](paper_images/v2/method_page.jpg) -->

Hightlights

  1. We use GPT-4 to rewrite the instructions in the InstructPix2Pix dataset, such as transforming "add a storm" into "Can you add a storm effect to the image?" and responding with "Sure, I have successfully added a storm effect to the image.". The instruction tuned model can generate informative text and images in a single response, as shown in the figure below (this is also an emergent ability). image

  2. Given a starting image and story, the instruction tuned model can generate the following story and multiple images in one go. image

  3. We use GPT-4 to generate instruction based on the text content of MMC4. The instruction tuned model can generate image-text interleaved content (Our released sft model does not possess this feature as we separately instruction tune the pre-trained model on MMC4). image

Usage

Dependencies

Installation

Clone the repo and install dependent packages

git clone https://github.com/AILab-CVC/SEED.git
cd SEED
pip install -r requirements.txt

Model Weights

We release the pretrained SEED Tokenizer and De-Tokenizer, pretrained and instruction tuned SEED-LLaMA-8B and SEED-LLaMA-14B as below,

<!-- Please download the checkpoints and save under the folder `./pretrained`. ```bash cd pretrained # SEED/pretrained git lfs install git clone https://huggingface.co/AILab-CVC/SEED mv SEED/* ./ ``` -->

The model weights of unCLIP SD-UNet which are used to reconstruct the image will be downloaded automatically.

<!-- To reconstruct the image from the SEED visual codes using unCLIP SD-UNet, please download the pretrained [unCLIP SD](https://huggingface.co/stabilityai/stable-diffusion-2-1-unclip). --> <!-- To reconstruct the image from the SEED visual codes using unCLIP SD-UNet, please download the pretrained [unCLIP SD](https://huggingface.co/stabilityai/stable-diffusion-2-1-unclip). Rename the checkpoint directory to **"diffusion_model"** and create a soft link to the "pretrained/seed_tokenizer" directory. ```bash # SEED/pretrained git lfs install git clone https://huggingface.co/stabilityai/stable-diffusion-2-1-unclip mv stable-diffusion-2-1-unclip seed_tokenizer/diffusion_model ``` -->

Inference for visual tokenization and de-tokenization

To discretize an image to 1D visual codes with causal dependency, and reconstruct the image from the visual codes using the off-the-shelf unCLIP SD-UNet:

cd ..   # SEED/ 
python scripts/seed_tokenizer_inference.py

Inference for SEED-LLaMA

Given that SEED-LLaMA-8B is based on Vicuna-7B and SEED-LLaMA-14B based on LLaMA2-Chat-13B, we use Vicuna-7B's ("USER:", "ASSISTANT:") and LLaMA2-Chat-13B's ([INST] [/INST]) prompts for respective instruction tuning.

# Inference for SEED-LLaMA-8B
python scripts/seed_llama_inference_8B.py
# Inference for SEED-LLaMA-14B
python scripts/seed_llama_inference_14B.py

Launching Gradio Demo of SEED-LLaMA-14B Locally

  1. Building the local demo of SEED-LLaMA-14B currently requires single 24GB GPU.
# SEED/
# in first terminal
bash scripts/start_backend_14b.sh
# in second terminal
bash scripts/start_frontend_14b.sh
  1. Building the local demo of SEED-LLaMA-8B currently requires single 16GB GPU.
# SEED/
# in first terminal
bash scripts/start_backend_8b.sh
# in second terminal
bash scripts/start_frontend_8b.sh

Then the demo can be accessed through http://127.0.0.1:80

Training SEED-LLaMA

Training SEED Tokenization based on LAVIS

  1. Installation
cd SEED/SEED_Tokenizer
sh install.sh
  1. Download pre-trained Q-Former from BLIP-2 and put the checkpoint under the folder "pretrained".

  2. Training Causal Q-Former

sh train_scripts/causal_qformer.sh
  1. Download CLIP for unCLIP-SD and put the checkpoint under the folder "pretrained".

  2. Training SEED Tokenizer and De-Tokenizer

sh train_scripts/codebook.sh
  1. After training, you can tokenize an image into discrete tokens and decode the discrete tokens into a realistic image via unclip SD
# You need to load the pre-trained ckpt.
python3 eval/seed_inference.py

Multimodal LLM Pre-training and instruction truning

  1. Installation
cd SEED
pip install -r requirements.txt
cd MultiModalLLM
  1. Download the pre-trained LLM (for example, vicuna-7b-v1.1) and SEED Tokenizer, and put them under the folder "pretrained".

  2. Pre-process the pre-training data by converting images into discrete tokens. For example,

python3 src/tools/extract_image_ids_to_torchdata_parallel.py \
  --tokenizer configs/tokenizer/seed_llama_tokenizer.yaml \
  --image_transform configs/processer/blip_transform.yaml \
  --data configs/data/caption_torchdata_preprocess.yaml \
  --save_dir dataset/seed_llama/caption/unsplash_cc3m/ \
  --batch_size 1024 --num_workers 8 --gpus 8
  1. Pre-training Multimodal LLM with SEED tokens using lora.
sh scripts/train_a100_lora_multi_node_pretrain.sh
  1. Merge the lora checkpoint with the original LLM.
python3 src/tools/merge_lora_weights.py \
  --model_cfg configs/model/vicuna_7b_lora_pretrained.yaml \
  --tokenizer_cfg configs/tokenizer/seed_llama_tokenizer.yaml \ 
  --base_model pretrained/vicuna-7b-v1.1 \
  --lora_model log/seed_vicuna-7b_lora_pretrain/checkpoint-10000 \
  --save_path log/seed_vicuna-7b_lora_pretrain/checkpoint-merged-10000 
  1. Pre-process the instruction tuning data by converting images into discrete tokens. (You first need to convert the data into JSON format, with each line of the JSON containing "image" (the path of the image), "question", and "answer".)
python3 src/tools/extract_image_ids_to_torchdata_parallel_qa.py \
  --tokenizer configs/tokenizer/seed_llama_tokenizer.yaml \
  --image_transform configs/processer/blip_transform.yaml \
  --data configs/data/question_answer_torchdata_eval.yaml \
  --save_dir  data/VQAv2 \
  --batch_size 512 --num_workers 8 --gpus 8
  1. Instruction tuning Multimodal LLM with SEED tokens using lora.
sh scripts/train_a100_lora_multi_node_sft.sh

Citation

If you find the work helpful, please consider citing:

@article{ge2023making,
  title={Making LLaMA SEE and Draw with SEED Tokenizer},
  author={Ge, Yuying and Zhao, Sijie and Zeng, Ziyun and Ge, Yixiao and Li, Chen and Wang, Xintao and Shan, Ying},
  journal={arXiv preprint arXiv:2310.01218},
  year={2023}
}

@article{ge2023planting,
  title={Planting a seed of vision in large language model},
  author={Ge, Yuying and Ge, Yixiao and Zeng, Ziyun and Wang, Xintao and Shan, Ying},
  journal={arXiv preprint arXiv:2307.08041},
  year={2023}
}

The project is still in progress.

License

SEED is released under Apache License Version 2.0.

SEED-LLaMA is released under the original License of LLaMA2.

Acknowledgement

We thank the great work from unCLIP SD and BLIP2.