Home

Awesome

<div align="center"> <img src="docs/flamingo.png" alt="Logo" width="100">

Instruct-Flamingo

Codebase and Fondation Models for Visual Instruction Tuning

</div>

News

Clever Flamingo v2

Clever Flamingo v2 is an instruction-following multi-modal LLM fine-tuned from OpenFlamingo-v2 (MPT-7B). It accepts arbitrarily interleaved image-text instructions as input and generates text responses. It is an upgraded version of the Clever Flamingo v1 model. The key features of Clever Flamingo v2 include:

Large Instruction Corpus: The Flamingo model's "<Image(s) + Text $\rightarrow$ Text>" approach allows us to unify a large amount of datasets (+100) into a standard input-output format. These datasets include PF-1M, MIMIC-IT, SVIT, LAMM, FunQA, OpenORCA, WizardLM-Evol, CAMEL, Alpaca-CoT, and many more. This leads to the largest scale experiment on visual instruction tuning to our best knowledge. Its training process takes one week on an 8 $\times$ A100 (80G) machine.

Number of Samples

Balanced Visual-Textual Instruction Tuning: Unlike previous approaches to visual instruction tuning that sometimes pay less attention on the usage of textual instruction data, Clever Flamingo v2 maintains a strict 1:1 ratio when sampling visual and textual instructions. This ensures accurate visual understanding and strong instruction-following abilities.

Dataset Ratio

Long Context and LangChain: Clever Flamingo utilizes a context window of 2k tokens during training, enabling it to fully leverage high-quality instruction datasets. Despite being based on a 7B LLM, Clever Flamingo v2 demonstrates strong instruction-following capabilities, long response generation, and chain-of-thought capabilities. It is integrated into the LangChain framework, allowing for easy use of pre-built chains. Additionally, several chains for multi-modal scenarios, such as CoT-SC and Caption-Self-Verification Chain, have been implemented. Please see langchain.ipynb for an example.

LangChain

From Single Image to Multi-images, Region Crops, and Videos: Given that Flamingo models accept arbitrarily interleaved image-text sequences as input, Clever Flamingo v2 has been trained on a diverse instruction dataset that contains various image-text combinations. We integrate tasks involving multi-image comparison, reasoning, region understanding (based on cropped boxes), and video understanding (based on sampled frames). During training, it can process a maximum of 16 images per context window, compared to 5 in OpenFlamingo pretraining.

Image-Text Interleaving

OCR Integrations: Clever Flamingo v2 incorporates OCR results (from PaddleOCR) into instructions of OCR-related datasets (TextCaps, InfographicQA, VisualMRC, OCR-VQA, etc.) during training. OCR results with similar vertical pixel coordinates are merged in the same line and concatenated by '\t', preserving layout information. The OCR process is efficient, taking less than 0.2 seconds to process each image.

OCR

<!-- - **Easy and Efficient Fine-tuning**. -->

About This Codebase

This repository aims to provide an easy-to-use codebase and foundation models for (instruction-)finetuning of multi-modal LLMs. It is built upon OpenFlamingo🦩 codes and OpenFlamingo-v2 models, which are powerful vision-language foundation models trained on massive interleaved image-text data. Key features of this codebase include:

This is a ongoing project. We are working on verifying codes and training better instruction foundation models.

Getting Started🚩

1. Install Dependencies

First, clone this repo:

git clone https://github.com/ChenDelong1999/instruct_flamingo.git

Our code is developed upon OpenFlamingo, and therefore inherits its environment dependencies. One can use an OpenFlamingo environment to run our code, or create one by:

conda env create -f environment.yml

Note: please avoid using environment with pip installed open-flamingo package to avoid import conflicts.

Additionally, as in our method LoRA adapter need to be inserted to the language model, a PEFT installation is required. Tensorboard should also be installed for logging.

pip install peft, tensorboard

The following packages are the dependencies of hosting API and gradio web demo:

pip install gradio, uvicorn, fastapi, pydantic

2. Download Pretrained Weights

Model Inference🎈

Hosting Local API and Web Demo

We suggest to host a local API then host a local gradio web demo, such that the front-end and back-end is seperated (easier to debug, since re-loading LLM is slow), and the local API could make model inference and evaluations much convinient. You can start an API server via the following command. Please see api.py and make necessary changes (e.g., model checkpoint caching path).

CUDA_VISIBLE_DEVICES=0 uvicorn api:app --host=0.0.0.0 --port=1234 --log-level=info

This API can be called by the following code:

import json, request

url = '0.0.0.0:1234/clever_flamingo'
content_lst = {
    # remenber to add '<image>' to your instruction to indecate the location of image(s)
    'prompt': f'### Human: {<YOUR INSTRUCTION>}\n### Assistant: ',     
    'imgpaths': ['<YOUR IMAGE 1>', '<YOUR IMAGE 2>'],
    'args':{
        'max_new_token':1024,
        'num_beams':1,
        'temperature':1.0,
        'top_k':20,
        'top_p':1,
        'do_sample':True,
        'length_penalty':1.0,
        'no_repeat_ngram_size':3,
    }
}
d = {"content_lst": content_lst,'typ': 'None'}
d = json.dumps(d).encode('utf8')
r = requests.post(url, data=d)
js = json.loads(r.text)

print(js['result']['response'])

Now you can start the gradio web demo, make sure you have checked the configrations in gradio_demo.py.

python gradio_demo.py

Prepare Fine-tuning Dataset(s)📜

Training samples are expected to be provided by .json files, where each file has the following structure:

[
  {
    "input": "An Instruction or a question. Image path(s) (either absolute or relative) can be interleaved here as <img_path>path/to/the/image.png<img_path>, there can be more than one images: <img_path>path/to/the/second/image.png<img_path>",
    "output": "Expected response or answer. The language modelling loss only operate on this part, and it contains text only."
  },
  {
    "input": "This input-output format can be applied to many kinds of datasets, such as captioning ('input' filed can be leaved blank or as 'Describe this image'), VQA, multi-image reasoning, and also text-only instruction datasets.",
    "output": "The output field must be not empty."
  }
]

In the instruction_dataset folder, we provide some scripts for converting existing datasets into this format.

The path of this .json dataset can be fed into training by --instruction_data='path/to/dataset.json'. Additionally, multiple datasets can be mixed by creating a dataset config file, which structures as follows:

[
  {
    "dataset_name": "llava-complex-reasoning-77k",
    "json_path": "instruction_dataset/converted_datasets/llava/complex_reasoning_77k.json",
    "img_dir": "",
    "ratio": 77
  },
  {
    "dataset_name": "sharegpt",
    "json_path": "instruction_dataset/converted_datasets/sharegpt/sharegpt.json",
    "img_dir": "",
    "ratio": 45
  }
]

Here img_dir is the path to image dictionary if image paths are provided as relative path. The ratio specifies the sampling ratio of each subsets. Using --instruction_data='path/to/dataset_config.json' to feed the config for training.

Notes on dataset sampling: the following arguments of instruction_tuning/train.py controls how the dataset is sampled during training

Training🔥

First, tuning_config should be specified. This config controls which group of parameters will have LoRA adapters, and which group of parameters will be unfreezed. In the following example (open_flamingo/instruction_tuning/tuning_config/lora+perceiver.json), LoRA adapter with a rank of 64 will be applied to MPT language models (not including cross-attention layers), and the perceiver resampler will be unfreezed.

{
    "lora": true,
    "from_pretrained": false,
    "lora_target_modules": ["Wqkv", "out_proj", "up_proj", "down_proj"],
    "lora_r": 64,
    "lora_alpha": 64,
    "lora_dropout": 0.0,
    "unfrozen": ["perceiver"]
}

Set "lora": false to skip adding LoRA adapter to any model parameters. The "from_pretrained" field is only useful for Polite Flamingo and Clever Flamingo (v1) models, as they use Guanaco QLoRA on LLaMA-7B as initialization.

The following is an example of starting instruction tuning on OpenFlamingo-9B-v2, this setting comsumes 62GB memory on each GPU. One can lower the --max_length and --batch_size, or seting fewer parameters to be unfrozen in --tuning_config to save memory.

export PYTHONPATH="$PYTHONPATH:open_flamingo"
CUDA_VISIBLE_DEVICES='0,1,2,3,4,5,6,7' torchrun --nnodes=1 --nproc_per_node=8 --master_port=29502 open_flamingo/instruction_tuning/train.py \
    --instruction_data 'instruction_dataset/configs/datasets.json' \
    --instruction_prompt_templete 'guanaco-no-prompt' \
    --run_name 'runs/0709-clever_flamingo_v2-8x80g-2k_context' \
    --seed 42 \
    --vision_encoder_path 'ViT-L-14-336' \
    --lm_path 'anas-awadalla/mpt-7b' \
    --tokenizer_path 'anas-awadalla/mpt-7b' \
    --freeze_lm_embeddings \
    --tuning_config 'open_flamingo/instruction_tuning/tuning_config/lora[lm+xqttn]+perceiver.json' \
    --resume_from_checkpoint '/path/to/cached/OpenFlamingo-9B-vitl-mpt7b.pt' \
    --max_length 2048 \
    --multiturn_augmentation 32 \
    --max_img 16 \
    --cross_attn_every_n_layers 4 \
    --batch_size 2 \
    --learning_rate 5e-5 \
    --gradient_accumulation_steps 4 \
    --precision 'bf16' \
    --train_num_samples 100000 \
    --workers 32 \
    --num_epochs 100 \
    --lr_scheduler constant \
    --warmup_steps 1000 \
    --logging_steps 500

The --resume_from_checkpoint specify the pretrained weights to load. Multiple checkpoints (e.g., when using visual instruction foundation model) can be concatenated with a seperation of ',', and the model will load them one by one.

Acknowledgements🙏

This codebase is built upon OpenFlamingo. Implementation of PEFT tuning config is inspired by Multimodal-GPT. Thanks for their wonderful works.

This project is under active development, feel free to raise an issue if there are any bugs, we will try to fix them as soon as posible!

If you find this project useful, please consider cite the following paper:

@article{chen2023visual,
  title={Visual Instruction Tuning with Polite Flamingo},
  author={Chen, Delong and Liu, Jianfeng and Dai, Wenliang and Wang, Baoyuan},
  journal={arXiv preprint arXiv:2307.01003},
  year={2023}
}