Home

Awesome

<h2 align="center"> <a href="https://arxiv.org/pdf/2402.04236">CogCoM: Train Large Vision-Language Models Diving into Details through Chain of Manipulations</a></h2> <h5 align="center"> If you like our project, please give us a star ⭐ on GitHub for latest update.<br>

hf_space arXiv License

</h5> <details><summary>💡 We also have other vision-language projects that may interest you ✨. </summary><p> <!-- may -->

CogVLM: Visual Expert for Pretrained Language Models <br> github github <br> CogAgent: A Visual Language Model for GUI Agents <br> github github <br>

</p></details>

📣 News

😮 Highlights

CogCoM enables VLMs to solve various visual problems step-by-step with evidence, without involving external tools.

<p align="center"> <img src="assets/cases.png" width=100%> </p>

📖 Introduction to CogCoM

🤗 Demo

We support two GUIs for model inference, Web demo and CLI. If you want to use it in your python code, it is easy to modify the CLI scripts for your case.

Web Demo

Now you can use the local code we have implemented with Gradio for GUI demo. Please switch to the directory demo/ and run:

# Local gradio
python web_demo.py  --from_pretrained cogcom-base-17b --local_tokenizer path/to/tokenizer --bf16 --english

CLI Demo

We also support interactive CLI inference using SAT. If you want to use it in your python code, it is easy to modify the CLI scripts for your case. The program will automatically download the sat model and interact in the command line (can simply using vicuna-7b-1.5 tokenizer).

# Launch an interactive environment
python cli_demo_sat.py --from_pretrained cogcom-base-17b --local_tokenizer path/to/tokenizer --bf16 --english

The program will automatically download the sat model and interact in the command line (can simply using vicuna-7b-1.5 tokenizer). You can generate replies by entering instructions and pressing enter. Enter clear to clear the conversation history and stop to stop the program.

We also support model parallel inference, which splits model to multiple (2/4/8) GPUs. --nproc-per-node=[n] in the following command controls the number of used GPUs.

Tips:

For example

```bash
python cli_demo_sat.py --from_pretrained cogcom-base-17b --fp16 --quant 8
```

🐳 Model Zoo

If you run the demo/cli_demo*.py from the code repository, it will automatically download SAT or Hugging Face weights. Alternatively, you can choose to manually download the necessary weights.

Model nameInput resolutionIntroductionHuggingface modelSAT model
cogcom-base-17b490Supports grounding, OCR, and CoM.coming soonlink
cogcom-grounding-17b490Supports grounding, OCR, and CoM.coming soonlink
cogcom-chat-17b490Supports chat, grounding, OCR, and CoM.coming soonlink

⚙️ Requirements and Installation

We recommend the requirements as follows.

pip install -r requirements.txt
python -m spacy download en_core_web_sm

[!Warning]

<div align="left"> <b> 🚨 Please install proper version of `pydantic` for smooth inference as mentioned in [issie3](https://github.com/THUDM/CogCoM/issues/3). </b> </div>

🗝️ Training & Validating

Finetuning CogCoM

You may want to use CogCoM in your own task, which needs a different output style or domain knowledge. All code for finetuning is located under at finetune.sh and finetune.py files.

Hardware requirement

Evaluation

<details> <summary>Click to view results on GQA, TallyVQA, TextVQA, ST-VQA. </summary> <table> <tr> <td>Method</td> <td>GQA</td> <td>TallyVQA-s</td> <td>TallyVQA-c</td> <td>TextVQA</td> <td>ST-VQA</td> </tr> <tr> <td>Flamingo</td> <td>-</td> <td>-</td> <td>-</td> <td>54.1</td> <td>-</td> </tr> <tr> <td>GIT</td> <td>-</td> <td>-</td> <td>-</td> <td>59.8</td> <td>-</td> </tr> <tr> <td>GIT2</td> <td>-</td> <td>-</td> <td>-</td> <td>67.3</td> <td>-</td> </tr> <tr> <td>BLIP-2</td> <td>44.7*</td> <td>-</td> <td>-</td> <td>-</td> <td>21.7</td> </tr> <tr> <td>InstructBLIP</td> <td>49.5*</td> <td>-</td> <td>-</td> <td>-</td> <td>50.7*</td> </tr> <tr> <td>Qwen-VL</td> <td>49.5*</td> <td>-</td> <td>-</td> <td>-</td> <td>50.7*</td> </tr> <tr> <td>CogCoM</td> <td>71.7</td> <td>84.0</td> <td>70.1</td> <td>71.1</td> <td>70.0</td> </tr> </table> </details> <details> <summary>Click to view results of grounding benchmarks. </summary> <table> <tr> <td></td> <td>RefCOCO</td> <td></td> <td></td> <td>RefCOCO+</td> <td></td> <td></td> <td>RefCOCOg</td> <td></td> </tr> <tr> <td></td> <td>val</td> <td>testA</td> <td>testB</td> <td>val</td> <td>testA</td> <td>testB</td> <td>val</td> <td>test</td> </tr> <tr> <td>CogCoM-grounding-generalist</td> <td>92.34</td> <td>94.57</td> <td>89.15</td> <td>88.19</td> <td>92.80</td> <td>82.08</td> <td>89.32</td> <td>90.45</td> </tr> </table> </details>

🍭 Examples

CogCoM demonstrates the flexible capabilities for adapting to different multimodal scenarios, including evidential visual reasoning, Visual Grounding, Grounded Captioning, Image Captioning, Multi Choice, and Detailed Captioning.

<p align="center"> <img src=assets/app_case.jpg width=100% /> </p>

💡 Cookbook

Task Prompts

  1. General Multi-Round Dialogue: Say whatever you want.

  2. Chain of Manipulations : Explicitly launching CoM reasoning.

    • We randomly add launching prompts to the CoM chains for solving meticulous visual problems, so you can explicitly let CogCoM to run with CoM mechanism, by adding the following launching prompt (we randomly generate numerous launching prompts for flexibility, see com_dataset.py for all details):
        Please solve the problem gradually via a chain of manipulations, where in each step you can selectively adopt one of the following manipulations GROUNDING(a phrase)->boxes, OCR(an image or a region)->texts, CROP_AND_ZOOMIN(a region on given image)->new_image, CALCULATE(a computable target)->numbers, or invent a new manipulation, if that seems helpful. {QUESTION}
    
  3. Visual Grounding. Our model is compatible with the grounding instructions from MultiInstruct and CogVLM, we provide basic usage of three functionalities here:

    • Visual Grounding (VG): Returning grounding coordinates (bounding box) based on the description of objects. Use any template from instruction template. For example (replacing <expr> with the object's description):

      "Find the region in image that "<expr>" describes."

    • Grounded Captioning (GC): Providing a description based on bounding box coordinates. Use a template from instruction template. For example (replacing <objs> with the position coordinates),

      "Describe the content of [[086,540,400,760]] in the picture."

    • Image Description with Cooordinates (IDC): Image description with grounding coordinates (bounding box). Use any template from caption_with_box template as model input. For example:

      Can you provide a description of the image and include the coordinates [[x0,y0,x1,y1]] for each mentioned object?

Format of coordination: The bounding box coordinates in the model's input and output use the format [[x1, y1, x2, y2]], with the origin at the top left corner, the x-axis to the right, and the y-axis downward. (x1, y1) and (x2, y2) are the top-left and bottom-right corners, respectively, with values as relative coordinates multiplied by 1000 (prefixed with zeros to three digits).

FAQ

🔒 License

The code in this repository is open source under the Apache-2.0 license, while the use of the CogCoM model weights must comply with the Model License.

✒️ Citation & Acknowledgements

@article{qi2024cogcom,
  title={CogCoM: Train Large Vision-Language Models Diving into Details through Chain of Manipulations},
  author={Qi, Ji and Ding, Ming and Wang, Weihan and Bai, Yushi and Lv, Qingsong and Hong, Wenyi and Xu, Bin and Hou, Lei and Li, Juanzi and Dong, Yuxiao and Tang, Jie},
  journal={arXiv preprint arXiv:2402.04236},
  year={2024}
}