Awesome

<div align=center> <img src="figures/logo.png" width="280px"> </div> <h2 align="center"> <a href="https://arxiv.org/abs/2411.10440v2">LLaVA-CoT: Let Vision Language Models Reason Step-by-Step</a></h2> <h5 align="center">This project was previously named LLaVA-o1. We have renamed it to LLaVA-CoT. </h5> <h5 align="center"> If you like our project, please give us a star ⭐ on GitHub for the latest update.</h5> <h5 align=center>

</h5> <h5 align="center">本项目希望构建开源慢思考大模型，由北大-兔展AIGC联合实验室共同发起。</h5>

📣 News

[2024/11/28] We've released the dataset: https://huggingface.co/datasets/Xkev/LLaVA-CoT-100k
[2024/11/25] We've released the code for dataset generation: dataset_generation/generate.py
[2024/11/23] We've released the Gradio App: https://huggingface.co/spaces/Xkev/Llama-3.2V-11B-cot
[2024/11/21] LLaVA-o1 is renamed to LLaVA-CoT https://arxiv.org/abs/2411.10440v2.
[2024/11/20] We've released the pretrained weights: https://huggingface.co/Xkev/Llama-3.2V-11B-cot
[2024/11/18] We've released our paper: https://arxiv.org/abs/2411.10440
[2024/11/18] Welcome to watch 👀 this repository for the latest updates.

🔥 Highlights

LLaVA-CoT is the first visual language model capable of spontaneous, systematic reasoning, similar to GPT-o1!

Our 11B model outperforms Gemini-1.5-pro,GPT-4o-mini, and Llama-3.2-90B-Vision-Instruct on six challenging multimodal benchmarks.

🚀 Demos

LLaVA-CoT begins by outlining the problem, interprets relevant information from the image, proceeds step-by-step through reasoning, and ultimately reaches a well-supported conclusion.

Reasoning Problems

Question	<img src="figures/reasoning.png" width="400"> <br> Subtract all tiny shiny balls. Subtract all purple objects. How many objects are left? Options: A. 4, B. 8, C. 2, D. 6
Llama-3.2-11B-Vision-Instruct	To determine the correct answer, we need to follow the instructions provided. <br> The first instruction is to subtract all tiny shiny balls. <br> Since there are <span style="color: red;">three tiny shiny balls</span> in the image, we would be left with no shiny balls after performing this action. <br> The second instruction is to subtract all purple objects. However, upon closer inspection, it becomes apparent that there are <span style="color: red;">no shiny purple objects</span> present in the image. Therefore, we cannot subtract any purple objects. <br> Given that we have already removed all the tiny shiny balls and there were no purple objects to begin with, the correct answer is <span style="color: red;">C. 2</span>, as initially, there were 2 purple objects, but they are not to be subtracted according to the instruction.
LLaVA-CoT	*Summary stage: What's the problem? What should I do?* <br> <span style="color: gray;"><SUMMARY></span> I will examine the image to identify and count all the objects, focusing on the <span style="color: green;">tiny shiny balls and purple objects</span>. Then, I will subtract these objects from the total count and determine the number of remaining objects. <span style="color: gray;"></SUMMARY></span> <br>*Caption stage: What can I know from the image?* <br> <span style="color: gray;"><CAPTION></span> The image shows various 3D geometric shapes on a flat surface, including spheres, cubes, and cylinders in different colors. Notably, there are <span style="color: green;">small shiny green spheres</span> and <span style="color: green;">a purple cylinder</span>. <span style="color: gray;"></CAPTION></span> <br> *Reasoning stage: How to solve the problem step-by-step?* <br> <span style="color: gray;"><REASONING></span> First, I will count all the objects in the image: there are <span style="color: green;">10 objects in total</span>. Next, I will identify and subtract the tiny shiny balls. There is <span style="color: green;">one small shiny green sphere</span>. Then, I will identify and subtract the purple objects. There is <span style="color: green;">one purple cylinder</span>. By subtracting the two objects (the shiny ball and the purple cylinder) from the total, I find that <span style="color: green;">8 objects remain</span>. <span style="color: gray;"></REASONING></span> <br> *Conclusion stage: What is the final answer?* <br> <span style="color: gray;"><CONCLUSION></span> B <span style="color: gray;"></CONCLUSION></span>

Question

<img src="figures/reasoning.png" width="400"> Subtract all tiny shiny balls. Subtract all purple objects. How many objects are left? Options: A. 4, B. 8, C. 2, D. 6

Llama-3.2-11B-Vision-Instruct

To determine the correct answer, we need to follow the instructions provided. The first instruction is to subtract all tiny shiny balls. Since there are three tiny shiny balls in the image, we would be left with no shiny balls after performing this action. The second instruction is to subtract all purple objects. However, upon closer inspection, it becomes apparent that there are no shiny purple objects present in the image. Therefore, we cannot subtract any purple objects. Given that we have already removed all the tiny shiny balls and there were no purple objects to begin with, the correct answer is C. 2, as initially, there were 2 purple objects, but they are not to be subtracted according to the instruction.

LLaVA-CoT

Summary stage: What's the problem? What should I do? <SUMMARY> I will examine the image to identify and count all the objects, focusing on the tiny shiny balls and purple objects. Then, I will subtract these objects from the total count and determine the number of remaining objects. </SUMMARY> Caption stage: What can I know from the image? <CAPTION> The image shows various 3D geometric shapes on a flat surface, including spheres, cubes, and cylinders in different colors. Notably, there are small shiny green spheres and a purple cylinder. </CAPTION> Reasoning stage: How to solve the problem step-by-step? <REASONING> First, I will count all the objects in the image: there are 10 objects in total. Next, I will identify and subtract the tiny shiny balls. There is one small shiny green sphere. Then, I will identify and subtract the purple objects. There is one purple cylinder. By subtracting the two objects (the shiny ball and the purple cylinder) from the total, I find that 8 objects remain. </REASONING> Conclusion stage: What is the final answer? <CONCLUSION> B </CONCLUSION>

Science Problems

Question	<img src="figures/science.png" width="400"> <br> An acetal polymer block is fixed to the rigid plates at its top and bottom surfaces. If the top plate displaces 2 mm horizontally when it is subjected to a horizontal force ( P = 2 , \text{kN} ), determine the shear modulus of the polymer. The width of the block is 100 mm. Assume that the polymer is linearly elastic and use small angle analysis. <br> Options: A: 15 MPa, B: 5 MPa, C: 20 MPa, D: 2 MPa
Llama-3.2-11B-Vision-Instruct	To determine the shear modulus of the polymer, we can use the formula for shear modulus in a small angle analysis: <br> ( G = \frac{F}{A \gamma} ) <br> where ( G ) is the shear modulus, ( F ) is the applied force, ( A ) is the area of the block, and ( \gamma ) is the shear strain. <br> Given that the top plate of the block displaces horizontally by 2 mm when subjected to a horizontal force of 2 kN, we can calculate the shear strain ( \gamma ). Since the displacement is horizontal and <span style="color: red;">the original length of the block is 400 mm</span>, the shear strain ( \gamma ) is: <br> [ \gamma = \frac{\Delta x}{L} = \frac{2 , \text{mm}}{400 , \text{mm}} = 0.005 ] <br> Next, we calculate the area ( A ) of the block. The width of the block is given as 100 mm, and assuming <span style="color: red;">the height of the block is also 200 mm</span> (as given in the image analysis), the area ( A ) is: <br> [ A = 100 , \text{mm} \times 200 , \text{mm} = 20000 , \text{mm}^2 ] <br> Now, we can substitute these values into the formula for shear modulus: <br> [ G = \frac{F}{A \gamma} = \frac{2000 , \text{N}}{20000 , \text{mm}^2 \times 0.005} = \frac{2000}{100} = 20 , \text{MPa} ] <br> Therefore, the correct answer is <span style="color: red;">C: 20 MPa</span>.
LLaVA-CoT	*Summary stage: What's the problem? What should I do?* <br> <span style="color: gray;"><SUMMARY></span> To solve the problem, I will <span style="color: green;">determine the shear modulus</span> of the acetal polymer block using the relationship between the applied force, displacement, and dimensions of the block. I will <span style="color: green;">use the formula for shear modulus</span> and calculate the result. <span style="color: gray;"></SUMMARY></span> <br> *Caption stage: What can I know from the image?* <br> <span style="color: gray;"><CAPTION></span> The image shows a rectangular block of acetal polymer fixed between two rigid plates. The top plate moves horizontally by 2 mm when a force ( P = 2 , \text{kN} ) is applied. The block's dimensions are given as <span style="color: green;">400 mm in length, 200 mm in height</span>, and 100 mm in width. <span style="color: gray;"></CAPTION></span> <br> *Reasoning stage: How to solve the problem step-by-step?* <br> <span style="color: gray;"><REASONING></span> The shear modulus ( G ) of the polymer can be calculated using the formula: <br> [ G = \frac{F}{A \cdot \frac{\Delta x}{L}} ] <br> where: <br> - ( F ) is the applied force (2 kN), <br> - ( A ) is the area of the block's top surface (<span style="color: green;">100 mm x 400 mm = 40,000 mm²</span>), <br> - ( \Delta x ) is the horizontal displacement (2 mm), <br> - ( L ) is the height of the block (<span style="color: green;">200 mm</span>). <br> First, convert the force to Newtons: ( F = 2 , \text{kN} = 2000 , \text{N} ). <br> Next, calculate the shear modulus: <br> [ G = \frac{2000}{40,000 \cdot \frac{2}{200}} ] <br> [ G = \frac{2000}{40,000 \cdot 0.01} ] <br> [ G = \frac{2000}{400} ] <br> [ G = 5 , \text{MPa} ] <br> <span style="color: gray;"></REASONING></span> <br> *Conclusion stage: What is the final answer?* <br> <span style="color: gray;"><CONCLUSION></span> <span style="color: green;">B: 5 MPa</span> <span style="color: gray;"></CONCLUSION></span>

Question

<img src="figures/science.png" width="400"> An acetal polymer block is fixed to the rigid plates at its top and bottom surfaces. If the top plate displaces 2 mm horizontally when it is subjected to a horizontal force ( P = 2 , \text{kN} ), determine the shear modulus of the polymer. The width of the block is 100 mm. Assume that the polymer is linearly elastic and use small angle analysis. Options: A: 15 MPa, B: 5 MPa, C: 20 MPa, D: 2 MPa

Llama-3.2-11B-Vision-Instruct

To determine the shear modulus of the polymer, we can use the formula for shear modulus in a small angle analysis: ( G = \frac{F}{A \gamma} ) where ( G ) is the shear modulus, ( F ) is the applied force, ( A ) is the area of the block, and ( \gamma ) is the shear strain. Given that the top plate of the block displaces horizontally by 2 mm when subjected to a horizontal force of 2 kN, we can calculate the shear strain ( \gamma ). Since the displacement is horizontal and the original length of the block is 400 mm, the shear strain ( \gamma ) is: [ \gamma = \frac{\Delta x}{L} = \frac{2 , \text{mm}}{400 , \text{mm}} = 0.005 ] Next, we calculate the area ( A ) of the block. The width of the block is given as 100 mm, and assuming the height of the block is also 200 mm (as given in the image analysis), the area ( A ) is: [ A = 100 , \text{mm} \times 200 , \text{mm} = 20000 , \text{mm}^2 ] Now, we can substitute these values into the formula for shear modulus: [ G = \frac{F}{A \gamma} = \frac{2000 , \text{N}}{20000 , \text{mm}^2 \times 0.005} = \frac{2000}{100} = 20 , \text{MPa} ] Therefore, the correct answer is C: 20 MPa.

LLaVA-CoT

Summary stage: What's the problem? What should I do? <SUMMARY> To solve the problem, I will determine the shear modulus of the acetal polymer block using the relationship between the applied force, displacement, and dimensions of the block. I will use the formula for shear modulus and calculate the result. </SUMMARY> Caption stage: What can I know from the image? <CAPTION> The image shows a rectangular block of acetal polymer fixed between two rigid plates. The top plate moves horizontally by 2 mm when a force ( P = 2 , \text{kN} ) is applied. The block's dimensions are given as 400 mm in length, 200 mm in height, and 100 mm in width. </CAPTION> Reasoning stage: How to solve the problem step-by-step? <REASONING> The shear modulus ( G ) of the polymer can be calculated using the formula: [ G = \frac{F}{A \cdot \frac{\Delta x}{L}} ] where: - ( F ) is the applied force (2 kN), - ( A ) is the area of the block's top surface (100 mm x 400 mm = 40,000 mm²), - ( \Delta x ) is the horizontal displacement (2 mm), - ( L ) is the height of the block (200 mm). First, convert the force to Newtons: ( F = 2 , \text{kN} = 2000 , \text{N} ). Next, calculate the shear modulus: [ G = \frac{2000}{40,000 \cdot \frac{2}{200}} ] [ G = \frac{2000}{40,000 \cdot 0.01} ] [ G = \frac{2000}{400} ] [ G = 5 , \text{MPa} ] </REASONING> Conclusion stage: What is the final answer? <CONCLUSION> B: 5 MPa </CONCLUSION>

🛠️ Usage

Pretrained weights

You can download the pretrained weights from the Huggingface: Xkev/Llama-3.2V-11B-cot.

Dataset

You can download the dataset from the Huggingface: Xkev/LLaVA-CoT-100k.

Inference

You can use the same code as Llama-3.2-11B-Vision-Instruct to load the model and perform inference.

If you want to use perform inference time scaling, you can use code provided in inference_demo/inference_demo.py.

Finetuning

You may use any repository that supports Llama-3.2-11B-Vision-Instruct for finetuning.

We recommend using llama-recipes.

📝 Citation

If you find this paper useful, please consider staring 🌟 this repo and citing 📑 our paper:

@misc{xu2024llavacot,
      title={LLaVA-CoT: Let Vision Language Models Reason Step-by-Step},
      author={Guowei Xu and Peng Jin and Hao Li and Yibing Song and Lichao Sun and Li Yuan},
      year={2024},
      eprint={2411.10440},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2411.10440},
}

🙏 Acknowledgement

The majority of this project is released under the Apache 2.0 license as found in the LICENSE file.
The service is a research preview intended for non-commercial use only, subject to LLAMA 3.2 COMMUNITY LICENSE AGREEMENT, and Terms of Use of the data generated by OpenAI. Please contact us if you find any potential violations.
The template is modified from Chat-Univi and LLaVA.