Home

Awesome

<img src="figures/meteor_emoji.png" style="vertical-align: 0px;" :height="50px" width="30px">Meteor: Mamba-based traversal of rationale for Large Language and Vision Models [ArXiv]

πŸ“° News

ezgif-1-389577e9b3

Official PyTorch implementation code for realizing the technical part of Mamba-based traversal of rationale (Meteor) to improve numerous vision language performances with efficient model size. This code is developed from scratch. so I have been trying to improve the readibility and simplicity of the code, compared with LLaVA which has relatively complexly structured code.

The contributions of Meteor can be simply summarized as the following lists

πŸ’‘ Highlights

Open-source LLVMs with Standard Model Size

LLVMsSQA-IMGPOPEMMEMMBMathVistaSEED-IMGMM-VetLLaVA-W
Yi-VL-6B71.782.5191564.229.767.532.151.9
LLaVA-NeXT-7B70.186.5185169.634.670.243.972.3
MM1-7B72.686.6185872.335.970.942.1-
Meteor-7B88.388.7222982.953.475.057.387.1

Open-source LLVMs with Large Model Sizes

LLVMsAI2DChartQAMMEMMBMathVistaMM-VetLLaVA-W
InternVL1.5-40B79.068.0217582.247.748.9-
InternVL1.5-26B80.783.8218882.253.562.8-
MM1-30B--206975.139.448.7-
MiniGemini-34B--210579.638.953.0-
MiniGemini-HD-34B--214180.643.359.3-
LLaVA-NeXT-8B71.669.5197272.137.5-80.1
LLaVA-NeXT-34B74.968.7203079.346.057.488.8
LLaVA-NeXT-72B77.477.0215980.546.6-89.2
LLaVA-NeXT-110B80.480.4220180.549.0-90.4
Meteor-7B77.974.9222982.953.457.387.1

Closed-source LLVMs

LLVMsSQA-IMGAI2DChartQAMMEMMBMathVistaSEED-IMGMMStar
Qwen-VL-Plus71.675.978.1218367.043.372.739.7
Gemini-Pro80.173.974.1193373.645.270.741.6
GPT-4V84.678.278.5192777.049.969.146.1
Meteor-7B88.377.974.9222982.953.475.052.8

😎 How to run demo?

Run the following order.

bash install
pip install -r requirements.txt

and run the demo (Enjoy Meteor).

python demo.py

(Optional) If you want to make πŸ“» Gradio demo by yourself, then you should run the following file or change it to fit your style.

python app.py

(Optional) If you want to enjoy the curated question-ratinale-answer triples, then you should debug the following file.

python check_dataset.py

(Optional) If you want to conduct the vision language evaluation, then you should run the following file.

bash run

πŸ“‹ Gathered & Curated Dataset Description

Gathered Total: 2130830, 2.1M

------------------------------
* Real-World Image: 755k
* Document & Chart & Diagram & Sign & Symbol: 627k
* Math: 747k
    - Math with Vision: 180k
    - Math with Text only: 566k
------------------------------

- ShareGPT4V-Caption [without SAM] (91021, 91k)
- ShareGPT4V-Instruction [Without few samples of OCR-VQA] (664703, 664k)
- MiniGemini-Instruction [DocVQA, ChartQA, DVQA, AI2D] (27670, 27k)
- DocDownstream (574268, 574k)
- DocReason (25877, 25k)
- GLLaVA-Align (60252, 60k)
- GLLaVA-QA (117205, 117k)
- MathVision (3040, 3k)
- MathInstruct [TextOnlyDataset] (262040, 262k)
- MathPlus [TextOnlyDataset] (304754, 304k)

Curated Total: 1059382, 1.1M

--------------------------------------------
Real-World Image: 338K
Document & Chart & Diagram & Sign & Symbol: 379K
Math: 342K
     Math with Vision: 165K
     Math with Text only: 177K
--------------------------------------------


- ShareGPT4V-Caption (72507, 73K)
- ShareGPT4V-Instruction (266072, 266K)
- MiniGemini-Instruction (26885, 27K)
- DocDownstream (298748, 299K)
- DocReason (53065, 53K)
- GLLaVA (162378, 162K)
- MathVision (2992, 3K)
- MathInstruct (81496, 81K)
- MathPlus (95239, 95K)

πŸš€ Download Training Datasets

We collect the following eight datasets. For MiniGemini, we selectively use data samples only for DocVQA, ChartQA, DVQA, and AI2D. Therefore, it is no need for you to download all data samples for MiniGemini.

Gathered Dataset Layout

Meteor_Dataset_Path
β”œβ”€β”€ llava                                                       # ShareGPT4V
β”‚   └── llava_pretrain                  
β”‚       └── images                  
β”œβ”€β”€ coco                                                        # ShareGPT4V
β”‚   └── train2017                   
β”œβ”€β”€ sam                                                         # ShareGPT4V
β”‚   └── images                  
β”œβ”€β”€ gqa                                                         # ShareGPT4V
β”‚   └── images                  
β”œβ”€β”€ ocr_vqa                                                     # ShareGPT4V
β”‚   └── images                  
β”œβ”€β”€ textvqa                                                     # ShareGPT4V
β”‚   └── train_images                    
β”œβ”€β”€ vg                                                          # ShareGPT4V
β”‚   β”œβ”€β”€ VG_100K                 
β”‚   └── VG_100K_2                   
β”œβ”€β”€ share_textvqa                                               # ShareGPT4V
β”‚   └── images                  
β”œβ”€β”€ web-celebrity                                               # ShareGPT4V
β”‚   └── images                  
β”œβ”€β”€ web-landmark                                                # ShareGPT4V
β”‚   └── images                  
β”œβ”€β”€ wikiart                                                     # ShareGPT4V
β”‚   └── images                  
β”œβ”€β”€ share_textvqa                                               # ShareGPT4V
β”‚   └── images                  
β”œβ”€β”€ docvqa                                                      # MiniGemini
β”‚   └── images                  
β”œβ”€β”€ chartqa                                                     # MiniGemini
β”‚   └── train                   
β”‚       └── images                  
β”œβ”€β”€ dvqa                                                        # MiniGemini
β”‚   └── images                  
β”œβ”€β”€ ai2d                                                        # MiniGemini
β”‚   └── images                  
β”œβ”€β”€ imgs                                                        # DocDownstream & DocReason
β”‚   └── ChartQA
β”‚   └── DUE_Benchmark
β”‚       └── DeepForm
β”‚       └── DocVQA
β”‚       └── InfographicsVQA
β”‚       └── KleisterCharity
β”‚       └── TabFact
β”‚       └── WikiTableQuestions
β”‚   └── TextCaps
β”‚   └── TextVQA
β”‚   └── VisualMRC
β”œβ”€β”€ geo3k                                                       # GLLaVA
|   └── train
β”œβ”€β”€ geoqa_plus                                                  # GLLaVA
β”œβ”€β”€ images                                                      # MathVision
|
β”œβ”€β”€ sharegpt4v_instruct_gpt4-vision_cap100k.json                # ShareGPT4V-Caption
β”œβ”€β”€ sharegpt4v_mix665k_cap23k_coco-ap9k_lcs3k_sam9k_div2k.json  # ShareGPT4V-Instruction
β”œβ”€β”€ train.jsonl                                                 # DocDownstream
β”œβ”€β”€ detailed_explanation.jsonl                                  # DocReason
β”œβ”€β”€ minigemini_instruction.json                                 # MiniGemini-Instruction
β”œβ”€β”€ gllava_align.parquet                                        # GLLaVA-Align
β”œβ”€β”€ gllava_qa.parquet                                           # GLLaVA-QA
β”œβ”€β”€ mathvision.parquet                                          # MathVision
β”œβ”€β”€ MathInstruct.json                                           # MathInstruct
└── mathplus.parquet                                            # MathPlus

πŸ“‚ Evaluation Benchmarks

These are the list of evaluation datasets. If you completely download them, the dataset should be placed in the folder by the following below directory layout.

Evaluation Dataset Directory Layout

Evaluation_Dataset_Path
β”œβ”€β”€ LLVisionQA-QBench               # Q-Bench
β”œβ”€β”€ ScienceQA                       # SQA-IMG
β”œβ”€β”€ ai2d                            # AI2D
β”œβ”€β”€ chartqa                         # ChartQA
β”œβ”€β”€ SEED-Bench                      # SEED-IMG
β”œβ”€β”€ POPE                            # POPE
β”œβ”€β”€ HallusionBench                  # HallusionBench
β”œβ”€β”€ MME_Benchmark_release_version   # MME
β”œβ”€β”€ MathVista                       # MathVista
β”œβ”€β”€ MMBench                         # MMB
β”œβ”€β”€ mm-vet                          # MM-Vet
β”œβ”€β”€ llava-bench-in-the-wild         # LLaVA Bench in the Wild
β”œβ”€β”€ MMStar                          # MMStar
└── MathVerse                       # MathVerse