Home

Awesome

GLaMM <img src="images/logos/face.png" height="40">: Pixel Grounding Large Multimodal Model [CVPR 2024]

<p align="center"> <img src="https://i.imgur.com/waxVImv.png" alt="Oryx Video-ChatGPT"> </p>

Hanoona Rasheed*, Muhammad Maaz*, Sahal Shaji, Abdelrahman Shaker, Salman Khan, Hisham Cholakkal, Rao M. Anwer, Eric Xing, Ming-Hsuan Yang and Fahad Khan

Mohamed bin Zayed University of AI, Australian National University, Aalto University, Carnegie Mellon University, University of California - Merced, LinkΓΆping University, Google Research

paper Dataset Demo Website video


πŸ“’ Latest Updates


<img src="images/logos/face.png" height="40"> GLaMM Overview

Grounding Large Multimodal Model (GLaMM) is an end-to-end trained LMM which provides visual grounding capabilities with the flexibility to process both image and region inputs. This enables the new unified task of Grounded Conversation Generation that combines phrase grounding, referring expression segmentation, and vision-language conversations. Equipped with the capability for detailed region understanding, pixel-level groundings, and conversational abilities, GLaMM offers a versatile capability to interact with visual inputs provided by the user at multiple granularity levels.


πŸ† Contributions


πŸš€ Dive Deeper: Inside GLaMM's Training and Evaluation

Delve into the core of GLaMM with our detailed guides on the model's Training and Evaluation methodologies.

πŸ‘οΈπŸ’¬ GLaMM: Grounding Large Multimodal Model

The components of GLaMM are cohesively designed to handle both textual and optional visual prompts (image level and region of interest), allowing for interaction at multiple levels of granularity, and generating grounded text responses.

<p align="center"> <img src="images/glamm/model_arch.png" alt="GLaMM Architectural Overview"> </p>

πŸ” Grounding-anything Dataset (GranD)

The Grounding-anything GranD dataset, a large-scale dataset with automated annotation pipeline for detailed region-level understanding and segmentation masks. GranD comprises 7.5M unique concepts anchored in a total of 810M regions, each with a segmentation mask.

<p align="center"> <img src="images/glamm/dataset_pipeline.png" alt="Dataset Annotation Pipeline"> </p>

Below we present some examples of the GranD dataset.

<p align="center"> <img src="images/glamm/grand_sample_2.png" alt="GranD Dataset Sample"> </p> <p align="center"> <img src="images/glamm/grand_sample_1.png" alt="GranD Dataset Sample"> </p>

πŸ“š Building GranD-f for Grounded Conversation Generation

The GranD-f dataset is designed for the GCG task, with about 214K image-grounded text pairs for higher-quality data in fine-tuning stage.

<p align="center"> <img src="images/glamm/grand_f_samples.png" alt="GranD-f Dataset Sample"> </p>

πŸ€– Grounded Conversation Generation (GCG)

Introducing GCG, a task to create image-level captions tied to segmentation masks, enhancing the model’s visual grounding in natural language captioning.

<p align="center"> <img src="images/glamm/results_7_gcg_combined.png" alt="Results_GCG"> </p> <p align="center"> <img src="images/tables/GCG_Table.png" alt="GCG_Table"> </p>

πŸš€ Downstream Applications

🎯 Referring Expression Segmentation

Our model excels in creating segmentation masks from text-based referring expressions.

<p align="center"> <img src="images/glamm/results_3_refseg.png" alt="Results_RefSeg"> </p> <p align="center"> <img src="images/tables/ReferSeg_Table.png" alt="Table_RefSeg"> </p>

πŸ–ΌοΈ Region-Level Captioning

GLaMM generates detailed region-specific captions and answers reasoning-based visual questions.

<p align="center"> <img src="images/glamm/results_4_regcap.png" alt="Results_RegionCap"> </p> <p align="center"> <img src="images/tables/Region_Cap_Table.png" alt="Table_RegionCap"> </p>

πŸ“· Image Captioning

Comparing favorably to specialized models, GLaMM provides high-quality image captioning.

<p align="center"> <img src="images/glamm/results_6_cap.png" alt="Results_Cap"> </p>

πŸ’¬ Conversational Style Question Answering

GLaMM demonstrates its prowess in engaging in detailed, region-specific, and grounded conversations. This effectively highlights its adaptability in intricate visual-language interactions and robustly retaining reasoning capabilities inherent to LLMs.

<p align="center"> <img src="images/glamm/results_4_conv.png" alt="Results_Conv"> </p>
<p align="center"> <img src="images/glamm/results_5_conv.png" alt="Results_Conv"> </p>

πŸ“œ Citation

  @article{hanoona2023GLaMM,
          title={GLaMM: Pixel Grounding Large Multimodal Model},
          author={Rasheed, Hanoona and Maaz, Muhammad and Shaji, Sahal and Shaker, Abdelrahman and Khan, Salman and Cholakkal, Hisham and Anwer, Rao M. and Xing, Eric and Yang, Ming-Hsuan and Khan, Fahad S.},
          journal={The IEEE/CVF Conference on Computer Vision and Pattern Recognition},
          year={2024}
  }

πŸ™ Acknowledgement

We are thankful to LLaVA, GPT4ROI, and LISA for releasing their models and code as open-source contributions.


<img src="images/logos/IVAL_logo.png" width="200" height="100"> <img src="images/logos/Oryx_logo.png" width="100" height="100"> <img src="images/logos/MBZUAI_logo.png" width="360" height="85">