Home

Awesome

LLaVA-SpaceSGG

Paper | Dataset | Benchmark | Models

Overview

LLaVA-SpaceSGG is a multimodal large language model (MLLM) designed to tackle the challenges of Scene Graph Generation (SGG) by improving spatial relation modeling and enabling open-vocabulary generalization. SGG converts visual scenes into structured graph representations, providing deeper scene understanding for complex vision tasks.

Key Features

Achievements


Installation

Clone the repository and set up the environment:

git clone https://github.com/Endlinc/LLaVA-SpaceSGG.git
cd LLaVA-SpaceSGG
pip install -r requirements.txt

Data Preparation

Stage 1: Generate Point Clouds and Layered Objects

The scene graph description generation process in Stage 1 is built upon the All-Seeing v2 project. Please refer to their repository for detailed instructions and implementation.

  1. Generate Point Cloud from RGB and Depth Image:
    python d2p.py --dataset-path dataset/coco --scale-factor 5000 --world-coordinates
    
  2. Cluster Objects by Depth into Layers:
    python layers_aggregation.py \
        --input-file asv2_level.json \
        --depth-dir ./depth-output \
        --mask-dir ./mask-output \
        --output-file processed_annotations.json \
        --dataset-base /home/ming/Datasets/all-seeing-v2/materials/ \
        --data-prefix ../data/
    
  3. Generate Multiview Layered Objects:
    python multiview_layers.py \
        --input-file asv2_level.json \
        --point-cloud-dir ./point_clouds \
        --mask-dir ./mask-output \
        --output-file processed_annotations.json \
        --dataset-base /home/ming/Datasets/all-seeing-v2/materials/ \
        --data-prefix ../data/
    

Stage 2: Generate Training Data Formats

  1. Generate Layered Descriptions:
    python llm_based_query.py \
        --anno-file annotations.json \
        --prompt-function create_layer_prompt \
        --output-file layer_description.json
    
  2. Generate Question-Answering (QA) Data:
    python llm_based_query.py \
        --anno-file annotations.json \
        --prompt-function create_between_prompt \
        --output-file between_qa.json
    
  3. Generate Conversation Data:
    python llm_based_query.py \
        --anno-file annotations.json \
        --prompt-function create_rotation_prompt \
        --output-file rotation_prompts.json
    

Usage

After preparing the dataset, train the LLaVA-SpaceSGG model using the scripts provide in project LLaVA and The All-Seeing Project V2


Citation

If you use LLaVA-SpaceSGG or SpaceSGG dataset in your research, please cite our work:

@inproceedings{llava_spacesgg2025,
  title={LLaVA-SpaceSGG: Visual Instruct Tuning for Open-vocabulary Scene Graph Generation with Enhanced Spatial Relations},
  author={Your Name and Co-authors},
  booktitle={Proceedings of WACV 2025},
  year={2025}
}

License

This project is licensed under the Apache License.


Contact

For questions or feedback, please contact parasolohalo@gmail.com.

Let me know if you need adjustments!