Home

Awesome

CLIP-Guided Vision-Language Pre-training for Question Answering in 3D Scenes

<p align="middle"> <img src="pre-training.png" width="40%" /> </p>

This is the repository of the paper "CLIP-Guided Vision-Language Pre-training for Question Answering in 3D Scenes" (CVPR Workshops '23).

Installation

Code

  1. Install CUDA-enabled PyTorch by following https://pytorch.org/get-started/locally/ (Note that this code has been tested with PyTorch 1.9.0 and 1.10.2 + cudatoolkit 11.3).

  2. Install the remaining necessary dependencies with requirements.txt:

    pip install -r requirements.txt
    
  3. Compile the CUDA modules for the PointNet++ backbone by running setup.py inside lib/pointnet2/:

    cd lib/pointnet2
    python setup.py install
    

    (Note that this requires the full CUDA toolkit. If it fails: goto Troubleshooting.

Data

  1. Download the ScanQA dataset under data/qa/.

  2. Download the ScanRefer dataset and unzip it under data/. To download the ScanRefer dataset you need to fill out this form.

  3. Download the ScanNetV2 dataset and put scans/ under data/scannet/. To download the ScanNetV2 dataset, follow https://github.com/daveredrum/ScanRefer/blob/master/data/scannet/README.md.

  4. Generate the top-down image views for all scenes with run_generate(generate_top_down.py renders the top-down image view for a single scene):

     python run_generate.py
    
  5. Download the PointNet++(-1x) checkpoint from https://github.com/facebookresearch/DepthContrast and store checkpoint under directory: checkpoints/

In the end, the data/ directory should have the following structure:

data/
├── qa/
├── scannet/
│   ├── batch_load_scannet_data.py
│   ├── load_scannet_data.py
│   ├── meta_data/
│   ├── model_util_scannet.py
│   ├── scannet_data
│   ├── scannet_utils.py
│   ├── scans/
│   └── visualize.py
├── ScanRefer_filtered.*
└── top_imgs/

Usage

Pretraining

Training

Inference

Troubleshooting

BibTeX

@inproceedings{Parelli_2023_CVPR, 
	author = {Maria Parelli and Alexandros Delitzas and Nikolas Hars and Georgios Vlassis and Sotirios Anagnostidis and Gregor Bachmann and Thomas Hofmann}, 
	title = {CLIP-Guided Vision-Language Pre-Training for Question Answering in 3D Scenes}, 
	booktitle = {Proc. of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)}, 
	year = {2023}
}

Acknowledgements

This project builds upon ATR-DBI/ScanQA and daveredrum/ScanRefer. It also makes use of openai/CLIP.