Awesome
CLIP-Guided Vision-Language Pre-training for Question Answering in 3D Scenes
<p align="middle"> <img src="pre-training.png" width="40%" /> </p>This is the repository of the paper "CLIP-Guided Vision-Language Pre-training for Question Answering in 3D Scenes" (CVPR Workshops '23).
Installation
Code
-
Install CUDA-enabled PyTorch by following https://pytorch.org/get-started/locally/ (Note that this code has been tested with PyTorch 1.9.0 and 1.10.2 + cudatoolkit 11.3).
-
Install the remaining necessary dependencies with
requirements.txt
:pip install -r requirements.txt
-
Compile the CUDA modules for the PointNet++ backbone by running
setup.py
insidelib/pointnet2/
:cd lib/pointnet2 python setup.py install
(Note that this requires the full CUDA toolkit. If it fails: goto Troubleshooting.
Data
-
Download the ScanQA dataset under
data/qa/
. -
Download the ScanRefer dataset and unzip it under
data/
. To download the ScanRefer dataset you need to fill out this form. -
Download the ScanNetV2 dataset and put
scans/
underdata/scannet/
. To download the ScanNetV2 dataset, follow https://github.com/daveredrum/ScanRefer/blob/master/data/scannet/README.md. -
Generate the top-down image views for all scenes with
run_generate
(generate_top_down.py
renders the top-down image view for a single scene):python run_generate.py
-
Download the PointNet++(-1x) checkpoint from https://github.com/facebookresearch/DepthContrast and store checkpoint under directory:
checkpoints/
In the end, the data/
directory should have the following structure:
data/
├── qa/
├── scannet/
│ ├── batch_load_scannet_data.py
│ ├── load_scannet_data.py
│ ├── meta_data/
│ ├── model_util_scannet.py
│ ├── scannet_data
│ ├── scannet_utils.py
│ ├── scans/
│ └── visualize.py
├── ScanRefer_filtered.*
└── top_imgs/
Usage
Pretraining
-
Execute
scripts/pretrain.py
:python scripts/pretrain.py --no_height
Training
-
Execute
scripts/train.py
:-
Training with pre-trained weights:
python scripts/train.py --no_height --tokenizer_name clip --pretrain_src <folder_name_of_ckpt_file>
<folder_name_of_ckpt_file> corresponds to the folder under outputs/ with the timestamp + (optional) <tag_name>.
-
Training from scratch:
python scripts/train.py --no_height --tokenizer_name clip
-
Inference
-
Evaluation of trained models with the val dataset:
python scripts/eval.py --folder <folder_name> --qa --force --tokenizer_name clip
<folder_name> corresponds to the folder under outputs/ with the timestamp + <tag_name>.
Troubleshooting
-
Installation of
open3d
fails:user@device:~/3D-VQA-dev$ pip install open3d ERROR: Could not find a version that satisfies the requirement open3d (from versions: none) ERROR: No matching distribution found for open3d
- Make sure to generate the topview images on a desktop computer. The device that you are running the training on, might not have a prebuilt
open3d
package available - Comment
open3d
inrequirements.txt
and thus omit the installation ofopen3d
on this device
- Make sure to generate the topview images on a desktop computer. The device that you are running the training on, might not have a prebuilt
-
Execution of
lib/pointnet2/setup.py
fails:user@device:~/3D-VQA-dev/lib/pointnet2$ python setup.py install OSError: CUDA_HOME environment variable is not set. Please set it to your CUDA install root.
- Make sure that
CUDA_HOME
is set.
user@device:~/3D-VQA-dev$ python lib/pointnet2/setup.py install FileNotFoundError: [Errno 2] No such file or directory: '_version.py'
- Make sure to execute the code inside of
lib/pointnet/
as described in cudalayers installation
- Make sure that
BibTeX
@inproceedings{Parelli_2023_CVPR,
author = {Maria Parelli and Alexandros Delitzas and Nikolas Hars and Georgios Vlassis and Sotirios Anagnostidis and Gregor Bachmann and Thomas Hofmann},
title = {CLIP-Guided Vision-Language Pre-Training for Question Answering in 3D Scenes},
booktitle = {Proc. of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)},
year = {2023}
}
Acknowledgements
This project builds upon ATR-DBI/ScanQA and daveredrum/ScanRefer. It also makes use of openai/CLIP.