Awesome
<div align="center"> <h2> 🍾 POP-3D: Open-Vocabulary 3D Occupancy Prediction from Images<br> <p></p> <p></p><a href="https://vobecant.github.io/">Antonin Vobecky</a>  <a href="https://osimeoni.github.io/">Oriane Siméoni</a>  <a href="https://scholar.google.com/citations?hl=en&user=XY1PVwYAAAAJ">David Hurych</a>  <a href="https://scholar.google.fr/citations?user=7atfg7EAAAAJ&hl=en">Spyros Gidaris</a>  <a href="https://abursuc.github.io/">Andrei Bursuc</a>  <a href="https://ptrckprz.github.io/">Patrick Pérez</a>  <a href="https://people.ciirc.cvut.cz/~sivic/">Josef Sivic</a> 
<p></p> <a href="https://arxiv.org/abs/2401.09413"><img src="https://img.shields.io/badge/-Paper-blue.svg?colorA=333&logo=arxiv" height=35em></a> <a href="https://vobecant.github.io/POP3D/"><img src="https://img.shields.io/badge/-Webpage-blue.svg?colorA=333&logo=html5" height=35em></a> <a href="https://recorder-v3.slideslive.com/?share=89535&s=2be28040-6fb0-45b9-a6b4-d50731da0417"><img src="https://img.shields.io/badge/-video-blue.svg?colorA=333&logo=Youtube" height=35em></a> <p></p> </h2> </div>Code for paper "POP-3D: Open-Vocabulary 3D Occupancy Prediction from Images"
Welcome to the official implrmrntation of POP-3D: Open-Vocabulary 3D Occupancy Prediction from Images
@article{
vobecky2023POP3D,
title={POP-3D: Open-Vocabulary 3D Occupancy Prediction from Images},
author={Antonin Vobecky and Oriane Siméoni and David Hurych and Spyros Gidaris and Andrei Bursuc and Patrick Pérez and Josef Sivic},
booktitle = {Advances in Neural Information Processing Systems},
volume = {37},
year = {2023}
}
environment setup
Please, have GCC 5 or higher.
POP-3D
Run the following script to prepare the pop3d
conda environment:
conda env create -f conda_env.yaml
Download weights from this link and put them to ./ckpts
MaskCLIP
Step 0. Create a conda environment, activate it and install requirements
cd MaskCLIP
conda create -n maskclip python=3.9
conda activate maskclip
pip install --no-cache-dir -r requirements.txt
pip install --no-cache-dir opencv-python
Step 1. Install PyTorch and Torchvision following official instructions, e.g., fo4 PyTorch 1.10 with CUDA 10.2:
pip install --no-cache-dir torch==1.10.1+cu111 torchvision==0.11.2+cu111 torchaudio==0.10.1 -f https://download.pytorch.org/whl/cu111/torch_stable.html
Step 2. Install MMCV:
pip install --no-cache-dir mmcv-full==1.5.0
Step 3. Install CLIP.
pip install ftfy regex tqdm
pip install git+https://github.com/openai/CLIP.git
Step 4. Install MaskCLIP.
pip install --no-cache-dir -v -e .
Data preparation
Download nuScenes
Download and extract the nuScenes dataset (link) and place it to the ./data/nuscenes
folder. This means downloading all the nuScenes files, including both trainval and test splits,
Download "info" files:
We provide files for simpler manipulation with the nuScenes dataset. We use these files in our dataloaders. Again, please put these files to the ./data
folder (in the POP3D
folder). To do this, please simply run:
bash scripts/download_info_files.sh
Download retrieval benchmark files:
To download the data for our Open-vocabulary language-driven retrieval dataset, please run:
bash scripts/download_retrieval_benchmark.sh
Prepare projection files.
To activate the environment, please run:
conda activate pop3d
Run the following script to prepare projection files. The default path to the directory with the nuScenes dataset is set to ./data/nuscenes
.
NUSC_ROOT=./data/nuscenes
PROJ_DIR=./data/nuscenes/features/projections
python3 generate_projections_nuscenes.py --nusc_root ${NUSC_ROOT} --proj-dir ${PROJ_DIR}
Generate MaskCLIP features.
Switch to MaskCLIP directory in this project (cd MaskCLIP
).
- Activate MaskCLIP environment:
conda activate maskclip
- Prepare backbone weights by:
mkdir -p ./pretrain
python tools/maskclip_utils/convert_clip_weights.py --model ViT16 --backbone
python tools/maskclip_utils/convert_clip_weights.py --model ViT16
-
Download pre-trained weights from this link and put them to
ckpts/maskclip_plus_vit16_deeplabv2_r101-d8_512x512_8k_coco-stuff164k.pth
-
Run feature extraction:
CFG_PATH=configs/maskclip_plus/anno_free/maskclip_plus_vit16_deeplabv2_r101-d8_512x512_8k_coco-stuff164k__nuscenes_trainvaltest.py
CKPT_PATH=ckpts/maskclip_plus_vit16_deeplabv2_r101-d8_512x512_8k_coco-stuff164k.pth
PROJ_DIR=../data/nuscenes/features/projections/data/nuscenes
OUT_DIR=../data/nuscenes/maskclip_features_projections
python tools/extract_features.py ${CFG_PATH} --save-dir ${OUT_DIR} --checkpoint ${CKPT_PATH} --projections-dir ${PROJ_DIR} --complete
to generate the target MaskCLIP+ features to use in the training of our method.
Note: the process of preparing the targets from MaskCLIP+ can be slow, depending on the speed of your file system. If you want to parallelize, we provide the following skeleton for launching multiple jobs using SLURM:
NUM_GPUS=... # fill the number of nodes
ACCOUNT=... # name of your account, if any
HOURS_TOTAL=... # how long you expect the *WHOLE* extraction of features to last
MASKCLIP_DIR=/path/to/POP3D/MaskCLIP
bash generate_features_slurm.sh ${NUM_GPUS} ${HOURS_TOTAL} ${ACCOUNT} ${MASKCLIP_DIR}
Note2: It is expected to get size mismatch for decode_head.text_embeddings: copying a param with shape torch.Size([171, 512]) from checkpoint, the shape in current model is torch.Size([28, 512]).
We do not use these weights during feature extraction.
Training
Our model was trained on 8x NVIIDA A100 GPUs.
Please, modify the following variables in the training script ``:
PARTITION="..." # name of the parition on your cluser, e.g., "gpu"
ACCOUNT="..." # name of your account, if it is set on your cluster
USERNAME="..." # your username on the cluster, used just for printing of running jobs
Script to run the training using SLURM: (NOT WORKING YET)
POP3D_DIR=/path/to/POP3D
bash scripts/train_slurm.sh ${POP3D_DIR}
Pre-trained weights
Weights used for results in the paper are here and used zero-shot weights from here. Please, put both files to ${POP3D_DIR}/pretrained
folder for easier use.
Evaluation
Zero-shot open-vocabulary semantic segmentation
To obtain results from our paper, please run:
A) single-GPU (slow):
CFG=...
CKPT=...
ZEROSHOT_PTH=...
python3 eval.py --py-config ${CFG} --resume-from ${CKPT} --maskclip --no-wandb --text-embeddings-path ${ZEROSHOT_PTH}
If you followed the instructions above, you can run:
python3 eval.py --py-config config/pop3d_maskclip_12ep.py --resume-from ./pretrained/pop3d_weights.pth --maskclip --no-wandb --text-embeddings-path ./pretrained/zeroshot_weights.pth
B) multi-GPU using SLURM (faster), e.g.:
POP3D_DIR=`pwd`
CKPT="./pretrained/pop3d_weights.pth"
NUM_GPUS=8
HOURS=1
CFG="config/pop3d_maskclip_12ep.py"
EXTRA="--text-embeddings-path ./pretrained/zeroshot_weights.pth"
bash scripts/eval_zeroshot_slurm.sh ${POP3D_DIR} ${CKPT} ${NUM_GPUS} ${HOURS} ${CFG} ${EXTRA}
EXPECTED RESULTS:
val_miou_vox_clip_all (evaluated at the complete voxel space): 16.65827465887346
Open-vocabulary language-driven retrieval
To obtain results from our paper, please run:
python retrieval.py
Expected results:
+-------------------------------+
| train (42 samples) |
+----------+------+-------------+
| method | mAP | mAP visible |
+----------+------+-------------+
| POP3D | 15.3 | 15.6 |
| MaskCLIP | N/A | 13.5 |
+----------+------+-------------+
+-------------------------------+
| val (27 samples) |
+----------+------+-------------+
| method | mAP | mAP visible |
+----------+------+-------------+
| POP3D | 24.1 | 24.7 |
| MaskCLIP | N/A | 18.7 |
+----------+------+-------------+
+-------------------------------+
| test (36 samples) |
+----------+------+-------------+
| method | mAP | mAP visible |
+----------+------+-------------+
| POP3D | 12.6 | 13.6 |
| MaskCLIP | N/A | 12.0 |
+----------+------+-------------+
+-------------------------------+
| valtest (63 samples) |
+----------+------+-------------+
| method | mAP | mAP visible |
+----------+------+-------------+
| POP3D | 17.5 | 18.4 |
| MaskCLIP | N/A | 14.9 |
+----------+------+-------------+
Results will be written to ./results/results_${TIMESTAMP}.txt
and to ./results/results_tables_${TIMESTAMP}.txt
Acknowledgements
Our code is based on TPVFormer and MaskCLIP. Many thanks to the authors!