Home

Awesome

Pix3D: Dataset and Methods for Single-Image 3D Shape Modeling

Xingyuan Sun*, Jiajun Wu*, Xiuming Zhang, Zhoutong Zhang, Chengkai Zhang, Tianfan Xue, Joshua B. Tenenbaum, William T. Freeman.

Paper / Project Page

Teaser Image

Prerequisites

Evaluation

Rendering Demo

Installation

Our current release has been tested on Ubuntu 16.04.4 LTS.

Cloning the repository and downloading Pix3D (3.6GB)

git clone git@github.com:xingyuansun/pix3d.git
cd pix3d
./download_dataset.sh

The Pix3D dataset is licensed under a <a rel="license" href="http://creativecommons.org/licenses/by/4.0/">Creative Commons Attribution 4.0 International License</a>.

Dataset

For each instance in Pix3D, we provide the following information (stored in pix3d.json):

You can load pix3d.json into Python with

import json
json.load(open('pix3d.json'))

Rendering Demo

Usage:

blender --background --python demo.py -- --anno_idx <i> --output_path <p>

Rendering of the i-th (starting from 0) annotation in pix3d.json will be saved to path p. Note that your Blender should be bundled with Python, which is usually the default. For rendering, a camera with a sensor width of 32mm and a focal length of focal_length is placed at the origin. We apply rot_mat and trans_mat to the object.

For example, by executing

blender --background --python demo.py -- --anno_idx 0  --output_path ./demo.png

you get the rendering on the left (the associated image is shown on the right).

<table> <tr> <td><img src="http://pix3d.csail.mit.edu/data/demo.png" width="400"></td> <td><img src="http://pix3d.csail.mit.edu/data/original.png" width="400"></td> </tr> </table>

Evaluation

Compiling the evaluation code

First, modify the first 4 lines of eval/Makefile according to your environment. Then, in the eval folder, execute

make

Usage

Note: Calculations of CD and EMD need to be run on a GPU.

You can evaluate your predictions on Pix3D with eval/eval.py. The file takes two file lists and calculates IoU, EMD, and CD between each pair of voxels or point clouds. The following options are available:

Evaluation details

As different voxelization methods may result in objects of different scales in the voxel grid, for a fair comparison, we preprocess all voxels and point clouds before calculating IoU, CD and EMD.

For IoU, we first find the bounding box of the object with a threshold of 0.1, pad the bounding box into a cube, and then use trilinear interpolation to resample to the desired resolution (32 x 32 x 32). Some algorithms reconstruct shapes at a resolution of 128 x 128 x 128. In this case, we first, apply a 4x max-pooling before trilinear interpolation; without the max pooling, the sampling grid can be too sparse and some thin structure can be left out. After the resampling of both the output voxel and the ground truth voxel, we search for an optimal threshold that maximizes the average IoU score over all objects, from 0.01 to 0.50 with a step size of 0.01.

For CD and EMD, we first sample a point cloud from the voxelized reconstructions. For each shape, we compute its isosurface with a threshold of 0.1, and then sample 1,024 points from the surface. All point clouds are then translated and scaled such that the bounding box of the point cloud is centered at the origin with its longest side being 1. We then compute CD and EMD for each pair of point clouds.

Demo

eval/list shows an evaluation example on the 2,894 untruncated, unoccluded chair images from Pix3D. Slightly occluded images have also been excluded.

Download the baseline output (829MB) by executing

./download_baseline_output.sh

Then, you can evaluate the output by executing the following command in the eval folder:

CUDA_VISIBLE_DEVICES=0 python eval.py --list1_path ./list/baseline_output.txt --list1_max_value 255 --list2_path ./list/gt.txt --calc_cd --calc_emd --calc_iou --threshold 0.1 --output_path results.csv

Your results should be around 0.287 for IoU, 0.119 for CD, and 0.120 for EMD, corresponding to the last row of Table 3 in the paper. The numbers might be slightly different from those reported in the paper because

Acknowledgements

Code for calculating CD and EMD comes from PSGN.

Notice

For rendering masks, we used rot_mat, trans_mat, and focal_length, which are defined in camera coordinates and applied to objects. However, for viewer-centered algorithms whose predictions need to be rotated back to the canonical view for evaluations against ground truth shapes, those values are not very helpful. As most algorithms assume that the camera is looking at the object's center, the raw input images are usually cropped or transformed before sending into their pipeline. This will result in a rotation matrix that is slightly different from the one provided. We provide cam_position and inplane_rotation to mitigate this. Those values are defined in object coordinates and will reproduce an image that is equivalent to the original image under a homography transformation. We use these values to rotate back viewer-centered predictions to evaluate their performance. These values are also used in evaluating viewpoint estimation algorithms for a similar reason.

Reference

@inproceedings{pix3d,
  title={Pix3D: Dataset and Methods for Single-Image 3D Shape Modeling},
  author={Sun, Xingyuan and Wu, Jiajun and Zhang, Xiuming and Zhang, Zhoutong and Zhang, Chengkai and Xue, Tianfan and Tenenbaum, Joshua B and Freeman, William T},
  booktitle={IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  year={2018}
}

For any questions, please contact Xingyuan Sun (xingyuansun.cs@gmail.com) and Jiajun Wu (jiajunwu@mit.edu).