Home

Awesome

JPerceiver: Joint Perception Network for Depth, Pose and Layout Estimation in Driving Scenes

Haimei Zhao, Jing Zhang, Sen Zhang and Dacheng Tao

Accepted to ECCV 2022

the teaser figure

Abstract

Depth estimation, visual odometry (VO), and bird's-eye-view (BEV) scene layout estimation present three critical tasks for driving scene perception, which is fundamental for motion planning and navigation in autonomous driving. Though they are complementary to each other, prior works usually focus on each individual task and rarely deal with all the three tasks together. A naive way is to accomplish them independently in a sequential or parallel manner, but there are three drawbacks, i.e., 1) the depth and VO results suffer afrom the inherent scale ambiguity issue; 2) the BEV layout is usually estimated separately for roads and vehicles, while the explicit overlay-underlay relations between them are ignored; and 3) the BEV layout is directly predicted from the front-view image without using any depth-related information, although the depth map contains useful geometry clues for inferring scene layouts. In this paper, we address these issues by proposing a novel joint perception framework named JPerceiver, which can estimate scale-aware depth and VO as well as BEV layout simultaneously from a monocular video sequence. It exploits the cross-view geometric transformation (CGT) to propagate the absolute scale from the road layout to depth and VO based on a carefully-designed scale loss. Meanwhile, a cross-view and cross-modal transfer (CCT) module is devised to leverage the depth clues for reasoning road and vehicle layout through an attention mechanism. JPerceiver can be trained in an end-to-end multi-task learning way, where the CGT scale loss and CCT module promote inter-task knowledge transfer to benefit feature learning of each task. Experiments on Argoverse, Nuscenes and KITTI show the superiority of JPerceiver over existing methods on all the above three tasks in terms of accuracy, model size, and inference speed.

Contributions

Approach overview

the framework figure

More details can be found in the paper: JPerceiver: Joint Perception Network for Depth, Pose and Layout Estimation in Driving Scenes (ECCV 2022) by Haimei Zhao, Jing Zhang, Sen Zhang and Dacheng Tao.

Installation

We recommend setting up a Python 3.5+ and Pytorch 1.1+ Virtual Environment and installing all the dependencies listed in the requirements file.

git clone https://github.com/sunnyHelen/JPerceiver.git

cd JPerceiver
pip install -r requirements.txt

Datasets

In the paper, we've presented results for KITTI 3D Object, KITTI Odometry, KITTI RAW, and Argoverse 3D Tracking v1.0 datasets. For comparison with Schulter et. al., We've used the same training and test splits sequences from the KITTI RAW dataset. For more details about the training/testing splits one can look at the splits directory. And you can download Ground-truth from Monolayout.

# Download KITTI RAW
./data/download_datasets.sh raw

# Download KITTI 3D Object
./data/download_datasets.sh object

# Download KITTI Odometry
./data/download_datasets.sh odometry

# Download Argoverse Tracking v1.0
./data/download_datasets.sh argoverse

The above scripts will download, unzip and store the respective datasets in the datasets directory.

datasets/
└── argoverse                          # argoverse dataset
    └── argoverse-tracking
        └── train1
            └── 1d676737-4110-3f7e-bec0-0c90f74c248f
                ├── car_bev_gt         # Vehicle GT
                ├── road_gt            # Road GT
                ├── stereo_front_left  # RGB image
└── kitti                              # kitti dataset 
    └── object                         # kitti 3D Object dataset 
        └── training
            ├── image_2                # RGB image
            ├── vehicle_256            # Vehicle GT
    ├── odometry                       # kitti odometry dataset 
        └── 00
            ├── image_2                # RGB image
            ├── road_dense128  # Road GT
    ├── raw                            # kitti raw dataset 
        └── 2011_09_26
            └── 2011_09_26_drive_0001_sync
                ├── image_2            # RGB image
                ├── road_dense128      # Road GT

Training

  1. Prepare the corresponding dataset
  2. Run training
# Training
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python -m torch.distributed.launch --nproc_per_node=8 --master_port 25629  train.py --config config/cfg_kitti_baseline_odometry_boundary_ce_iou_1024_20.py --work_dir log/odometry/

3.Choose different config file and log directory for different datasets and training settings. 4. The evaluation of BEV layout is conducted during training, which can be found in respective "xxx.log.json" files.

Evaluation

  1. Prepare the corresponding dataset
  2. Download pre-trained models
  3. Run evaluation
# Evaluate depth results 
python scripts/eval_depth_eigen.py 

# Evaluate VO results
python scripts/draw_odometry.py 

Pretrained Models

The following table provides links to the pre-trained models for each dataset mentioned in our paper. The table also shows the corresponding evaluation results for these models.

DatasetSegmentation ObjectsmIOU(%)mAP(%)Pretrained Model
KITTI 3D ObjectVehicle40.8557.23link
KITTI OdometryRoad78.1389.57link
KITTI RawRoad66.3986.17link
Argoverse TrackingVehicle49.4565.84link
Argoverse TrackingRoad77.5090.21link

Results

qualitative results

Visualize predictions

<img src="./images/video_argo_val_demo.gif">
### Draw trajectories
python scripts/plot_kitti.py 

### Prediction video generation
# kitti
python eval_kitti_video.py

# Argoverse
eval_argo_both_video.py

#Nuscenes
eval_nuscenes_both.py

Contact

If you meet any problems, please describe them in issues or contact:

License

This project is released under the MIT License (refer to the LICENSE file for details). Thanks for the open-source related works. This project partially depends on the sources of Monolayout, PYVA and FeatDepth