Home

Awesome

<p align="right">English | <a href="docs/README_CN.md">简体中文</a></p> <p align="center"> <img src="docs/figs/logo.png" align="center" width="44%"> <h3 align="center"><strong>Segment Any Point Cloud Sequences by Distilling Vision Foundation Models</strong></h3> <p align="center"> <a href="https://github.com/youquanl">Youquan Liu</a><sup>1,*</sup>&nbsp;&nbsp;&nbsp; <a href="https://ldkong.com">Lingdong Kong</a><sup>1,2,*</sup>&nbsp;&nbsp;&nbsp; <a href="http://cen-jun.com">Jun Cen</a><sup>3</sup>&nbsp;&nbsp;&nbsp; <a href="https://scholar.google.com/citations?user=Uq2DuzkAAAAJ">Runnan Chen</a><sup>4</sup>&nbsp;&nbsp;&nbsp; <a href="http://zhangwenwei.cn">Wenwei Zhang</a><sup>1,5</sup><br> <a href="https://scholar.google.com/citations?user=lSDISOcAAAAJ">Liang Pan</a><sup>5</sup>&nbsp;&nbsp;&nbsp; <a href="http://chenkai.site">Kai Chen</a><sup>1</sup>&nbsp;&nbsp;&nbsp; <a href="https://liuziwei7.github.io">Ziwei Liu</a><sup>5</sup> <br> <sup>1</sup>Shanghai AI Laboratory&nbsp;&nbsp;&nbsp; <sup>2</sup>National University of Singapore&nbsp;&nbsp;&nbsp; <sup>3</sup>The Hong Kong University of Science and Technology&nbsp;&nbsp;&nbsp; <sup>4</sup>The University of Hong Kong&nbsp;&nbsp;&nbsp; <sup>5</sup>S-Lab, Nanyang Technological University </p> </p> <p align="center"> <a href="https://arxiv.org/abs/2306.09347" target='_blank'> <img src="https://img.shields.io/badge/Paper-%F0%9F%93%83-purple"> </a> <a href="https://ldkong.com/Seal" target='_blank'> <img src="https://img.shields.io/badge/Project-%F0%9F%94%97-violet"> </a> <a href="https://youtu.be/S0q2-nQdwSs" target='_blank'> <img src="https://img.shields.io/badge/Demo-%F0%9F%8E%AC-purple"> </a> <a href="" target='_blank'> <img src="https://img.shields.io/badge/%E4%B8%AD%E8%AF%91%E7%89%88-%F0%9F%90%BC-violet"> </a> <a href="" target='_blank'> <img src="https://visitor-badge.laobi.icu/badge?page_id=youquanl.Segment-Any-Point-Cloud&left_color=gray&right_color=purple"> </a> </p>

Seal :seal:

Seal is a versatile self-supervised learning framework capable of segmenting any automotive point clouds by leveraging off-the-shelf knowledge from vision foundation models (VFMs) and encouraging spatial and temporal consistency from such knowledge during the representation learning stage.

<p align="center"> <img src="docs/figs/teaser.jpg" align="center" width="95%"> </p>

:sparkles: Highlight

:oncoming_automobile: 2D-3D Correspondence

<p align="center"> <img src="docs/figs/demo.gif" align="center" width="95%"> </p>

:movie_camera: Video Demo

Demo 1Demo 2Demo 3
<img width="100%" src="docs/figs/demo1.jpg"><img width="100%" src="docs/figs/demo2.jpg"><img width="100%" src="docs/figs/demo3.jpg">
Link <sup>:arrow_heading_up:</sup>Link <sup>:arrow_heading_up:</sup>Link <sup>:arrow_heading_up:</sup>

Updates

Outline

Installation

Please refer to INSTALL.md for the installation details.

Data Preparation

nuScenesSemanticKITTIWaymo OpenScribbleKITTI
<img width="115" src="docs/figs/dataset/nuscenes.png"><img width="115" src="docs/figs/dataset/semantickitti.png"><img width="115" src="docs/figs/dataset/waymo-open.png"><img width="115" src="docs/figs/dataset/scribblekitti.png">
RELLIS-3DSemanticPOSSSemanticSTFDAPS-3D
<img width="115" src="docs/figs/dataset/rellis-3d.png"><img width="115" src="docs/figs/dataset/semanticposs.png"><img width="115" src="docs/figs/dataset/semanticstf.png"><img width="115" src="docs/figs/dataset/daps-3d.png">
SynLiDARSynth4DnuScenes-C
<img width="115" src="docs/figs/dataset/synlidar.png"><img width="115" src="docs/figs/dataset/synth4d.png"><img width="115" src="docs/figs/dataset/nuscenes-c.png">

Please refer to DATA_PREPARE.md for the details to prepare these datasets.

Superpoint Generation

Raw Point CloudSemantic SuperpointGroundtruth
<img src="docs/figs/rotate/rotate1.gif" align="center" width="240"><img src="docs/figs/rotate/rotate1_sp.gif" align="center" width="240"><img src="docs/figs/rotate/rotate1_gt.gif" align="center" width="240">
<img src="docs/figs/rotate/rotate2.gif" align="center" width="240"><img src="docs/figs/rotate/rotate2_sp.gif" align="center" width="240"><img src="docs/figs/rotate/rotate2_gt.gif" align="center" width="240">
<img src="docs/figs/rotate/rotate3.gif" align="center" width="240"><img src="docs/figs/rotate/rotate3_sp.gif" align="center" width="240"><img src="docs/figs/rotate/rotate3_gt.gif" align="center" width="240">
<img src="docs/figs/rotate/rotate4.gif" align="center" width="240"><img src="docs/figs/rotate/rotate4_sp.gif" align="center" width="240"><img src="docs/figs/rotate/rotate4_gt.gif" align="center" width="240">

Kindly refer to SUPERPOINT.md for the details to generate the semantic superpixels & superpoints with vision foundation models.

Getting Started

Kindly refer to GET_STARTED.md to learn more usage of this codebase.

Main Result

:unicorn: Framework Overview

<img src="docs/figs/framework.jpg" align="center" width="99%">
Overview of the Seal :seal: framework. We generate, for each {LiDAR, camera} pair at timestamp t and another LiDAR frame at timestamp t + n, the semantic superpixel and superpoint by VFMs. Two pertaining objectives are then formed, including spatial contrastive learning between paired LiDAR and camera features and temporal consistency regularization between segments at different timestamps.

:car: Cosine Similarity

<img src="docs/figs/cosine.jpg" align="center" width="99%">
The cosine similarity between a query point (red dot) and the feature learned with SLIC and different VFMs in our Seal :seal: framework. The queried semantic classes from top to bottom examples are: “car”, “manmade”, and “truck”. The color goes from violet to yellow denoting low and high similarity scores, respectively.

:blue_car: Benchmark

<table class="center"> <tr> <th rowspan="2">Method</th> <th colspan="6">nuScenes</th> <th colspan="1">KITTI</th> <th colspan="1">Waymo</th> <th colspan="1">Synth4D</th> </tr> <tr> <td>LP</td> <td>1%</td> <td>5%</td> <td>10%</td> <td>25%</td> <td>Full</td> <td>1%</td> <td>1%</td> <td>1%</td> </tr> <tr> <td>Random</td> <td>8.10</td> <td>30.30</td> <td>47.84</td> <td>56.15</td> <td>65.48</td> <td>74.66</td> <td>39.50</td> <td>39.41</td> <td>20.22</td> </tr> <tr> <td>PointContrast</td> <td>21.90</td> <td>32.50</td> <td >-</td> <td>-</td> <td>-</td> <td>-</td> <td>41.10</td> <td>-</td> <td>-</td> </tr> <tr> <td>DepthContrast</td> <td>22.10</td> <td>31.70</td> <td>-</td> <td>-</td> <td>-</td> <td>-</td> <td>41.50</td> <td>-</td> <td>-</td> </tr> <tr> <td>PPKT</td> <td>35.90</td> <td>37.80</td> <td>53.74</td> <td>60.25</td> <td>67.14</td> <td>74.52</td> <td>44.00</td> <td>47.60</td> <td>61.10</td> </tr> <tr> <td>SLidR</td> <td>38.80</td> <td>38.30</td> <td>52.49</td> <td>59.84</td> <td>66.91</td> <td>74.79</td> <td>44.60</td> <td>47.12</td> <td>63.10</td> </tr> <tr> <td>ST-SLidR</td> <td>40.48</td> <td>40.75</td> <td>54.69</td> <td>60.75</td> <td>67.70</td> <td>75.14</td> <td>44.72</td> <td>44.93</td> <td>-</td> </tr> <tr> <td><strong>Seal :seal:</strong></td> <td>44.95</td> <td>45.84</td> <td>55.64</td> <td>62.97</td> <td>68.41</td> <td>75.60</td> <td>46.63</td> <td>49.34</td> <td>64.50</td> </tr> </table>

:bus: Linear Probing

<img src="docs/figs/linear.gif" align="center" width="99%">
The qualitative results of our Seal :seal: framework pretrained on nuScenes (without using groundtruth labels) and linear probed with a frozen backbone and a linear classification head. To highlight the differences, the correct / incorrect predictions are painted in gray / red, respectively.

:articulated_lorry: Downstream Generalization

<table class="center"> <tr> <th rowspan="2">Method</th> <th colspan="2">ScribbleKITTI</th> <th colspan="2">RELLIS-3D</th> <th colspan="2">SemanticPOSS</th> <th colspan="2">SemanticSTF</th> <th colspan="2">SynLiDAR</th> <th colspan="2">DAPS-3D</th> </tr> <tr> <td>1%</td> <td>10%</td> <td>1%</td> <td>10%</td> <td>Half</td> <td>Full</td> <td>Half</td> <td>Full</td> <td>1%</td> <td>10%</td> <td>Half</td> <td>Full</td> </tr> <tr> <td>Random</td> <td>23.81</td> <td>47.60</td> <td>38.46</td> <td>53.60</td> <td>46.26</td> <td>54.12</td> <td>48.03</td> <td>48.15</td> <td>19.89</td> <td>44.74</td> <td>74.32</td> <td>79.38</td> </tr> <tr> <td>PPKT</td> <td>36.50</td> <td>51.67</td> <td>49.71</td> <td>54.33</td> <td>50.18</td> <td>56.00</td> <td>50.92</td> <td>54.69</td> <td>37.57</td> <td>46.48</td> <td>78.90</td> <td>84.00</td> </tr> <tr> <td>SLidR</td> <td>39.60</td> <td>50.45</td> <td>49.75</td> <td>54.57</td> <td>51.56</td> <td>55.36</td> <td>52.01</td> <td>54.35</td> <td>42.05</td> <td>47.84</td> <td>81.00</td> <td>85.40</td> </tr> <tr> <td><strong>Seal :seal:</strong></td> <td>40.64</td> <td>52.77</td> <td>51.09</td> <td>55.03</td> <td>53.26</td> <td>56.89</td> <td>53.46</td> <td>55.36</td> <td>43.58</td> <td>49.26</td> <td>81.88</td> <td>85.90</td> </tr> </table>

:truck: Robustness Probing

InitBackbonemCEmRRFogWetSnowMotionBeamCrossEchoSensor
RandomPolarNet115.0976.3458.2369.9164.8244.6061.9140.7753.6442.01
RandomCENet112.7976.0467.0169.8761.6458.3149.9760.8953.3124.78
RandomWaffleIron106.7372.7856.0773.9349.5959.4665.1933.1261.5144.01
RandomCylinder3D105.5678.0861.4271.0258.4056.0264.1545.3659.9743.03
RandomSPVCNN106.6574.7059.0172.4641.0858.3665.3636.8362.2949.21
RandomMinkUNet112.2072.5762.9670.6555.4851.7162.0131.5659.6439.41
PPKTMinkUNet105.6476.0664.0172.1859.0857.1763.8836.3460.5939.57
SLidRMinkUNet106.0875.9965.4172.3156.0156.0762.8741.9461.1638.90
Seal :seal:MinkUNet92.6383.0872.6674.3166.2266.1465.9657.4459.8739.85

:tractor: Qualitative Assessment

<img src="docs/figs/qualitative.jpg" align="center" width="99%">
The qualitative results of Seal :seal: and prior methods pretrained on nuScenes (without using groundtruth labels) and fine-tuned with 1% labeled data. To highlight the differences, the correct / incorrect predictions are painted in gray / red, respectively.

TODO List

Citation

If you find this work helpful, please kindly consider citing our paper:

@inproceedings{liu2023segment,
  title = {Segment Any Point Cloud Sequences by Distilling Vision Foundation Models},
  author = {Liu, Youquan and Kong, Lingdong and Cen, Jun and Chen, Runnan and Zhang, Wenwei and Pan, Liang and Chen, Kai and Liu, Ziwei},
  booktitle = {Advances in Neural Information Processing Systems}, 
  year = {2023},
}
@misc{liu2023segment_any_point_cloud,
  title = {The Segment Any Point Cloud Codebase},
  author = {Liu, Youquan and Kong, Lingdong and Cen, Jun and Chen, Runnan and Zhang, Wenwei and Pan, Liang and Chen, Kai and Liu, Ziwei},
  howpublished = {\url{https://github.com/youquanl/Segment-Any-Point-Cloud}},
  year = {2023},
}

License

<a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/"><img alt="Creative Commons License" style="border-width:0" src="https://i.creativecommons.org/l/by-nc-sa/4.0/80x15.png" /></a> <br /> This work is under the <a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/">Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License</a>.

Acknowledgement

This work is developed based on the MMDetection3D codebase.

<img src="https://github.com/open-mmlab/mmdetection3d/blob/main/resources/mmdet3d-logo.png" width="30%"/><br> MMDetection3D is an open-source object detection toolbox based on PyTorch, towards the next-generation platform for general 3D detection. It is a part of the OpenMMLab project developed by MMLab.

Part of this codebase has been adapted from SLidR, Segment Anything, X-Decoder, OpenSeeD, Segment Everything Everywhere All at Once, LaserMix, and Robo3D.

:heart: We thank the exceptional contributions from the above open-source repositories!