Awesome

Awesome-3D-Visual-Grounding

A continual collection of papers related to Text-guided 3D Visual Grounding (T-3DVG).

Text-guided 3D visual grounding (T-3DVG) aims to locate a specific object that semantically corresponds to a language query from a complicated 3D scene, has drawn increasing attention in the 3D research community over the past few years. T-3DVG presents great potential and challenges due to its closer proximity to the real world and the complexity of data collection and 3D point cloud source processing.

In the T-3DVG community, we've summarized existing T-3DVG methods in our survey paper👍.

A Survey on Text-guided 3D Visual Grounding: Elements, Recent Advances, and Future Directions.

If you find some important work missed, it would be super helpful to let me know (dzliu@stu.pku.edu.cn). Thanks!

If you find our survey useful for your research, please consider citing:

@article{liu2024survey,
  title={A Survey on Text-guided 3D Visual Grounding: Elements, Recent Advances, and Future Directions},
  author={Liu, Daizong and Liu, Yang and Huang, Wencan and Hu, Wei},
  journal={arXiv preprint arXiv:2406.05785},
  year={2024}
}

Table of Contents

Fully-supervised two-stage approach
Fully-supervised one-stage approach
Weakly-supervised approach
Semi-supervised approach
Approaches using No Point Cloud Input
Approaches using LLMs
Approaches tackling Outdoor Scenes

Fully-Supervised-Two-Stage

ScanRefer: 3D Object Localization in RGB-D Scans using Natural Language | Github
- Dave Zhenyu Chen, Angel X. Chang, Matthias Nießner
- Technical University of Munich, Simon Fraser University
- [ECCV2020] https://arxiv.org/abs/1912.08830
- A dataset, two-stage approach, proposal-then-selection
ReferIt3D: Neural Listeners for Fine-Grained 3D Object Identification in Real-World Scenes | Github
- Panos Achlioptas, Ahmed Abdelreheem, Fei Xia, Mohamed Elhoseiny, Leonidas Guibas
- Stanford University, King Abdullah University of Science and Technology
- [ECCV2020] https://www.ecva.net/papers/eccv_2020/papers_ECCV/papers/123460409.pdf
- A dataset, two-stage approach, proposal-then-selection
Free-form Description Guided 3D Visual Graph Network for Object Grounding in Point Cloud | Github
- Mingtao Feng, Zhen Li, Qi Li, Liang Zhang, XiangDong Zhang, Guangming Zhu, Hui Zhang, Yaonan Wang, Ajmal Mian
- Xidian University, Hunan University, The University of Western Australia
- [ICCV2021] https://openaccess.thecvf.com/content/ICCV2021/papers/Feng_Free-Form_Description_Guided_3D_Visual_Graph_Network_for_Object_Grounding_ICCV_2021_paper.pdf
- Two-stage approach, proposal-then-selection, graph neural network
InstanceRefer: Cooperative Holistic Understanding for Visual Grounding on Point Clouds through Instance Multi-level Contextual Referring | Github
- Zhihao Yuan, Xu Yan, Yinghong Liao, Ruimao Zhang, Sheng Wang, Zhen Li, Shuguang Cui
- The Chinese University of Hong Kong (Shenzhen), Southern University of Science and Technology
- [ICCV2021] https://openaccess.thecvf.com/content/ICCV2021/papers/Yuan_InstanceRefer_Cooperative_Holistic_Understanding_for_Visual_Grounding_on_Point_Clouds_ICCV_2021_paper.pdf
- Two-stage approach, proposal-then-selection, segmentation
SAT: 2D Semantics Assisted Training for 3D Visual Grounding | Github
- Zhengyuan Yang, Songyang Zhang, Liwei Wang, Jiebo Luo
- University of Rochester, The Chinese University of Hong Kong
- [ICCV2021] https://arxiv.org/pdf/2105.11450
- Two-stage approach, proposal-then-selection, additional multi-modal input
Text-Guided Graph Neural Networks for Referring 3D Instance Segmentation | Github
- Pin-Hao Huang, Han-Hung Lee, Hwann-Tzong Chen, Tyng-Luh Liu
- Academia Sinica, National Tsing Hua University, Taiwan AI Labs, Aeolus Robotics
- [AAAI2021] https://ojs.aaai.org/index.php/AAAI/article/view/16253
- Two-stage approach, proposal-then-selection, segmentation
LanguageRefer: Spatial-Language Model for 3D Visual Grounding | Github
- Junha Roh, Karthik Desingh, Ali Farhadi, Dieter Fox
- University of Washington
- [CoRL2021] https://openreview.net/pdf?id=dgQdvPZnH-t
- Two-stage approach, proposal-then-selection, spatial embedding, positional encoding
TransRefer3D: Entity-and-Relation Aware Transformer for Fine-Grained 3D Visual Grounding | Github
- Dailan He, Yusheng Zhao, Junyu Luo, Tianrui Hui, Shaofei Huang, Aixi Zhang, Si Liu
- Beihang University, Chinese Academy of Sciences, Alibaba Group
- [ACMMM2021] https://arxiv.org/abs/2108.02388
- Two-stage approach, proposal-then-selection, entity-and-relation
3DJCG: A Unified Framework for Joint Dense Captioning and Visual Grounding on 3D Point Clouds | Github
- Daigang Cai, Lichen Zhao, Jing Zhang, Lu Sheng, Dong Xu
- Beihang University, The University of Sydney
- [CVPR2022] https://openaccess.thecvf.com/content/CVPR2022/papers/Cai_3DJCG_A_Unified_Framework_for_Joint_Dense_Captioning_and_Visual_CVPR_2022_paper.pdf
- Two-stage approach, proposal-then-selection, joint 3D captioning and grounding
Multi-View Transformer for 3D Visual Grounding | Github
- Shijia Huang, Yilun Chen, Jiaya Jia, Liwei Wang
- The Chinese University of Hong Kong
- [CVPR2022] https://openaccess.thecvf.com/content/CVPR2022/papers/Huang_Multi-View_Transformer_for_3D_Visual_Grounding_CVPR_2022_paper.pdf
- Two-stage approach, proposal-then-selection, additional multi-view input
3DRefTransformer: Fine-Grained Object Identification in Real-World Scenes Using Natural Language |
- Ahmed Abdelreheem, Ujjwal Upadhyay, Ivan Skorokhodov, Rawan Al Yahya, Jun Chen, Mohamed Elhoseiny
- King Abdullah University of Science and Technology
- [WACV2022] https://openaccess.thecvf.com/content/WACV2022/papers/Abdelreheem_3DRefTransformer_Fine-Grained_Object_Identification_in_Real-World_Scenes_Using_Natural_Language_WACV_2022_paper.pdf
- Two-stage approach, proposal-then-selection
D3Net: A Unified Speaker-Listener Architecture for 3D Dense Captioning and Visual Grounding | Github
- Dave Zhenyu Chen, Qirui Wu, Matthias Nießner, Angel X. Chang
- Technical University of Munich, Simon Fraser University
- [ECCV2022] https://arxiv.org/abs/2112.01551
- Two-stage approach, proposal-then-selection, joint 3D captioning and grounding
Language Conditioned Spatial Relation Reasoning for 3D Object Grounding | Github
- Shizhe Chen, Pierre-Louis Guhur, Makarand Tapaswi, Cordelia Schmid, Ivan Laptev
- PSL Research University, IIIT Hyderabad
- [NeurIPS2022] https://arxiv.org/abs/2211.09646
- Two-stage approach, proposal-then-selection, spatial relation
Look Around and Refer: 2D Synthetic Semantics Knowledge Distillation for 3D Visual Grounding | Github
- Eslam Mohamed Bakr, Yasmeen Alsaedy, Mohamed Elhoseiny
- King Abdullah University of Science and Technology
- [NeurIPS2022] https://arxiv.org/abs/2211.14241
- Two-stage approach, proposal-then-selection, additional multi-modal input
Context-aware Alignment and Mutual Masking for 3D-Language Pre-training | Github
- Zhao Jin, Munawar Hayat, Yuwei Yang, Yulan Guo, Yinjie Lei
- Sichuan University, Monash University, Sun Yat-sen University
- [CVPR2023] https://openaccess.thecvf.com/content/CVPR2023/papers/Jin_Context-Aware_Alignment_and_Mutual_Masking_for_3D-Language_Pre-Training_CVPR_2023_paper.pdf
- Two-stage approach, proposal-then-selection, pre-training
NS3D: Neuro-Symbolic Grounding of 3D Objects and Relations | Github
- Joy Hsu, Jiayuan Mao, Jiajun Wu
- Stanford University, Massachusetts Institute of Technology
- [CVPR2023] https://arxiv.org/abs/2303.13483
- Two-stage approach, proposal-then-selection, semantic learning
3D-VisTA: Pre-trained Transformer for 3D Vision and Text Alignment | Github
- Ziyu Zhu, Xiaojian Ma, Yixin Chen, Zhidong Deng, Siyuan Huang, Qing Li
- Tsinghua University, BIGAI
- [ICCV2023] https://arxiv.org/pdf/2308.04352
- Two-stage approach, proposal-then-selection, pre-training
Multi3DRefer: Grounding Text Description to Multiple 3D Objects | Github
- Yiming Zhang, ZeMing Gong, Angel X. Chang
- Simon Fraser University, Alberta Machine Intelligence Institute
- [ICCV2023] https://3dlg-hcvc.github.io/multi3drefer/
- A dataset, Two-stage approach, proposal-then-selection, multiple-object grounding
UniT3D: A Unified Transformer for 3D Dense Captioning and Visual Grounding |
- Dave Zhenyu Chen, Ronghang Hu, Xinlei Chen, Matthias Nießner, Angel X. Chang
- Technical University of Munich, Meta AI, Simon Fraser University
- [ICCV2023] https://arxiv.org/abs/2212.00836
- Two-stage approach, proposal-then-selection, joint 3D captioning and grounding
ViewRefer: Grasp the Multi-view Knowledge for 3D Visual Grounding with GPT and Prototype Guidance | Github
- Zoey Guo, Yiwen Tang, Ray Zhang, Dong Wang, Zhigang Wang, Bin Zhao, Xuelong Li
- Shanghai Artificial Intelligence Laboratory, The Chinese University of Hong Kong, Northwestern Polytechnical University
- [ICCV2023] https://arxiv.org/pdf/2303.16894
- Two-stage approach, proposal-then-selection, additional multi-view input
ARKitSceneRefer: Text-based Localization of Small Objects in Diverse Real-World 3D Indoor Scenes | Github
- Shunya Kato, Shuhei Kurita, Chenhui Chu, Sadao Kurohashi
- Kyoto University, RIKEN
- [EMNLP2023] https://aclanthology.org/2023.findings-emnlp.56.pdf
- A dataset, Two-stage approach, proposal-then-selection, small object
Reimagining 3D Visual Grounding: Instance Segmentation and Transformers for Fragmented Point Cloud Scenarios |
- Zehan Tan, Weidong Yang, Zhiwei Wang
- Fudan University, GREE ELECTRIC APPLIANCES
- [ACMMMAsia2023] https://dl.acm.org/doi/10.1145/3595916.3626405
- Two-stage approach, proposal-then-selection, additional multi-modal input
HAM: Hierarchical Attention Model with High Performance for 3D Visual Grounding |
- Jiaming Chen, Weixin Luo, Xiaolin Wei, Lin Ma, and Wei Zhang
- Shandong University, Meituan
- [Arxiv2023] https://arxiv.org/abs/2210.12513
- One-stage approach, proposal-then-selection, point detection
Three Ways to Improve Verbo-visual Fusion for Dense 3D Visual Grounding |
- Ozan Unal, Christos Sakaridis, Suman Saha, Fisher Yu, Luc Van Gool
- ETH Zurich
- [Arxiv2023] https://arxiv.org/abs/2309.04561
- Two-stage approach, proposal-then-selection, segmentation
ScanERU: Interactive 3D Visual Grounding based on Embodied Reference Understanding |
- Ziyang Lu, Yunqiang Pei, Guoqing Wang, Yang Yang, Zheng Wang, Heng Tao Shen
- University of Electronic Science and Technology of China
- [Arxiv2023] https://arxiv.org/abs/2303.13186
- Two-stage approach, proposal-then-selection, additional multi-modal input
ScanEnts3D: Exploiting Phrase-to-3D-Object Correspondences for Improved Visio-Linguistic Models in 3D Scenes | Github
- Ahmed Abdelreheem, Kyle Olszewski, Hsin-Ying Lee, Peter Wonka, Panos Achlioptas
- King Abdullah University of Science and Technology, Snap Inc.
- [WACV2024] https://openaccess.thecvf.com/content/WACV2024/papers/Abdelreheem_ScanEnts3D_Exploiting_Phrase-to-3D-Object_Correspondences_for_Improved_Visio-Linguistic_Models_in_3D_WACV_2024_paper.pdf
- Two-stage approach, proposal-then-selection
COT3DREF: Chain-of-Thoughts Data-Efficeint 3D Visual Grounding | Github
- Eslam Mohamed Bakr, Mohamed Ayman, Mahmoud Ahmed, Habib Slim, Mohamed Elhoseiny
- King Abdullah University of Science and Technology
- [ICLR2024] https://arxiv.org/abs/2310.06214
- Two-stage approach, proposal-then-selection, Chain-of-Thoughts
Exploiting Contextual Objects and Relations for 3D Visual Grounding | Github
- Li Yang, Chunfeng Yuan, Ziqi Zhang, Zhongang Qi, Yan Xu, Wei Liu, Ying Shan, Bing Li, Weiping Yang, Peng Li, Yan Wang, Weiming Hu
- CASIA, Tencent PCG, The Chinese University of Hong Kong, Ministry of Education, Alibaba Group, Zhejiang Linkheer Science And Technology Co., Ltd., University of Chinese Academy of Sciences, ShanghaiTech University
- [NeurIPS2024] https://papers.nips.cc/paper_files/paper/2023/hash/9b91ee0da3bcd61905fcd89e770168fc-Abstract-Conference.html
- Two-stage approach, proposal-then-selection
Cross3DVG: Cross-Dataset 3D Visual Grounding on Different RGB-D Scans | Github
- Taiki Miyanishi, Daichi Azuma, Shuhei Kurita, Motoaki Kawanabe
- ATR, Kyoto University, RIKEN AIP
- [3DV2024] https://arxiv.org/abs/2305.13876
- Two-stage approach, proposal-then-selection, additional multi-modal input
A Transformer-based Framework for Visual Grounding on 3D Point Clouds |
- Ali Solgi, Mehdi Ezoji
- Babol Noshirvani University of Technology
- [AISP2024] https://ieeexplore.ieee.org/abstract/document/10475280/
- Two-stage approach, proposal-then-selection
MiKASA: Multi-Key-Anchor & Scene-Aware Transformer for 3D Visual Grounding |
- Chun-Peng Chang, Shaoxiang Wang, Alain Pagani, Didier Stricker
- DFKI Augmented Vision
- [CVPR2024] https://arxiv.org/abs/2403.03077
- Two-stage approach, proposal-then-selection, spatial relation
Towards CLIP-driven Language-free 3D Visual Grounding via 2D-3D Relational Enhancement and Consistency | Github
- Yuqi Zhang, Han Luo, Yinjie Lei
- Sichuan University
- [CVPR2024] https://openaccess.thecvf.com/content/CVPR2024/papers/Zhang_Towards_CLIP-driven_Language-free_3D_Visual_Grounding_via_2D-3D_Relational_Enhancement_CVPR_2024_paper.pdf
- Two-stage approach, proposal-then-selection, additional multi-modal input
Multi-Attribute Interactions Matter for 3D Visual Grounding | Github
- Can Xu, Yuehui Han, Rui Xu, Le Hui, Jin Xie, Jian Yang
- Nanjing University of Science and Technology, Northwestern Polytechnical University, Nanjing University
- [CVPR2024] https://openaccess.thecvf.com/content/CVPR2024/papers/Xu_Multi-Attribute_Interactions_Matter_for_3D_Visual_Grounding_CVPR_2024_paper.pdf
- Two-stage approach, proposal-then-selection, additional multi-modal input
Viewpoint-Aware Visual Grounding in 3D Scenes |
- Xiangxi Shi, Zhonghua Wu, Stefan Lee
- Oregon State University, SenseTime Research
- [CVPR2024] https://openaccess.thecvf.com/content/CVPR2024/papers/Shi_Viewpoint-Aware_Visual_Grounding_in_3D_Scenes_CVPR_2024_paper.pdf
- Two-stage approach, proposal-then-selection, additional multi-modal input
Advancing 3D Object Grounding Beyond a Single 3D Scene |
- Wencan Huang, Daizong Liu, Wei Hu
- Peking University
- [ACMMM2024] https://openreview.net/forum?id=KuHaYRftVN
- Two-stage approach, proposal-then-selection, flexible-object group-wise scene grounding
Multi-Object 3D Grounding with Dynamic Modules and Language-Informed Spatial Attention |
- Haomeng Zhang, Chiao-An Yang, Raymond A. Yeh
- Purdue University
- [NeurIPS2024] https://arxiv.org/abs/2410.22306
- Two-stage approach, proposal-then-selection, multi-object grounding #
Fine-Grained Spatial and Verbal Losses for 3D Visual Grounding |
- Sombit Dey, Ozan Unal, Christos Sakaridis, Luc Van Gool
- ETH Zurich, INSAIT, Huawei Technologies, KU Leuven
- [WACV2024] https://arxiv.org/abs/2411.03405
- Two-stage approach, novel losses #
SCENEVERSE: Scaling 3D Vision-Language Learning for Grounded Scene Understanding | Github
- Baoxiong Jia , Yixin Chen , Huanyue Yu, Yan Wang, Xuesong Niu, Tengyu Liu, Qing Li, Siyuan Huang
- Beijing Institute for General Artificial Intelligence
- [Arxiv2024] https://arxiv.org/abs/2401.09340
- A dataset, Two-stage approach, proposal-then-selection
SeCG: Semantic-Enhanced 3D Visual Grounding via Cross-modal Graph Attention | Github
- Feng Xiao, Hongbin Xu, Qiuxia Wu, Wenxiong Kang
- South China University of Technology
- [Arxiv2024] https://arxiv.org/abs/2403.08182
- Two-stage approach, proposal-then-selection, additional multi-modal input
DOrA: 3D Visual Grounding with Order-Aware Referring |
- Tung-Yu Wu, Sheng-Yu Huang, Yu-Chiang Frank Wang
- National Taiwan University, NVIDIA
- [Arxiv2024] https://arxiv.org/abs/2403.16539
- Two-stage approach, proposal-then-selection, Chain-of-Thoughts
Dual Attribute-Spatial Relation Alignment for 3D Visual Grounding |
- Yue Xu, Kaizhi Yang, Jiebo Luo, Xuejin Chen
- University of Science and Technology of China, University of Rochester
- [Arxiv2024] https://arxiv.org/abs/2406.08907
- Two-stage approach, proposal-then-selection, spatial embedding
R2G: Reasoning to Ground in 3D Scenes |
- Yixuan Li, Zan Wang, Wei Liang
- n/a
- [Arxiv2024] https://arxiv.org/abs/2408.13499
- Two-stage approach, proposal-then-selection, scene graph reasoning

Fully-Supervised-One-Stage

3DVG-Transformer: Relation Modeling for Visual Grounding on Point Clouds | Github
- Lichen Zhao, Daigang Cai, Lu Sheng, Dong Xu
- Beihang University, The University of Sydney
- [ICCV2021] https://openaccess.thecvf.com/content/ICCV2021/papers/Zhao_3DVG-Transformer_Relation_Modeling_for_Visual_Grounding_on_Point_Clouds_ICCV_2021_paper.pdf
- One-stage approach, unified detection-interaction
3D-SPS: Single-Stage 3D Visual Grounding via Referred Point Progressive Selection | Github
- Junyu Luo, Jiahui Fu, Xianghao Kong, Chen Gao, Haibing Ren, Hao Shen, Huaxia Xia, Si Liu
- Beihang University, Meituan Inc.
- [CVPR2022] https://openaccess.thecvf.com/content/CVPR2022/papers/Luo_3D-SPS_Single-Stage_3D_Visual_Grounding_via_Referred_Point_Progressive_Selection_CVPR_2022_paper.pdf
- One-stage approach, regression-based
Bottom Up Top Down Detection Transformers for Language Grounding in Images and Point Clouds | Github
- Ayush Jain, Nikolaos Gkanatsios, Ishita Mediratta, Katerina Fragkiadaki
- Carnegie Mellon University, Meta AI
- [ECCV2022] https://arxiv.org/abs/2112.08879
- One-stage approach, unified detection-interaction
EDA: Explicit Text-Decoupling and Dense Alignment for 3D Visual Grounding | Github
- Yanmin Wu, Xinhua Cheng, Renrui Zhang, Zesen Cheng, Jian Zhang
- Peking University, The Chinese University of Hong Kong, Peng Cheng Laboratory, Shanghai AI Laboratory
- [CVPR2023] https://arxiv.org/abs/2209.14941
- One-stage approach, unified detection-interaction, text-decoupling, dense
Dense Object Grounding in 3D Scenes |
- Wencan Huang, Daizong Liu, Wei Hu
- Peking University
- [ACMMM2023] https://arxiv.org/abs/2309.02224
- One-stage approach, unified detection-interaction, transformer
3DRP-Net: 3D Relative Position-aware Network for 3D Visual Grounding |
- Zehan Wang, Haifeng Huang, Yang Zhao, Linjun Li, Xize Cheng, Yichen Zhu, Aoxiong Yin, Zhou Zhao
- Zhejiang University, ByteDance
- [EMNLP2023] https://aclanthology.org/2023.emnlp-main.656/
- One-stage approach, unified detection-interaction, relative position
LAMM: Language-Assisted Multi-Modal Instruction-Tuning Dataset, Framework, and Benchmark | Github
- Zhenfei Yin, Jiong Wang, Jianjian Cao, Zhelun Shi, Dingning Liu, Mukai Li, Lu Sheng, Lei Bai, Xiaoshui Huang, Zhiyong Wang, Jing Shao, Wanli Ouyang
- Shanghai AI Lab, Beihang University, The Chinese University of Hong Kong (Shenzhen), Fudan University, Dalian University of Technology, The University of Sydney
- [NeurIPs2023] https://arxiv.org/abs/2306.06687
- A dataset, One-stage approach, regression-based, multi-task
PATRON: Perspective-Aware Multitask Model for Referring Expression Grounding Using Embodied Multimodal Cues |
- Md Mofijul Islam, Alexi Gladstone, Tariq Iqbal
- University of Virginia
- [AAAI2023] https://ojs.aaai.org/index.php/AAAI/article/view/25177
- A dataset, One-stage approach, multi-task, multi-modal
Toward Fine-Grained 3D Visual Grounding through Referring Textual Phrases | Github
- Zhihao Yuan, Xu Yan, Zhuo Li, Xuhao Li, Yao Guo, Shuguang Cui, Zhen Li
- CUHK-Shenzhen, Shanghai Jiao Tong University
- [Arxiv2023] https://arxiv.org/abs/2207.01821
- A dataset, One-stage approach, unified detection-interaction
A Unified Framework for 3D Point Cloud Visual Grounding | Github
- Haojia Lin, Yongdong Luo, Xiawu Zheng, Lijiang Li, Fei Chao, Taisong Jin, Donghao Luo, Yan Wang, Liujuan Cao, Rongrong Ji
- Xiamen University, Peng Cheng Laboratory
- [Arxiv2023] https://arxiv.org/abs/2308.11887
- One-stage approach, unified detection-interaction, superpoint
Uni3DL: Unified Model for 3D and Language Understanding |
- Xiang Li, Jian Ding, Zhaoyang Chen, Mohamed Elhoseiny
- King Abdullah University of Science and Technology, Ecole Polytechnique
- [Arxiv2023] https://arxiv.org/abs/2312.03026
- One-stage approach, regression-based, multi-task
3D-STMN: Dependency-Driven Superpoint-Text Matching Network for End-to-End 3D Referring Expression Segmentation | Github
- Changli Wu, Yiwei Ma, Qi Chen, Haowei Wang, Gen Luo, Jiayi Ji, Xiaoshuai Sun
- Xiamen University
- [AAAI2024] https://arxiv.org/abs/2308.16632
- One-stage approach, unified detection-interaction, superpoint
Vision-Language Pre-training with Object Contrastive Learning for 3D Scene Understanding |
- Taolin Zhang, Sunan He, Tao Dai, Zhi Wang, Bin Chen, Shu-Tao Xia
- Tsinghua University, Hong Kong University of Science and Technology, Shenzhen University, Harbin Institute of Technology(Shenzhen), Peng Cheng Laboratory
- [AAAI2024] https://arxiv.org/abs/2305.10714
- One-stage approach, regression-based, pre-training
Visual Programming for Zero-shot Open-Vocabulary 3D Visual Grounding | Github
- Zhihao Yuan, Jinke Ren, Chun-Mei Feng, Hengshuang Zhao, Shuguang Cui, Zhen Li
- The Chinese University of Hong Kong (Shenzhen), A*STAR, The University of Hong Kong
- [CVPR2024] https://arxiv.org/abs/2311.15383
- One-stage approach, zero-shot, data construction
G3-LQ: Marrying Hyperbolic Alignment with Explicit Semantic-Geometric Modeling for 3D Visual Grounding |
- Yuan Wang, Yali Li, Shengjin Wang
- Tsinghua University, Beijing National Research Center for Information Science and Technology
- [CVPR2024] https://openaccess.thecvf.com/content/CVPR2024/papers/Wang_G3-LQ_Marrying_Hyperbolic_Alignment_with_Explicit_Semantic-Geometric_Modeling_for_3D_CVPR_2024_paper.pdf
- One-stage approach, regression-based
PointCloud-Text Matching: Benchmark Datasets and a Baseline |
- Yanglin Feng, Yang Qin, Dezhong Peng, Hongyuan Zhu, Xi Peng, Peng Hu
- Sichuan University, A*STAR
- [Arxiv2024] https://arxiv.org/abs/2403.19386
- A dataset, One-stage approach, regression-based, pre-training
PD-TPE: Parallel Decoder with Text-guided Position Encoding for 3D Visual Grounding |
- Chenshu Hou, Liang Peng, Xiaopei Wu, Wenxiao Wang, Xiaofei He
- Zhejiang University, FABU Inc.
- [Arxiv2024] https://arxiv.org/abs/2407.14491
- A dataset, One-stage approach #
Grounding 3D Scene Affordance From Egocentric Interactions |
- Cuiyu Liu, Wei Zhai, Yuhang Yang, Hongchen Luo, Sen Liang, Yang Cao, Zheng-Jun Zha
- University of Science and Technology of China, Northeastern University
- [Arxiv2024] https://arxiv.org/abs/2409.19650
- A dataset, One-stage approach, video #
Multi-branch Collaborative Learning Network for 3D Visual Grounding | Github
- Zhipeng Qian, Yiwei Ma, Zhekai Lin, Jiayi Ji, Xiawu Zheng, Xiaoshuai Sun, Rongrong Ji
- Xiamen University
- [ECCV2024] https://arxiv.org/abs/2407.05363
- One-stage approach, regression-based
Joint Top-Down and Bottom-Up Frameworks for 3D Visual Grounding |
- Yang Liu, Daizong Liu, Wei Hu
- Peking University
- [ICPR2024] https://arxiv.org/abs/2410.15615
- Joint top-down and bottom-up, One-stage approach #

Weakly-supervised

Distilling Coarse-to-Fine Semantic Matching Knowledge for Weakly Supervised 3D Visual Grounding | Github
- Zehan Wang, Haifeng Huang, Yang Zhao, Linjun Li, Xize Cheng, Yichen Zhu, Aoxiong Yin, Zhou Zhao
- Zhejiang University
- [ICCV2023] https://openaccess.thecvf.com/content/ICCV2023/papers/Wang_Distilling_Coarse-to-Fine_Semantic_Matching_Knowledge_for_Weakly_Supervised_3D_Visual_ICCV_2023_paper.pdf
- weakly-supervised, two-stage matching, pseudo label
Weakly-Supervised 3D Visual Grounding based on Visual Linguistic Alignment |
- Xiaoxu Xu, Yitian Yuan, Qiudan Zhang, Wenhui Wu, Zequn Jie, Lin Ma, Xu Wang
- Meituan
- [Arxiv2023] https://arxiv.org/abs/2312.09625
- weakly-supervised, two-stage matching, multi-modal

Semi-supervised

Cross-Task Knowledge Transfer for Semi-supervised Joint 3D Grounding and Captioning |
- Yang Liu, Daizong Liu, Zongming Guo, Wei Hu
- Peking University
- [ACMMM2024] https://openreview.net/forum?id=OmymuhXWcn
- semi-supervised, cross-task teacher-student framework, joint 3D grounding and captioning
Bayesian Self-Training for Semi-Supervised 3D Segmentation |
- Ozan Unal, Christos Sakaridis, Luc Van Gool
- ETH Zurich, Huawei Technologies, KU Leuven, INSAIT
- [ECCV2024] https://arxiv.org/abs/2409.08102
- semi-supervised, self-training

Other-Modality

Refer-it-in-RGBD: A Bottom-up Approach for 3D Visual Grounding in RGBD Images | Github
- Haolin Liu, Anran Lin, Xiaoguang Han, Lei Yang, Yizhou Yu, Shuguang Cui
- CUHK-Shenzhen, Deepwise AI Lab, The University of Hong Kong
- [CVPR2021] https://arxiv.org/pdf/2103.07894
- No point cloud input, RGB-D image
PATRON: Perspective-Aware Multitask Model for Referring Expression Grounding Using Embodied Multimodal Cues |
- Md Mofijul Islam, Alexi Gladstone, Tariq Iqbal
- University of Virginia
- [AAAI2023] https://ojs.aaai.org/index.php/AAAI/article/view/25177
- No point cloud input, multi-view
Mono3DVG: 3D Visual Grounding in Monocular Images | Github
- Yang Zhan, Yuan Yuan, Zhitong Xiong
- Northwestern Polytechnical University, Technical University of Munich
- [AAAI2024] https://arxiv.org/pdf/2312.08022
- No point cloud input, monocular image
EmbodiedScan: A Holistic Multi-Modal 3D Perception Suite Towards Embodied AI | Github
- Tai Wang, Xiaohan Mao, Chenming Zhu, Runsen Xu, Ruiyuan Lyu, Peisen Li, Xiao Chen, Wenwei Zhang, Kai Chen, Tianfan Xue, Xihui Liu, Cewu Lu, Dahua Lin, Jiangmiao Pang
- Shanghai AI Laboratory, Shanghai Jiao Tong University, The University of Hong Kong, The Chinese University of Hong Kong, Tsinghua University
- [CVPR2024] https://arxiv.org/abs/2312.16170
- A dataset, No point cloud input, RGB-D image
WildRefer: 3D Object Localization in Large-scale Dynamic Scenes with Multi-modal Visual Data and Natural Language |
- Zhenxiang Lin, Xidong Peng, Peishan Cong, Yuenan Hou, Xinge Zhu, Sibei Yang, Yuexin Ma
- ShanghaiTech University, Shanghai AI Laboratory, The Chinese University of Hong Kong
- [Arxiv2023] https://arxiv.org/abs/2304.05645
- No point cloud input, wild point cloud, additional multi-modal input
HiFi-CS: Towards Open Vocabulary Visual Grounding For Robotic Grasping Using Vision-Language Models |
- Vineet Bhat, Prashanth Krishnamurthy, Ramesh Karri, Farshad Khorrami
- New York University
- [Arxiv2024] https://arxiv.org/abs/2409.10419
- No point cloud input, RGB image #

LLMs-based

ViewRefer: Grasp the Multi-view Knowledge for 3D Visual Grounding with GPT and Prototype Guidance | Github
- Zoey Guo, Yiwen Tang, Ray Zhang, Dong Wang, Zhigang Wang, Bin Zhao, Xuelong Li
- Shanghai Artificial Intelligence Laboratory, The Chinese University of Hong Kong, Northwestern Polytechnical University
- [ICCV2023] https://arxiv.org/pdf/2303.16894
- LLMs-based, enriching text description
LAMM: Language-Assisted Multi-Modal Instruction-Tuning Dataset, Framework, and Benchmark | Github
- Zhenfei Yin, Jiong Wang, Jianjian Cao, Zhelun Shi, Dingning Liu, Mukai Li, Lu Sheng, Lei Bai, Xiaoshui Huang, Zhiyong Wang, Jing Shao, Wanli Ouyang
- Shanghai AI Lab, Beihang University, The Chinese University of Hong Kong (Shenzhen), Fudan University, Dalian University of Technology, The University of Sydney
- [NeurIPs2023] https://arxiv.org/abs/2306.06687
- LLMs-based, LLM architecture
Transcribe3D: Grounding LLMs Using Transcribed Information for 3D Referential Reasoning with Self-Corrected Finetuning |
- Jiading Fang, Xiangshan Tan, Shengjie Lin, Hongyuan Mei, Matthew R. Walter
- Toyota Technological Institute at Chicago
- [CoRL2023] https://openreview.net/forum?id=7j3sdUZMTF
- LLMs-based, enriching text description
LLM-Grounder: Open-Vocabulary 3D Visual Grounding with Large Language Model as an Agent |
- Jianing Yang, Xuweiyi Chen, Shengyi Qian, Nikhil Madaan, Madhavan Iyengar, David F. Fouhey, Joyce Chai1
- University of Michigan, New York University
- [Arxiv2023] https://arxiv.org/abs/2309.12311
- LLMs-based, enriching text description
Mono3DVG: 3D Visual Grounding in Monocular Images | Github
- Yang Zhan, Yuan Yuan, Zhitong Xiong
- Northwestern Polytechnical University, Technical University of Munich
- [AAAI2024] https://arxiv.org/pdf/2312.08022
- LLMs-based, enriching text description
COT3DREF: Chain-of-Thoughts Data-Efficient 3D Visual Grounding | Github
- Eslam Mohamed Bakr, Mohamed Ayman, Mahmoud Ahmed, Habib Slim, Mohamed Elhoseiny
- King Abdullah University of Science and Technology
- [ICLR2024] https://arxiv.org/abs/2310.06214
- LLMs-based, Chain-of-Thoughts, reasoning
Visual Programming for Zero-shot Open-Vocabulary 3D Visual Grounding | Github
- Zhihao Yuan, Jinke Ren, Chun-Mei Feng, Hengshuang Zhao, Shuguang Cui, Zhen Li
- The Chinese University of Hong Kong (Shenzhen), A*STAR, The University of Hong Kong
- [CVPR2024] https://arxiv.org/abs/2311.15383
- LLMs-based, construct text description
Naturally Supervised 3D Visual Grounding with Language-Regularized Concept Learners | Github
- Chun Feng, Joy Hsu, Weiyu Liu, Jiajun Wu
- Stanford University
- [CVPR2024] https://arxiv.org/abs/2404.19696
- LLMs-based, language constraint
3DMIT: 3D MULTI-MODAL INSTRUCTION TUNING FOR SCENE UNDERSTANDING | Github
- Zeju Li, Chao Zhang, Xiaoyan Wang, Ruilong Ren, Yifan Xu, Ruifei Ma, Xiangde Liu
- Beijing University of Posts and Telecommunications, Beijing Digital Native Digital City Research Center, Peking University, Beihang University, Beijing University of Science and Technology
- [Arxiv2024] https://arxiv.org/abs/2401.03201
- LLMs-based, LLM architecture
DOrA: 3D Visual Grounding with Order-Aware Referring |
- Tung-Yu Wu, Sheng-Yu Huang, Yu-Chiang Frank Wang
- National Taiwan University, NVIDIA
- [Arxiv2024] https://arxiv.org/abs/2403.16539
- LLMs-based, Chain-of-Thoughts
SCENEVERSE: Scaling 3D Vision-Language Learning for Grounded Scene Understanding | Github
- Baoxiong Jia , Yixin Chen , Huanyue Yu, Yan Wang, Xuesong Niu, Tengyu Liu, Qing Li, Siyuan Huang
- Beijing Institute for General Artificial Intelligence
- [Arxiv2024] https://arxiv.org/abs/2401.09340
- A dataset, LLMs-based, LLM architecture
Language-Image Models with 3D Understanding | Github
- Jang Hyun Cho, Boris Ivanovic, Yulong Cao, Edward Schmerling, Yue Wang, Xinshuo Weng, Boyi Li, Yurong You, Philipp Krähenbühl, Yan Wang, Marco Pavone
- UT Austin, NVIDIA Research
- [Arxiv2024] https://arxiv.org/abs/2405.03685
- A dataset, LLMs-based #
Task-oriented Sequential Grounding in 3D Scenes | Github
- Zhuofan Zhang, Ziyu Zhu, Pengxiang Li, Tengyu Liu, Xiaojian Ma, Yixin Chen, Baoxiong Jia, Siyuan Huang, Qing Li
- BIGA, Tsinghua Universit, Beijing Institute of Technology
- [Arxiv2024] https://arxiv.org/abs/2408.04034
- A dataset, LLMs-based #
Lexicon3D: Probing Visual Foundation Models for Complex 3D Scene Understanding | Github
- Yunze Man, Shuhong Zheng, Zhipeng Bao, Martial Hebert, Liang-Yan Gui, Yu-Xiong Wang
- University of Illinois Urbana-Champaign, Carnegie Mellon University
- [Arxiv2024] https://arxiv.org/abs/2409.03757
- Foundation model #
Robin3D: Improving 3D Large Language Model via Robust Instruction Tuning |
- Weitai Kang, Haifeng Huang, Yuzhang Shang, Mubarak Shah, Yan Yan
- Illinois Institute of Technology, Zhejiang University, University of Central Florida, University of Illinois at Chicago
- [Arxiv2024] https://arxiv.org/abs/2410.00255
- LLMs-based #
Solving Zero-Shot 3D Visual Grounding as Constraint Satisfaction Problems | Github
- Qihao Yuan, Jiaming Zhang, Kailai Li, Rainer Stiefelhagen
- Karlsruhe Institute of Technology, University of Groningen
- [Arxiv2024] https://arxiv.org/abs/2411.14594
- LLMs-based, zero-shot #
SeeGround: See and Ground for Zero-Shot Open-Vocabulary 3D Visual Grounding | Github
- Rong Li, Shijie Li, Lingdong Kong, Xulei Yang, Junwei Liang
- HKUST, A*STAR, National University of Singapore
- [Arxiv2024] https://arxiv.org/abs/2412.04383
- LLMs-based, zero-shot #
Empowering 3D Visual Grounding with Reasoning Capabilities | Github
- Chenming Zhu, Tai Wang, Wenwei Zhang, Kai Chen, Xihui Liu
- The University of Hong Kong, Shanghai AI Laboratory
- [ECCV2024] https://arxiv.org/abs/2407.01525
- LLMs-based, LLM architecture, A dataset
VLM-Grounder: A VLM Agent for Zero-Shot 3D Visual Grounding | Github
- Runsen Xu, Zhiwei Huang, Tai Wang, Yilun Chen, Jiangmiao Pang, Dahua Lin
- The Chinese University of Hong Kong, Zhejiang University, Shanghai AI Laboratory, Centre for Perceptual and Interactive Intelligence
- [CoRL2024] https://arxiv.org/abs/2410.13860
- LLMs-based, zero-shot #

Outdoor-Scenes

Language Prompt for Autonomous Driving | Github
- Dongming Wu, Wencheng Han, Tiancai Wang, Yingfei Liu, Xiangyu Zhang, Jianbing Shen
- Beijing Institute of Technology, University of Macau, MEGVII Technology, Beijing Academy of Artificial Intelligence
- [Arxiv2023] https://arxiv.org/abs/2309.04379
- Ourdoor scene, autonomous driving #
Talk2Radar: Bridging Natural Language with 4D mmWave Radar for 3D Referring Expression Comprehension | Github
- Runwei Guan, Ruixiao Zhang, Ningwei Ouyang, Jianan Liu, Ka Lok Man, Xiaohao Cai, Ming Xu, Jeremy Smith, Eng Gee Lim, Yutao Yue, Hui Xiong
- JITRI, University of Liverpool, University of Southampton, Vitalent Consulting, Xi’an Jiaotong-Liverpool University, HKUST (GZ)
- [Arxiv2024] https://arxiv.org/abs/2405.12821
- Ourdoor scene, autonomous driving
Talk to Parallel LiDARs: A Human-LiDAR Interaction Method Based on 3D Visual Grounding |
- Yuhang Liu, Boyi Sun, Guixu Zheng, Yishuo Wang, Jing Wang, Fei-Yue Wang
- Chinese Academy of Sciences, South China Agricultural University, Beijing Institute of Technology
- [Arxiv2024] https://arxiv.org/abs/2405.15274
- Ourdoor scene, autonomous driving
LidaRefer: Outdoor 3D Visual Grounding for Autonomous Driving with Transformers |
- Yeong-Seung Baek, Heung-Seon Oh
- Korea University of Technology and Education
- [Arxiv2024] https://arxiv.org/abs/2411.04351
- Ourdoor scene, autonomous driving #