Home

Awesome

Awesome-LLM-3D Awesome Maintenance PR's Welcome <a href="" target='_blank'><img src="https://visitor-badge.laobi.icu/badge?page_id=activevisionlab.llm3d&left_color=gray&right_color=blue"> </a> arXiv

<div align="center"> <img src="assets/Figure1_v6.png" width="100%"> </div>

🏠 About

Here is a curated list of papers about 3D-Related Tasks empowered by Large Language Models (LLMs). It contains various tasks including 3D understanding, reasoning, generation, and embodied agents. Also, we include other Foundation Models (CLIP, SAM) for the whole picture of this area.

This is an active repository, you can watch for following the latest advances. If you find it useful, please kindly star ⭐ this repo and cite the paper.

🔥 News

Table of Content

3D Understanding via LLM

DateKeywordsInstitute (first)PaperPublicationOthers
2024-10-12Situation3DUIUCSituational Awareness Matters in 3D Vision Language ReasoningCVPR '24project
2024-09-28LLaVA-3DHKULLaVA-3D: A Simple yet Effective Pathway to Empowering LMMs with 3D-awarenessArxivproject
2024-09-08MSR3DBIGAIMulti-modal Situated Reasoning in 3D ScenesNeurIPS '24project
2024-08-28GreenPLMHUST More Text, Less Point: Towards 3D Data-Efficient Point-Language UnderstandingArxivgithub
2024-06-17LLaNAUniBOLLaNA: Large Language and NeRF AssistantNeurIPS '24project
2024-06-07SpatialPINOxfordSpatialPIN: Enhancing Spatial Reasoning Capabilities of Vision-Language Models through Prompting and Interacting 3D PriorsNeurIPS '24project
2024-06-03SpatialRGPTUCSDSpatialRGPT: Grounded Spatial Reasoning in Vision Language ModelsNeurIPS '24github
2024-05-02MiniGPT-3DHUSTMiniGPT-3D: Efficiently Aligning 3D Point Clouds with Large Language Models using 2D PriorsACM MM '24project
2024-02-27ShapeLLMXJTUShapeLLM: Universal 3D Object Understanding for Embodied InteractionArxivproject
2024-01-22SpatialVLMGoogle DeepMindSpatialVLM: Endowing Vision-Language Models with Spatial Reasoning CapabilitiesCVPR '24project
2023-12-21LiDAR-LLMPKULiDAR-LLM: Exploring the Potential of Large Language Models for 3D LiDAR UnderstandingArxivproject
2023-12-153DAPShanghai AI Lab3DAxiesPrompts: Unleashing the 3D Spatial Task Capabilities of GPT-4VArxivproject
2023-12-13Chat-SceneZJUChat-Scene: Bridging 3D Scene and Large Language Models with Object IdentifiersNeurIPS '24github
2023-12-5GPT4PointHKUGPT4Point: A Unified Framework for Point-Language Understanding and GenerationArxivgithub
2023-11-30LL3DAFudan UniversityLL3DA: Visual Interactive Instruction Tuning for Omni-3D Understanding, Reasoning, and PlanningArxivgithub
2023-11-26ZSVG3DCUHK(SZ)Visual Programming for Zero-shot Open-Vocabulary 3D Visual GroundingArxivproject
2023-11-18LEOBIGAIAn Embodied Generalist Agent in 3D WorldArxivgithub
2023-10-14JM3D-LLMXiamen UniversityJM3D & JM3D-LLM: Elevating 3D Representation with Joint Multi-modal CuesACM MM '23github
2023-10-10Uni3DBAAIUni3D: Exploring Unified 3D Representation at ScaleICLR '24project
2023-9-27-KAUSTZero-Shot 3D Shape CorrespondenceSiggraph Asia '23-
2023-9-21LLM-GrounderU-MichLLM-Grounder: Open-Vocabulary 3D Visual Grounding with Large Language Model as an AgentICRA '24github
2023-9-1Point-BindCUHKPoint-Bind & Point-LLM: Aligning Point Cloud with Multi-modality for 3D Understanding, Generation, and Instruction FollowingArxivgithub
2023-8-31PointLLMCUHKPointLLM: Empowering Large Language Models to Understand Point CloudsECCV '24github
2023-8-17Chat-3DZJUChat-3D: Data-efficiently Tuning Large Language Model for Universal Dialogue of 3D ScenesArxivgithub
2023-8-83D-VisTABIGAI3D-VisTA: Pre-trained Transformer for 3D Vision and Text AlignmentICCV '23github
2023-7-243D-LLMUCLA3D-LLM: Injecting the 3D World into Large Language ModelsNeurIPS '23github
2023-3-29ViewReferCUHKViewRefer: Grasp the Multi-view Knowledge for 3D Visual GroundingICCV '23github
2022-9-12-MITLeveraging Large (Visual) Language Models for Robot 3D Scene UnderstandingArxivgithub

3D Understanding via other Foundation Models

IDkeywordsInstitute (first)PaperPublicationOthers
2024-10-12Lexicon3DUIUCLexicon3D: Probing Visual Foundation Models for Complex 3D Scene UnderstandingNeurIPS '24project
2024-10-07Diff2SceneCMUOpen-Vocabulary 3D Semantic Segmentation with Text-to-Image Diffusion ModelsECCV 2024project
2024-04-07Any2PointShanghai AI LabAny2Point: Empowering Any-modality Large Models for Efficient 3D UnderstandingECCV 2024github
2024-03-16N2F2Oxford-VGGN2F2: Hierarchical Scene Understanding with Nested Neural Feature FieldsArxiv-
2023-12-17SAI3DPKUSAI3D: Segment Any Instance in 3D ScenesArxivproject
2023-12-17Open3DISVinAIOpen3DIS: Open-vocabulary 3D Instance Segmentation with 2D Mask GuidanceArxivproject
2023-11-6OVIR-3DRutgers UniversityOVIR-3D: Open-Vocabulary 3D Instance Retrieval Without Training on 3D DataCoRL '23github
2023-10-29OpenMask3DETHOpenMask3D: Open-Vocabulary 3D Instance SegmentationNeurIPS '23project
2023-10-5Open-Fusion-Open-Fusion: Real-time Open-Vocabulary 3D Mapping and Queryable Scene RepresentationArxivgithub
2023-9-22OV-3DDetHKUSTCoDA: Collaborative Novel Box Discovery and Cross-modal Alignment for Open-vocabulary 3D Object DetectionNeurIPS '23github
2023-9-19LAMP-From Language to 3D Worlds: Adapting Language Model for Point Cloud PerceptionOpenReview-
2023-9-15OpenNerf-OpenNerf: Open Set 3D Neural Scene Segmentation with Pixel-Wise Features and Rendered Novel ViewsOpenReviewgithub
2023-9-1OpenIns3DCambridgeOpenIns3D: Snap and Lookup for 3D Open-vocabulary Instance SegmentationArxivproject
2023-6-7Contrastive LiftOxford-VGGContrastive Lift: 3D Object Instance Segmentation by Slow-Fast Contrastive FusionNeurIPS '23github
2023-6-4Multi-CLIPETHMulti-CLIP: Contrastive Vision-Language Pre-training for Question Answering tasks in 3D ScenesArxiv-
2023-5-233D-OVSNTUWeakly Supervised 3D Open-vocabulary SegmentationNeurIPS '23github
2023-5-21VL-FieldsUniversity of EdinburghVL-Fields: Towards Language-Grounded Neural Implicit Spatial RepresentationsICRA '23project
2023-5-8CLIP-FO3DTsinghua UniversityCLIP-FO3D: Learning Free Open-world 3D Scene Representations from 2D Dense CLIPICCVW '23-
2023-4-123D-VQAETHCLIP-Guided Vision-Language Pre-training for Question Answering in 3D ScenesCVPRW '23github
2023-4-3RegionPLCHKURegionPLC: Regional Point-Language Contrastive Learning for Open-World 3D Scene UnderstandingArxivproject
2023-3-20CG3DJHUCLIP goes 3D: Leveraging Prompt Tuning for Language Grounded 3D RecognitionArxivgithub
2023-3-16LERFUC BerkeleyLERF: Language Embedded Radiance FieldsICCV '23github
2023-2-14ConceptFusionMITConceptFusion: Open-set Multimodal 3D MappingRSS '23project
2023-1-12CLIP2SceneHKUCLIP2Scene: Towards Label-efficient 3D Scene Understanding by CLIPCVPR '23github
2022-12-1UniT3DTUMUniT3D: A Unified Transformer for 3D Dense Captioning and Visual GroundingICCV '23github
2022-11-29PLAHKUPLA: Language-Driven Open-Vocabulary 3D Scene UnderstandingCVPR '23github
2022-11-28OpenSceneETHzOpenScene: 3D Scene Understanding with Open VocabulariesCVPR '23github
2022-10-11CLIP-FieldsNYUCLIP-Fields: Weakly Supervised Semantic Fields for Robotic MemoryArxivproject
2022-7-23Semantic AbstractionColumbiaSemantic Abstraction: Open-World 3D Scene Understanding from 2D Vision-Language ModelsCoRL '22project
2022-4-26ScanNet200TUMLanguage-Grounded Indoor 3D Semantic Segmentation in the WildECCV '22project

3D Reasoning

DatekeywordsInstitute (first)PaperPublicationOthers
2023-5-203D-CLRUCLA3D Concept Learning and Reasoning from Multi-View ImagesCVPR '23github
-Transcribe3DTTI, ChicagoTranscribe3D: Grounding LLMs Using Transcribed Information for 3D Referential Reasoning with Self-Corrected FinetuningCoRL '23github

3D Generation

DatekeywordsInstitutePaperPublicationOthers
2023-11-29ShapeGPTFudan UniversityShapeGPT: 3D Shape Generation with A Unified Multi-modal Language ModelArxivgithub
2023-11-27MeshGPTTUMMeshGPT: Generating Triangle Meshes with Decoder-Only TransformersArxivproject
2023-10-193D-GPTANU3D-GPT: Procedural 3D Modeling with Large Language ModelsArxivgithub
2023-9-21LLMRMITLLMR: Real-time Prompting of Interactive Worlds using Large Language ModelsArxiv-
2023-9-20DreamLLMMEGVIIDreamLLM: Synergistic Multimodal Comprehension and CreationArxivgithub
2023-4-1ChatAvatarDeemos TechDreamFace: Progressive Generation of Animatable 3D Faces under Text GuidanceACM TOGwebsite

3D Embodied Agent

DatekeywordsInstitutePaperPublicationOthers
2024-01-22SpatialVLMDeepmindSpatialVLM: Endowing Vision-Language Models with Spatial Reasoning CapabilitiesCVPR '24project
2023-11-27Dobb-ENYUOn Bringing Robots HomeArxivgithub
2023-11-26STEVEZJUSee and Think: Embodied Agent in Virtual EnvironmentArxivgithub
2023-11-18LEOBIGAIAn Embodied Generalist Agent in 3D WorldArxivgithub
2023-9-14UniHSIShanghai AI LabUnified Human-Scene Interaction via Prompted Chain-of-ContactsArxivgithub
2023-7-28RT-2Google-DeepMindRT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic ControlArxivgithub
2023-7-12SayPlanQUT Centre for RoboticsSayPlan: Grounding Large Language Models using 3D Scene Graphs for Scalable Robot Task PlanningCoRL '23github
2023-7-12VoxPoserStanfordVoxPoser: Composable 3D Value Maps for Robotic Manipulation with Language ModelsArxivgithub
2022-12-13RT-1GoogleRT-1: Robotics Transformer for Real-World Control at ScaleArxivgithub
2022-12-8LLM-PlannerThe Ohio State UniversityLLM-Planner: Few-Shot Grounded Planning for Embodied Agents with Large Language ModelsICCV '23github
2022-10-11CLIP-FieldsNYU, MetaCLIP-Fields: Weakly Supervised Semantic Fields for Robotic MemoryRSS '23github
2022-09-20NLMap-SayCanGoogleOpen-vocabulary Queryable Scene Representations for Real World PlanningICRA '23github

3D Benchmarks

DatekeywordsInstitutePaperPublicationOthers
2024-09-08MSQA / MSNNBIGAIMulti-modal Situated Reasoning in 3D ScenesNeurIPS '24project
2024-06-103D-GRAND / 3D-POPEUMich3D-GRAND: A Million-Scale Dataset for 3D-LLMs with Better Grounding and Less HallucinationArxivproject
2024-06-03SpatialRGPT-BenchUCSDSpatialRGPT: Grounded Spatial Reasoning in Vision Language ModelsNeurIPS '24github
2024-1-18SceneVerseBIGAISceneVerse: Scaling 3D Vision-Language Learning for Grounded Scene UnderstandingArxivgithub
2023-12-26EmbodiedScanShanghai AI LabEmbodiedScan: A Holistic Multi-Modal 3D Perception Suite Towards Embodied AIArxivgithub
2023-12-17M3DBenchFudan UniversityM3DBench: Let's Instruct Large Models with Multi-modal 3D PromptsArxivgithub
2023-11-29-DeepMindEvaluating VLMs for Score-Based, Multi-Probe Annotation of 3D ObjectsArxivgithub
2023-09-14CrossCoherenceUniBOLooking at words and points with attention: a benchmark for text-to-shape coherenceICCV '23github
2022-10-14SQA3DBIGAISQA3D: Situated Question Answering in 3D ScenesICLR '23github
2021-12-20ScanQARIKEN AIPScanQA: 3D Question Answering for Spatial Scene UnderstandingCVPR '23github
2020-12-3Scan2CapTUMScan2Cap: Context-aware Dense Captioning in RGB-D ScansCVPR '21github
2020-8-23ReferIt3DStanfordReferIt3D: Neural Listeners for Fine-Grained 3D Object Identification in Real-World ScenesECCV '20github
2019-12-18ScanReferTUMScanRefer: 3D Object Localization in RGB-D Scans using Natural LanguageECCV '20github

Contributing

Your contributions are always welcome!

I will keep some pull requests open if I'm not sure if they are awesome for 3D LLMs, you could vote for them by adding 👍 to them.


If you have any questions about this opinionated list, please get in touch at xianzheng@robots.ox.ac.uk or Wechat ID: mxz1997112.

Star History

Star History Chart

Citation

If you find this repository useful, please consider citing this paper:

@misc{ma2024llmsstep3dworld,
      title={When LLMs step into the 3D World: A Survey and Meta-Analysis of 3D Tasks via Multi-modal Large Language Models}, 
      author={Xianzheng Ma and Yash Bhalgat and Brandon Smart and Shuai Chen and Xinghui Li and Jian Ding and Jindong Gu and Dave Zhenyu Chen and Songyou Peng and Jia-Wang Bian and Philip H Torr and Marc Pollefeys and Matthias Nießner and Ian D Reid and Angel X. Chang and Iro Laina and Victor Adrian Prisacariu},
      year={2024},
      journal={arXiv preprint arXiv:2405.10255},
}

Acknowledgement

This repo is inspired by Awesome-LLM