Home

Awesome

Awesome PR's Welcome <br />

<p align="center"> <h1 align="center">Towards Open Vocabulary Learning: A Survey</h1> <p align="center"> <b> T-PAMI, 2024 </b> <br /> <a href="https://jianzongwu.github.io/"><strong>Jianzong Wu <sup>*</sup></strong></a> . <a href="https://lxtgh.github.io/"><strong> Xiangtai Li <sup>*</sup> </strong></a> · <a href="https://xushilin1.github.io/"><strong>Shilin Xu <sup>*</sup></strong></a> · <a href="https://yuanhaobo.me/"><strong>Haobo Yuan <sup>*</sup></strong></a> · <a href="https://henghuiding.github.io/"><strong>Henghui Ding</strong></a> · <a href="https://iboing.github.io/"><strong>Yibo Yang</strong></a> · <a href="https://xialipku.github.io/"><strong>Xia Li</strong></a> · <a href="https://zhangzjn.github.io/"><strong>Jiangning Zhang</strong></a> · <a href="https://scholar.google.com/citations?user=T4gqdPkAAAAJ&hl=zh-CN"><strong>Yunhai Tong</strong></a> · <a href="http://scholar.google.com/citations?user=IL3mSioAAAAJ&hl=zh-CN"><strong>Xudong Jiang</strong></a> · <a href="https://scholar.google.com/citations?user=rVsGTeEAAAAJ&hl=zh-CN"><strong>Bernard Ghanem</strong></a> · <a href="https://scholar.google.com/citations?user=RwlJNLcAAAAJ&hl=zh-CN"><strong>Dacheng Tao</strong></a> · </p> <p align="center"> <a href='https://arxiv.org/abs/2306.15880'> <img src='https://img.shields.io/badge/arXiv-PDF-green?style=flat&logo=arXiv&logoColor=green' alt='arXiv PDF'> </a> <a href='https://ieeexplore.ieee.org/document/10420487'> <img src='https://img.shields.io/badge/TPAMI-PDF-blue?style=flat&logo=IEEE&logoColor=green' alt='TPAMI PDF'> </a> </p> <br />

This repo is used for recording, tracking, and benchmarking several recent open vocabulary methods to supplement our survey. If you find any work missing or have any suggestions (papers, implementations, and other resources), feel free to pull requests. We will add the missing papers to this repo as soon as possible.

🔥Add Your Paper in our Repo and Survey!!!!!

[-] You are welcome to give us an issue or PR for your open vocabulary learning work !!!!!

[-] Note that: Due to the huge paper in Arxiv, we are sorry to cover all in our survey. You can directly present a PR into this repo and we will record it for next version update of our survey.

[-] Our survey will be updated in 2024.3.

🔥New

[-] Our work is accepted by T-PAMI !!! 🔥🔥🔥

[-] We update GitHub to record the available paper by the end of 2024/1/10.

[-] We update GitHub to record the available paper by the end of 2023/7/20.

🔥Highlight!!

[1] The first survey for open vocabulary learning, including open vocabulary detection/segmentation/tracking.

[2] It also contains several related domains, including foundation model tuning and open-world detection.

[3] We list detailed results for the most representative works and give a fairer and clearer comparison of different approaches.

Introduction

This survey presents the first detailed survey on open vocabulary tasks, including open-vocabulary object detection, open-vocabulary segmentation, and 3D/video open-vocabulary tasks.

Alt Text

Summary of Contents

Methods: A Survey

Keywords

Open Vocabulary Object Detection

YearVenueKeywordsPaper TitleCode/Project
2021CVPRcap.Open-Vocabulary Object Detection Using CaptionsCode
2022ICLRvlm.Open-vocabulary Object Detection via Vision and Language Knowledge DistillationCode
2022CVPRcap., vlm., pre.RegionCLIP: Region-based Language-Image PretrainingCode
2022CVPRvlm.Learning to Prompt for Open-Vocabulary Object Detection with Vision-Language ModelCode
2022CVPRvlm., cap.Open-Vocabulary One-Stage Detection with Hierarchical Visual-Language Knowledge DistillationCode
2022CVPRcap., vlm.Grounded Language-Image Pre-training[Code]
2022NeurIPScap., vlm.GLIPv2: Unifying Localization and VL UnderstandingCode
2022GCPRcap.Localized Vision-Language Matching for Open-vocabulary Object DetectionCode
2022ECCVvlm.Open-Vocabulary DETR with Conditional MatchingCode
2022ECCVvlm., cap., pl.Open Vocabulary Object Detection with Pseudo Bounding-Box LabelsCode
2022ECCVvlm.Promptdet: Towards open-vocabulary detection using uncurated imagesCode
2022ECCVvlm., pl., w/o ps.Detecting Twenty-thousand Classes using Image-level SupervisionCode
2022ECCVvlm.. pl.Exploiting unlabeled data with vision and language models for object detectionCode
2022ECCVvlm., cap.Simple Open-Vocabulary Object Detection with Vision TransformersCode
2022NeurIPSvlm., pl.Bridging the Gap between Object and Image-level Representations for Open-Vocabulary DetectionCode
2022NeurIPSvlm., cap.DetCLIP: Dictionary-Enriched Visual-Concept Paralleled Pre-training for Open-world DetectionN/A
2022arXivvlm.Open Vocabulary Object Detection with Proposal Mining and Prediction EqualizationCode
2022arXivvlm., pl.P3OVD: Fine-grained Visual-Text Prompt-Driven Self-Training for Open-Vocabulary Object DetectionN/A
2023ICLRvlm., pl.Learning Object-Language Alignments for Open-Vocabulary Object DetectionCode
2023ICLRvlm.F-VLM: Open-Vocabulary Object Detection upon Frozen Vision and Language ModelsCode
2023CVPRother., vlm.Learning to Detect and Segment for Open Vocabulary Object DetectionN/A
2023CVPRvlm., cap.Aligning Bag of Regions for Open-Vocabulary Object DetectionCode
2023CVPRvlm.Object-Aware Distillation Pyramid for Open-Vocabulary Object DetectionCode
2023CVPRvlm.CORA: Adapting CLIP for Open-Vocabulary Detection with Region Prompting and Anchor Pre-MatchingN/A
2023CVPRvlm., pl.DetCLIPv2: Scalable Open-Vocabulary Object Detection Pre-training via Word-Region AlignmentN/A
2023CVPRvlm.Region-Aware Pretraining for Open-Vocabulary Object Detection with Vision TransformersN/A
2023ICMLvlm.Multi-Modal Classifiers for Open-Vocabulary Object DetectionProject
2023arXivvlm.GridCLIP: One-Stage Object Detection by Grid-Level CLIP Representation LearningN/A
2023arXivvlm., cap.Enhancing the Role of Context in Region-Word Alignment for Object DetectionN/A
2023arXivcap., pl.Open-Vocabulary Object Detection using Pseudo Caption LabelsN/A
2023arXivvlm., pl.Three ways to improve feature alignment for open vocabulary detectionN/A
2023arXivvlm.Prompt-Guided Transformers for End-to-End Open-Vocabulary Object DetectionN/A
2023TMLRvlm., cap., pl.MaMMUT: A Simple Architecture for Joint Learning for MultiModal TasksN/A
2023NeurIPSvlm., cap., pl.Scaling Open-Vocabulary Object DetectionN/A
2023arXivvlm.Open-Vocabulary Object Detection via Scene Graph DiscoveryN/A
2023ICCVvlm.Detection-Oriented Image-Text Pretraining for Open-Vocabulary DetectionCode
2023ICCVvlm.EdaDet: Open-Vocabulary Object Detection Using Early Dense AlignmentCode
2023KDDvlm.What Makes Good Open-Vocabulary Detector: A Disassembling PerspectiveN/A
2023NeurIPSvlm.CoDet: Co-Occurrence Guided Region-Word Alignment for Open-Vocabulary Object DetectionCode
2023arXivvlm.DST-Det: Simple Dynamic Self-Training for Open-Vocabulary Object DetectionCode
2023arXivvlm.Taming Self-Training for Open-Vocabulary Object DetectionCode
2023arXivunify., vlm., pre.CLIPSelf: Vision Transformer Distills Itself for Open-Vocabulary Dense PredictionCode
2023BMVCvlm.Open-Vocabulary Object Detection with Meta Prompt Representation and Instance Contrastive OptimizationN/A
2024AAAIvlm.Simple Image-level Classification Improves Open-vocabulary Object DetectionCode
2024AAAIvlm.ProxyDet: Synthesizing Proxy Novel Classes via Classwise Mixup for Open-Vocabulary Object DetectionCode
2024AAAIunify., vlm., pre.CLIM: Contrastive Language-Image Mosaic for Region RepresentationCode
2024WACVvlm.LP-OVOD: Open-Vocabulary Object Detection by Linear ProbingCode
2024CVPRvlm.YOLO-World: Real-Time Open-Vocabulary Object DetectionCode
2024CVPRbenchThe devil is in the fine-grained details: Evaluating open-vocabulary object detectors for fine-grained understandingProject
2024ICLRvlm.LLMs Meet VLMs: Boost Open Vocabulary Object Detection with Fine-grained DescriptorsN/A
2024arXivvlm.Mamba-YOLO-World: Marrying YOLO-World with Mamba for Open-Vocabulary DetectionCode

Open Vocabulary Segmentation

YearVenueKeywordsPaper TitleCode/Project
2023CVPRunify., vlm.Primitive Generation and Semantic-related Alignment for Universal Zero-Shot SegmentationCode
2023CVPRunify., vlm.FreeSeg: Unified, Universal and Open-Vocabulary Image SegmentationCode
2023arXivunify., vlm.OpenSD: Unified Open-Vocabulary Segmentation and DetectionCode

Semantic Segmentation

YearVenueKeywordsPaper TitleCode/Project
2022ICLRvlm.Language-driven Semantic SegmentationCode
2022CVPRcap., w/o ps.GroupViT: Semantic Segmentation Emerges from Text SupervisionCode
2022CVPRvlm.ZegFormer: Decoupling Zero-Shot Semantic SegmentationCode
2022ECCVcap., vlm.Scaling Open-Vocabulary Image Segmentation with Image-Level LabelsN/A
2022ECCVvlm, pl, w/o ps.Extract Free Dense Labels from CLIPCode
2022ECCVvlm.A Simple Baseline for Open-Vocabulary Semantic Segmentation with Pre-trained Vision-Language ModelCode
2022ECCVvlm., cap., w/o ps.Open-world Semantic Segmentation via Contrasting and Clustering Vision-Language EmbeddingN/A
2022BMVCvlm.Open-vocabulary Semantic Segmentation with Frozen Vision-Language ModelsCode
2022arXivvlm., cap., pl, w/o ps.Perceptual Grouping in Contrastive Vision-Language ModelsCode
2022arXivvlm., cap., pl, w/o ps.SegCLIP: Patch Aggregation with Learnable Centers for Open-Vocabulary Semantic SegmentationCode
2022arXivvlm., cap., w/o ps.Open Vocabulary Semantic Segmentation with Patch Aligned Contrastive LearningN/A
2023CVPRvlm., pre.Generalized Decoding for Pixel, Image, and LanguageCode
2023CVPRvlm., pl.Open-Vocabulary Semantic Segmentation with Mask-adapted CLIPCode
2023CVPRcap., vlm., w/o ps.Learning Open-vocabulary Semantic Segmentation Models From Natural Language SupervisionCode
2023CVPRvlm.Side Adapter Network for Open-Vocabulary Semantic SegmentationCodd
2023arXivvlm., unifyA Simple Framework for Open-Vocabulary Segmentation and DetectionCode
2023arXivvlm.Global Knowledge Calibration for Fast Open-Vocabulary SegmentationN/A
2023arXivvlm.CAT-Seg: Cost Aggregation for Open-Vocabulary Semantic SegmentationCode
2023arXivvlm., unifyPrompt Pre-Training with Twenty-Thousand Classes for Open-Vocabulary Visual RecognitionCode
2023arXivvlm., unifySegment Everything Everywhere All at OnceCode
2023arXivvlm.MVP-SEG: Multi-View Prompt Learning for Open-Vocabulary Semantic SegmentationN/A
2023arXivvlm.TagCLIP: Improving Discrimination Ability of Open-Vocabulary Semantic SegmentationN/A
2023arXivvlm., w/o ps., samExploring Open-Vocabulary Semantic Segmentation without Human LabelsN/A
2023arXivvlm., unifyDaTaSeg: Taming a Universal Multi-Dataset Multi-Task Segmentation ModelN/A
2023arXivdiff.Diffusion Models for Zero-Shot Open-Vocabulary SegmentationProject
2023ICCVdiff.Diffumask: Synthesizing images with pixel-level annotations for semantic segmentation using diffusion modelsProject
2023ICCVdiff.Guiding Text-to-Image Diffusion Model Towards Grounded GenerationProject
2023NeurIPScap., w/o ps.Uncovering Prototypical Knowledge for Weakly Open-Vocabulary Semantic SegmentationCode
2023arXivvlm.SED: A Simple Encoder-Decoder for Open-Vocabulary Semantic SegmentationCode
2023arXivvlm., no-trainPlug-and-Play, Dense-Label-Free Extraction of Open-Vocabulary Semantic Segmentation from Vision-Language ModelsN/A
2023arXivvlm., no-trainGrounding Everything: Emerging Localization Properties in Vision-Language TransformersCode
2023arXivvlm.Open-Vocabulary Segmentation with Semantic-Assisted CalibrationN/A
2023arXivvlm., no-trainSelf-Guided Open-Vocabulary Semantic SegmentationN/A
2023arXivno-train., vlm., samCLIP as RNN: Segment Countless Visual Concepts without Training EndeavorProject
2023arXivvlm.CLIP-DINOiser: Teaching CLIP a few DINO tricksCode
2024arXivvlm., no-trainPay Attention to Your Neighbours: Training-Free Open-Vocabulary Semantic SegmentationCode
2024ECCVvlm., no-trainIn Defense of Lazy Visual Grounding for Open-Vocabulary Semantic SegmentationCode

Instance Segmentation

YearVenueKeywordsPaper TitleCode/Project
2023CVPRvlm.Semantic-Promoted Debiasing and Background Disambiguation for Zero-Shot Instance SegmentationCode
2022CVPRcap., pl., vlm.Open-Vocabulary Instance Segmentation via Robust Cross-Modal Pseudo-LabelingCode
2023CVPRvlm, cap, w/o ps.Mask-free OVIS: Open-Vocabulary Instance Segmentation without Manual Mask AnnotationsCode
2023arXivcap.Betrayed by Captions: Joint Caption Grounding and Generation for Open Vocabulary Instance SegmentationCode
2023arXivcap.Leveraging Open-Vocabulary Diffusion to Camouflaged Instance SegmentationN/A

Panoptic Segmentation

YearVenueKeywordsPaper TitleCode/Project
2023CVPRunify., vlm.Primitive Generation and Semantic-related Alignment for Universal Zero-Shot SegmentationCode
2022arXivvlmOpen-Vocabulary Panoptic Segmentation with MaskCLIPN/A
2023CVPRdiff, vlmOpen-Vocabulary Panoptic Segmentation with Text-to-Image Diffusion ModelsCode
2023ICCVvlm.Open-vocabulary Panoptic Segmentation with Embedding ModulationN/A
2023NeurIPSvlm., unifyHierarchical Open-vocabulary Universal Image SegmentationCode
2024CVPRvlm., unify, 'open'OMG-Seg: Is One Model Good Enough For All Segmentation?Code

Open Vocabulary Video Understanding

Video Classification

YearVenueKeywordsPaper TitleCode/Project
2021arXivvlm.,open.ActionCLIP: A New Paradigm for Video Action RecognitionCode
2022ECCVvlm.,open.Prompting Visual-Language Models for Efficient Video UnderstandingProject
2022ECCVvlm.Frozen CLIP Models are Efficient Video LearnersCode
2022ECCVvlm.,open.Expanding Language-Image Pretrained Models for General Video RecognitionCode
2022arXivvlm.,open.,audio.Multimodal Open-Vocabulary Video Classification via Pre-Trained Vision and Language ModelsN/A
2023AAAIvlm.,open.Revisiting Classifier: Transferring Vision-Language Models for Video RecognitionCode
2023ICLRvlm.AIM: Adapting Image Models for Efficient Video Action RecognitionProject
2023CVPRvlm.,open.Fine-tuned CLIP Models are Efficient Video LearnersCode
2023ICMLvlm.,open.Open-VCLIP: Transforming CLIP to an Open-vocabulary Video Model via Interpolated Weight OptimizationCode
2023ICCVvlm.,open.Video Action Recognition with Attentive Semantic UnitsN/A
2023ICCVvlm.,open.MAtch, eXpand and Improve: Unsupervised Finetuning for Zero-Shot Action Recognition with Language KnowledgeCode
2023arXivvlm.,open.VicTR: Video-conditioned Text Representations for Activity RecognitionN/A
2023arXivvlm.,open.Generating Action-conditioned Prompts for Open-vocabulary Video Action RecognitionN/A
2024NeurIPSvlm.,open.AWT: Transferring Vision-Language Models via Augmentation, Weighting, and TransportationCode

Tracking

YearVenueKeywordsPaper TitleCode/Project
2023CVPRvlm.,open.OVTrack: Open-Vocabulary Multiple Object TrackingProject

Video Instance Segmentation

YearVenueKeywordsPaper TitleCode/Project
2023ICCVvlm.,open.Towards Open-Vocabulary Video Instance SegmentationCode
2023arXivvlm.,open.OpenVIS: Open-vocabulary Video Instance SegmentationN/A
2023arXivvlm.,open.DVIS++: Improved Decoupled Framework for Universal Video SegmentationCode

Open Vocabulary 3D Scene Understanding

3D Classification

YearVenueKeywordsPaper TitleCode/Project
2022CVPRvlm.PointCLIP: Point Cloud Understanding by CLIPCode
2023CVPRvlm.ULIP: Learning a Unified Representation of Language, Images, and Point Clouds for 3D UnderstandingCode
2023ICCVvlm.PointCLIP V2: Adapting CLIP for Powerful 3D Open-world LearningCode
2023ICCVvlm.CLIP2Point: Transfer CLIP to Point Cloud Classification with Image-Depth Pre-trainingCode
2023ICMLvlm.Contrast with Reconstruct: Contrastive 3D Representation Learning Guided by Generative PretrainingCode
2024WACVvlm.LidarCLIP or: How I Learned to Talk to Point CloudsCode

3D Detection

YearVenueKeywordsPaper TitleCode/Project
2022arXivvlm.Open-Vocabulary 3D Detection via Image-level Class and Debiased Cross-modal Contrastive LearningN/A
2023CVPRvlm.Open-Vocabulary Point-Cloud Object Detection without 3D AnnotationCode
2023NeurIPSvlm.CoDA: Collaborative Novel Box Discovery and Cross-modal Alignment for Open-vocabulary 3D Object DetectionProject
2023arXivvlm.Object2Scene: Putting Objects in Context for Open-Vocabulary 3D DetectionN/A
2023arXivvlm.FM-OV3D: Foundation Model-based Cross-modal Knowledge Blending for Open-Vocabulary 3D DetectionN/A
2023arXivvlm.OpenSight: A Simple Open-Vocabulary Framework for LiDAR-Based Object DetectionN/A

3D segmentation

YearVenueKeywordsPaper TitleCode/Project
2023CVPRvlm.PLA: Language-Driven Open-Vocabulary 3D Scene UnderstandingCode
2023CVPRvlm.CLIP2Scene: Towards Label-efficient 3D Scene Understanding by CLIPCode
2023CVPRvlm.OpenScene: 3D Scene Understanding with Open VocabulariesProject
2023ICCVWvlm.CLIP-FO3D: Learning Free Open-world 3D Scene Representations from 2D Dense CLIPN/A
2023NeurIPSvlm.OpenMask3D: Open-Vocabulary 3D Instance SegmentationProject
2023arXivvlm.OpenIns3D: Snap and Lookup for 3D Open-vocabulary Instance SegmentationProject
2023arXivvlm.Open3DIS: Open-vocabulary 3D Instance Segmentation with 2D Mask GuidanceProject
2024arXivvlm.UniM-OV3D: Uni-Modality Open-Vocabulary 3D Scene Understanding with Fine-Grained Feature RepresentationCode
2024arXivvlm.OpenSU3D: Open World 3D Scene Understanding using Foundation ModelsProject

Related Domains and Beyond

Class-agnostic Detection and Segmentation

YearVenueKeywordsPaper TitleCode/Project
2022RA-L-Learning Open-World Object Proposals without Learning to ClassifyCode
2021ICCV-Unidentified Video Objects: A Benchmark for Dense, Open-World SegmentationProject
2022CVPR-Open-World Instance Segmentation: Exploiting Pseudo Ground Truth From Learned Pairwise AffinityProject
2022ECCV-Class-agnostic object detection with multi-modal transformerCode
2022TPAMI-Open World Entity SegmentationProject
2023ICCV-Fine-Grained Entity SegmentationProject
2023ICCVbenchSegPrompt: Boosting Open-World Segmentation via Category-level Prompt LearningCode

Open-World Object Detection

YearVenueKeywordsPaper TitleCode/Project
2015CVPR-Towards Open World RecognitionN/A
2021CVPR-Towards Open World Object Detection.Code
2022CVPR-OW-DETR: Open-world Detection TransformerCode
2022ECCV-UC-OWOD: Unknown-Classified Open World Object DetectionCode
2022arXiv-Revisiting Open World Object DetectionCode
2022arXiv-Rectifying Open-set Object Detection: A Taxonomy, Practical Applications, and Proper Evaluation[N/A]
2022arXiv-Open World DETR: Transformer based Open World Object DetectionN/A
2023CVPR-PROB: Probabilistic Objectness for Open World Object DetectionCode
2023arXiv-Open World Object Detection in the Era of Foundation ModelsCode
2023arXiv-Hyp-OW: Exploiting Hierarchical Structure Learning with Hyperbolic Distance Enhances Open World Object Detection[N/A]

Open-Set Panoptic Segmentation

YearVenueKeywordsPaper TitleCode/Project
2021CVPR-Exemplar-Based Open-Set Panoptic Segmentation NetworkProject
2022BMVC-Dual Decision Improves Open-Set Panoptic SegmentationCode

Acknowledgement

If you find our survey and repository useful for your research project, please consider citing our paper:

@article{wu2023open,
      title={Towards Open Vocabulary Learning: A Survey},
      author={Jianzong Wu and Xiangtai Li and Shilin Xu and Haobo Yuan and Henghui Ding and Yibo Yang and Xia Li and Jiangning Zhang and Yunhai Tong and Xudong Jiang and Bernard Ghanem and Dacheng Tao},
      year={2024},
      journal={T-PAMI},
}

Contact

jzwu@stu.pku.edu.cn
lxtpku@pku.edu.cn or xiangtai94@gmail.com

Alt Text