Awesome
ICCV-2023-Papers
官网链接:https://iccv2023.thecvf.com/
研讨会:bell::2023 年 10 月 2 日至 3 日<br>
主会:bell::2023 年 10 月 4 日至 6 日
历年综述论文分类汇总戳这里↘️CV-Surveys施工中~~~~~~~~~~
2024 年论文分类汇总戳这里
2023 年论文分类汇总戳这里
↘️CVPR-2023-Papers ↘️WACV-2023-Papers ↘️ICCV-2023-Papers
2022 年论文分类汇总戳这里
2021 年论文分类汇总戳这里
2020 年论文分类汇总戳这里
目录
💥💥💥ICCV 2023 获奖论文
最佳论文奖——马尔奖
- Adding Conditional Control to Text-to-Image Diffusion Models<br>:star:code
- Passive Ultra-Wideband Single-Photon Imaging<br>:star:code
最佳论文奖提名
- Segment Anything<br>:house:project
最佳学生论文奖
<br>:thumbsup:ICCV 2023 数据集分享(含水下图像视频、阴影去除、目标检测跟踪分割、交互、超分等) <br>:thumbsup:ICCV 2023 数据集分享(含动人物姿态、自动驾驶、遥感、去雪、人脸、VOS等)
<a name="78"/>78.Sketch
<a name="73"/>73.Spiking Neural Networks
- RMP-Loss: Regularizing Membrane Potential Distribution for Spiking Neural Networks
- Towards Memory- and Time-Efficient Backpropagation for Training Spiking Neural Networks
- SSF: Accelerating Training of Spiking Neural Networks with Stabilized Spiking Flow
- Inherent Redundancy in Spiking Neural Networks<br>:star:code
- Membrane Potential Batch Normalization for Spiking Neural Networks<br>:star:code
- Unleashing the Potential of Spiking Neural Networks with Dynamic Confidence
- Temporal-Coded Spiking Neural Networks with Dynamic Firing Threshold: Learning with Event-Driven Backpropagation
- Efficient Converted Spiking Neural Network for 3D and 2D Classification
72.Dense Prediction(密集预测)
- Multi-Task Learning with Knowledge Distillation for Dense Prediction
- Consistent Depth Prediction for Transparent Object Reconstruction from RGB-D Camera
- EfficientViT: Lightweight Multi-Scale Attention for High-Resolution Dense Prediction
71.Data Augmentation(数据增强)
- HybridAugment++: Unified Frequency Spectra Perturbations for Model Robustness
- MixBag: Bag-Level Data Augmentation for Learning from Label Proportions
- When to Learn What: Model-Adaptive Data Augmentation Curriculum
- Diverse Data Augmentation with Diffusions for Effective Test-time Prompt Tuning
70.Active Learning(主动学习)
- TiDAL: Learning Training Dynamics for Active Learning
- HAL3D: Hierarchical Active Learning for Fine-Grained 3D Part Labeling
- Knowledge-Aware Federated Active Learning with Non-IID Data
- Skip-Plan: Procedure Planning in Instructional Videos via Condensed Action Space Learning
69.Affordance Learning(启示学习)
<a name="68"/>68.Clustering(聚类)
- Deep Multiview Clustering by Contrasting Cluster Assignments
- Cross-modal Scalable Hyperbolic Hierarchical Clustering
- Cross-view Topology Based Consistent and Complementary Information for Deep Multi-view Clustering
- MHCN: A Hyperbolic Neural Network Model for Multi-view Hierarchical Clustering
- Stable Cluster Discrimination for Deep Clustering
- Unsupervised Manifold Linearizing and Clustering
- Anchor Structure Regularization Induced Multi-view Subspace Clustering via Enhanced Tensor Rank Minimization
- Surface Normal Clustering for Implicit Representation of Manhattan Scenes
67.Open Set Recognition(开集识别)
<a name="66"/>66.Graph Neural Networks(图神经网络)
- VertexSerum: Poisoning Graph Neural Networks for Link Inference
- Learning Adaptive Neighborhoods for Graph Neural Networks
- Vision HGNN: An Image is More than a Graph of Nodes
65.Deepfake Detectors
- Towards Understanding the Generalization of Deepfake Detectors from a Game-Theoretical View
- Quality-Agnostic Deepfake Detection with Intra-model Collaborative Learning
- SeeABLE: Soft Discrepancies and Bounded Contrastive Learning for Exposing Deepfakes
- UCF: Uncovering Common Features for Generalizable Deepfake Detection
64.Scene Understanding(场景理解)
- Shape Anchor Guided Holistic Indoor Scene Understanding
- Efficient Computation Sharing for Multi-Task Visual Scene Understanding
- Ordered Atomic Activity for Fine-grained Interactive Traffic Scenario Understanding
- Clutter Detection and Removal in 3D Scenes with View-Consistent Inpainting
- Human-centric Scene Understanding for 3D Large-scale Scenarios
63.Industrial Defect Detectors
- 工业缺陷定位
- 工业异常检测
- 裂缝检测
62.Group Affect Recognition(群体情感识别)
<a name="61"/>61.Natural Language Progress(NLP)
- Improved Visual Fine-tuning with Natural Language Supervision
- Tracking by Natural Language Specification with Long Short-term Context Decoupling
60.Visual Localization(视觉定位)
- EP2P-Loc: End-to-End 3D Point to 2D Pixel Localization for Large-Scale Visual Localization
- Flexible Visual Recognition by Evidential Modeling of Confusion and Ignorance
- Yes, we CANN: Constrained Approximate Nearest Neighbors for Local Feature-Based Visual Localization
- OFVL-MS: Once for Visual Localization across Multiple Indoor Scenes
59.Copyright Protection(版权保护/信息安全)
- Towards Robust Model Watermark via Reducing Parametric Vulnerability<br>:star:code
- CopyRNeRF: Protecting the CopyRight of Neural Radiance Fields
58.scene flow estimation(场景流估计)
- EMR-MSF: Self-Supervised Recurrent Monocular Scene Flow Exploiting Ego-Motion Rigidity
- Fast Neural Scene Flow
- Multi-Scale Bidirectional Recurrent Network with Hybrid Correlation for Point Cloud Based Scene Flow Estimation
- IHNet: Iterative Hierarchical Network Guided by High-Resolution Estimated Information for Scene Flow Estimation
57.Semantic Scene Completion(语义场景补全)
- NDC-Scene: Boost Monocular 3D Semantic Scene Completion in Normalized Device Coordinates Space<br>:star:code<br>:star:code
- DDIT: Semantic Scene Completion via Deformable Deep Implicit Templates
- CVSformer: Cross-View Synthesis Transformer for Semantic Scene Completion
- Learning Long-Range Information with Dual-Scale Transformers for Indoor Scene Completion
56.NAS
- ShiftNAS: Improving One-shot NAS via Probability Shift
- ROME: Robustifying Memory-Efficient NAS via Topology Disentanglement and Gradient Accumulation
- Extensible and Efficient Proxy for Neural Architecture Search
- MixPath: A Unified Approach for One-shot Neural Architecture Search
- Unleashing the Power of Gradient Signal-to-Noise Ratio for Zero-Shot NAS
55.sound(语音)
- Speech2Lip: High-fidelity Speech to Lip Generation by Learning from a Short Video
- MixSpeech: Cross-Modality Self-Learning with Audio-Visual Stream Mixup for Visual Speech Translation and Recognition
- Be Everywhere - Hear Everything (BEE): Audio Scene Reconstruction by Sparse Audio-Visual Samples
- On the Audio-visual Synchronization for Lip-to-Speech Synthesis
- Boosting Positive Segments for Weakly-Supervised Audio-Visual Video Parsing
- DiffV2S: Diffusion-based Video-to-Speech Synthesis with Vision-guided Speaker Embedding
- Omnidirectional Information Gathering for Knowledge Transfer-based Audio-Visual Navigation
- Sound Source Localization is All about Cross-Modal Alignment
- Lip2Vec: Efficient and Robust Visual Speech Recognition via Latent-to-Latent Visual to Audio Representation Mapping
- 去混响
- 唇语识别
- 音频-视频生成
- 音视觉导航
- 声音定位
- 视听分割
54.Gaze Estimation
- DVGaze: Dual-View Gaze Estimation<br>:star:code
53.Computed Imaging(计算成像,如光学、几何、光场成像等)
- Event Camera Data Pre-training
- Deep Geometrized Cartoon Line Inbetweening
- Aperture Diffraction for Compact Snapshot Spectral Imaging
- Examining Autoexposure for Challenging Scenes
- Vanishing Point Estimation in Uncalibrated Images with Prior Gravity Direction<br>:star:code
- Robust Frame-to-Frame Camera Rotation Estimation in Crowded Scenes<br>:house:project
- Exploring Positional Characteristics of Dual-Pixel Data for Camera Autofocus
- Enhancing Non-line-of-sight Imaging via Learnable Inverse Kernel and Attention Mechanisms
- On the Robustness of Normalizing Flows for Inverse Problems in Imaging
- Factorized Inverse Path Tracing for Efficient and Accurate Material-Lighting Estimation
- 光场
- 相机姿势估计
52.View Synthesis(视图合成)
- Forward Flow for Novel View Synthesis of Dynamic Scenes
- iVS-Net: Learning Human View Synthesis from Internet Videos
- Multi-task View Synthesis with Neural Radiance Fields
- IntrinsicNeRF: Learning Intrinsic Neural Radiance Fields for Editable Novel View Synthesis
- Generative Novel View Synthesis with 3D-Aware Diffusion Models
- SparseNeRF: Distilling Depth Ranking for Few-shot Novel View Synthesis
- Total-Recon: Deformable Scene Reconstruction for Embodied View Synthesis
- Neural LiDAR Fields for Novel View Synthesis
- LoLep: Single-View View Synthesis with Locally-Learned Planes and Self-Attention Occlusion Inference
- Learning Unified Decompositional and Compositional NeRF for Editable Novel View Synthesis<br>:star:code
- Efficient View Synthesis with Neural Radiance Distribution Field
- NeO 360: Neural Fields for Sparse View Synthesis of Outdoor Scenes<br>:star:code
- PARF: Primitive-Aware Radiance Fusion for Indoor Scene Novel View Synthesis
- Urban Radiance Field Representation with Deformable Neural Mesh Primitives<br>:house:project
- FlipNeRF: Flipped Reflection Rays for Few-shot Novel View Synthesis
- SAMPLING: Scene-adaptive Hierarchical Multiplane Images Representation for Novel View Synthesis from a Single Image
- A Large-Scale Outdoor Multi-Modal Dataset and Benchmark for Novel View Synthesis and Implicit Scene Reconstruction
- Cross-Ray Neural Radiance Fields for Novel-View Synthesis from Unconstrained Image Collections
- Long-Term Photometric Consistent Novel View Synthesis with Diffusion Models
- NEMTO: Neural Environment Matting for Novel View and Relighting Synthesis of Transparent Objects
51.Visual Place Recognition
- CASSPR: Cross Attention Single Scan Place Recognition
- EigenPlaces: Training Viewpoint Robust Models for Visual Place Recognition<br>:star:code<br>:star:code
- CrossLoc3D: Aerial-Ground Cross-Source 3D Place Recognition
- BEVPlace: Learning LiDAR-based Place Recognition using Bird's Eye View Images
50.Image Matching(图像匹配)
- OccNet: Robust Image Matching Based on 3D Occupancy Estimation for Occluded Regions<br>:thumbsup:ICCV 2023|Occ2Net,一种基于3D 占据估计的有效且稳健的带有遮挡区域的图像匹配方法
- Scene-Aware Feature Matching
- Improving Transformer-based Image Matching by Cascaded Capturing Spatially Informative Keypoints
- GlueStick: Robust Image Matching by Sticking Points and Lines Together
- Grounded Image Text Matching with Mismatched Relation Reasoning
- Graph Matching with Bi-level Noisy Correspondence
- 特征匹配
49.Image Fusion(图像融合)
- DDFM: Denoising Diffusion Model for Multi-Modality Image Fusion
- MEFLUT: Unsupervised 1D Lookup Tables for Multi-exposure Image Fusion
- Learned Image Reasoning Prior Penetrates Deep Unfolding Network for Panchromatic and Multi-Spectral Image Fusion<br>:star:code
- Degradation-Resistant Unfolding Network for Heterogeneous Image Fusion
- Learned Image Reasoning Prior Penetrates Deep Unfolding Network for Panchromatic and Multi-spectral Image Fusion
- UniFusion: Unified Multi-View Fusion Transformer for Spatial-Temporal Representation in Bird's-Eye-View
- Multi-Modal Gated Mixture of Local-to-Global Experts for Dynamic Image Fusion
48.Image Reconstruction
- Pixel Adaptive Deep Unfolding Transformer for Hyperspectral Image Reconstruction<br>:star:code
- RawHDR: High Dynamic Range Image Reconstruction from a Single Raw Image
- Learning Continuous Exposure Value Representations for Single-Image HDR Reconstruction<br>:star:code
47.Image-to-Image Translation
- General Image-to-Image Translation with One-Shot Image Guidance<br>:star:code
- Scenimefy: Learning to Craft Anime Scene via Semi-Supervised Image-to-Image Translation<br>:star:code<br>:star:code
- UGC: Unified GAN Compression for Efficient Image-to-Image Translation
46.Edge Detection
<a name="45"/>45.Scene Graph Generation(场景图合成)
- SGAligner: 3D Scene Alignment with Scene Graphs
- Environment-Invariant Curriculum Relation Learning for Fine-Grained Scene Graph Generation
- Compositional Feature Augmentation for Unbiased Scene Graph Generation
- Vision Relation Transformer for Unbiased Scene Graph Generation
- TextPSG: Panoptic Scene Graph Generation from Textual Descriptions
- HiLo: Exploiting High Low Frequency Relations for Unbiased Panoptic Scene Graph Generation
- Visually-Prompted Language Model for Fine-Grained Scene Graph Generation in an Open World
- Visual Traffic Knowledge Graph Generation from Scene Images
44.Rendering(渲染)
- LiveHand: Real-time and Photorealistic Neural Hand Rendering
- NeMF: Inverse Volume Rendering with Neural Microflake Field
- ActorsNeRF: Animatable Few-shot Human Rendering with Generalizable NeRFs
- HollowNeRF: Pruning Hashgrid-Based NeRFs with Trainable Collision Mitigation
- DNA-Rendering: A Diverse Neural Actor Repository for High-Fidelity Human-centric Rendering<br>:house:project
- S3IM: Stochastic Structural SIMilarity and Its Unreasonable Effectiveness for Neural Fields<br>:star:code<br>:thumbsup:ICCV 2023 | NeRF 提点的 Magic Loss —— S3IM 随机结构相似性
- TransHuman: A Transformer-based Human Representation for Generalizable Neural Human Rendering<br>:star:code
- Tri-MipRF: Tri-Mip Representation for Efficient Anti-Aliasing Neural Radiance Fields<br>:star:code
- Rendering Humans from Object-Occluded Monocular Videos<br>:house:project
- ScatterNeRF: Seeing Through Fog with Physically-Based Inverse Neural Rendering
- DNA-Rendering: A Diverse Neural Actor Repository for High-Fidelity Human-Centric Rendering
- Neural Microfacet Fields for Inverse Rendering
- 3D-aware Blending with Generative NeRFs
43.Neural Radiance Fields
- Instance Neural Radiance Field
- Adaptive Positional Encoding for Bundle-Adjusting Neural Radiance Fields
- FeatureNeRF: Learning Generalizable NeRFs by Distilling Foundation Models
- NerfAcc: Efficient Sampling Accelerates NeRFs
- MIMO-NeRF: Fast Neural Rendering with Multi-input Multi-output Neural Radiance Fields
- UHDNeRF: Ultra-High-Definition Neural Radiance Fields
- Deformable Neural Radiance Fields using RGB and Event Cameras
- Learning Neural Implicit Surfaces with Object-Aware Radiance Fields
- ClimateNeRF: Extreme Weather Synthesis in Neural Radiance Field
- HOSNeRF: Dynamic Human-Object-Scene Neural Radiance Fields from a Single Video
- ReNeRF: Relightable Neural Radiance Fields with Nearfield Lighting
- E2NeRF: Event Enhanced Neural Radiance Fields from Blurry Images
- Neural Fields for Structured Lighting
- NeRF-MS: Neural Radiance Fields with Multi-Sequence
- StegaNeRF: Embedding Invisible Information within Neural Radiance Fields
- SHERF: Generalizable Human NeRF from a Single Image
- DeformToon3D: Deformable Neural Radiance Fields for 3D Toonification
- Nerfbusters: Removing Ghostly Artifacts from Casually Captured NeRFs
- Tetra-NeRF: Representing Neural Radiance Fields Using Tetrahedra
- Zip-NeRF: Anti-Aliased Grid-Based Neural Radiance Fields
- NeRFrac: Neural Radiance Fields through Refractive Surface
- MonoNeRF: Learning a Generalizable Dynamic Radiance Field from Monocular Videos
- Reference-guided Controllable Inpainting of Neural Radiance Fields
- DeLiRa: Self-Supervised Depth, Light, and Radiance Fields
- Neural Radiance Field with LiDAR maps
- Dynamic Mesh-Aware Radiance Fields
- Locally Stylized Neural Radiance Fields
- Generalizable Neural Fields as Partially Observed Neural Processes
- DeformToon3D: Deformable 3D Toonification from Neural Radiance Fields<br>:house:project<br>:star:code
- Robust e-NeRF: NeRF from Sparse & Noisy Events under Non-Uniform Motion<br>:star:code
- Pose-Free Neural Radiance Fields via Implicit Pose Regularization
- Canonical Factors for Hybrid Neural Fields<br>:star:code
- Multi-Modal Neural Radiance Field for Monocular Dense SLAM with a Light-Weight ToF Sensor<br>:star:code
- Blending-NeRF: Text-Driven Localized Editing in Neural Radiance Fields
- Strata-NeRF : Neural Radiance Fields for Stratified Scenes<br>:star:code
- DReg-NeRF: Deep Registration for Neural Radiance Fields<br>:star:code
- Seal-3D: Interactive Pixel-Level Editing for Neural Radiance Fields<br>:house:project<br>:star:code
- WaveNeRF: Wavelet-based Generalizable Neural Radiance Fields
- UrbanGIRAFFE: Representing Urban Scenes as Compositional Generative Neural Feature Fields
- LERF: Language Embedded Radiance Fields
- Strivec: Sparse Tri-Vector Radiance Fields
- Multiscale Representation for Real-Time Anti-Aliasing Neural Rendering
42.Dataset/Benchmark
- 数据集
- Building3D: An Urban-Scale Dataset and Benchmarks for Learning Roof Structures from Point Clouds<br>:sunflower:dataset<br>:thumbsup:ICCV2023 首个城市级别的基于航空点云的房屋建模数据集 Building3D
- LoTE-Animal: A Long Time-span Dataset for Endangered Animal Behavior Understanding
- Beyond the Pixel: a Photometrically Calibrated HDR Dataset for Luminance and Color Prediction
- Atmospheric Transmission and Thermal Inertia Induced Blind Road Segmentation with a Large-Scale Dataset TBRSD
- H3WB: Human3.6M 3D WholeBody Dataset and Benchmark
- V3Det: Vast Vocabulary Visual Detection Dataset
- HoloAssist: an Egocentric Human Interaction Dataset for Interactive AI Assistants in the Real World
- Zenseact Open Dataset: A Large-Scale and Diverse Multimodal Dataset for Autonomous Driving
- FunnyBirds: A Synthetic Vision Dataset for a Part-Based Analysis of Explainable AI Methods
- Lecture Presentations Multimodal Dataset: Towards Understanding Multimodality in Educational Videos
- RealGraph: A Multiview Dataset for 4D Real-world Context Graph Generation
- Video Background Music Generation: Dataset, Method and Evaluation
- Thinking Image Color Aesthetics Assessment: Models, Datasets and Benchmarks
- Snow Removal in Video: A New Dataset and A Novel Method
- SportsMOT: A Large Multi-Object Tracking Dataset in Multiple Sports Scenes
- EmoSet: A Large-scale Visual Emotion Dataset with Rich Attributes
- DetermiNet: A Large-Scale Diagnostic Dataset for Complex Visually-Grounded Referencing using Determiners
- PointOdyssey: A Large-Scale Synthetic Dataset for Long-Term Point Tracking
- SynBody: Synthetic Dataset with Layered Human Models for 3D Human Perception and Modeling
- MOSE: A New Dataset for Video Object Segmentation in Complex Scenes
- Audio-Visual Deception Detection: DOLOS Dataset and Parameter-Efficient Crossmodal Learning
- 3DMiner: Discovering Shapes from Large-Scale Unannotated Image Datasets
- MatrixCity: A Large-scale City Dataset for City-scale Neural Rendering and Beyond
- LaRS: A Diverse Panoptic Maritime Obstacle Detection Dataset and Benchmark<br>:star:code
- EgoObjects: A Large-Scale Egocentric Dataset for Fine-Grained Object Understanding<br>:star:code
- Towards Universal Image Embeddings: A Large-Scale Dataset and Challenge for Generic Image Representations<br>:house:project
- High-Resolution Document Shadow Removal via A Large-Scale Real-World Dataset and A Frequency-Aware Shadow Erasing Net
- ScanNet++: A High-Fidelity Dataset of 3D Indoor Scenes<br>:house:project<br>:star:code
- Learning Optical Flow from Event Camera with Rendered Dataset
- Efficient neural supersampling on a novel gaming dataset
- AIDE: A Vision-Driven Multi-View, Multi-Modal, Multi-Tasking Dataset for Assistive Driving Perception<br>:star:code
- 360VOT: A New Benchmark Dataset for Omnidirectional Visual Object Tracking<br>:house:project<br>:star:code全向视觉目标跟踪
- Harvard Glaucoma Detection and Progression: A Multimodal Multitask Dataset and Generalization-Reinforced Semi-Supervised Learning<br>:house:project
- FishNet: A Large-scale Dataset and Benchmark for Fish Recognition, Detection, and Functional Trait Prediction
- 基准
- Towards Real-world Burst Image Super-Resolution: Benchmark and Method<br>:star:code<br>:thumbsup:ICCV2023 |FBANet:迈向真实世界的多帧超分
- SQAD: Automatic Smartphone Camera Quality Assessment and Benchmarking
- ARNOLD: A Benchmark for Language-Grounded Task Learning with Continuous States in Realistic 3D Scenes
- Aria Digital Twin: A New Benchmark Dataset for Egocentric 3D Machine Perception
- From Sky to the Ground: A Large-scale Benchmark and Simple Baseline Towards Real Rain Removal
- ChildPlay: A New Benchmark for Understanding Children's Gaze Behaviour
- PlanarTrack: A Large-scale Challenging Benchmark for Planar Object Tracking
- OmniLabel: A Challenging Benchmark for Language-Based Object Detection
- OpenOccupancy: A Large Scale Benchmark for Surrounding Semantic Occupancy Perception
- HRS-Bench: Holistic, Reliable and Scalable Benchmark for Text-to-Image Models
- Beyond Object Recognition: A New Benchmark towards Object Concept Learning
- ClothPose: A Real-world Benchmark for Visual Analysis of Garment Pose via An Indirect Recording Solution
- REAP: A Large-Scale Realistic Adversarial Patch Benchmark
- Chaotic World: A Large and Challenging Benchmark for Human Behavior Understanding in Chaotic Events
- Breaking Common Sense: WHOOPS! A Vision-and-Language Benchmark of Synthetic and Compositional Images
- FACET: Fairness in Computer Vision Evaluation Benchmark
- A Benchmark for Chinese-English Scene Text Image Super-Resolution
- Ego-Humans: An Ego-Centric 3D Multi-Human Benchmark
- Towards Real-World Burst Image Super-Resolution: Benchmark and Method<br>:star:code
- COCO-O: A Benchmark for Object Detectors under Natural Distribution Shifts<br>:star:code
- Dancing in the Dark: A Benchmark towards General Low-light Video Enhancement
- DiLiGenT-Pi: Photometric Stereo for Planar Surfaces with Rich Details - Benchmark Dataset and Beyond
- 方法
- Prototype-based Dataset Comparison<br>:star:code
41.Vision Transformers
- Masked Spiking Transformer
- Scale-Aware Modulation Meet Transformer
- BiViT: Extremely Compressed Binary Vision Transformers
- Fcaformer: Forward Cross Attention in Hybrid Vision Transformer
- FastViT: A Fast Hybrid Vision Transformer Using Structural Reparameterization
- SwinLSTM: Improving Spatiotemporal Prediction Accuracy using Swin Transformer and LSTM
- Multimodal High-order Relation Transformer for Scene Boundary Detection
- GET: Group Event Transformer for Event-Based Vision
- DiffRate : Differentiable Compression Rate for Efficient Vision Transformers
- Scratching Visual Transformer's Back with Uniform Attention
- Skill Transformer: A Monolithic Policy for Mobile Manipulation
- A Multidimensional Analysis of Social Biases in Vision Transformers
- Token-Label Alignment for Vision Transformers
- Building Vision Transformers with Hierarchy Aware Feature Aggregation
- TripLe: Revisiting Pretrained Model Reuse and Progressive Learning for Efficient Vision Transformer Scaling and Searching
- DarSwin: Distortion Aware Radial Swin Transformer
- Robustifying Token Attention for Vision Transformers
- FLatten Transformer: Vision Transformer using Focused Linear Attention
- Detection Transformer with Stable Matching
- LaPE: Layer-adaptive Position Embedding for Vision Transformers with Independent Layer Normalization
- M2T: Masking Transformers Twice for Faster Decoding
- FDViT: Improve the Hierarchical Architecture of Vision Transformer
- Jumping through Local Minima: Quantization in the Loss Landscape of Vision Transformers
- Rethinking Vision Transformers for MobileNet Size and Speed
- Structure Invariant Transformation for better Adversarial Transferability<br>:star:code
- SG-Former: Self-guided Transformer with Evolving Token Reallocation<br>:star:code
- Pre-training Vision Transformers with Very Limited Synthesized Images
- SMMix: Self-Motivated Image Mixing for Vision Transformers
- Revisiting Vision Transformer from the View of Path Ensemble
- SwinLSTM:Improving Spatiotemporal Prediction Accuracy using Swin Transformer and LSTM<br>:star:code
- Eventful Transformers: Leveraging Temporal Redundancy in Vision Transformers
- Contrastive Feature Masking Open-Vocabulary Vision Transformer
- MMST-ViT: Climate Change-aware Crop Yield Prediction via Multi-Modal Spatial-Temporal Vision Transformer
- SAL-ViT: Towards Latency Efficient Private Inference on ViT using Selective Attention Search with a Learnable Softmax Approximation
- SwiftFormer: Efficient Additive Attention for Transformer-based Real-time Mobile Vision Applications
- MPCViT: Searching for Accurate and Efficient MPC-Friendly Vision Transformer with Heterogeneous Attention
40.Anomaly Detection(异常检测)
- Unilaterally Aggregated Contrastive Learning with Hierarchical Augmentation for Anomaly Detection
- Anomaly Detection Under Distribution Shift
- Unsupervised Surface Anomaly Detection with Diffusion Probabilistic Model
- Anomaly Detection using Score-based Perturbation Resilience
- Remembering Normality: Memory-guided Knowledge Distillation for Unsupervised Anomaly Detection
- Template-guided Hierarchical Feature Restoration for Anomaly Detection
- Inter-Realization Channels: Unsupervised Anomaly Detection Beyond One-Class Classification
- 图像异常检测
- OOD
- CLIPN for Zero-Shot OOD Detection: Teaching CLIP to Say No<br>:star:code
- Meta OOD Learning for Continuously Adaptive OOD Detection
- Simple and Effective Out-of-Distribution Detection via Cosine-based Softmax Loss
- Nearest Neighbor Guidance for Out-of-Distribution Detection<br>:star:code
- Understanding the Feature Norm for Out-of-Distribution Detection
- Meta OOD Learning For Continuously Adaptive OOD Detection
- SAFE: Sensitivity-Aware Features for Out-of-Distribution Object Detection
- Unified Out-Of-Distribution Detection: A Model-Specific Perspective
- Revisit PCA-based Technique for Out-of-Distribution Detection
- Hierarchical Visual Categories Modeling: A Joint Representation Learning and Density Estimation Framework for Out-of-Distribution Detection
- WDiscOOD: Out-of-Distribution Detection via Whitened Linear Discriminant Analysis
39.Keypoint Detection(关键点检测)
- Neural Interactive Keypoint Detection<br>:star:code
- 3D Implicit Transporter for Temporally Consistent Keypoint Discovery<br>:star:code
38.Vision-Language(视觉语言)
- Linear Spaces of Meanings: Compositional Structures in Vision-Language Models
- ViLTA: Enhancing Vision-Language Pre-training through Textual Augmentation
- Gloss-Free Sign Language Translation: Improving from Visual-Language Pretraining
- SuS-X: Training-Free Name-Only Transfer of Vision-Language Models
- Bayesian Prompt Learning for Image-Language Model Generalization
- eP-ALM: Efficient Perceptual Augmentation of Language Models
- Task-Oriented Multi-Modal Mutual Leaning for Vision-Language Models
- SLAN: Self-Locator Aided Network for Vision-Language Understanding
- Borrowing Knowledge From Pre-trained Language Model: A New Data-efficient Visual Learning Paradigm
- VL-Match: Enhancing Vision-Language Pretraining with Token-Level and Instance-Level Matching
- A Retrospect to Multi-prompt Learning across Vision and Language
- CiT: Curation in Training for Effective Vision-Language Data
- EgoTV: Egocentric Task Verification from Natural Language Task Descriptions
- Gradient-Regulated Meta-Prompt Learning for Generalizable Vision-Language Models
- Towards Unifying Medical Vision-and-Language Pre-Training via Soft Prompts
- Preventing Zero-Shot Transfer Degradation in Continual Learning of Vision-Language Models
- Perceptual Grouping in Contrastive Vision-Language Models
- Black Box Few-Shot Adaptation for Vision-Language Models
- CTP:Towards Vision-Language Continual Pretraining via Compatible Momentum Contrast and Topology Preservation
- VL-PET: Vision-and-Language Parameter-Efficient Tuning via Granularity Control
- GrowCLIP: Data-Aware Automatic Model Growing for Large-scale Contrastive Language-Image Pre-Training
- I Can't Believe There's No Images! Learning Visual Tasks Using only Language Supervision
- Too Large; Data Reduction for Vision-Language Pre-Training
- Equivariant Similarity for Vision-Language Foundation Models
- Going Beyond Nouns With Vision & Language Models Using Synthetic Data
- SINC: Self-Supervised In-Context Learning for Vision-Language Tasks
- Unified Visual Relationship Detection with Vision and Language Models
- ProbVLM: Probabilistic Adapter for Frozen Vison-Language Models
- Distilling Large Vision-Language Model with Out-of-Distribution Generalizability
- Distribution-Aware Prompt Tuning for Vision-Language Models<br>:star:code
- LoGoPrompt: Synthetic Text Images Can Be Good Visual Prompts for Vision-Language Models<br>:star:code
- CLIPTrans: Transferring Visual Knowledge with Pre-trained Models for Multimodal Machine Translation
- GrowCLIP: Data-aware Automatic Model Growing for Large-scale Contrastive Language-Image Pre-training
- RLIPv2: Fast Scaling of Relational Language-Image Pre-training<br>:star:code
- Regularized Mask Tuning: Uncovering Hidden Knowledge in Pre-trained Vision-Language Models<br>:star:code
- Why Is Prompt Tuning for Vision-Language Models Robust to Noisy Labels?<br>:star:code
- Set-level Guidance Attack: Boosting Adversarial Transferability of Vision-Language Pre-training Models<br>:thumbsup:ICCV 2023 Oral | 南科大VIP Lab | 针对VLP模型的集合级引导攻击
- CTP: Towards Vision-Language Continual Pretraining via Compatible Momentum Contrast and Topology Preservation<br>:star:code
- VL-PET: Vision-and-Language Parameter-Efficient Tuning via Granularity Control<br>:star:code
- Knowledge-Aware Prompt Tuning for Generalizable Vision-Language Models
- VLSlice: Interactive Vision-and-Language Slice Discovery<br>:house:project
- What does CLIP know about a red circle? Visual prompt engineering for VLMs
- BUS: Efficient and Effective Vision-Language Pre-Training with Bottom-Up Patch Summarization
- 视觉表示学习
- VLN
- Learning Vision-and-Language Navigation from YouTube Videos<br>:star:code
- Learning Navigational Visual Representations with Semantic Map Supervision
- Grounded Entity-Landmark Adaptive Pre-Training for Vision-and-Language Navigation
- GridMM: Grid Memory Map for Vision-and-Language Navigation
- Scaling Data Generation in Vision-and-Language Navigation
- Bird's-Eye-View Scene Graph for Vision-Language Navigation<br>:star:code
- AerialVLN: Vision-and-Language Navigation for UAVs<br>:star:code
- DREAMWALKER: Mental Planning for Continuous Vision-Language Navigation<br>:star:code
- VLN-PETL: Parameter-Efficient Transfer Learning for Vision-and-Language Navigation
- March in Chat: Interactive Prompting for Remote Embodied Referring Expression
- Grounded Entity-Landmark Adaptive Pre-training for Vision-and-Language Navigation
- BEVBert: Multimodal Map Pre-training for Language-guided Navigation<br>:star:code
- Visual Grounding(视觉定位)
- Video-Language
- HiTeA: Hierarchical Temporal-Aware Video-Language Pre-training
- Learning Trajectory-Word Alignments for Video-Language Tasks
- HiVLP: Hierarchical Interactive Video-Language Pre-Training
- Verbs in Action: Improving Verb Understanding in Video-Language Models
- Exploring Temporal Concurrency for Video-Language Representation Learning
- EgoVLPv2: Egocentric Video-Language Pre-training with Fusion in the Backbone
- SMAUG: Sparse Masked Autoencoder for Efficient Video-Language Pre-Training
- 视觉推理
- LLM
37.Object Pose Estimation(物体姿势估计)
- IST-Net: Prior-Free Category-Level Pose Estimation with Implicit Space Transformation
- LU-NeRF: Scene and Pose Estimation by Synchronizing Local Unposed NeRFs
- PoseDiffusion: Solving Pose Estimation via Diffusion-aided Bundle Adjustment
- Nonrigid Object Contact Estimation With Regional Unwrapping Transformer
- 6D
- Deep Fusion Transformer Network with Weighted Vector-Wise Keypoints Voting for Robust 6D Object Pose Estimation
- SOCS: Semantically-Aware Object Coordinate Space for Category-Level 6D Object Pose Estimation under Large Shape Variations
- Linear-Covariance Loss for End-to-End Learning of 6D Pose Estimation
- Pseudo Flow Consistency for Self-Supervised 6D Object Pose Estimation
- Center-Based Decoupled Point-cloud Registration for 6D Object Pose Estimation
- Query6DoF: Learning Sparse Queries as Implicit Shape Prior for Category-Level 6DoF Pose Estimation
- VI-Net: Boosting Category-level 6D Object Pose Estimation via Learning Decoupled Rotations on the Spherical Representations<br>:star:code
- Learning Symmetry-Aware Geometry Correspondences for 6D Object Pose Estimation
- 3D Neural Embedding Likelihood: Probabilistic Inverse Graphics for Robust 6D Pose Estimation
- 物体计数
- 动物姿势估计
36.Vision Question Answering(视觉问答)
- Toward Unsupervised Realistic Visual Question Answering
- Variational Causal Inference Network for Explanatory Visual Question Answering
- VQA Therapy: Exploring Answer Differences by Visually Grounding Answers
- VQA-GNN: Reasoning with Multimodal Knowledge via Graph Neural Networks for Visual Question Answering
- TIFA: Accurate and Interpretable Text-to-Image Faithfulness Evaluation with Question Answering
- Encyclopedic VQA: Visual Questions About Detailed Properties of Fine-Grained Categories
- PromptCap: Prompt-Guided Image Captioning for VQA with GPT-3
- Decouple Before Interact: Multi-Modal Prompt Learning for Continual Visual Question Answering
- Video-QA
- Discovering Spatio-Temporal Rationales for Video Question Answering<br>:star:code
- Knowledge Proxy Intervention for Deconfounded Video Question Answering
- Tem-adapter: Adapting Image-Text Pretraining for Video Question Answer
- Open-vocabulary Video Question Answering: A New Benchmark for Evaluating the Generalizability of Video Question Answering Models<br>:star:code
- Tem-Adapter: Adapting Image-Text Pretraining for Video Question Answer
- 视频意图推理
35.Human Motion Prediction(人体运动预测)
- Auxiliary Tasks Benefit 3D Skeleton-based Human Motion Prediction<br>:star:code
- Forecast-MAE: Self-supervised Pre-training for Motion Forecasting with Masked Autoencoders<br>:star:code
- Priority-Centric Human Motion Generation in Discrete Latent Space
- MotionLM: Multi-Agent Motion Forecasting as Language Modeling
- HumanMAC: Masked Motion Completion for Human Motion Prediction
- Joint-Relation Transformer for Multi-Person Motion Prediction
- Bootstrap Motion Forecasting With Self-Consistent Constraints
- PhysDiff: Physics-Guided Human Motion Diffusion Model
- AttT2M: Text-Driven Human Motion Generation with Multi-Perspective Attention Mechanism
- BeLFusion: Latent Diffusion for Behavior-Driven Human Motion Prediction
- Social Diffusion: Long-term Multiple Human Motion Anticipation
- SINC: Spatial Composition of 3D Human Motions for Simultaneous Action Generation
34.Action Detection(动作识别)
- Multimodal Distillation for Egocentric Action Recognition
- Memory-and-Anticipation Transformer for Online Action Understanding<br>:star:code
- Masked Motion Predictors are Strong 3D Action Representation Learners<br>:star:code
- Efficient Video Action Detection with Token Dropout and Context Refinement
- Video-FocalNets: Spatio-Temporal Focal Modulation for Video Action Recognition
- E2E-LOAD: End-to-End Long-form Online Action Detection
- Ego-Only: Egocentric Action Detection without Exocentric Transferring
- Cross-Modal Learning with 3D Deformable Attention for Action Recognition
- DiffTAD: Temporal Action Detection with Proposal Denoising Diffusion
- STPrivacy: Spatio-Temporal Privacy-Preserving Action Recognition
- MiniROAD: Minimal RNN Framework for Online Action Detection
- Video Action Recognition with Attentive Semantic Units
- A Large-scale Study of Spatiotemporal Representation Learning with a New Benchmark on Action Recognition
- What Can a Cook in Italy Teach a Mechanic in India? Action Recognition Generalisation Over Scenarios and Locations
- 基于骨架的动作识别
- LAC -- Latent Action Composition for Skeleton-based Action Segmentation
- Generative Action Description Prompts for Skeleton-based Action Recognition
- Parallel Attention Interaction Network for Few-Shot Skeleton-Based Action Recognition
- Leveraging Spatio-Temporal Dependency for Skeleton-Based Action Recognition
- Hierarchically Decomposed Graph Convolutional Networks for Skeleton-Based Action Recognition
- Modeling the Relative Visual Tempo for Self-supervised Skeleton-based Action Recognition
- SkeleTR: Towrads Skeleton-based Action Recognition in the Wild
- Hard No-Box Adversarial Attack on Skeleton-Based Human Action Recognition with Skeleton-Motion-Informed Gradient
- FSAR: Federated Skeleton-based Action Recognition with Adaptive Topology Structure and Knowledge Distillation
- 开集动作识别
- 零样本动作识别
- 小样本动作识别
- 时序动作定位
- DDG-Net: Discriminability-Driven Graph Network for Weakly-supervised Temporal Action Localization<br>:star:code
- Movement Enhancement toward Multi-Scale Video Feature Representation for Temporal Action Detection
- Self-Feedback DETR for Temporal Action Detection
- Action Sensitivity Learning for Temporal Action Localization
- Revisiting Foreground and Background Separation in Weakly-supervised Temporal Action Localization
- Learning from Noisy Pseudo Labels for Semi-Supervised Temporal Action Localization
- 弱监督动作定位
- 小样本动作定位
- 动作理解
33.Video(视频)
- Neural Video Depth Stabilizer
- NPC: Neural Point Characters from Video
- Localizing Moments in Long Video Via Multimodal Guidance
- Order-Prompted Tag Sequence Generation for Video Tagging
- Moment Detection in Long Tutorial Videos
- MMVP: Motion-Matrix-based Video Prediction<br>:star:code
- D3G: Exploring Gaussian Prior for Temporal Sentence Grounding with Glance Annotation<br>:star:code
- LAN-HDR: Luminance-based Alignment Network for High Dynamic Range Video Reconstruction
- TALL: Thumbnail Layout for Deepfake Video Detection
- Spatio-temporal Prompting Network for Robust Video Feature Extraction
- Neural Reconstruction of Relightable Human Model from Monocular Video
- 视频理解
- 视频分类
- 视频合成
- StyleInV: A Temporal Style Modulated Inversion Network for Unconditional Video Generation<br>:house:project<br>:star:code
- Mixed Neural Voxels for Fast Multi-view Video Synthesis
- WALDO: Future Video Synthesis Using Object Layer Decomposition and Parametric Flow Prediction
- Text2Video-Zero: Text-to-Image Diffusion Models are Zero-Shot Video Generators
- StyleLipSync: Style-based Personalized Lip-sync Video Generation
- Text2Performer: Text-Driven Human Video Generation
- DreamPose: Fashion Video Synthesis with Stable Diffusion
- Structure and Content-Guided Video Synthesis with Diffusion Models
- 视频稳定
- Video Grounding(视频定位)
- G2L: Semantically Aligned and Uniform Video Grounding via Geodesic and Game Theory
- UniVTG: Towards Unified Video-Language Temporal Grounding<br>:star:code
- Knowing Where to Focus: Event-aware Transformer for Video Grounding<br>:star:code
- Scanning Only Once: An End-to-end Framework for Fast Temporal Grounding in Long Videos
- 视频分割
- XMem++: Production-level Video Segmentation From Few Annotated Frames
- Rethinking Amodal Video Segmentation from Learning Supervised Signals with Object-centric Representation<br>:star:code
- GraphEcho: Graph-Driven Unsupervised Domain Adaptation for Echocardiogram Video Segmentation<br>:star:code
- MeViS: A Large-scale Benchmark for Video Segmentation with Motion Expressions<br>:star:code<br>:star:code<br>:thumbsup:ICCV2023|新数据集 MeViS:基于动作描述的视频分割
- MEGA: Multimodal Alignment Aggregation and Distillation For Cinematic Video Segmentation
- Tracking Anything with Decoupled Video Segmentation<br>:star:code
- The Making and Breaking of Camouflage
- Tube-Link: A Flexible Cross Tube Framework for Universal Video Segmentation
- 视频对应
- 视频感知
- 视频识别
- Audio-Visual Glance Network for Efficient Video Recognition
- Efficient Decision-based Black-box Patch Attacks on Video Recognition
- Learning from Semantic Alignment between Unpaired Multiviews for Egocentric Video Recognition<br>:star:code
- Implicit Temporal Modeling with Learnable Alignment for Video Recognition
- 视频修补
- 视频表示学习
- VAD
- Video Localization
- UnLoc: A Unified Framework for Video Localization Tasks<br>:star:code
- Video OWL-ViT: Temporally-consistent open-world localization in video
- Multimodal Motion Conditioned Diffusion Model for Skeleton-based Video Anomaly Detection
- TeD-SPAD: Temporal Distinctiveness for Self-Supervised Privacy-Preservation for Video Anomaly Detection
- 视频预测
- 视频玻璃分割
- 视频帧插值
- 视频语义压缩
- 视频-视频翻译
32.Sign Language Recognition(手语)
- Human Part-wise 3D Motion Context Learning for Sign Language Recognition
- CoSign: Exploring Co-occurrence Signals in Skeleton-based Continuous Sign Language Recognition
- Improving Continuous Sign Language Recognition with Cross-Lingual Signs
- C2ST: Cross-Modal Contextualized Sequence Transduction for Continuous Sign Language Recognition
- 手语翻译
31.Human-Object Interaction(人物交互)
- Full-Body Articulated Human-Object Interaction
- Efficient Adaptive Human-Object Interaction Detection with Concept-guided Memory
- Learning Human-Human Interactions in Images from Weak Textual Supervision
- Persistent-Transient Duality: A Multi-mechanism Approach for Modeling Human-Object Interaction
- Re-mine, Learn and Reason: Exploring the Cross-modal Semantic Correlations for Language-guided HOI detection
- Agglomerative Transformer for Human-Object Interaction Detection
- InterDiff: Generating 3D Human-Object Interactions with Physics-Informed Diffusion<br>:star:code
- Exploring Predicate Visual Context in Detecting of Human-Object Interactions
- Persistent-Transient Duality: A Multi-Mechanism Approach for Modeling Human-Object Interaction
- Narrator: Towards Natural Control of Human-Scene Interaction Generation via Relationship Reasoning
- Open Set Video HOI detection from Action-Centric Chain-of-Look Prompting
- Hierarchical Generation of Human-Object Interactions with Diffusion Probabilistic Models
- 手物交互
- EgoPCA: A New Framework for Egocentric Hand-Object Interaction Understanding<br>:house:project
- Diffusion-Guided Reconstruction of Everyday Hand-Object Interaction Clips<br>:star:code
- AffordPose: A Large-scale Dataset of Hand-Object Interactions with Affordance-driven Hand Pose<br>:star:code
- Novel-View Synthesis and Pose Estimation for Hand-Object Interaction from Sparse Views
30.SLAM/Augmented Reality/Virtual Reality/Robotics(增强/虚拟现实/机器人)
- 虚拟人物生成
- MODA: Mapping-Once Audio-driven Portrait Animation with Dual Attentions
- GETAvatar: Generative Textured Meshes for Animatable Human Avatars
- NSF: Neural Surface Fields for Human Modeling from Monocular Depth<br>:house:project
- AvatarCraft: Transforming Text into Neural Human Avatars with Parameterized Shape and Pose Control
- DINAR: Diffusion Inpainting of Neural Textures for One-Shot Human Avatars
- 机器人
- AR/VR
- SLAM
- GO-SLAM: Global Optimization for Consistent 3D Instant Reconstruction<br>:star:code<br>:star:code
- Point-SLAM: Dense Neural Point Cloud-based SLAM
- NeRF-LOAM: Neural Implicit Representation for Large-Scale Incremental LiDAR Odometry and Mapping
- MV-Map: Offboard HD-Map Generation with Multi-view Consistency
- 虚拟试穿
29.Autonomous vehicles(自动驾驶)
- HM-ViT: Hetero-Modal Vehicle-to-Vehicle Cooperative Perception with Vision Transformer(https://github.com/XHwind/HM-ViT)
- TransIFF: An Instance-Level Feature Fusion Framework for Vehicle-Infrastructure Cooperative 3D Detection with Transformers
- Learning Human Dynamics in Autonomous Driving Scenarios
- 自动驾驶
- Improving Online Lane Graph Extraction by Object-Lane Clustering
- SurroundOcc: Multi-camera 3D Occupancy Prediction for Autonomous Driving
- Hidden Biases of End-to-End Driving Models
- Optimizing the Placement of Roadside LiDARs for Autonomous Driving
- VAD: Vectorized Scene Representation for Efficient Autonomous Driving
- Domain Generalization of 3D Semantic Segmentation in Autonomous Driving
- Unsupervised 3D Perception with 2D Vision-Language Distillation for Autonomous Driving
- DriveAdapter: Breaking the Coupling Barrier of Perception and Planning in End-to-End Autonomous Driving<br>:star:code
- Video Task Decathlon: Unifying Image and Video Tasks in Autonomous Driving<br>:house:project
- Does Physical Adversarial Example Really Matter to Autonomous Driving? Towards System-Level Effect of Adversarial Object Evasion Attack
- Towards Viewpoint Robustness in Bird's Eye View Segmentation<br>:star:code
- GameFormer: Game-theoretic Modeling and Learning of Transformer-based Interactive Prediction and Planning for Autonomous Driving
- MV-DeepSDF: Implicit Modeling with Multi-Sweep Point Clouds for 3D Vehicle Reconstruction in Autonomous Driving
- 轨迹预测
- ADAPT: Efficient Multi-Agent Trajectory Prediction with Adaptation<br>:star:code
- INT2: Interactive Trajectory Prediction at Intersections
- Semi-supervised Semantics-guided Adversarial Training for Robust Trajectory Prediction
- EigenTrajectory: Low-Rank Descriptors for Multi-Modal Trajectory Forecasting
- Fast Inference and Update of Probabilistic Density Estimation on Trajectory Prediction<br>:star:code
- Learn TAROT with MENTOR: A Meta-Learned Self-Supervised Approach for Trajectory Prediction
- Sparse Instance Conditioned Multimodal Trajectory Prediction
- Trajectory Unified Transformer for Pedestrian Trajectory Prediction
- BiFF: Bi-level Future Fusion with Polyline-based Coordinate for Interactive Trajectory Prediction
- Joint Metrics Matter: A Better Standard for Trajectory Forecasting
- Traj-MAE: Masked Autoencoders for Trajectory Prediction
- TrajPAC: Towards Robustness Verification of Pedestrian Trajectory Prediction Models
- 车道线
- PETRv2: A Unified Framework for 3D Perception from Multi-Camera Images<br>:star:code
- ADNet: Lane Shape Prediction via Anchor Decomposition<br>:star:code
- LATR: 3D Lane Detection from Monocular Images with Transformer<br>:star:code
- Recursive Video Lane Detection<br>:star:code
- Sparse Point Guided 3D Lane Detection
- Generating Dynamic Kernels via Transformers for Lane Detection
28.Style Transfer(风格迁移)
- All-to-Key Attention for Arbitrary Style Transfer
- AesPA-Net: Aesthetic Pattern-Aware Style Transfer Networks<br>:star:code
- Creative Birds: Self-Supervised Single-View 3D Style Transfer<br>:star:code
- StyleDiffusion: Controllable Disentangled Style Transfer via Diffusion Models
- Zero-Shot Contrastive Loss for Text-Guided Diffusion Image Style Transfer
- Two Birds, One Stone: A Unified Framework for Joint Learning of Image and Video Style Transfers
- StylerDALLE: Language-Guided Style Transfer Using a Vector-Quantized Tokenizer of a Large-Scale Generative Model
- 头发风格迁移
- 文本驱动的3D风格化
27.Self/Semi-Supervised Learning
- 全监督
- 无监督学习
- 弱监督学习
- 自监督学习
- Stable and Causal Inference for Discriminative Self-supervised Deep Visual Representations
- Time Does Tell: Self-Supervised Time-Tuning of Dense Image Representations
- Tubelet-Contrastive Self-Supervision for Video-Efficient Generalization
- Active Self-Supervised Learning: A Few Low-Cost Relationships Are All You Need
- Geometrized Transformer for Self-Supervised Homography Estimation
- LightDepth: Single-View Depth Self-Supervision from Illumination Decline
- Contactless Pulse Estimation Leveraging Pseudo Labels and Self-Supervision
- CROSSFIRE: Camera Relocalization On Self-Supervised Features from an Implicit Representation
- An Embarrassingly Simple Backdoor Attack on Self-supervised Learning
- Learning by Sorting: Self-supervised Learning with Group Ordering Constraints
- Representation Uncertainty in Self-Supervised Learning as Variational Inference
- L-DAWA: Layer-wise Divergence Aware Weight Aggregation in Federated Self-Supervised Visual Representation Learning
- Randomized Quantization: A Generic Augmentation for Data Agnostic Self-supervised Learning
- Semantics Meets Temporal Correspondence: Self-supervised Object-centric Learning in Videos<br>:star:code
- Self-supervised Learning to Bring Dual Reversed Rolling Shutter Images Alive
- Multi-Label Self-Supervised Learning with Scene Images
- Weakly Supervised Learning of Semantic Correspondence through Cascaded Online Correspondence Refinement
- Contrastive Continuity on Augmentation Stability Rehearsal for Continual Self-Supervised Learning
- 半监督学习
- Shrinking Class Space for Enhanced Certainty in Semi-Supervised Learning<br>:star:code
- SimMatchV2: Semi-Supervised Learning with Graph Consistency
- Rethinking Safe Semi-supervised Learning: Transferring the Open-set Problem to A Close-set One
- Enhancing Sample Utilization through Sample Adaptive Augmentation in Semi-Supervised Learning
- Semi-Supervised Learning via Weight-Aware Distillation under Class Distribution Mismatch
- Learning Semi-supervised Gaussian Mixture Models for Generalized Category Discovery
- SSB: Simple but Strong Baseline for Boosting Performance of Open-Set Semi-Supervised Learning
- The Perils of Learning From Unlabeled Data: Backdoor Attacks on Semi-supervised Learning
- A Soft Nearest-Neighbor Framework for Continual Semi-Supervised Learning
- Towards Semi-supervised Learning with Non-random Missing Labels<br>:star:code
- Diverse Cotraining Makes Strong Semi-Supervised Segmentor<br>:star:code
- Semi-Supervised Learning via Weight-aware Distillation under Class Distribution Mismatch<br>:star:code
- IOMatch: Simplifying Open-Set Semi-Supervised Learning with Joint Inliers and Outliers Utilization<br>:star:code
26.Machine Learning(机器学习)
- SAFE: Machine Unlearning With Shard Graphs
- Mitigating Adversarial Vulnerability through Causal Parameter Estimation by Adversarial Double Machine Learning
- Adversarial Learning(对抗学习)
- ACTIVE: Towards Highly Transferable 3D Physical Camouflage for Universal and Robust Vehicle Evasion<br>:star:code
- Towards Building More Robust Models with Frequency Bias
- Backpropagation Path Search On Adversarial Transferability
- Enhancing Generalization of Universal Adversarial Perturbation through Gradient Aggregation
- AdaptGuard: Defending Against Universal Attacks for Model Adaptation
- Explaining Adversarial Robustness of Neural Networks from Clustering Effect Perspective
- 黑盒
- 对抗样本
- 对抗攻击
- An Adaptive Model Ensemble Adversarial Attack for Boosting Adversarial Transferability
- Unified Adversarial Patch for Cross-Modal Attacks in the Physical World
- Tracing the Origin of Adversarial Attack for Forensic Investigation and Deterrence
- How to Choose your Best Allies for a Transferable Attack?
- LEA2: A Lightweight Ensemble Adversarial Attack via Non-overlapping Vulnerable Frequency Regions
- RFLA: A Stealthy Reflected Light Adversarial Attack in the Physical World
- Distracting Downpour: Adversarial Weather Attacks for Motion Estimation
- Boosting Adversarial Transferability via Gradient Relevance Attack
- SAGA: Spectral Adversarial Geometric Attack on 3D Meshes
- F&F Attack: Adversarial Attack against Multiple Object Trackers by Inducing False Negatives and False Positives
- Transferable Adversarial Attack for Both Vision Transformers and Convolutional Networks via Momentum Integrated Gradients
- 对抗训练
- Improving Generalization of Adversarial Training via Robust Critical Fine-Tuning<br>:star:code
- Adversarial Finetuning with Latent Representation Constraint to Mitigate Accuracy-Robustness Tradeoff
- Improving Generalization of Adversarial Training via Robust Critical Fine-Tuning
- Advancing Example Exploitation Can Alleviate Critical Challenges in Adversarial Training
- Fast Adversarial Training with Smooth Convergence
- TRM-UAP: Enhancing the Transferability of Data-Free Universal Adversarial Perturbation via Truncated Ratio Maximization
- 后门
- Class Incremental Learning(类增量学习)
- Proxy Anchor-based Unsupervised Learning for Continuous Generalized Category Discovery
- Online Class Incremental Learning on Stochastic Blurry Task Boundary via Mask and Visual Prompt Tuning
- Prototype Reminiscence and Augmented Asymmetric Knowledge Aggregation for Non-Exemplar Class-Incremental Learning
- Dynamic Residual Classifier for Class Incremental Learning
- First Session Adaptation: A Strong Replay-Free Baseline for Class-Incremental Learning
- Self-Organizing Pathway Expansion for Non-Exemplar Class-Incremental Learning
- Knowledge Restore and Transfer for Multi-Label Class-Incremental Learning
- Audio-Visual Class-Incremental Learning<br>:star:code
- Masked Autoencoders are Efficient Class Incremental Learners<br>:star:code
- Heterogeneous Forgetting Compensation for Class-Incremental Learning<br>:star:code
- Class-Incremental Grouping Network for Continual Audio-Visual Learning<br>:star:code
- Space-time Prompting for Video Class-incremental Learning
- 多任务学习
- Efficient Controllable Multi-Task Architectures
- Deep Multitask Learning with Progressive Parameter Sharing
- AdaMV-MoE: Adaptive Multi-Task Vision Mixture-of-Experts
- MAS: Towards Resource-Efficient Federated Multiple-Task Learning<br>:thumbsup:ICCV 2023 | 如何在资源受限下进行联邦多任务学习
- FULLER: Unified Multi-modality Multi-task 3D Perception via Multi-level Gradient Calibration
- Vision Transformer Adapters for Generalizable Multitask Learning<br>:star:code
- TaskExpert: Dynamically Assembling Multi-Task Representations with Memorial Mixture-of-Experts<br>:star:code
- Achievement-Based Training Progress Balancing for Multi-Task Learning
- 持续学习
- CLR: Channel-wise Lightweight Reprogramming for Continual Learning<br>:star:code
- Rapid Adaptation in Online Continual Learning: Are We Evaluating It Right?
- Self-Evolved Dynamic Expansion Model for Task-Free Continual Learning
- Wasserstein Expansible Variational Autoencoder for Discriminative and Generative Continual Learning
- Measuring Asymmetric Gradient Discrepancy in Parallel Continual Learning
- Data Augmented Flatness-aware Gradient Projection for Continual Learning
- A Unified Continual Learning Framework with General Parameter-Efficient Tuning
- Growing a Brain with Sparsity-Inducing Generation for Continual Learning
- Towards Realistic Evaluation of Industrial Continual Learning Scenarios with an Emphasis on Energy Consumption and Computational Footprint
- ICICLE: Interpretable Class Incremental Continual Learning
- CLNeRF: Continual Learning Meets NeRF<br>:star:code<br>:house:project
- SLCA: Slow Learner with Classifier Alignment for Continual Learning on a Pre-trained Model
- TARGET: Federated Class-Continual Learning via Exemplar-Free Distillation
- Instance and Category Supervision are Alternate Learners for Continual Learning
- Exemplar-Free Continual Transformer with Convolutions
- Online Prototype Learning for Online Continual Learning<br>:star:code
- CBA: Improving Online Continual Learning via Continual Bias Adaptor
- Introducing Language Guidance in Prompt-based Continual Learning
- NAPA-VQ: Neighborhood Aware Prototype Augmentation with Vector Quantization for Continual Learning<br>:star:code
- Generating Instance-level Prompts for Rehearsal-free Continual Learning
- Online Continual Learning on Hierarchical Label Expansion
- Few-shot Continual Infomax Learning
- NAPA-VQ: Neighborhood-Aware Prototype Augmentation with Vector Quantization for Continual Learning
- 增量学习
- Federated Learning(联邦学习)
- When Do Curricula Work in Federated Learning?
- Reducing Training Time in Cross-Silo Federated Learning Using Multigraph Topology
- Robust Heterogeneous Federated Learning under Data Corruption
- No Fear of Classifier Biases: Neural Collapse Inspired Federated Learning with Synthetic and Fixed Classifier
- Federated Learning Over Images: Vertical Decompositions and Pre-Trained Backbones Are Difficult to Beat
- Enhancing Privacy Preservation in Federated Learning via Learning Rate Perturbation
- Local or Global: Selective Knowledge Assimilation for Federated Learning with Limited Labels
- Holistic Geometric Feature Learning for Structured Reconstruction
- ProtoFL: Unsupervised Federated Learning via Prototypical Distillation
- Multi-Metrics Adaptively Identifies Backdoors in Federated Learning
- zPROBE: Zero Peek Robustness Checks for Federated Learning
- PGFed: Personalize Each Client's Global Objective for Federated Learning
- Communication-efficient Federated Learning with Single-Step Synthetic Features Compressor for Faster Convergence
- Towards Instance-adaptive Inference for Federated Learning
- Efficient Model Personalization in Federated Learning via Client-Specific Prompt Generation
- FedPerfix: Towards Partial Model Personalization of Vision Transformers in Federated Learning
- Workie-Talkie: Accelerating Federated Learning by Overlapping Computing and Communications via Contrastive Regularization
- Generative Gradient Inversion via Over-Parameterized Networks in Federated Learning
- FRAug: Tackling Federated Learning with Non-IID Features via Representation Augmentation
- Bold but Cautious: Unlocking the Potential of Personalized Federated Learning through Cautiously Aggressive Collaboration
- Reinforcement Learning(强化学习)
- Improving Generalization in Visual Reinforcement Learning via Conflict-aware Gradient Agreement Augmentation
- ReLeaPS : Reinforcement Learning-based Illumination Planning for Generalized Photometric Stereo
- Environment Agnostic Representation for Visual Reinforcement Learning
- PolicyCleanse: Backdoor Detection and Mitigation for Competitive Reinforcement Learning
- GAIT: Generating Aesthetic Indoor Tours with Deep Reinforcement Learning
- RLSAC: Reinforcement Learning Enhanced Sample Consensus for End-to-End Robust Estimation
- Simoun: Synergizing Interactive Motion-appearance Understanding for Vision-based Reinforcement Learning
- DISeR: Designing Imaging Systems with Reinforcement Learning<br>:star:code
- Learning to Identify Critical States for Reinforcement Learning from Videos<br>:star:code
- Towards Attack-tolerant Federated Learning via Critical Parameter Analysis
- Stabilizing Visual Reinforcement Learning via Asymmetric Interactive Cooperation
- 迁移学习
- Disposable Transfer Learning for Selective Source Task Unlearning
- Towards Inadequately Pre-trained Models in Transfer Learning
- Distilling from Similar Tasks for Transfer Learning on a Budget
- Building a Winning Team: Selecting Source Model Ensembles using a Submodular Transferability Estimation Approach
- Exploring Model Transferability through the Lens of Potential Energy<br>:star:code
- Disentangling Spatial and Temporal Learning for Efficient Image-to-Video Transfer Learning<br>:star:code
- 元学习
- Meta-ZSDETR: Zero-shot DETR with Meta-learning
- Enhanced Meta Label Correction for Coping with Label Corruption
- A Simple Recipe to Meta-Learn Forward and Backward Transfer
- Dark Side Augmentation: Generating Diverse Night Examples for Metric Learning
- Learning with Diversity: Self-Expanded Equalization for Better Generalized Deep Metric Learning
- 度量学习
- 多模态学习
- Preserving Modality Structure Improves Multi-Modal Learning<br>:star:code
- Distribution-Consistent Modal Recovering for Incomplete Multimodal Learning
- DG3D: Generating High Quality 3D Textured Shapes by Learning to Discriminate Multi-Modal Diffusion-Renderings
- Practical Membership Inference Attacks Against Large-Scale Multi-Modal Models: A Pilot Study
- 对比学习
- Contrastive Learning Relies More on Spatial Inductive Bias Than Supervised Learning: An Empirical Study
- CL-MVSNet: Unsupervised Multi-View Stereo with Dual-Level Contrastive Learning
- Semantic Information in Contrastive Learning
- Pre-Training-Free Image Manipulation Localization through Non-Mutually Exclusive Contrastive Learning
- CleanCLIP: Mitigating Data Poisoning Attacks in Multimodal Contrastive Learning
- One-Shot Recognition of Any Material Anywhere Using Contrastive Learning with Physics-Based Rendering
- All4One: Symbiotic Neighbour Contrastive Learning via Self-Attention and Redundancy Reduction
- 机器遗忘
25.Model Compression/Knowledge Distillation/Pruning(模型压缩/知识蒸馏/剪枝)
- ElasticViT: Conflict-aware Supernet Training for Deploying Fast Vision Transformer on Diverse Mobile Devices
- Tangent Model Composition for Ensembling and Continual Fine-tuning
- 量化
- EMQ: Evolving Training-free Proxies for Automated Mixed Precision Quantization
- I-ViT: Integer-only Quantization for Efficient Vision Transformer Inference
- RepQ-ViT: Scale Reparameterization for Post-Training Quantization of Vision Transformers
- Causal-DFQ: Causality Guided Data-free Network Quantization<br>:star:code
- DenseShift: Towards Accurate and Efficient Low-Bit Power-of-Two Quantization
- EQ-Net: Elastic Quantization Neural Networks
- Causal-DFQ: Causality Guided Data-Free Network Quantization
- 剪枝
- Unified Data-Free Compression: Pruning and Quantization without Fine-Tuning
- Automatic Network Pruning via Hilbert-Schmidt Independence Criterion Lasso under Information Bottleneck Principle
- Structural Alignment for Network Pruning through Partial Regularization
- Differentiable Transportation Pruning
- Efficient Joint Optimization of Layer-Adaptive Weight Pruning in Deep Neural Networks
- Towards Fairness-aware Adversarial Network Pruning
- 轻量级网络
- 知识蒸馏
- DOT: A Distillation-Oriented Trainer
- ORC: Network Group-based Knowledge Distillation using Online Role Change
- Class-relation Knowledge Distillation for Novel Class Discovery
- Beyond the Limitation of Monocular 3D Detector via Knowledge Distillation
- FerKD: Surgical Label Adaptation for Efficient Distillation
- Masked Autoencoders Are Stronger Knowledge Distillers
- DETRDistill: A Universal Knowledge Distillation Framework for DETR-families
- Distribution Shift Matters for Knowledge Distillation with Webly Collected Images
- Automated Knowledge Distillation via Monte Carlo Tree Search
- Cumulative Spatial Knowledge Distillation for Vision Transformers
- Data-free Knowledge Distillation for Fine-grained Visual Categorization
- Multi-Label Knowledge Distillation<br>:star:code
- DistillBEV: Boosting Multi-Camera 3D Object Detection with Cross-Modal Knowledge Distillation
- Alleviating Catastrophic Forgetting of Incremental Object Detection via Within-Class and Between-Class Knowledge Distillation
- From Knowledge Distillation to Self-Knowledge Distillation: A Unified Approach with Normalized Loss and Customized Soft Labels
- Enhancing Adversarial Robustness in Low-Label Regime via Adaptively Weighted Regularization and Knowledge Distillation
- 模型压缩
24.Few/Zero-Shot Learning/Domain Generalization/Adaptation(小/零样本/域泛化/域适应)
- 域适应
- Unsupervised Accuracy Estimation of Deep Visual Models using Domain-Adaptive Adversarial Perturbation without Source Samples
- Rethinking the Role of Pre-Trained Networks in Source-Free Domain Adaptation
- Energy-based Self-Training and Normalization for Unsupervised Domain Adaptation
- SSDA: Secure Source-Free Domain Adaptation
- Black-Box Unsupervised Domain Adaptation with Bi-Directional Atkinson-Shiffrin Memory
- PODA: Prompt-driven Zero-shot Domain Adaptation
- Local Context-Aware Active Domain Adaptation
- Similarity Min-Max: Zero-Shot Day-Night Domain Adaptation
- Augmenting and Aligning Snippets for Few-Shot Video Domain Adaptation
- Improved Knowledge Transfer for Semi-Supervised Domain Adaptation via Trico Training Strategy
- Universal Domain Adaptation via Compressive Attention Matching
- Unsupervised Domain Adaptation for Training Event-Based Networks Using Contrastive Learning and Uncorrelated Conditioning
- GeT: Generative Target Structure Debiasing for Domain Adaptation
- PADCLIP: Pseudo-labeling with Adaptive Debiasing in CLIP for Unsupervised Domain Adaptation
- Homeomorphism Alignment for Unsupervised Domain Adaptation
- Towards Effective Instance Discrimination Contrastive Loss for Unsupervised Domain Adaptation
- Order-preserving Consistency Regularization for Domain Adaptation and Generalization
- Complementary Domain Adaptation and Generalization for Unsupervised Continual Domain Shift Learning
- One-Shot Generative Domain Adaptation
- StyleDomain: Efficient and Lightweight Parameterizations of StyleGAN for One-shot and Few-shot Domain Adaptation
- SFHarmony: Source Free Domain Adaptation for Distributed Neuroimaging Analysis
- Domain Adaptive Few-Shot Open-Set Learning
- Confidence-based Visual Dispersal for Few-shot Unsupervised Domain Adaptation<br>:star:code
- LiDAR-UDA: Self-ensembling Through Time for Unsupervised LiDAR Domain Adaptation<br>:star:code
- Unsupervised Domain Adaptive Detection with Network Stability Analysis<br>:star:code
- The Unreasonable Effectiveness of Large Language-Vision Models for Source-free Video Domain Adaptation<br>:star:code
- Domain-Specificity Inducing Transformers for Source-Free Domain Adaptation<br>:house:project
- DomainAdaptor: A Novel Approach to Test-time Adaptation<br>:star:code
- GeT: Generative Target Structure Debiasing for Domain Adaptation<br>:star:code
- Black-box Unsupervised Domain Adaptation with Bi-directional Atkinson-Shiffrin Memory
- Towards Better Robustness against Common Corruptions for Unsupervised Domain Adaptation
- Bidirectional Alignment for Domain Adaptive Detection with Transformers
- Text-Driven Generative Domain Adaptation with Spectral Consistency Regularization
- 域泛化
- Flatness-Aware Minimization for Domain Generalization
- Cross Contrasting Feature Perturbation for Domain Generalization
- Domain Generalization Guided by Gradient Signal to Noise Ratio of Parameters
- DomainDrop: Suppressing Domain-Sensitive Channels for Domain Generalization<br>:star:code
- Domain Generalization via Balancing Training Difficulty and Model Capability
- Activate and Reject: Towards Safe Domain Generalization under Category Shift
- Adversarial Bayesian Augmentation for Single-Source Domain Generalization
- PASTA: Proportional Amplitude Spectrum Training Augmentation for Syn-to-Real Domain Generalization
- DandelionNet: Domain Composition with Instance Adaptive Classification for Domain Generalization
- iDAG: Invariant DAG Searching for Domain Generalization
- A Sentence Speaks a Thousand Images: Domain Generalization through Distilling CLIP with Language Guidance
- Understanding Hessian Alignment for Domain Generalization<br>:star:code
- Domain Generalization via Rationale Invariance<br>:star:code
- PromptStyler: Prompt-driven Style Generation for Source-free Domain Generalization<br>:house:project
- Generalizable Decision Boundaries: Dualistic Meta-Learning for Open Set Domain Generalization<br>:star:code
- 零样本学习
- Hyperbolic Audio-visual Zero-shot Learning
- Continual Zero-Shot Learning through Semantically Guided Generative Random Walks<br>:star:code
- Hierarchical Visual Primitive Experts for Compositional Zero-Shot Learning<br>:star:code
- Distilled Reverse Attention Network for Open-world Compositional Zero-Shot Learning
- 小样本学习
- Prototypes-oriented Transductive Few-shot Learning with Conditional Transport
- Frequency Guidance Matters in Few-Shot Learning
- CDFSL-V: Cross-Domain Few-Shot Learning for Videos<br>:star:code
- Read-only Prompt Optimization for Vision-Language Few-shot Learning<br>:star:code
- Task-aware Adaptive Learning for Cross-domain Few-shot Learning
- DETA: Denoised Task Adaptation for Few-Shot Learning
23.Optical Flow Estimation(光流估计)
- GAFlow: Incorporating Gaussian Attention into Optical Flow
- AccFlow: Backward Accumulation for Long-Range Optical Flow
- SemARFlow: Injecting Semantics into Unsupervised Optical Flow Estimation for Autonomous Driving
- Explicit Motion Disentangling for Efficient Optical Flow Estimation
- VideoFlow: Exploiting Temporal Cues for Multi-frame Optical Flow Estimation<br>:star:code<br>:thumbsup:ICCV2023|港中文MMLab提出多帧光流估计模型VideoFlow,充分挖掘时序线索,Sintel与KITTI榜单排名第一
- MPI-Flow: Learning Realistic Optical Flow with Multiplane Images<br>:star:code<br>:thumbsup:从多平面图像中学习更真实的光流
- RPEFlow: Multimodal Fusion of RGB-PointCloud-Event for Joint Optical Flow and Scene Flow Estimation<br>:star:code<br>:star:code
- Event-based Temporally Dense Optical Flow Estimation with Sequential Learning
- TMA: Temporal Motion Aggregation for Event-based Optical Flow
- Taming Contrast Maximization for Learning Sequential, Low-latency, Event-based Optical Flow
- CroCo v2: Improved Cross-view Completion Pre-training for Stereo Matching and Optical Flow
22.OCR
- ChartReader: A Unified Framework for Chart Derendering and Comprehension without Heuristic Rules
- Self-Supervised Character-to-Character Distillation for Text Recognition
- Vision Grid Transformer for Document Layout Analysis<br>:star:code
- CNN based Cuneiform Sign Detection Learned from Annotated 3D Renderings and Mapped Photographs with Illumination Augmentation<br>:house:project
- ESTextSpotter: Towards Better Scene Text Spotting with Explicit Synergy in Transformer<br>:star:code
- DocTr: Document Transformer for Structured Information Extraction in Documents
- 场景文本识别
- 中文文本识别
- 文档图像校正
- 文档理解
- 字体生成
- 实体识别
- 文本识别
- 手写打印文本分割
- 场景-文本理解
21.Point Cloud(点云)
- Self-Ordering Point Clouds
- Efficient LiDAR Point Cloud Oversegmentation Network
- Attention Discriminant Sampling for Point Clouds
- Dynamic Mesh Recovery from Partial Point Cloud Sequence
- Implicit Autoencoder for Point-Cloud Self-Supervised Representation Learning
- CO-PILOT: Dynamic Top-Down Point Cloud with Conditional Neighborhood Aggregation for Multi-Gigapixel Histopathology Image Representation
- Instance-aware Dynamic Prompt Tuning for Pre-trained Point Cloud Models
- Ponder: Point Cloud Pre-training via Neural Rendering
- CO-Net: Learning Multiple Point Cloud Tasks at Once with A Cohesive Network
- SVDFormer: Complementing Point Cloud via Self-view Augmentation and Self-structure Dual-generator
- EPiC: Ensemble of Partial Point Clouds for Robust Classification
- Point Contrastive Prediction with Semantic Clustering for Self-Supervised Learning on Point Cloud Videos
- Masked Spatio-Temporal Structure Prediction for Self-supervised Learning on Point Cloud Videos
- SC3K: Self-supervised and Coherent 3D Keypoints Estimation from Rotated, Noisy, and Decimated Point Cloud Data<br>:star:code
- 2D3D-MATR: 2D-3D Matching Transformer for Detection-free Registration between Images and Point Clouds<br>:star:code
- DELFlow: Dense Efficient Learning of Scene Flow for Large-Scale Point Clouds<br>:star:code
- Sketch and Text Guided Diffusion Model for Colored Point Cloud Generation
- Clustering based Point Cloud Representation Learning for 3D Analysis<br>:star:code
- Take-A-Photo: 3D-to-2D Generative Pre-training of Point Cloud Models<br>:house:project<br>:star:code
- PC-Adapter: Topology-Aware Adapter for Efficient Domain Adaption on Point Clouds with Rectified Pseudo-label
- Invariant Training 2D-3D Joint Hard Samples for Few-Shot Point Cloud Recognition
- PointDC: Unsupervised Semantic Segmentation of 3D Point Clouds via Cross-Modal Distillation and Super-Voxel Clustering
- 点云配准
- Density-invariant Features for Distant Point Cloud Registration
- Rethinking Point Cloud Registration as Masking and Reconstruction
- Point-TTA: Test-Time Adaptation for Point Cloud Registration Using Multitask Meta-Auxiliary Learning
- SIRA-PCR: Sim-to-Real Adaptation for 3D Point Cloud Registration
- PointMBF: A Multi-scale Bidirectional Fusion Network for Unsupervised RGB-D Point Cloud Registration<br>:star:code
- Sample-adaptive Augmentation for Point Cloud Recognition Against Real-world Corruptions<br>:star:code
- AutoSynth: Learning to Generate 3D Training Data for Object Point Cloud Registration
- RegFormer: An Efficient Projection-Aware Transformer Network for Large-Scale Point Cloud Registration
- Chasing Clouds: Differentiable Volumetric Rasterisation of Point Clouds as a Highly Efficient and Accurate Loss for Large-Scale Deformable 3D Registration
- 点云分割
- See More and Know More: Zero-shot Point Cloud Segmentation via Multi-modal Visual Data
- Generalized Few-Shot Point Cloud Segmentation via Geometric Words
- 2D-3D Interlaced Transformer for Point Cloud Segmentation with Scene-Level Supervision
- ProtoTransfer: Cross-Modal Prototype Transfer for Point Cloud Segmentation
- Zero-Shot Point Cloud Segmentation by Semantic-Visual Aware Synthesis
- CPCM: Contextual Point Cloud Modeling for Weakly-supervised Point Cloud Semantic Segmentation
- GaPro: Box-Supervised 3D Point Cloud Instance Segmentation Using Gaussian Processes as Pseudo Labelers<br>:star:code
- Hierarchical Point-based Active Learning for Semi-supervised Point Cloud Semantic Segmentation<br>:star:code
- Generalized Few-Shot Point Cloud Segmentation Via Geometric Words<br>:star:code
- 点云补全
- 点云分类
- 3D点云
- 3DHacker: Spectrum-based Decision Boundary Generation for Hard-label 3D Point Cloud Attack
- GridPull: Towards Scalability in Learning Implicit Representations from 3D Point Clouds<br>:star:code
- DiffFacto: Controllable Part-Based 3D Point Cloud Generation with Cross Diffusion
- GeoUDF: Surface Reconstruction from 3D Point Clouds via Geometry-guided Distance Representation
- 4D 点云
20.Reid(人员重识别/步态识别/行人检测)
- 人员搜索
- Reid
- Identity-Seeking Self-Supervised Representation Learning for Generalizable Person Re-identification<br>:star:code
- Person Re-Identification without Identification via Event anonymization
- Part-Aware Transformer for Generalizable Person Re-identification
- Unified Pre-training with Pseudo Texts for Text-To-Image Person Re-identification
- Camera-Driven Representation Learning for Unsupervised Domain Adaptive Person Re-identification
- Discrepant and Multi-Instance Proxies for Unsupervised Person Re-Identification
- Learning Clothing and Pose Invariant 3D Shape Representation for Long-Term Person Re-Identification
- Identity-Seeking Self-Supervised Representation Learning for Generalizable Person Re-Identification
- 换衣重识别
- 可见光红外重识别
- Modality Unifying Network for Visible-Infrared Person Re-Identification
- Dual Pseudo-Labels Interactive Self-Training for Semi-Supervised Visible-Infrared Person Re-Identification
- Towards Grand Unified Representation Learning for Unsupervised Visible-Infrared Person Re-Identification
- Visible-Infrared Person Re-Identification via Semantic Alignment and Affinity Inference
- Learning Concordant Attention via Target-aware Alignment for Visible-Infrared Person Re-identification
- 文本-图像重识别
- 步态识别
- 人群计数
19.UAV/Remote Sensing/Satellite Image(无人机/遥感/卫星图像)
- View Consistent Purification for Accurate Cross-View Localization
- Class Prior-Free Positive-Unlabeled Learning with Taylor Variational Loss for Hyperspectral Remote Sensing Imagery<br>:star:code
- Boosting 3-DoF Ground-to-Satellite Camera Localization Accuracy via Geometry-Guided Cross-View Transformer
- Scalable Multi-Temporal Remote Sensing Change Data Generation via Simulating Stochastic Change Process
- 遥感目标检测
- 遥感图像分割
- 遥感图像理解
- 无人机跟踪
- 变化检测
18.Human Pose Estimation
- Plausible Uncertainties for Human Pose Regression
- BaRe-ESA: A Riemannian Framework for Unregistered Human Body Shapes
- Algebraically Rigorous Quaternion Framework for the Neural Network Pose Estimation Problem
- 人体姿态估计
- DiffPose: SpatioTemporal Diffusion Model for Video-Based Human Pose Estimation
- Rethinking Pose Estimation in Crowds: Overcoming the Detection Information Bottleneck and Ambiguity
- SEFD: Learning to Distill Complex Pose and Occlusion
- DiffPose: Multi-hypothesis Human Pose Estimation using Diffusion Models
- Source-free Domain Adaptive Human Pose Estimation
- Prior-guided Source-free Domain Adaptation for Human Pose Estimation
- TEMPO: Efficient Multi-View Pose Estimation, Tracking, and Forecasting
- MHEntropy: Entropy Meets Multiple Hypotheses for Pose and Shape Recovery
- MixSynthFormer: A Transformer Encoder-like Structure with Mixed Synthetic Self-attention for Efficient Human Pose Estimation
- Human from Blur: Human Pose Tracking from Blurry Images
- 多人姿态估计
- 3D人体姿态估计
- 3D-Aware Neural Body Fitting for Occlusion Robust 3D Human Pose Estimation<br>:star:code<br>:star:code
- Diffusion-Based 3D Human Pose Estimation with Multi-Hypothesis Aggregation
- Global Adaptation Meets Local Generalization: Unsupervised Domain Adaptation for 3D Human Pose Estimation
- PhaseMP: Robust 3D Pose Estimation via Phase-conditioned Human Motion Prior
- Test-time Personalizable Forecasting of 3D Human Poses<br>:thumbsup:面向分布外角色的个性化人体运动预测
- Probabilistic Triangulation for Uncalibrated Multi-View 3D Human Pose Estimation
- HopFIR: Hop-wise GraphFormer with Intragroup Joint Refinement for 3D Human Pose Estimation
- Co-Evolution of Pose and Mesh for 3D Human Body Estimation from Video<br>:star:code<br>:star:code
- EMDB: The Electromagnetic Database of Global 3D Human Pose and Shape in the Wild<br>:house:project
- GLA-GCN: Global-local Adaptive Graph Convolutional Network for 3D Human Pose Estimation from Monocular Video
- PoseFix: Correcting 3D Human Poses with Natural Language
- Towards Robust and Smooth 3D Multi-Person Pose Estimation from Monocular Videos in the Wild<br>:house:project
- 人体姿态预测
- 人体网格恢复
- JOTR: 3D Joint Contrastive Learning with Transformers for Occluded Human Mesh Recovery
- Probabilistic Human Mesh Recovery in 3D Scenes from Egocentric Views
- Zolly: Zoom Focal Length Correctly for Perspective-Distorted Human Mesh Reconstruction
- Delicate Textured Mesh Recovery from NeRF via Adaptive Surface Refinement
- TORE: Token Reduction for Efficient Human Mesh Recovery with Transformer
- Distribution-Aligned Diffusion for Human Mesh Recovery<br>:star:code
- Cloth2Body: Generating 3D Human Body Mesh from 2D Clothing
- 3D Human Mesh Recovery with Sequentially Global Rotation Estimation
- 多人网格恢复
- 3D人体恢复
- 姿势迁移
- Weakly-supervised 3D Pose Transfer with Keypoints
- WaveIPT: Joint Attention and Flow Alignment in the Wavelet domain for Pose Transfer
- Collecting The Puzzle Pieces: Disentangled Self-Driven Human Pose Transfer by Permuting Textures
- Bidirectionally Deformable Motion Modulation For Video-based Human Pose Transfer
- MAPConNet: Self-supervised 3D Pose Transfer with Mesh and Point Contrastive Learning
- 姿势生成
- 三维人体重建
- 三维人体网格重建
- 手部姿势估计
- 3D手部姿态估计
- OCHID-Fi: Occlusion-Robust Hand Pose Estimation in 3D via RF-Vision
- RenderIH: A Large-Scale Synthetic Dataset for 3D Interacting Hand Pose Estimation
- HandR2N2: Iterative 3D Hand Pose Estimation Using a Residual Recurrent Neural Network
- RenderIH: A Large-scale Synthetic Dataset for 3D Interacting Hand Pose Estimation<br>:star:code
- 手-物建模
- 手部重建
- Spectral Graphormer: Spectral Graph-based Transformer for Egocentric Two-Hand Reconstruction using Multi-View Color Images
- Decoupled Iterative Refinement Framework for Interacting Hands Reconstruction from a Single RGB Image
- Spectral Graphormer: Spectral Graph-Based Transformer for Egocentric Two-Hand Reconstruction using Multi-View Color Images
- 手势生成
- 手势识别
- 人体合成
- 3D 人体运动生成
- Make-An-Animation: Large-Scale Text-conditional 3D Human Motion Generation
- ActFormer: A GAN-based Transformer towards General Action-Conditioned 3D Human Motion Generation
- Synthesizing Diverse Human Motions in 3D Indoor Scenes
- TMR: Text-to-Motion Retrieval Using Contrastive 3D Human Motion Synthesis
- Guided Motion Diffusion for Controllable Human Motion Synthesis
- MotionBERT: A Unified Perspective on Learning Human Motion Representations
- Fg-T2M: Fine-Grained Text-Driven Human Motion Generation via Diffusion Model
- 人体姿态异常检测
- 舞蹈生成
17.Generative Adversarial Network
- LFS-GAN: Lifelong Few-Shot Image Generation
- Mimic3D: Thriving 3D-Aware GANs via 3D-to-2D Imitation
- What can Discriminator do? Towards Box-free Ownership Verification of Generative Adversarial Network<br>:star:code
- OrthoPlanes: A Novel Representation for Better 3D-Awareness of GANs
- GlowGAN: Unsupervised Learning of HDR Images from LDR Images in the Wild
- Improving Diversity in Zero-Shot GAN Adaptation with Semantic Variations
- LinkGAN: Linking GAN Latents to Pixels for Controllable Image Synthesis
- Robust One-Shot Face Video Re-enactment using Hybrid Latent Spaces of StyleGAN2
- Smoothness Similarity Regularization for Few-Shot GAN Adaptation
- SIDGAN: High-Resolution Dubbed Video Generation via Shift-Invariant Learning
- Frequency-aware GAN for Adversarial Manipulation Generation
- What can Discriminator do? Towards Box-free Ownership Verification of Generative Adversarial Networks
- GAN 逆映射
16.Super-Resolution(超分辨率)
- Self-Supervised Burst Super-Resolution
- Who Are You Referring To? Coreference Resolution In Image Narrations
- Spherical Space Feature Decomposition for Guided Depth Map Super-Resolution
- SRFormer: Permuted Self-Attention for Single Image Super-Resolution
- ESSAformer: Efficient Transformer for Hyperspectral Image Super-resolution
- Spatially-Adaptive Feature Modulation for Efficient Image Super-Resolution
- CuNeRF: Cube-Based Neural Radiance Field for Zero-Shot Medical Image Arbitrary-Scale Super Resolution
- Content-Aware Local GAN for Photo-Realistic Super-Resolution
- Boosting Single Image Super-Resolution via Partial Channel Shifting
- Reconstructed Convolution Module Based Look-Up Tables for Efficient Image Super-Resolution
- Learning Non-Local Spatial-Angular Correlation for Light Field Image Super-Resolution
- Iterative Soft Shrinkage Learning for Efficient Image Super-Resolution
- On the Effectiveness of Spectral Discriminators for Perceptual Quality Improvement<br>:star:code
- Dual Aggregation Transformer for Image Super-Resolution<br>:star:code
- Feature Modulation Transformer: Cross-Refinement of Global Representation via High-Frequency Prior for Image Super-Resolution<br>:star:code
- MetaF2N: Blind Image Super-Resolution by Learning Efficient Model Adaptation from Faces<br>:star:code
- HSR-Diff: Hyperspectral Image Super-Resolution via Conditional Diffusion Models
- Lightweight Image Super-Resolution with Superpixel Token Interaction
- MSRA-SR: Image Super-resolution Transformer with Multi-scale Shared Representation Acquisition
- DLGSANet: Lightweight Dynamic Local and Global Self-Attention Networks for Image Super-Resolution
- Learning Correction Filter via Degradation-Adaptive Regression for Blind Single Image Super-Resolution
- 视频超分辨率
- MoTIF: Learning Motion Trajectories with Local Implicit Neural Functions for Continuous Space-Time Video Super-Resolution
- Learning Data-Driven Vector-Quantized Degradation Model for Animation Video Super-Resolution
- Multi-Frequency Representation Enhancement with Privilege Information for Video Super-Resolution
- 基于参考的超分辨率
- 图像重缩放
15.Image/Video Retrieval(图像/视频检索)
- Zero-Shot Composed Image Retrieval with Textual Inversion
- Democratising 2D Sketch to 3D Shape Retrieval Through Pivoting
- Learning Spatial-context-aware Global Visual Feature Representation for Instance Image Retrieval
- Prototypical Mixing and Retrieval-Based Refinement for Label Noise-Resistant Image Retrieval
- U-RED: Unsupervised 3D Shape Retrieval and Deformation for Partial Point Clouds
- DeDrift: Robust Similarity Search under Content Drift
- FashionNTM: Multi-turn Fashion Image Retrieval via Cascaded Memory<br>:house:project
- Coarse-to-Fine: Learning Compact Discriminative Representation for Single-Stage Image Retrieval<br>:star:code
- Global Features are All You Need for Image Retrieval and Reranking<br>:star:code
- Unsupervised Feature Representation Learning for Domain-generalized Cross-domain Image Retrieval
- Towards Content-based Pixel Retrieval in Revisited Oxford and Paris
- Fan-Beam Binarization Difference Projection (FB-BDP): A Novel Local Object Descriptor for Fine-Grained Leaf Image Retrieval
- 图像-文本检索
- 文本-视频检索
- Helping Hands: An Object-Aware Ego-Centric Video Recognition Model
- DiffusionRet: Generative Text-Video Retrieval with Diffusion Model
- Audio-Enhanced Text-to-Video Retrieval using Text-Conditioned Feature Alignment
- PIDRo: Parallel Isomeric Attention with Dynamic Routing for Text-Video Retrieval
- Prompt Switch: Efficient CLIP Adaptation for Text-Video Retrieval
- In-Style: Bridging Text and Uncurated Videos with Style Transfer for Text-Video Retrieval<br>:star:code
- UATVR: Uncertainty-Adaptive Text-Video Retrieval
- Progressive Spatio-Temporal Prototype Matching for Text-Video Retrieval
- 视频-文本检索
- 视频检索
- 视频时刻检索
14.Image/Video Composition(图像/视频压缩)
- 图像压缩
- TF-ICON: Diffusion-Based Training-Free Cross-Domain Image Composition<br>:star:code
- AdaNIC: Towards Practical Neural Image Compression via Dynamic Transform Routing
- Semantically Structured Image Compression via Irregular Group-Based Decoupling
- COMPASS: High-Efficiency Deep Image Compression with Arbitrary-scale Spatial Scalability
- Computationally-Efficient Neural Image Compression with Shallow Decoders
- Dec-Adapter: Exploring Efficient Decoder-Side Adapter for Bridging Screen Content and Natural Image Compression
- TransTIC: Transferring Transformer-based Image Compression from Human Perception to Machine Perception
- RFD-ECNet: Extreme Underwater Image Compression with Reference to Feature Dictionary
- 视频压缩
13.Image Captions(图像字幕)
- Noise-Aware Learning from Web-Crawled Image-Text Data for Image Captioning
- UniT3D: A Unified Transformer for 3D Dense Captioning and Visual Grounding
- RCA-NOC: Relative Contrastive Alignment for Novel Object Captioning
- Guiding Image Captioning Models Toward More Specific Captions
- OxfordTVG-HIC: Can Machine Make Humorous Captions from Images?
- Transferable Decoding with Visual Entities for Zero-Shot Image Captioning<br>:star:code
- Explore and Tell: Embodied Visual Captioning in 3D Environments<br>:star:code
- With a Little Help from your own Past: Prototypical Memory Networks for Image Captioning<br>:star:code
- With a Little Help from Your Own Past: Prototypical Memory Networks for Image Captioning
- 更改字幕
- 视频字幕
12.Medical Image(医学影像)
- LIMITR: Leveraging Local Information for Medical Image-Text Representation
- SKiT: a Fast Key Information Video Transformer for Online Surgical Phase Recognition
- LNPL-MIL: Learning from Noisy Pseudo Labels for Promoting Multiple Instance Learning in Whole Slide Image
- A skeletonization algorithm for gradient-based optimization
- Learning to Distill Global Representation for Sparse-View CT<br>:star:code
- Taxonomy Adaptive Cross-Domain Adaptation in Medical Imaging via Optimization Trajectory Distillation
- Dual Meta-Learning with Longitudinally Generalized Regularization for One-Shot Brain Tissue Segmentation Across the Human Lifespan<br>:star:code
- MRM: Masked Relation Modeling for Medical Image Pre-Training with Genetics
- Neural Deformable Models for 3D Bi-Ventricular Heart Shape Reconstruction and Modeling from 2D Sparse Cardiac Magnetic Resonance Imaging
- 医学影像配准
- 医学报告生成
- 医学影像分割
- CauSSL: Causality-inspired Semi-supervised Learning for Medical Image Segmentation
- Dynamic Snake Convolution Based on Topological Geometric Constraints for Tubular Structure Segmentation
- UniverSeg: Universal Medical Image Segmentation
- Probabilistic Modeling of Inter- and Intra-observer Variability in Medical Image Segmentation
- 医学影像分析
- 切片分类
- 细胞核检测
- X 射线
- MRI
- 脑肿瘤分割
- Scratch Each Other's Back: Incomplete Multi-Modal Brain Tumor Segmentation via Category Aware Group Self-Support Learning
- Enhancing Modality-Agnostic Representations via Meta-Learning for Brain Tumor Segmentation
- Dual Meta-Learning with Longitudinally Consistent Regularization for One-Shot Brain Tissue Segmentation Across the Human Lifespan
- 器官分割和肿瘤检测
- CT
- Continual Segment: Towards a Single, Unified and Non-forgetting Continual Segmentation Model of 143 Whole-body Organs in CT Scans
- DOLCE: A Model-Based Probabilistic Diffusion Framework for Limited-Angle CT Reconstruction
- CancerUniT: Towards a Single Unified Model for Effective Detection, Segmentation, and Diagnosis of Eight Major Cancers Using a Large Collection of CT Scans
- 线粒体分割
11.Image/Video Editing(图像/视频编辑)
- 图像编辑
- Effective Real Image Editing with Accelerated Iterative Diffusion Inversion
- Multimodal Garment Designer: Human-Centric Latent Diffusion Models for Fashion Image Editing
- SKED: Sketch-guided Text-based 3D Editing
- Prompt Tuning Inversion for Text-driven Image Editing Using Diffusion Models
- A Latent Space of Stochastic Diffusion Models for Zero-Shot Image Editing and Guidance
- 视频编辑
- 头发编辑
10.Image Synthesis(图像合成)
- Foreground Object Search by Distilling Composite Image Feature<br>:star:code
- Beyond Image Borders: Learning Feature Extrapolation for Unbounded Image Composition
- 图像生成
- MosaiQ: Quantum Generative Adversarial Networks for Image Generation on NISQ Computers
- Neural Characteristic Function Learning for Conditional Image Generation
- Efficient-VQGAN: Towards High-Resolution Image Generation with Efficient Vision Transformers
- HumanSD: A Native Skeleton-Guided Diffusion Model for Human Image Generation
- GRAM-HD: 3D-Consistent Image Generation at High Resolution with Generative Radiance Manifolds
- The Euclidean Space is Evil: Hyperbolic Attribute Editing for Few-shot Image Generation
- Generative Multiplane Neural Radiance for 3D-Aware Image Generation
- EGC: Image Generation and Classification via a Diffusion Energy-Based Model
- Personalized Image Generation for Color Vision Deficiency Population
- 3D-aware Image Generation using 2D Diffusion Models
- Ray Conditioning: Trading Photo-consistency for Photo-realism in Multi-view Image Generation
- Both Diverse and Realism Matter: Physical Attribute and Style Alignment for Rainy Image Generation
- 图像合成
- Conditional 360-degree Image Synthesis for Immersive Indoor Scene Decoration
- Controllable Person Image Synthesis with Pose-Constrained Latent Diffusion
- Steered Diffusion: A Generalized Framework for Plug-and-Play Conditional Image Synthesis
- BallGAN: 3D-aware Image Synthesis with a Spherical Background
- Perceptual Artifacts Localization for Image Synthesis Tasks
- Masked Diffusion Transformer is a Strong Image Synthesizer
- SideGAN: 3D-Aware Generative Model for Improved Side-View Image Synthesis
- VeRi3D: Generative Vertex-based Radiance Fields for 3D Controllable Human Image Synthesis
- MasaCtrl: Tuning-Free Mutual Self-Attention Control for Consistent Image Synthesis and Editing
- CHORUS : Learning Canonicalized 3D Human-Object Spatial Relations from Unbounded Synthesized Images
- 文本-图像合成
- Dense Text-to-Image Generation with Attention Modulation<br>:star:code
- Unleashing Text-to-Image Diffusion Models for Visual Perception
- Anti-DreamBooth: Protecting Users from Personalized Text-to-image Synthesis
- Text2Room: Extracting Textured 3D Meshes from 2D Text-to-Image Models
- ELITE: Encoding Visual Concepts into Textual Embeddings for Customized Text-to-Image Generation
- DALL-Eval: Probing the Reasoning Skills and Social Biases of Text-to-Image Generation Models
- ITI-GEN: Inclusive Text-to-Image Generation<br>:star:code
- MagicFusion: Boosting Text-to-Image Generation Performance by Fusing Diffusion Models
- Localizing Object-Level Shape Variations with Text-to-Image Diffusion Models
- Expressive Text-to-Image Generation with Rich Text
- Evaluating Data Attribution for Text-to-Image Models
- Affective Image Filter: Reflecting Emotions from Text to Images
- Editing Implicit Assumptions in Text-to-Image Diffusion Models
- A-STAR: Test-time Attention Segregation and Retention for Text-to-image Synthesis
- PODIA-3D: Domain Adaptation of 3D Generative Model Across Large Domain Gap Using Pose-Preserved Text-to-Image Diffusion
- BoxDiff: Text-to-Image Synthesis with Training-Free Box-Constrained Diffusion<br>:star:code
- Learning to Generate Semantic Layouts for Higher Text-Image Correspondence in Text-to-Image Synthesis<br>:star:code
- Unsupervised Compositional Concepts Discovery with Text-to-Image Generative Models
- Rickrolling the Artist: Injecting Backdoors into Text Encoders for Text-to-Image Synthesis
- Text-Conditioned Sampling Framework for Text-to-Image Generation with Masked Generative Models
- Zero-Shot Spatial Layout Conditioning for Text-to-Image Diffusion Models
- Harnessing the Spatial-Temporal Attention of Diffusion Models for High-Fidelity Text-to-Image Synthesis
- 图像-视频生成
- 文本-视频生成
- 音频驱动的图像生成
- X-图像生成
- 扩散
- DPM-OT: A New Diffusion Probabilistic Model Based on Optimal Transport<br>:star:code
- FreeDoM: Training-Free Energy-Guided Conditional Diffusion Model
- Improving 3D Imaging with Pre-Trained Perpendicular 2D Diffusion Models
- DDP: Diffusion Model for Dense Visual Prediction
- GECCO: Geometrically-Conditioned Point Diffusion Models
- LAW-Diffusion: Complex Scene Generation by Diffusion with Layouts
- Scalable Diffusion Models with Transformers
- Diffuse3D: Wide-Angle 3D Photography via Bilateral Diffusion
- Diffusion Models as Masked Autoencoders
- Versatile Diffusion: Text, Images and Variations All in One Diffusion Model
- DS-Fusion: Artistic Typography via Discriminated and Stylized Diffusion
- HyperDiffusion: Generating Implicit Neural Fields with Weight-Space Diffusion
- The Stable Signature: Rooting Watermarks in Latent Diffusion Models
- Improving Sample Quality of Diffusion Models Using Self-Attention Guidance
- Ablating Concepts in Text-to-Image Diffusion Models
- Preserve Your Own Correlation: A Noise Prior for Video Diffusion Models
- Score-Based Diffusion Models as Principled Priors for Inverse Imaging
- Diffusion-SDF: Conditional Generative Modeling of Signed Distance Functions
- Benchmarking Low-Shot Robustness to Natural Distribution Shifts
- Viewset Diffusion: (0-)Image-Conditioned 3D Generative Models from 2D Data
- AutoDiffusion: Training-Free Optimization of Time Steps and Architectures for Automated Diffusion Model Acceleration
- Text2Tex: Text-driven Texture Synthesis via Diffusion Models
- AdvDiffuser: Natural Adversarial Example Synthesis with Diffusion Models
- Q-Diffusion: Quantizing Diffusion Models
- Phasic Content Fusing Diffusion Model with Directional Distribution Consistency for Few-Shot Model Adaption<br>:star:code
- DiffGuard: Semantic Mismatch-Guided Out-of-Distribution Detection using Pre-trained Diffusion Models
- DiffDis: Empowering Generative Diffusion Model with Cross-Modal Discrimination Capability
- Texture Generation on 3D Meshes with Point-UV Diffusion<br>:star:code
- End-to-End Diffusion Latent Optimization Improves Classifier Guidance
- Stochastic Segmentation with Conditional Categorical Diffusion Models
- DIFFGUARD: Semantic Mismatch-Guided Out-of-Distribution Detection Using Pre-Trained Diffusion Models
- Erasing Concepts from Diffusion Models
- A Complete Recipe for Diffusion Generative Models
- SVDiff: Compact Parameter Space for Diffusion Fine-Tuning
- TexFusion: Synthesizing 3D Textures with Text-Guided Image Diffusion Models
- DiffDreamer: Towards Consistent Unsupervised Single-view Scene Extrapolation with Conditional Diffusion Models
- 3D形状生成
- 故事可视化
- AIGC
- 布局生成
- LayoutDiffusion: Improving Graphic Layout Generation by Discrete Diffusion Probabilistic Models
- CC3D: Layout-Conditioned Generation of Compositional 3D Scenes
- A Parse-Then-Place Approach for Generating Graphic Layouts from Textual Descriptions
- GlobalMapper: Arbitrary-Shaped Urban Layout Generation
- DLT: Conditioned layout generation with Joint Discrete-Continuous Diffusion Layout Transformer
- MapPrior: Bird's-Eye View Map Layout Estimation with Generative Models
- 文本-3D
9.Image Classification(图像分类)
- Adaptive Image Anonymization in the Context of Image Classification with Neural Networks
- Personalized Semantics Excitation for Federated Image Classification
- Learning Support and Trivial Prototypes for Interpretable Image Classification
- Boosting Novel Category Discovery Over Domains with Soft Contrastive Learning and All in One Classifier
- Dynamic Perceiver for Efficient Visual Recognition
- Agile Modeling: From Concept to Classifier in Minutes
- Waffling Around for Performance: Visual Classification with Random Words and Broad Concepts
- A step towards understanding why classification helps regression
- Image-free Classifier Injection for Zero-Shot Classification<br>:star:code
- Label-Free Event-based Object Recognition via Joint Learning with Image Reconstruction from Events<br>:star:code
- ImbSAM: A Closer Look at Sharpness-Aware Minimization in Class-Imbalanced Recognition<br>:star:code
- Learning Concise and Descriptive Attributes for Visual Recognition
- Get the Best of Both Worlds: Improving Accuracy and Transferability by Grassmann Class Representation<br>:star:code
- What do neural networks learn in image classification? A frequency shortcut perspective
- Identification of Systematic Errors of Image Classifiers on Rare Subgroups
- Better May Not Be Fairer: A Study on Subgroup Discrepancy in Image Classification
- Neglected Free Lunch - Learning Image Classifiers Using Annotation Byproducts
- 零样本图像分类
- 小样本图像分类
- 多标签图像分类
- CDUL: CLIP-Driven Unsupervised Learning for Multi-Label Image Classification
- Learning in Imperfect Environment: Multi-Label Classification with Long-Tailed Distribution and Partial Labels
- PatchCT: Aligning Patch Set and Label Set with Conditional Transport for Multi-Label Image Classification
- Scene-Aware Label Graph Learning for Multi-Label Image Classification
- Holistic Label Correction for Noisy Multi-Label Classification
- 细粒度识别/分类
- 长尾识别
- 长尾分类
- 长尾学习
8.Image Segmentation(图像分割)
- Segment Anything<br>:house:project
- High Quality Entity Segmentation
- InterFormer: Real-time Interactive Image Segmentation
- Zero-guidance Segmentation Using Zero Segment Labels
- Video State-Changing Object Segmentation
- RbA: Segmenting Unknown Regions Rejected by All
- Locating Noise is Halfway Denoising for Semi-Supervised Segmentation
- SEMPART: Self-supervised Multi-resolution Partitioning of Image Semantics
- Texture Learning Domain Randomization for Domain Generalized Segmentation
- Open-vocabulary Object Segmentation with Diffusion Models
- SegGPT: Towards Segmenting Everything in Context
- Coarse-to-Fine Amodal Segmentation with Shape Prior<br>:star:code
- Homography Guided Temporal Fusion for Road Line and Marking Segmentation
- SegPrompt: Boosting Open-world Segmentation via Category-level Prompt Learning<br>:star:code
- Multi-interactive Feature Learning and a Full-time Multi-modality Benchmark for Image Fusion and Segmentation<br>:star:code
- CoinSeg: Contrast Inter- and Intra- Class Representations for Incremental Segmentation
- Unmasking Anomalies in Road-Scene Segmentation<br>:star:code
- DynaMITe: Dynamic Query Bootstrapping for Multi-object Interactive Segmentation Transformer
- LogicSeg: Parsing Visual Semantics with Neural Logic Learning and Reasoning
- SimpleClick: Interactive Image Segmentation with Simple Vision Transformers
- UniSeg: A Unified Multi-Modal LiDAR Segmentation Network and the OpenPCSeg Codebase<br>:star:code
- 3D Segmentation of Humans in Point Clouds with Synthetic Data
- MasQCLIP for Open-Vocabulary Universal Image Segmentation
- FreeCOS: Self-Supervised Learning from Fractals and Unlabeled Images for Curvilinear Object Segmentation
- Rethinking Range View Representation for LiDAR Segmentation
- SegPrompt: Boosting Open-World Segmentation via Category-Level Prompt Learning
- 引用表达式分割
- 指代图像分割
- Bridging Vision and Language Encoders: Parameter-Efficient Tuning for Referring Image Segmentation<br>:star:code
- Beyond One-to-One: Rethinking the Referring Image Segmentation<br>:star:code
- Referring Image Segmentation Using Text Supervision<br>:star:code
- Shatter and Gather: Learning Referring Image Segmentation with Text Supervision
- Weakly Supervised Referring Image Segmentation with Intra-Chunk and Inter-Chunk Consistency
- 小样本分割
- 语义分割
- A Good Student is Cooperative and Reliable: CNN-Transformer Collaborative Learning for Semantic Segmentation
- Walking Your LiDOG: A Journey Through Multiple Domains for LiDAR Semantic Segmentation
- SegRCDB: Semantic Segmentation via Formula-Driven Supervised Learning
- Contrastive Model Adaptation for Cross-Condition Robustness in Semantic Segmentation
- Learning Pseudo-Relations for Cross-domain Semantic Segmentation
- Adaptive Superpixel for Active Learning in Semantic Segmentation
- Dynamic Token Pruning in Plain Vision Transformers for Semantic Segmentation
- MemorySeg: Online LiDAR Semantic Segmentation with a Latent Memory
- Disentangle then Parse: Night-time Semantic Segmentation with Illumination Disentanglement
- DiffuMask: Synthesizing Images with Pixel-level Annotations for Semantic Segmentation Using Diffusion Models
- Boosting Semantic Segmentation from the Perspective of Explicit Class Embeddings
- Informative Data Mining for One-Shot Cross-Domain Semantic Segmentation<br>:star:code
- CMDA: Cross-Modality Domain Adaptation for Nighttime Semantic Segmentation<br>:star:code
- To Adapt or Not to Adapt? Real-Time Adaptation for Semantic Segmentation<br>:star:code
- Residual Pattern Learning for Pixel-Wise Out-of-Distribution Detection in Semantic Segmentation
- Look at the Neighbor: Distortion-aware Unsupervised Domain Adaptation for Panoramic Semantic Segmentation
- SVQNet: Sparse Voxel-Adjacent Query Network for 4D Spatio-Temporal LiDAR Semantic Segmentation
- Preparing the Future for Continual Semantic Segmentation
- 点云语义分割
- 小样本语义分割
- 零样本语义分割
- 无监督语义分割
- 弱监督语义分割
- MARS: Model-agnostic Biased Object Removal without Additional Supervision for Weakly-Supervised Semantic Segmentation
- USAGE: A Unified Seed Area Generation Paradigm for Weakly Supervised Semantic Segmentation
- FPR: False Positive Rectification for Weakly Supervised Semantic Segmentation
- Treating Pseudo-labels Generation as Image Matting for Weakly Supervised Semantic Segmentation
- 半监督语义分割
- Space Engage: Collaborative Space Supervision for Contrastive-Based Semi-Supervised Semantic Segmentation
- Large-Scale Land Cover Mapping with Fine-Grained Classes via Class-Aware Semi-Supervised Semantic Segmentation
- Semi-Supervised Semantic Segmentation under Label Noise via Diverse Learning Groups
- Logic-induced Diagnostic Reasoning for Semi-supervised Semantic Segmentation<br>:star:code
- Enhanced Soft Label for Semi-Supervised Semantic Segmentation
- CFCG: Semi-Supervised Semantic Segmentation via Cross-Fusion and Contour Guidance Supervision
- XNet: Wavelet-Based Low and High Frequency Fusion Networks for Fully- and Semi-Supervised Semantic Segmentation of Biomedical Images
- 域适应语义分割
- Diffusion-based Image Translation with Label Guidance for Domain Adaptive Semantic Segmentation
- CrossMatch: Source-Free Domain Adaptive Semantic Segmentation via Cross-Modal Consistency Training
- Focus on Your Target: A Dual Teacher-Student Framework for Domain-Adaptive Semantic Segmentation
- CDAC: Cross-domain Attention Consistency in Transformer for Domain Adaptive Semantic Segmentation
- 开放词汇语义分割
- 开放世界语义分割
- 3D 语义分割
- BEV-DG: Cross-Modal Learning under Bird's-Eye View for Domain Generalization of 3D Semantic Segmentation
- Multi-Modal Continual Test-Time Adaptation for 3D Semantic Segmentation
- Efficient 3D Semantic Segmentation with Superpoint Transformer
- You Never Get a Second Chance To Make a Good First Impression: Seeding Active Learning for 3D Semantic Segmentation
- 实例分割
- BoxSnake: Polygonal Instance Segmentation with Box Supervision
- TopoSeg: Topology-Aware Nuclear Instance Segmentation
- MUVA: A New Large-Scale Benchmark for Multi-View Amodal Instance Segmentation in the Shopping Scenario
- WaterMask: Instance Segmentation for Underwater Imagery
- Learning Cross-Representation Affinity Consistency for Sparsely Supervised Biomedical Instance Segmentation
- Class-incremental Continual Learning for Instance Segmentation with Image-level Weak Supervision
- 3D 实例分割
- Mask-Attention-Free Transformer for 3D Instance Segmentation<br>:star:code
- 3D Instance Segmentation via Enhanced Spatial and Semantic Supervision
- Collaborative Propagation on Multiple Instance Graphs for 3D Instance Segmentation with Single-point Supervision
- Query Refinement Transformer for 3D Instance Segmentation
- 点云实例分割
- 细胞实例分割
- 半监督实例分割
- 开放世界实例分割
- 开放词汇实例分割
- 全景分割
- Towards Deeply Unified Depth-aware Panoptic Segmentation with Bi-directional Guidance Learning
- A Generalist Framework for Panoptic Segmentation of Images and Videos
- Open-vocabulary Panoptic Segmentation with Embedding Modulation
- 4D Panoptic Segmentation as Invariant and Equivariant Field Prediction
- EDAPS: Enhanced Domain-Adaptive Panoptic Segmentation
- Point2Mask: Point-supervised Panoptic Segmentation via Optimal Transport<br>:star:code
- LiDAR-Camera Panoptic Segmentation via Geometry-Consistent and Semantic-Aware Alignment<br>:star:code
- 开放词汇部分分割
- VIS
- CTVIS: Consistent Training for Online Video Instance Segmentation<br>:star:code
- DVIS: Decoupled Video Instance Segmentation Framework<br>:star:code<br>:thumbsup:ICCV 2023 | 发挥offline方法的潜力,武大&快手提出解耦合的视频实例分割框架DVIS
- TCOVIS: Temporally Consistent Online Video Instance Segmentation<br>:star:code
- Towards Open-Vocabulary Video Instance Segmentation<br>:thumbsup:面向开放任务的视频实例分割( Oral )
- VOS
- Spectrum-guided Multi-granularity Referring Video Object Segmentation<br>:star:code
- HTML: Hybrid Temporal-scale Multimodal Learning Framework for Referring Video Object Segmentation
- Video Object Segmentation-aware Video Frame Interpolation
- LVOS: A Benchmark for Long-term Video Object Segmentation
- Robust Referring Video Object Segmentation with Cyclic Structural Consensus
- Unsupervised Video Object Segmentation with Online Adversarial Self-Tuning
- Alignment Before Aggregation: Trajectory Memory Retrieval Network for Video Object Segmentation
- Multi-grained Temporal Prototype Learning for Few-shot Video Object Segmentation<br>:star:code
- Temporal Collection and Distribution for Referring Video Object Segmentation<br>:star:code
- OnlineRefer: A Simple Online Baseline for Referring Video Object Segmentation<br>:star:code
- Isomer: Isomerous Transformer for Zero-shot Video Object Segmentation<br>:star:code
- Scalable Video Object Segmentation with Simplified Framework
- Learning Cross-Modal Affinity for Referring Video Object Segmentation Targeting Limited Samples<br>:star:code
- 动作分割
- Diffusion Action Segmentation
- How Much Temporal Long-Term Context is Needed for Action Segmentation?
- Video Action Segmentation via Contextually Refined Temporal Keypoints
- Markov Game Video Augmentation for Action Segmentation
- Weakly-Supervised Action Segmentation and Unseen Error Detection in Anomalous Instructional Videos
- LAC - Latent Action Composition for Skeleton-based Action Segmentation
- 交互分割
- 基于文本的图像分割
- 场景解析
7.Image Progress(低层图像处理、质量评价)
- DRAW: Defending Camera-shooted RAW against Image Manipulation
- SAFL-Net: Semantic-Agnostic Feature Learning Network with Auxiliary Plugins for Image Manipulation Detection
- Towards Generic Image Manipulation Detection with Weakly-Supervised Self-Consistency Learning<br>:star:code
- Improving Lens Flare Removal with General Purpose Pipeline and Multiple Light Sources Recovery
- Deep Homography Mixture for Single Image Rolling Shutter Correction
- 图像修饰
- 图像恢复
- Physics-Driven Turbulence Image Restoration with Stochastic Refinement<br>:star:code
- Multi-weather Image Restoration via Domain Translation
- PIRNet: Privacy-Preserving Image Restoration Network via Wavelet Lifting
- DiffIR: Efficient Diffusion Model for Image Restoration
- DDS2M: Self-Supervised Denoising Diffusion Spatio-Spectral Model for Hyperspectral Image Restoration
- Under-Display Camera Image Restoration with Scattering Effect<br>:star:code
- Learning Image-Adaptive Codebooks for Class-Agnostic Image Restoration
- Fingerprinting Deep Image Restoration Models
- Focal Network for Image Restoration
- FSI: Frequency and Spatial Interactive Learning for Image Restoration in Under-Display Cameras
- Self-supervised Monocular Underwater Depth Recovery, Image Restoration, and a Real-sea Video Dataset
- 图像/视频修复
- Rethinking Fast Fourier Convolution in Image Inpainting
- Continuously Masked Transformer for Image Inpainting
- Semantic-Aware Dynamic Parameter for Video Inpainting Transformer
- MI-GAN: A Simple Baseline for Image Inpainting on Mobile Devices
- CIRI: Curricular Inactivation for Residue-aware One-shot Video Inpainting
- 图像/视频增强
- Coherent Event Guided Low-Light Video Enhancement
- Lighting up NeRF via Unsupervised Decomposition and Enhancement<br>:house:project
- Implicit Neural Representation for Cooperative Low-light Image Enhancement<br>:star:code<br>:thumbsup:ICCV2023 | 将隐式神经表征用于“低光增强”,北大张健团队提出NeRCo
- Retinexformer: One-stage Retinex-based Transformer for Low-light Image Enhancement<br>:star:code<br>:thumbsup:ICCV 2023 清华ETH提出 Retinexformer 刷新十三大暗光增强榜单
- Low-Light Image Enhancement with Illumination-Aware Gamma Correction and Complete Image Modelling Network
- Diff-Retinex: Rethinking Low-light Image Enhancement with A Generative Diffusion Model
- Empowering Low-Light Image Enhancer through Customized Learnable Priors<br>:star:code
- ExposureDiffusion: Learning to Expose for Low-light Image Enhancement
- NIR-assisted Video Enhancement via Unpaired 24-hour Data
- Low-Light Image Enhancement with Multi-Stage Residue Quantization and Brightness-Aware Attention
- Iterative Prompt Learning for Unsupervised Backlit Image Enhancement
- 图像/视频去雨
- 图像去噪
- Delta Denoising Score
- Random Sub-Samples Generation for Self-Supervised Real Image Denoising<br>:star:code
- Denoising Diffusion Autoencoders are Unified Self-supervised Learners
- Noise2Info: Noisy Image to Information of Noise for Self-Supervised Image Denoising
- Iterative Denoiser and Noise Estimator for Self-Supervised Image Denoising
- Self-supervised Image Denoising with Downsampled Invariance Loss and Conditional Blind-Spot Network
- The Devil is in the Upsampling: Architectural Decisions Made Simpler for Denoising with Deep Image Prior
- Lighting Every Darkness in Two Pairs: A Calibration-Free Pipeline for RAW Denoising<br>:star:code
- Score Priors Guided Deep Variational Inference for Unsupervised Real-World Single Image Denoising
- Unsupervised Image Denoising in Real-World Scenarios via Self-Collaboration Parallel Generative Adversarial Branches
- Multi-view Self-supervised Disentanglement for General Image Denoising
- RED-PSM: Regularization by Denoising of Partially Separable Models for Dynamic Imaging
- Hybrid Spectral Denoising Transformer with Guided Attention
- 图像去雾
- 图像去除阴影
- 图像/视频去模糊
- Exploring Temporal Frequency Spectrum in Deep Video Deblurring
- Single Image Deblurring with Row-dependent Blur Magnitude
- Multi-Scale Residual Low-Pass Filter Network for Image Deblurring
- Multiscale Structure Guided Diffusion for Image Deblurring
- Single Image Defocus Deblurring via Implicit Neural Inverse Kernels
- Non-Coaxial Event-Guided Motion Deblurring with Spatial Alignment
- Deep Feature Deblurring Diffusion for Detecting Out-of-Distribution Objects
- 图像/视频区摩尔纹
- 去马赛克/去鬼影
- 图像/视频质量评估
- 图像和谐化
- 图像校正
- 图像拼接
- 图像着色
- 图像/视频分解
- 运动(去)模糊
- 水印去除
6.Face(人脸)
- StyleGANEX: StyleGAN-Based Manipulation Beyond Cropped Aligned Faces
- Efficient Region-Aware Neural Radiance Fields for High-Fidelity Talking Portrait Synthesis
- UniFace: Unified Cross-Entropy Loss for Deep Face Recognition
- Human-Inspired Facial Sketch Synthesis with Dynamic Adaptation<br>:star:code
- Can Language Models Learn to Listen?<br>:house:project
- Out-of-Domain GAN Inversion via Invertibility Decomposition for Photo-Realistic Human Face Manipulation
- Knowledge-Spreader: Learning Semi-Supervised Facial Action Dynamics by Consistifying Knowledge Granularity
- 去识别
- 人脸活体检测
- 说话头合成
- Implicit Identity Representation Conditioned Memory Compensation Network for Talking Head video Generation<br>:star:code
- Efficient Emotional Adaptation for Audio-Driven Talking-Head Generation<br>:star:code
- Talking Head Generation with Probabilistic Audio-to-Visual Diffusion Priors
- Emotional Listener Portrait: Neural Listener Head Generation with Emotion
- 说话人脸合成
- 人脸交换
- 假脸检测
- 人脸再现
- 文本驱动的人脸处理
- 人脸表情
- 人脸识别
- IDiff-Face: Synthetic-based Face Recognition through Fizzy Identity-Conditioned Diffusion Models
- Invariant Feature Regularization for Fair Face Recognition
- Benchmarking Algorithmic Bias in Face Recognition: An Experimental Approach Using Synthetic Faces and Human Evaluation
- TransFace: Calibrating Transformer Training for Face Recognition from a Data-Centric Perspective<br>:star:code
- How to Boost Face Recognition with StyleGAN?
- Privacy-Preserving Face Recognition Using Random Frequency Components<br>:star:code
- ICD-Face: Intra-class Compactness Distillation for Face Recognition
- 人脸聚类
- 人脸合成
- 3D 人脸
- Unpaired Multi-domain Attribute Translation of 3D Facial Shapes with a Square and Symmetric Geometric Map
- ASM: Adaptive Skinning Model for High-Quality 3D Face Modeling
- Relightify: Relightable 3D Faces from a Single Image via Diffusion Models
- 3D人脸合成
- 3D 人脸动画
- EmoTalk: Speech-Driven Emotional Disentanglement for 3D Face Animation
- Semi-supervised Speech-driven 3D Facial Animation via Cross-modal Encoding
- Imitator: Personalized Speech-driven 3D Facial Animation
- Speech4Mesh: Speech-Assisted Monocular 3D Facial Reconstruction for Speech-Driven 3D Facial Animation
- 3D 人脸重建
- 人脸质量评估
- 人脸恢复
- 人脸编辑
- 语音驱动的人像动画
- 人脸重照明
- 面部行为理解
5.Biometric Recognition(生物特征识别)
<a name="4"/>4.Object Tracking(目标跟踪)
- Tracking Everything Everywhere All at Once
- Deep Active Contours for Real-time 6-DoF Object Tracking
- Humans in 4D: Reconstructing and Tracking Humans with Transformers
- Cross-Modal Orthogonal High-Rank Augmentation for RGB-Event Transformer-Trackers
- Multiple Planar Object Tracking
- End-to-end 3D Tracking with Decoupled Queries
- TAPIR: Tracking Any Point with Per-Frame Initialization and Temporal Refinement
- 多目标跟踪
- MeMOTR: Long-Term Memory-Augmented Transformer for Multi-Object Tracking<br>:star:code
- Tracking without Label: Unsupervised Multiple Object Tracking via Contrastive Similarity Learning
- Collaborative Tracking Learning for Frame-Rate-Insensitive Multi-Object Tracking
- DARTH: Holistic Test-time Adaptation for Multiple Object Tracking
- Uncertainty-aware Unsupervised Multi-Object Tracking
- TrackFlow: Multi-Object Tracking with Normalizing Flows
- 3DMOTFormer: Graph Transformer for Online 3D Multi-Object Tracking<br>:star:code
- ReST: A Reconfigurable Spatial-Temporal Graph Model for Multi-Camera Multi-Object Tracking
- Object-Centric Multiple Object Tracking
- Heterogeneous Diversity Driven Active Learning for Multi-Object Tracking
- 视觉跟踪
- Exploring Lightweight Hierarchical Vision Transformers for Efficient Visual Tracking
- CiteTracker: Correlating Image and Text for Visual Tracking
- Robust Object Modeling for Visual Tracking<br>:star:code
- PVT++: A Simple End-to-End Latency-Aware Visual Tracking Framework
- Integrating Boxes and Masks: A Multi-Object Framework for Unified Visual Tracking and Segmentation<br>:star:code
- Foreground-Background Distribution Modeling Transformer for Visual Object Tracking
- 3D目标跟踪
- Delving into Motion-Aware Matching for Monocular 3D Object Tracking<br>:star:code
- Synchronize Feature Extracting and Matching: A Single Branch Framework for 3D Object Tracking
- MixCycle: Mixup Assisted Semi-Supervised 3D Single Object Tracking with Cycle Consistency
- TrajectoryFormer: 3D Object Tracking Transformer with Predictive Trajectory Hypotheses
- MBPTrack: Improving 3D Point Cloud Tracking with Memory Networks and Box Priors
3.Object Detection(目标检测)
- Deep Equilibrium Object Detection
- DIRE for Diffusion-Generated Image Detection
- DiffusionDet: Diffusion Model for Object Detection
- Shift from Texture-bias to Shape-bias: Edge Deformation-based Augmentation for Robust Object Recognition
- Detecting Objects with Context-Likelihood Graphs and Graph Refinement
- Multi-Object Discovery by Low-Dimensional Object Motion
- MOST: Multiple Object Localization with Self-Supervised Transformers for Object Discovery
- Label-Efficient Online Continual Object Detection in Streaming Video
- Spurious Features Everywhere - Large-Scale Detection of Harmful Spurious Features in ImageNet
- SparseDet: Improving Sparsely Annotated Object Detection with Pseudo-positive Mining
- RecursiveDet: End-to-End Region-Based Recursive Object Detection
- StageInteractor: Query-based Object Detector with Cross-stage Interaction
- Anchor-Intermediate Detector: Decoupling and Coupling Bounding Boxes for Accurate Object Detection
- Uncertainty-guided Learning for Improving Image Manipulation Detection
- From Chaos Comes Order: Ordering Event Representations for Object Recognition and Detection
- Integrally Migrating Pre-trained Transformer Encoder-decoders for Visual Object Detection
- Unleashing Vanilla Vision Transformer with Masked Image Modeling for Object Detection
- FemtoDet: An Object Detection Baseline for Energy Versus Performance Tradeoffs
- Periodically Exchange Teacher-Student for Source-Free Object Detection
- Parametric Depth Based Feature Representation Learning for Object Detection and Segmentation in Bird's-Eye View
- CoTDet: Affordance Knowledge Prompting for Task Driven Object Detection
- A Dynamic Dual-Processing Object Detection Framework Inspired by the Brain's Recognition Mechanism
- ASAG: Building Strong One-Decoder-Layer Sparse Detectors via Adaptive Sparse Anchor Generation<br>:star:code
- FeatEnHancer: Enhancing Hierarchical Features for Object Detection and Beyond Under Low-Light Vision
- FB-BEV: BEV Representation from Forward-Backward View Transformations<br>:star:code
- DETR Doesn't Need Multi-Scale or Locality Design<br>:star:code
- RecursiveDet: End-to-End Region-based Recursive Object Detection<br>:star:code
- Less is More: Focus Attention for Efficient DETR<br>:star:code<br>:house:project
- Spatial Self-Distillation for Object Detection with Inaccurate Bounding Boxes<br>:star:code
- AlignDet: Aligning Pre-training and Fine-tuning in Object Detection<br>:star:code
- Augmented Box Replay: Overcoming Foreground Shift for Incremental Object Detection
- Deep Directly-Trained Spiking Neural Networks for Object Detection
- Cascade-DETR: Delving into High-Quality Universal Object Detection<br>:star:code
- Object-aware Gaze Target Detection<br>:star:code
- FS-DETR: Few-Shot DEtection TRansformer with Prompting and without Re-Training
- Adaptive Rotated Convolution for Rotated Object Detection
- UniKD: Universal Knowledge Distillation for Mimicking Homogeneous or Heterogeneous Object Detectors
- 3D OD
- PG-RCNN: Semantic Surface Point Generation for 3D Object Detection<br>:star:code
- Efficient Transformer-based 3D Object Detection with Dynamic Token Halting
- Temporal Enhanced Training of Multi-view 3D Object Detector via Historical Object Prediction
- DQS3D: Densely-matched Quantization-aware Semi-supervised 3D Detection
- Towards Universal LiDAR-Based 3D Object Detection by Multi-Domain Knowledge Transfer
- Towards Fair and Comprehensive Comparisons for Image-Based 3D Object Detection
- DPF-Net: Combining Explicit Shape Priors in Deformable Primitive Field for Unsupervised Structural Reconstruction of 3D Objects
- FocalFormer3D: Focusing on Hard Instance for 3D Object Detection
- GACE: Geometry Aware Confidence Enhancement for Black-Box 3D Object Detectors on LiDAR-Data
- Not Every Side Is Equal: Localization Uncertainty Estimation for Semi-Supervised 3D Object Detection
- Once Detected, Never Lost: Surpassing Human Performance in Offline LiDAR based 3D Object Detection
- Ada3D : Exploiting the Spatial Redundancy with Adaptive Inference for Efficient 3D Object Detection
- Revisiting Domain-Adaptive 3D Object Detection by Reliable, Diverse and Class-balanced Pseudo-Labeling
- Clusterformer: Cluster-based Transformer for 3D Object Detection in Point Clouds
- Object as Query: Lifting Any 2D Object Detector to 3D Detection
- MetaBEV: Solving Sensor Failures for 3D Detection and Map Segmentation
- SupFusion: Supervised LiDAR-Camera Fusion for 3D Object Detection
- UpCycling: Semi-supervised 3D Object Detection without Sharing Raw-level Unlabeled Scenes
- Pixel-Aligned Recurrent Queries for Multi-View 3D Object Detection
- Monocular 3D Object Detection with Bounding Box Denoising in 3D by Perceiver
- MonoNeRD: NeRF-like Representations for Monocular 3D Object Detection
- MonoDETR: Depth-guided Transformer for Monocular 3D Object Detection
- Learning from Noisy Data for Semi-Supervised 3D Object Detection
- DetZero: Rethinking Offboard 3D Object Detection with Long-term Sequential Point Clouds
- SparseFusion: Fusing Multi-Modal Sparse Representations for Multi-Sensor 3D Object Detection
- SA-BEV: Generating Semantic-Aware Bird's-Eye-View Feature for Multi-view 3D Object Detection
- KECOR: Kernel Coding Rate Maximization for Active 3D Object Detection
- ObjectFusion: Multi-modal 3D Object Detection with Object-Centric Fusion
- Predict to Detect: Prediction-guided 3D Object Detection using Sequential Images
- A Simple Vision Transformer for Weakly Semi-supervised 3D Object Detection
- A Fast Unified System for 3D Object Detection and Tracking
- 3DPPE: 3D Point Positional Encoding for Transformer-based Multi-Camera 3D Object Detection
- PARTNER: Level up the Polar Representation for LiDAR 3D Object Detection
- SHIFT3D: Synthesizing Hard Inputs For Tricking 3D Detectors
- ImGeoNet: Image-induced Geometry-aware Voxel Representation for Multi-view 3D Object Detection<br>:star:code
- MonoNeRD: NeRF-like Representations for Monocular 3D Object Detection<br>:star:code
- SparseBEV: High-Performance Sparse 3D Object Detection from Multi-Camera Videos<br>:star:code
- QD-BEV : Quantization-aware View-guided Distillation for Multi-view 3D Object Detection
- Representation Disparity-aware Distillation for 3D Object Detection
- GPA-3D: Geometry-aware Prototype Alignment for Unsupervised Domain Adaptive 3D Object Detection from Point Clouds<br>:star:code
- FocalFormer3D : Focusing on Hard Instance for 3D Object Detection<br>:star:code
- NeRF-Det: Learning Geometry-Aware Volumetric Representation for Multi-View 3D Object Detection<br>:star:code
- Exploring Object-Centric Temporal Modeling for Efficient Multi-View 3D Object Detection<br>:star:code
- Cross Modal Transformer: Towards Fast and Robust 3D Object Detection<br>:star:code
- CoIn: Contrastive Instance Feature Mining for Outdoor 3D Object Detection with Very Limited Annotations
- GraphAlign: Enhancing Accurate Feature Alignment by Graph matching for Multi-Modal 3D Object Detection
- 文本驱动的目标检测
- 开放词汇目标检测
- 开放世界目标检测
- 端到端目标检测
- 域适应目标检测
- 自监督目标检测
- 弱监督目标检测
- 半监督目标检测
- Gradient-based Sampling for Class Imbalanced Semi-supervised Object Detection
- 小样本目标检测
- 密集目标检测
- 视频目标检测
- 长尾目标检测
- 开集目标检测
- 小目标检测
- 目标定位
- 3D OL
- 无监督目标定位
- 弱监督目标定位
- 开放词汇目标定位
- 影子检测
- 镜子检查
2.3D(三维重建\三维视觉)
- ATT3D: Amortized Text-to-3D Object Synthesis
- Zero-1-to-3: Zero-shot One Image to 3D Object
- Robo3D: Towards Robust and Reliable 3D Perception against Corruptions
- AG3D: Learning to Generate 3D Avatars from 2D Image Collections
- Semantify: Simplifying the Control of 3D Morphable Models Using CLIP
- Vox-E: Text-Guided Voxel Editing of 3D Objects
- PlaneRecTR: Unified Query Learning for 3D Plane Recovery from a Single View
- Tiled Multiplane Images for Practical 3D Photography
- Learned Compressive Representations for Single-Photon 3D Imaging
- CVRecon: Rethinking 3D Geometric Feature Learning For Neural Reconstruction
- DeFormer: Integrating Transformers with Deformable Models for 3D Shape Abstraction from a Single Image
- HoloFusion: Towards Photo-realistic 3D Generative Modeling<br>:star:code
- PlaneRecTR: Unified Query learning for 3D Plane Recovery from a Single View<br>:house:project<br>:star:code
- OmnimatteRF: Robust Omnimatte with 3D Background Modeling<br>:star:code
- Learning Versatile 3D Shape Generation with Improved Auto-regressive Models
- GeoMIM: Towards Better 3D Knowledge Transfer via Masked Image Modeling for Multi-view 3D Understanding
- 3D-VisTA: Pre-trained Transformer for 3D Vision and Text Alignment
- 三维重建
- Metric3D: Towards Zero-shot Metric 3D Prediction from A Single Image<br>:star:code
- Batch-based Model Registration for Fast 3D Sherd Reconstruction
- UMIFormer: Mining the Correlations between Similar Tokens for Multi-View 3D Reconstruction
- Theoretical and Numerical Analysis of 3D Reconstruction Using Point and Line Incidences
- FineRecon: Depth-aware Feed-forward Network for Detailed 3D Reconstruction
- LIST: Learning Implicitly from Spatial Transformers for Single-View 3D Reconstruction
- Neural-PBIR Reconstruction of Shape, Material, and Illumination
- Single-Stage Diffusion NeRF: A Unified Approach to 3D Generation and Reconstruction
- LivePose: Online 3D Reconstruction from Monocular Video with Dynamic Camera Poses
- Long-Range Grouping Transformer for Multi-View 3D Reconstruction<br>:star:code
- Doppelgangers: Learning to Disambiguate Images of Similar Structures<br>:star:code
- Deformable Model-Driven Neural Rendering for High-Fidelity 3D Reconstruction of Human Heads Under Low-View Settings
- Iterative Superquadric Recomposition of 3D Objects from Multiple Views<br>:star:code
- CHORUS: Learning Canonicalized 3D Human-Object Spatial Relations from Unbounded Synthesized Images<br>:star:code
- PlankAssembly: Robust 3D Reconstruction from Three Orthographic Views with Learnt Shape Programs<br>:star:code
- Coordinate Quantized Neural Implicit Representations for Multi-view Reconstruction<br>:star:code
- Root Pose Decomposition Towards Generic Non-rigid 3D Reconstruction with Monocular Videos<br>:star:code
- R3D3: Dense 3D Reconstruction of Dynamic Scenes from Multiple Cameras<br>:house:project
- 三维场景重建
- FrozenRecon: Pose-free 3D Scene Reconstruction with Frozen Depth Models<br>:star:code
- Spacetime Surface Regularization for Neural Dynamic Scene Reconstruction
- Geometry-guided Feature Learning and Fusion for Indoor Scene Reconstruction
- SceneRF: Self-Supervised Monocular 3D Scene Reconstruction with Radiance Fields
- DG-Recon: Depth-Guided Neural 3D Scene Reconstruction
- Uni-3D: A Universal Model for Panoptic 3D Scene Reconstruction
- 深度估计
- MAMo: Leveraging Memory and Attention for Monocular Video Depth Estimation
- Out-of-Distribution Detection for Monocular Depth Estimation
- Improving Equivariance in State-of-the-Art Supervised Depth and Normal Predictors
- Two-in-One Depth: Bridging the Gap Between Monocular and Binocular Self-Supervised Depth Estimation
- Single Depth-image 3D Reflection Symmetry and Shape Prediction
- SlaBins: Fisheye Depth Estimation using Slanted Bins on Road Environments
- 3D Distillation: Improving Self-Supervised Monocular Depth Estimation on Reflective Surfaces
- EGformer: Equirectangular Geometry-biased Transformer for 360 Depth Estimation
- DPS-Net: Deep Polarimetric Stereo Depth Estimation
- Robust Geometry-Preserving Depth Estimation Using Differentiable Rendering
- Towards Zero-Shot Scale-Aware Monocular Depth Estimation
- Indoor Depth Recovery Based on Deep Unfolding with Non-Local Prior
- GasMono: Geometry-Aided Self-Supervised Monocular Depth Estimation for Indoor Scenes
- NDDepth: Normal-Distance Assisted Monocular Depth Estimation
- V-FUSE: Volumetric Depth Map Fusion with Long-Range Constraints
- Robust Monocular Depth Estimation under Challenging Conditions<br>:star:code
- Self-Supervised Monocular Depth Estimation by Direction-aware Cumulative Convolution Network
- Self-supervised Monocular Depth Estimation: Let's Talk About The Weather
- Calibrating Panoramic Depth Estimation for Practical Localization and Mapping
- Two-in-One Depth: Bridging the Gap Between Monocular and Binocular Self-supervised Depth Estimation<br>:star:code
- Robust Geometry-Preserving Depth Estimation Using Differentiable Rendering
- GEDepth: Ground Embedding for Monocular Depth Estimation
- 深度补全
- Stereo Matching
- 三维生成
- MVS
- Hierarchical Prior Mining for Non-local Multi-View Stereo
- MVPSNet: Fast Generalizable Multi-view Photometric Stereo
- S-VolSDF: Sparse Multi-View Stereo Regularization of Neural Implicit Surfaces
- When Epipolar Constraint Meets Non-Local Operators in Multi-View Stereo
- Constraining Depth Map Geometry for Multi-View Stereo: A Dual-Depth Approach with Saddle-shaped Depth Cells
- 表面重建
- Structure-Aware Surface Reconstruction via Primitive Assembly
- C2F2NeUS: Cascade Cost Frustum Fusion for High Fidelity and Generalizable Neural Surface Reconstruction
- NeuS2: Fast Learning of Neural Implicit Surfaces for Multi-view Reconstruction
- Ref-NeuS: Ambiguity-Reduced Neural Implicit Surface Learning for Multi-View Reconstruction with Reflection
1.其它(others)
- Poincare ResNet
- Muscles in Action
- Scene as Occupancy
- Diffusion in Style
- Attentive Mask CLIP
- Dataset Quantization
- Dynamic Point Fields
- Teaching CLIP to Count to Ten
- SiLK: Simple Learned Keypoints
- Audiovisual Masked Autoencoders
- Generalized Differentiable RANSAC
- Neural Implicit Surface Evolution
- Gender Artifacts in Visual Datasets
- Towards Models that Can See and Read
- Convex Decomposition of Indoor Scenes
- Viewing Graph Solvability in Practice
- Navigating to Objects Specified by Images
- InfiniCity: Infinite-Scale City Synthesis
- Quality Diversity for Visual Pre-Training
- Incremental Generalized Category Discovery
- Towards Multi-Layered 3D Garments Animation
- XiNet: Efficient Neural Networks for tinyML
- Simulating Fluids in Real-World Still Images
- Bayesian Optimization Meets Self-Distillation
- Sentence Attention Blocks for Answer Grounding
- Do DALL-E and Flamingo Understand Each Other?
- Computational 3D Imaging with Position Sensors
- Landscape Learning for Neural Network Inversion
- ClusT3: Information Invariant Test-Time Training
- DETR Does Not Need Multi-Scale or Locality Design
- Multi-Directional Subspace Editing in Style-Space
- Re-ReND: Real-Time Rendering of NeRFs across Devices
- Curvature-Aware Training for Coordinate Networks
- Document Understanding Dataset and Evaluation (DUDE)
- SparseMAE: Sparse Training Meets Masked Autoencoders
- Mesh2Tex: Generating Mesh Textures from Image Queries
- Rosetta Neurons: Mining the Common Units in a Model Zoo
- General Planar Motion from a Pair of 3D Correspondences
- Adaptive Reordering Sampler with Neurally Guided MAGSAC
- Transparent Shape from a Single View Polarization Image
- Leaping Into Memories: Space-Time Deep Feature Synthesis
- Label-Noise Learning with Intrinsically Long-Tailed Data
- 3D VR Sketch Guided 3D Shape Prototyping and Exploration
- Sensitivity-Aware Visual Parameter-Efficient Fine-Tuning
- CAD-Estate: Large-scale CAD Model Annotation in RGB Videos
- Role-Aware Interaction Generation from Textual Description
- DFA3D: 3D Deformable Attention For 2D-to-3D Feature Lifting
- BT^2: Backward-compatible Training with Basis Transformation
- Grounding 3D Object Affordance from 2D Interactions in Images
- FBLNet: FeedBack Loop Network for Driver Attention Prediction
- INSTA-BNN: Binary Neural Network with INSTAnce-aware Threshold
- Towards Geospatial Foundation Models via Continual Pretraining
- Deep Incubation: Training Large Models by Divide-and-Conquering
- Learning to Learn: How to Continuously Teach Humans and Machines
- Efficient Neural Supersampling on a Novel Gaming Dataset
- PPR: Physically Plausible Reconstruction from Monocular Videos
- SpinCam: High-Speed Imaging via a Rotating Point-Spread Function
- Lens Parameter Estimation for Realistic Depth of Field Modeling
- BANSAC: A Dynamic BAyesian Network for Adaptive SAmple Consensus
- What Can Simple Arithmetic Operations Do for Temporal Modeling?
- Sample4Geo: Hard Negative Sampling For Cross-View Geo-Localisation
- Cross-view Semantic Alignment for Livestreaming Product Recognition
- Geometric Viewpoint Learning with Hyper-Rays and Harmonics Encoding
- E^2VPT: An Effective and Efficient Approach for Visual Prompt Tuning
- Segmenting Known Objects and Unseen Unknowns without Prior Knowledge
- Learning Fine-Grained Features for Pixel-Wise Video Correspondences
- Graphics2RAW: Mapping Computer Graphics Images to Sensor RAW Images
- LoCUS: Learning Multiscale 3D-consistent Features from Posed Images
- DreamTeacher: Pretraining Image Backbones with Deep Generative Models
- Parametric Information Maximization for Generalized Category Discovery
- Sat2Density: Faithful Density Learning from Satellite-Ground Image Pairs
- Neural Collage Transfer: Artistic Reconstruction via Material Manipulation
- Adaptive Spiral Layers for Efficient 3D Representation Learning on Meshes
- Tiny Updater: Towards Efficient Neural Network-Driven Software Updating
- Efficient Unified Demosaicing for Bayer and Non-Bayer Patterned Image Sensors
- Multi-Object Navigation with Dynamically Learned Neural Implicit Representations
- Is Imitation All You Need? Generalized Decision-Making with Dual-Phase Training
- Not All Features Matter: Enhancing Few-shot CLIP with Adaptive Prior Refinement
- Communication-Efficient Vertical Federated Learning with Limited Overlapping Samples
- Improving Representation Learning for Histopathologic Images with Cluster Constraints
- UMC: A Unified Bandwidth-efficient and Multi-resolution based Collaborative Perception Framework
- Generalizing Neural Human Fitting to Unseen Poses With Articulated SE(3) Equivariance
- Rapid Network Adaptation: Learning to Adapt Neural Networks Using Test-Time Feedback
- A Skeletonization Algorithm for Gradient-Based Optimization
- Joint Implicit Neural Representation for High-fidelity and Compact Vector Fonts
- EfficientTrain: Exploring Generalized Curriculum Learning for Training Visual Backbones
- Mitigating and Evaluating Static Bias of Action Representations in the Background and the Foreground
- Probabilistic Precision and Recall Towards Reliable Evaluation of Generative Models
- S-TREK: Sequential Translation and Rotation Equivariant Keypoints for Local Feature Extraction
- Manipulate by Seeing: Creating Manipulation Controllers from Pre-Trained Representations
- SAFARI: Versatile and Efficient Evaluations for Robustness of Interpretability
- Evaluation and Improvement of Interpretability for Self-Explainable Part-Prototype Networks
- Combating Noisy Labels with Sample Selection by Mining High-Discrepancy Examples
- Among Us: Adversarially Robust Collaborative Perception by Consensus
- Strip-MLP: Efficient Token Interaction for Vision MLP
- MAGI: Multi-Annotated Explanation-Guided Learning
- NeTO:Neural Reconstruction of Transparent Objects with Self-Occlusion Aware Refraction-Tracing
- Video Adverse-Weather-Component Suppression Network via Weather Messenger and Adversarial Backpropagation
- Learning Foresightful Dense Visual Affordance for Deformable Object Manipulation
- A 5-Point Minimal Solver for Event Camera Relative Motion Estimation
- Sample-wise Label Confidence Incorporation for Learning with Noisy Labels
- Inducing Neural Collapse to a Fixed Hierarchy-Aware Frame for Reducing Mistake Severity
- Global Perception Based Autoregressive Neural Processes
- Counterfactual-based Saliency Map: Towards Visual Contrastive Explanations for Neural Networks
- Prompt-aligned Gradient for Prompt Tuning
- Recovering a Molecule's 3D Dynamics from Liquid-phase Electron Microscopy Movies
- Overwriting Pretrained Bias with Finetuning Data
- DataDAM: Efficient Dataset Distillation with Attention Matching
- MatrixVT: Efficient Multi-Camera to BEV Transformation for 3D Perception
- UMFuse: Unified Multi View Fusion for Human Editing Applications
- Semantic-Aware Implicit Template Learning via Part Deformation Consistency
- Reinforce Data, Multiply Impact: Improved Model Accuracy and Robustness with Dataset Reinforcement
- When Noisy Labels Meet Long Tail Dilemmas: A Representation Calibration Method
- FCCNs: Fully Complex-valued Convolutional Networks using Complex-valued Color Model and Loss Function
- Name Your Colour For the Task: Artificially Discover Colour Naming via Colour Quantisation Transformer
- NeSS-ST: Detecting Good and Stable Keypoints with a Neural Stability Score and the Shi-Tomasi detector
- Search for or Navigate to? Dual Adaptive Thinking for Object Navigation
- Real-Time Neural Rasterization for Large Scenes
- A Game of Bundle Adjustment - Learning Efficient Convergence
- Fast Globally Optimal Surface Normal Estimation from an Affine Correspondence
- Fast and Accurate Transferability Measurement by Evaluating Intra-class Feature Variance
- Model Calibration in Dense Classification with Adaptive Label Perturbation
- Improving CLIP Fine-tuning Performance
- TiDy-PSFs: Computational Imaging with Time-Averaged Dynamic Point-Spread-Functions
- Ordinal Label Distribution Learning
- Why do networks have inhibitory/negative connections?
- Pyramid Dual Domain Injection Network for Pan-sharpening
- MULLER: Multilayer Laplacian Resizer for Vision
- Privacy Preserving Localization via Coordinate Permutations
- All in Tokens: Unifying Output Space of Visual Tasks via Soft Token
- Pretrained Language Models as Visual Planners for Human Assistance
- Concept-wise Fine-tuning Matters in Preventing Negative Transfer
- Essential Matrix Estimation using Convex Relaxations in Orthogonal Space
- IIEU: Rethinking Neural Feature Activation from Decision-Making
- A Unified Framework for Robustness on Diverse Sampling Errors
- VQ3D: Learning a 3D-Aware Generative Model on ImageNet
- Learning to Ground Instructional Articles in Videos through Narrations
- Window-Based Early-Exit Cascades for Uncertainty Estimation: When Deep Ensembles are More Efficient than Single Models
- Multimodal Optimal Transport-based Co-Attention Transformer with Global Structure Consistency for Survival Prediction
- PADDLES: Phase-Amplitude Spectrum Disentangled Early Stopping for Learning with Noisy Labels
- ICL-D3IE: In-Context Learning with Diverse Demonstrations Updating for Document Information Extraction
- Translating Images to Road Network: A Non-Autoregressive Sequence-to-Sequence Approach
- CHAMPAGNE: Learning Real-world Conversation from Large-Scale Web Videos
- AssetField: Assets Mining and Reconfiguration in Ground Feature Plane Representation
- NLOS-NeuS: Non-line-of-sight Neural Implicit Surface
- XVO: Generalized Visual Odometry via Cross-Modal Self-Training
- ASIC: Aligning Sparse in-the-wild Image Collections
- PointCLIP V2: Prompting CLIP and GPT for Powerful 3D Open-world Learning
- DEDRIFT: Robust Similarity Search under Content Drift
- MATE: Masked Autoencoders are Online 3D Test-Time Learners
- SYENet: A Simple Yet Effective Network for Multiple Low-Level Vision Tasks with Real-Time Performance on Mobile Device
- Misalign, Contrast then Distill: Rethinking Misalignments in Language-Image Pre-training
- Domain Specified Optimization for Deployment Authorization
- PRANC: Pseudo RAndom Networks for Compacting Deep Models
- Neural Haircut: Prior-Guided Strand-Based Hair Reconstruction
- NaviNeRF: NeRF-based 3D Representation Disentanglement by Latent Semantic Navigation
- Building Bridge Across the Time: Disruption and Restoration of Murals In the Wild
- MoreauGrad: Sparse and Robust Interpretation of Neural Networks via Moreau Envelope
- One-shot Implicit Animatable Avatars with Model-based Priors<br>:thumbsup:基于模型先验的隐式可动数字人单样本学习
- RecRecNet: Rectangling Rectified Wide-Angle Images by Thin-Plate Spline Model and DoF-based Curriculum Learning
- Luminance-aware Color Transform for Multiple Exposure Correction
- Segment Every Reference Object in Spatial and Temporal Spaces
- RankMatch: Fostering Confidence and Consistency in Learning with Noisy Labels
- Hiding Visual Information via Obfuscating Adversarial Perturbations
- SHACIRA: Scalable HAsh-grid Compression for Implicit Neural Representations
- VoroMesh: Learning Watertight Surface Meshes with Voronoi Diagrams
- UniDexGrasp++: Improving Dexterous Grasping Policy Learning via Geometry-Aware Curriculum and Iterative Generalist-Specialist Learning
- Will Large-scale Generative Models Corrupt Future Datasets?
- DREAM: Efficient Dataset Distillation by Representative Matching
- Inverse Compositional Learning for Weakly-supervised Relation Grounding
- Adaptive Similarity Bootstrapping for Self-Distillation Based Representation Learning
- ParCNetV2: Oversized Kernel with Enhanced Attention
- Efficiently Robustify Pre-Trained Models
- DIME-FM : DIstilling Multimodal and Efficient Foundation Models
- Overcoming Forgetting Catastrophe in Quantization-Aware Training
- Adverse Weather Removal with Codebook Priors
- MAP: Towards Balanced Generalization of IID and OOD through Model-Agnostic Adapters
- Instruct-NeRF2NeRF: Editing 3D Scenes with Instructions
- COOL-CHIC: Coordinate-based Low Complexity Hierarchical Image Codec
- Chop & Learn: Recognizing and Generating Object-State Compositions
- CORE: Cooperative Reconstruction for Multi-Agent Perception
- Gramian Attention Heads are Strong yet Efficient Vision Learners
- Improving Zero-Shot Generalization for CLIP with Synthesized Prompts
- E3Sym: Leveraging E(3) Invariance for Unsupervised 3D Planar Reflective Symmetry Detection
- Towards Improved Input Masking for Convolutional Neural Networks
- SIGMA: Scale-Invariant Global Sparse Shape Matching
- Pluralistic Aging Diffusion Autoencoder
- PEANUT: Predicting and Navigating to Unseen Targets
- ModelGiF: Gradient Fields for Model Functional Distance
- Robust Evaluation of Diffusion-Based Adversarial Purification
- R-Pred: Two-Stage Motion Prediction Via Tube-Query Attention-Based Trajectory Refinement
- DynamicISP: Dynamically Controlled Image Signal Processor for Image Recognition
- AutoReP: Automatic ReLU Replacement for Fast Private Network Inference
- Candidate-aware Selective Disambiguation Based On Normalized Entropy for Instance-dependent Partial-label Learning
- Distributed Bundle Adjustment with Block-Based Sparse Matrix Compression for Super Large Scale Datasets
- COPILOT: Human-Environment Collision Prediction and Localization from Egocentric Videos
- Convolutional Networks with Oriented 1D Kernels
- SurfsUP: Learning Fluid Simulation for Novel Surfaces
- How Far Pre-trained Models Are from Neural Collapse on the Target Dataset Informs their Transferability
- SpaceEvo: Hardware-Friendly Search Space Design for Efficient INT8 Inference
- Designing Phase Masks for Under-Display Cameras
- GePSAn: Generative Procedure Step Anticipation in Cooking Videos
- Online Clustered Codebook
- Source-free Depth for Object Pop-out
- Cross-Modal Translation and Alignment for Survival Analysis
- Partition Speeds Up Learning Implicit Neural Representations Based on Exponential-Increase Hypothesis
- Towards High-Quality Specular Highlight Removal by Leveraging Large-Scale Synthetic Data
- Corrupting Neuron Explanations of Deep Visual Features
- Editable Image Geometric Abstraction via Neural Primitive Assembly
- DETRs with Collaborative Hybrid Assignments Training
- D3G: Exploring Gaussian Prior for Temporal Sentence Grounding with Glance Annotation
- OccFormer: Dual-path Transformer for Vision-based 3D Semantic Occupancy Prediction
- Group DETR: Fast DETR Training with Group-Wise One-to-Many Assignment
- LightGlue: Local Feature Matching at Light Speed
- Minimal Solutions to Generalized Three-View Relative Pose Problem
- RICO: Regularizing the Unobservable for Indoor Compositional Reconstruction
- Tangent Sampson Error: Fast Approximate Two-view Reprojection Error for Central Camera Models
- DCPB: Deformable Convolution Based on the Poincare Ball for Top-view Fisheye Cameras
- Improving Lens Flare Removal with General-Purpose Pipeline and Multiple Light Sources Recovery
- CAFA: Class-Aware Feature Alignment for Test-Time Adaptation
- EverLight: Indoor-Outdoor Editable HDR Lighting Estimation
- Scale-MAE: A Scale-Aware Masked Autoencoder for Multiscale Geospatial Representation Learning
- Self-regulating Prompts: Foundational Model Adaptation without Forgetting
- MotionDeltaCNN: Sparse CNN Inference of Frame Differences in Moving Camera Videos with Spherical Buffers and Padded Convolutions
- CRN: Camera Radar Net for Accurate, Robust, Efficient 3D Perception
- Efficient Diffusion Training via Min-SNR Weighting Strategy
- Shape Analysis of Euclidean Curves under Frenet-Serret Framework
- Learning Shape Primitives via Implicit Convexity Regularization
- Exploring the Sim2Real Gap Using Digital Twins
- Compatibility of Fundamental Matrices for Complete Viewing Graphs
- Chordal Averaging on Flag Manifolds and Its Applications
- ContactGen: Generative Contact Modeling for Grasp Generation
- Parametric Classification for Generalized Category Discovery: A Baseline Study
- SeiT: Storage-Efficient Vision Training with Tokens Using 1% of Pixel Storage
- Not All Steps are Created Equal: Selective Diffusion Distillation for Image Manipulation
- ENVIDR: Implicit Differentiable Renderer with Neural Environment Lighting
- Foreground-Background Separation through Concept Distillation from Generative Image Foundation Models
- Beyond Single Path Integrated Gradients for Reliable Input Attribution via Randomized Path Sampling
- CaPhy: Capturing Physical Properties for Animatable Human Avatars
- RLIPv2: Fast Scaling of Relational Language-Image Pre-Training
- ICE-NeRF: Interactive Color Editing of NeRFs via Decomposition-Aware Weight Optimization
- Toward Multi-Granularity Decision-Making: Explicit Visual Reasoning with Hierarchical Knowledge
- Breaking The Limits of Text-conditioned 3D Motion Synthesis with Elaborative Descriptions
- AutoAD II: The Sequel - Who, When, and What in Movie Audio Description
- Bring Clipart to Life
- Few-Shot Dataset Distillation via Translative Pre-Training
- Controllable Visual-Tactile Synthesis
- MolGrapher: Graph-based Visual Recognition of Chemical Structures
- The Effectiveness of MAE Pre-Pretraining for Billion-Scale Pretraining
- Towards Zero Domain Gap: A Comprehensive Study of Realistic LiDAR Simulation for Autonomy Testing
- Learning Hierarchical Features with Joint Latent Space Energy-Based Prior
- Regularized Primitive Graph Learning for Unified Vector Mapping
- Saliency Regularization for Self-Training with Partial Annotations
- Ord2Seq: Regarding Ordinal Regression as Label Sequence Prediction
- ViM: Vision Middleware for Unified Downstream Transferring
- RANA: Relightable Articulated Neural Avatars
- Surface Extraction from Neural Unsigned Distance Fields
- Towards Nonlinear-Motion-Aware and Occlusion-Robust Rolling Shutter Correction
- Perpetual Humanoid Control for Real-time Simulated Avatars
- Efficient Deep Space Filling Curve
- Pixel-Wise Contrastive Distillation
- FACTS: First Amplify Correlations and Then Slice to Discover Bias
- Pairwise Similarity Learning is SimPLE
- PanFlowNet: A Flow-Based Deep Network for Pan-Sharpening
- Visual Explanations via Iterated Integrated Attributions
- Make-It-3D: High-fidelity 3D Creation from A Single Image with Diffusion Prior
- Fully Attentional Networks with Self-emerging Token Labeling
- DMNet: Delaunay Meshing Network for 3D Shape Representation
- Eulerian Single-Photon Vision
- Adaptive Calibrator Ensemble: Navigating Test Set Difficulty in Out-of-Distribution Scenarios
- Sequential Texts Driven Cohesive Motions Synthesis with Natural Transitions
- Locomotion-Action-Manipulation: Synthesizing Human-Scene Interactions in Complex 3D Environments
- Human Preference Score: Better Aligning Text-to-Image Models with Human Preference
- D-IF: Uncertainty-aware Human Digitization via Implicit Distribution Field
- P1AC: Revisiting Absolute Pose From a Single Affine Correspondence
- Mining bias-target Alignment from Voronoi Cells
- Exploring the Benefits of Visual Prompting in Differential Privacy
- The Victim and The Beneficiary: Exploiting a Poisoned Model to Train a Clean Model on Poisoned Data
- End2End Multi-View Feature Matching with Differentiable Pose Optimization
- Task Agnostic Restoration of Natural Video Dynamics
- GPFL: Simultaneously Learning Global and Personalized Feature Information for Personalized Federated Learning
- Scene Graph Contrastive Learning for Embodied Navigation
- Mastering Spatial Graph Prediction of Road Networks
- ETran: Energy-Based Transferability Estimation
- Inverse Problem Regularization with Hierarchical Variational Autoencoders
- A2Q: Accumulator-Aware Quantization with Guaranteed Overflow Avoidance
- Learning a Room with the Occ-SDF Hybrid: Signed Distance Function Mingled with Occupancy Aids Scene Representation
- Context-Aware Planning and Environment-Aware Memory for Instruction Following Embodied Agents
- Minimal Solutions to Uncalibrated Two-view Geometry with Known Epipoles
- Deep Geometry-Aware Camera Self-Calibration from Video
- Feature Proliferation -- the "Cancer" in StyleGAN and its Treatments
- Adaptive Testing of Computer Vision Models
- Semantic Attention Flow Fields for Monocular Dynamic Scene Decomposition
- Unmasked Teacher: Towards Training-Efficient Video Foundation Models
- Time-to-Contact Map by Joint Estimation of Up-to-Scale Inverse Depth and Global Motion using a Single Event Camera
- Cross-modal Latent Space Alignment for Image to Avatar Translation
- Partition-And-Debias: Agnostic Biases Mitigation via a Mixture of Biases-Specific Experts
- Learning to Transform for Generalizable Instance-wise Invariance
- DeePoint: Visual Pointing Recognition and Direction Estimation
- Sigmoid Loss for Language Image Pre-Training
- Tracking by 3D Model Estimation of Unknown Objects in Videos
- Physically-Plausible Illumination Distribution Estimation
- Exploiting Proximity-Aware Tasks for Embodied Social Navigation
- SkeletonMAE: Graph-based Masked Autoencoder for Skeleton Sequence Pre-training
- Studying How to Efficiently and Effectively Guide Models with Explanations
- NeuRBF: A Neural Fields Representation with Adaptive Radial Basis Functions<br>:star:code
- DECO: Dense Estimation of 3D Human-Scene Contact In The Wild<br>:house:project
- PHRIT: Parametric Hand Representation with Implicit Template
- Improving Unsupervised Visual Program Inference with Code Rewriting Families<br>:star:code
- Generating Visual Scenes from Touch<br>:star:code
- LOGICSEG: Parsing Visual Semantics with Neural Logic Learning and Reasoning<br>:star:code
- Automatic Animation of Hair Blowing in Still Portrait Photos<br>:star:code
- Cross-Modal Translation and Alignment for Survival Analysis
- Contrastive Pseudo Learning for Open-World DeepFake Attribution
- Dense 2D-3D Indoor Prediction with Sound via Aligned Cross-Modal Distillation
- Segmentation of Tubular Structures Using Iterative Training with Tailored Samples
- Active Stereo Without Pattern Projector<br>:star:code<br>:star:code
- TinyCLIP: CLIP Distillation via Affinity Mimicking and Weight Inheritance<br>:house:project
- BANSAC: A dynamic BAyesian Network for adaptive SAmple Consensus
- Tree-Structured Shading Decomposition<br>:house:project
- Keep It SimPool: Who Said Supervised Transformers Suffer from Attention Deficit?<br>:star:code
- Beyond Skin Tone: A Multidimensional Measure of Apparent Skin Color
- Multi3DRefer: Grounding Text Description to Multiple 3D Objects
- Panoramas from Photons<br>:house:project
- SimNP: Learning Self-Similarity Priors Between Neural Points
- Multi-label affordance mapping from egocentric vision
- SoDaCam: Software-defined Cameras via Single-Photon Imaging<br>:house:project
- PivotNet: Vectorized Pivot Learning for End-to-end HD Map Construction
- Active Neural Mapping<br>:star:code
- Reconstructing Groups of People with Hypergraph Relational Reasoning<br>:star:code
- Learning to Upsample by Learning to Sample<br>:star:code
- Efficient Discovery and Effective Evaluation of Visual Perceptual Similarity: A Benchmark and Beyond
- S-TREK: Sequential Translation and Rotation Equivariant Keypoints for local feature extraction
- Unaligned 2D to 3D Translation with Conditional Vector-Quantized Code Diffusion using Transformers
- 4D Myocardium Reconstruction with Decoupled Motion and Shape Model
- Hierarchical Contrastive Learning for Pattern-Generalizable Image Corruption Detection<br>:star:code
- LDL: Line Distance Functions for Panoramic Localization<br>:star:code
- Generalized Lightness Adaptation with Channel Selective Normalization<br>:star:code<br>:star:code
- MST-compression: Compressing and Accelerating Binary Neural Networks with Minimum Spanning Tree
- Late Stopping: Avoiding Confidently Learning from Mislabeled Examples
- Motion-Guided Masking for Spatiotemporal Representation Learning
- Self-supervised Learning of Implicit Shape Representation with Dense Correspondence for Deformable Objects<br>:star:code
- SUMMIT: Source-Free Adaptation of Uni-Modal Models to Multi-Modal Targets<br>:star:code
- DR-Tune: Improving Fine-tuning of Pretrained Visual Models by Distribution Regularization with Semantic Calibration<br>:star:code
- RankMixup: Ranking-Based Mixup Training for Network Calibration
- ACLS: Adaptive and Conditional Label Smoothing for Network Calibration
- SPANet: Frequency-balancing Token Mixer using Spectral Pooling Aggregation Modulation<br>:star:code
- Learning a More Continuous Zero Level Set in Unsigned Distance Fields through Level Set Projection<br>:star:code
- CAME: Contrastive Automated Model Evaluation
- LDP-Feat: Image Features with Local Differential Privacy
- DiffCloth: Diffusion Based Garment Synthesis and Manipulation via Structural Cross-modal Semantic Alignment
- Diffusion Model as Representation Learner<br>:star:code
- MetaGCD: Learning to Continually Learn in Generalized Category Discovery
- Few-Shot Physically-Aware Articulated Mesh Generation via Hierarchical Deformation<br>:star:code
- Improving Adversarial Robustness of Masked Autoencoders via Test-time Frequency-domain Prompting<br>:star:code
- Robust Mixture-of-Expert Training for Convolutional Neural Networks<br>:star:code
- Single Image Reflection Separation via Component Synergy<br>:star:code
- Partition-and-Debias: Agnostic Biases Mitigation via A Mixture of Biases-Specific Experts<br>:star:code
- On the Robustness of Open-World Test-Time Training: Self-Training with Dynamic Prototype Expansion<br>:star:code
- Understanding Self-attention Mechanism via Dynamical System Perspective
- A Theory of Topological Derivatives for Inverse Rendering of Geometry<br>:star:code
- X-VoE: Measuring eXplanatory Violation of Expectation in Physical Events<br>:house:project
- ClothesNet: An Information-Rich 3D Garment Model Repository with Simulated Clothes Environment
- Leveraging Intrinsic Properties for Non-Rigid Garment Alignment<br>:star:code
- Event-Guided Procedure Planning from Instructional Videos with Text Supervision
- Spatially and Spectrally Consistent Deep Functional Maps<br>:star:code
- Realistic Full-Body Tracking from Sparse Observations via Joint-Level Modeling<br>:star:code
- Label Shift Adapter for Test-Time Adaptation under Covariate and Label Shifts
- ALIP: Adaptive Language-Image Pre-training with Synthetic Caption<br>:star:code
- OmniZoomer: Learning to Move and Zoom in on Sphere at High-Resolution<br>:star:code
- One-bit Flip is All You Need: When Bit-flip Attack Meets Model Training<br>:star:code
- ObjectSDF++: Improved Object-Compositional Neural Implicit Surfaces<br>:star:code<br>:star:code
- UniTR: A Unified and Efficient Multi-Modal Transformer for Bird's-Eye-View Representation<br>:star:code
- PARIS: Part-level Reconstruction and Motion Analysis for Articulated Objects<br>:star:code<br>:house:project
- Boosting Multi-modal Model Performance with Adaptive Gradient Modulation<br>:star:code
- Towards Open-Set Test-Time Adaptation Utilizing the Wisdom of Crowds in Entropy Minimization
- Estimator Meets Equilibrium Perspective: A Rectified Straight Through Estimator for Binary Neural Networks Training
- GIFD: A Generative Gradient Inversion Method with Feature Domain Optimization
- Prune Spatio-temporal Tokens by Semantic-aware Temporal Accumulation<br>:star:code
- TIJO: Trigger Inversion with Joint Optimization for Defending Multimodal Backdoored Models<br>:star:code
- 3D Motion Magnification: Visualizing Subtle Motions with Time Varying Radiance Fields<br>:star:code
- Improving Pixel-based MIM by Reducing Wasted Modeling Capability<br>:star:code
- Revisiting the Parameter Efficiency of Adapters from the Perspective of Precision Redundancy<br>:star:code
- Towards General Low-Light Raw Noise Synthesis and Modeling<br>:star:code
- Supervised Homography Learning with Realistic Dataset Generation<br>:star:code
- Dynamic PlenOctree for Adaptive Sampling Refinement in Explicit NeRF<br>:house:project
- Spatio-Temporal Domain Awareness for Multi-Agent Collaborative Perception
- Conditional Cross Attention Network for Multi-Space Embedding without Entanglement in Only a SINGLE Network
- Rethinking Data Distillation: Do Not Overlook Calibration
- Kick Back & Relax: Learning to Reconstruct the World by Watching SlowTV<br>:star:code
- Towards Viewpoint-Invariant Visual Recognition via Adversarial Training
- Tuning Pre-trained Model via Moment Probing<br>:star:code
- Replay: Multi-modal Multi-view Acted Videos for Casual Holography
- TextManiA: Enriching Visual Feature by Text-driven Manifold Augmentation
- Rethinking Mobile Block for Efficient Attention-based Models
- RLSAC: Reinforcement Learning enhanced Sample Consensus for End-to-End Robust Estimation<br>:star:code
- Cross-Domain Product Representation Learning for Rich-Content E-Commerce
- Interaction-aware Joint Attention Estimation Using People Attributes<br>:house:project
2020 年论文分类汇总戳这里
↘️CVPR-2020-Papers ↘️ECCV-2020-Papers
<a name="00"/>2021 年论文分类汇总戳这里
↘️ICCV-2021-Papers ↘️CVPR-2021-Papers
<a name="000"/>2022 年论文分类汇总戳这里
↘️CVPR-2022-Papers ↘️WACV-2022-Papers ↘️ECCV-2022-Papers