Awesome
ECCV-2024-Papers
官网链接:https://eccv.ecva.net/
主会 :bell::9 月 29 日(周日)至 10 月 4 日
历年综述论文分类汇总戳这里↘️CV-Surveys施工中~~~~~~~~~~
2025 年论文分类汇总戳这里
↘️WACV-2025-Papers ↘️CVPR-2025-Papers
2024 年论文分类汇总戳这里
↘️WACV-2024-Papers ↘️CVPR-2024-Papers ↘️ECCV-2024-Papers
2022 年论文分类汇总戳这里
2022 年论文分类汇总戳这里
2021 年论文分类汇总戳这里
2020 年论文分类汇总戳这里
💥💥💥全部论文已分类完毕
<br>:thumbsup:ECCV 2024奖项公布,哥大摘得最佳论文奖桂冠
🏆Best Paper Award(最佳论文奖)
🏅Best Paper Honorable Mention(最佳论文荣誉提名奖)
- Rasterized Edge Gradients: Handling Discontinuities Differentiably
- Concept Arithmetics for Circumventing Concept Inhibition in Diffusion Models<br>:house:project
目录
58.全家桶
- X-InstructBLIP: A Framework for Aligning Image, 3D, Audio, Video to LLMs and its Emergent Cross-modal Reasoning<br>:star:code
57.Visual Relationship Detection(视觉关系检测)
- Visual Relationship Transformation
- Scene-Graph ViT: End-to-End Open-Vocabulary Visual Relationship Detection
56.Dense Prediction(密集预测)
- Chameleon: A Data-Efficient Generalist for Dense Visual Prediction in the Wild(https://github.com/GitGyun/chameleon)密集视觉预测
- Unsupervised Dense Prediction using Differentiable Normalized Cuts
- Three Things We Need to Know About Transferring Stable Diffusion to Visual Dense Prediciton Tasks
- Removing Rows and Columns of Tokens in Vision Transformer enables Faster Dense Prediction without Retraining<br>:star:code
55.Information Security(信息安全)
- 版权保护
- 图像水印
- Certifiably Robust Image Watermark<br>:star:code
- A Secure Image Watermarking Framework with Statistical Guarantees via Adversarial Attacks on Secret Key Networks图像水印
- Not Just Change the Labels, Learn the Features: Watermarking Deep Neural Networks with Multi-View Data<br>:star:code
- A Watermark-Conditioned Diffusion Model for IP Protection<br>:star:code
- A Geometric Distortion Immunized Deep Watermarking Framework with Robustness Generalizability
- LaWa: Using Latent Space for In-Generation Image Watermarking
54.Deepfake Detection
- Real Appearance Modeling for More General Deepfake Detection
- Contrasting Deepfakes Diffusion via Contrastive Learning and Global-Local Similarities<br>:star:code
- Fake It till You Make It: Curricular Dynamic Forgery Augmentations towards General Deepfake Detection
- Common Sense Reasoning for Deep Fake Detection<br>:star:code
- 图像伪造检测和定位
- 文档图像篡改检测
- 合成图像检测
53.Keypoint Detection(关键点检测)
- OpenKD: Opening Prompt Diversity for Zero- and Few-shot Keypoint Detection<br>:star:code
- KeypointDETR: An End-to-End 3D Keypoint Detector<br>:star:code
52.Visual Entity Recognition(视觉实体识别)
<a name="51"/>51.Feature Matching
<a name="50"/>50.Sketches(草图)
<a name="49"/>49.Light-Field(光场)
<a name="48"/>48.Computer Graphics(计算机图形学)
- 高动态范围成像
47.Animal
- Animal Avatars: Reconstructing Animatable 3D Animals from Casual Videos<br>:house:project
- Ponymation: Learning Articulated 3D Animal Motions from Unlabeled Online Videos<br>:house:project3D动物运动
- Adaptive High-Frequency Transformer for Diverse Wildlife Re-Identification<br>:star:code
46.Rendering(渲染)
- City-on-Web: Real-time Neural Rendering of Large-scale Scenes on the Web<br>:star:code<br>:house:project
- A Probability-guided Sampler for Neural Implicit Surface Rendering<br>:house:project渲染
- TextDiffuser-2: Unleashing the Power of Language Models for Text Rendering<br>:house:project
- AnyLens: A Generative Diffusion Model with Any Rendering Lens(https://anylens-diffusion.github.io/)
- CityGaussian: Real-time High-quality Large-Scale Scene Rendering with Gaussians<br>:star:code<br>:house:project
- METACAP: Meta-learning Priors from Multi-View Imagery for Sparse-view Human Performance Capture and Rendering<br>:house:project
- GAURA: Generalizable Approach for Unified Restoration and Rendering of Arbitrary Views
- MaRINeR: Enhancing Novel Views by Matching Rendered Images with Nearby References<br>:star:code
- Learning Unsigned Distance Functions from Multi-view Images with Volume Rendering Priors<br>:star:code
- CaesarNeRF: Calibrated Semantic Representation for Few-Shot Generalizable Neural Rendering<br>:house:project
- IntrinsicAnything: Learning Diffusion Priors for Inverse Rendering Under Unknown Illumination<br>:star:code渲染
- Photorealistic Object Insertion with Diffusion-Guided Inverse Rendering<br>:house:project
- VersatileGaussian: Real-time Neural Rendering for Versatile Tasks using Gaussian Splatting神经渲染
- UniVoxel: Fast Inverse Rendering by Unified Voxelization of Scene Representation<br>:star:code
- Photorealistic Object Insertion with Diffusion-Guided Inverse Rendering<br>:house:project
- GeoGaussian: Geometry-aware Gaussian Splatting for Scene Rendering<br>:star:code场景渲染
- GMT: Enhancing Generalizable Neural Rendering via Geometry-Driven Multi-Reference Texture Transfer<br>:star:code
- Boost Your NeRF: A Model-Agnostic Mixture of Experts Framework for High Quality and Efficient Rendering
45.Neural Radiance Fields
- Invertible Neural Warp for NeRF<br>:star:code
- VF-NeRF: Viewshed Fields for Rigid NeRF Registration
- NeRF-XL: NeRF at Any Scale with Multi-GPU<br>:house:project
- Regularizing Dynamic Radiance Fields with Kinematic Fields
- KFD-NeRF: Rethinking Dynamic NeRF with Kalman Filter<br>:star:code
- Dynamic Neural Radiance Field From Defocused Monocular Video
- Flash Cache: Reducing Bias in Radiance Cache Based Inverse Rendering<br>:house:project
- Protecting NeRFs' Copyright via Plug-And-Play Watermarking Base Model<br>:house:project
- GeometrySticker: Enabling Ownership Claim of Recolorized Neural Radiance Fields<br>:star:code<br>:house:project
- Efficient NeRF Optimization - Not All Samples Remain Equally Hard
- MeshFeat: Multi-Resolution Features for Neural Fields on Meshes<br>:house:project
- DecentNeRFs: Decentralized Neural Radiance Fields from Crowdsourced Images<br>:house:project
- TrackNeRF: Bundle Adjusting NeRF from Sparse and Noisy Views via Feature Tracks<br>:star:code
- BeNeRF: Neural Radiance Fields from a Single Blurry Image and Event Stream<br>:star:code
- TriNeRFLet: A Wavelet Based Multiscale Triplane NeRF Representation<br>:house:project
- RS-NeRF: Neural Radiance Fields from Rolling Shutter Images<br>:star:code
- Motion-Oriented Compositional Neural Radiance Fields for Monocular Dynamic Human Modeling<br>:star:code<br>:house:project
- RaFE: Generative Radiance Fields Restoration<br>:house:project
- Few-shot NeRF by Adaptive Rendering Loss Regularization<br>:star:code
- Depth-guided NeRF Training via Earth Mover’s Distance
- DatasetNeRF: Efficient 3D-aware Data Factory with Generative Radiance Fields<br>:star:code
- Flowed Time of Flight Radiance Fields
- Volumetric Rendering with Baked Quadrature Fields
- BeNeRF:Neural Radiance Fields from a Single Blurry Image and Event Stream<br>:star:code
- Taming Latent Diffusion Model for Neural Radiance Field Inpainting<br>:house:project
- Mesh2NeRF: Direct Mesh Supervision for Neural Radiance Field Representation and Generation<br>:house:project<br>🤗huggingface
- SlotLifter: Slot-guided Feature Lifting for Learning Object-Centric Radiance Fields<br>:house:project
- FisherRF: Active View Selection and Mapping with Radiance Fields using Fisher Information<br>:star:code
- DMiT: Deformable Mipmapped Tri-Plane Representation for Dynamic ScenesNeRF
- Single-Mask Inpainting for Voxel-based Neural Radiance Fields
- Content-Aware Radiance Fields: Aligning Model Complexity with Scene Intricacy Through Learned Bitwidth Quantization<br>:star:code
- Gaussian Frosting: Editable Complex Radiance Fields with Real-Time Rendering<br>:house:project
- Physically Plausible Color Correction for Neural Radiance Fields
- Leveraging Thermal Modality to Enhance Reconstruction in Low-Light ConditionsNeRF
- PointNeRF++: A multi-scale, point-based Neural Radiance Field<br>:house:project
- Omni-Recon: Harnessing Image-based Rendering for General-Purpose Neural Radiance Fields
- High-Fidelity and Transferable NeRF Editing by Frequency Decomposition<br>:house:project
- TriNeRFLet: A Wavelet Based Triplane NeRF Representation<br>:house:project
- Diffusion-Generated Pseudo-Observations for High-Quality Sparse-View Reconstruction<br>:house:project
- G2fR: Frequency Regularization in Grid-based Feature Encoding Neural Radiance Fields
- NeRF-MAE: Masked AutoEncoders for Self-Supervised 3D Representation Learning for Neural Radiance Fields<br>:house:project
- 新视图合成
- Fast View Synthesis of Casual Videos<br>:house:project
- PolyOculus: Simultaneous Multi-view Image-based Novel View Synthesis<br>:house:project
- RING-NeRF : Rethinking Inductive Biases for Versatile and Efficient Neural Fields
- Structured-NeRF: Hierarchical Scene Graph with Neural Representation
- URS-NeRF: Unordered Rolling Shutter Bundle Adjustment for Neural Radiance Fields
- A Compact Dynamic 3D Gaussian Representation for Real-Time Dynamic View Synthesis<br>:star:code<br>:house:project
- High-Resolution and Few-shot View Synthesis from Asymmetric Dual-lens Inputs<br>:star:code
- Distractor-Free Novel View Synthesis via Exploiting Memorization Effect in Optimization<br>:star:code
- NVS-Adapter: Plug-and-Play Novel View Synthesis from a Single Image<br>:star:code
- FSGS: Real-Time Few-shot View Synthesis using Gaussian Splatting<br>:star:code
- Fast View Synthesis of Casual Videos with Soup-of-Planes<br>:house:project
- CoherentGS: Sparse Novel View Synthesis with Coherent 3D Gaussians<br>:house:project
- MegaScenes: Scene-Level View Synthesis at Scale<br>:star:code
- Radiative Gaussian Splatting for Efficient X-ray Novel View Synthesis<br>:star:code视图合成
- NGP-RT: Fusing Multi-Level Hash Features with Lightweight Attention for Real-Time Novel View Synthesis
- Efficient Depth-Guided Urban View Synthesis<br>:star:code
- Generative Camera Dolly: Extreme Monocular Dynamic Novel View Synthesis<br>:star:code
- Generalizable Human Gaussians for Sparse View Synthesis<br>:house:project
- Thermal3D-GS: Physics-induced 3D Gaussians for Thermal Infrared Novel-view Synthesis<br>:star:code
44.Dataset/Benchmark(数据集/基准)
- FYI: Flip Your Images for Dataset Distillation
- Neural Spectral Decomposition for Dataset Distillation<br>:star:code
- Teddy: Efficient Large-Scale Dataset Distillation via Taylor-Approximated Matching<br>:star:code
- Distill Gold from Massive Ores: Bi-level Data Pruning towards Efficient Dataset Distillation<br>:star:code
- COM Kitchens: An Unedited Overhead-view Video Dataset as a Vision-Language Benchmark
- 基准
- MM-SafetyBench: A Benchmark for Safety Evaluation of Multimodal Large Language Models<br>:star:code
- DailyDVS-200: A Comprehensive Benchmark Dataset for Event-Based Action Recognition<br>:star:code
- Urban Waterlogging Detection: A Challenging Benchmark and Large-Small Model Co-Adapter<br>:star:code
- MSD: A Benchmark Dataset for Floor Plan Generation of Building Complexes
- BlinkVision: A Benchmark for Optical Flow, Scene Flow and Point Tracking Estimation using RGB Frames and Eventsbr>:house:project
- SIMBA: Split Inference - Mechanisms, Benchmarks and Attacks<br>:star:code
- A Multimodal Benchmark Dataset and Model for Crop Disease Diagnosis<br>:star:code
- BAFFLE: A Baseline of Backpropagation-Free Federated Learning<br>:star:code
- Omni6DPose: A Benchmark and Model for Universal 6D Object Pose Estimation and Tracking
- Affective Visual Dialog: A Large-Scale Benchmark for Emotional Reasoning Based on Visually Grounded Conversations<br>:house:project
- UniIR: Training and Benchmarking Universal Multimodal Information Retrievers<br>:house:project
- HyTAS: A Hyperspectral Image Transformer Architecture Search Benchmark and Analysis
- OphNet: A Large-Scale Video Benchmark for Ophthalmic Surgical Workflow Understanding<br>:house:project
- PredBench: Benchmarking Spatio-Temporal Prediction across Diverse Disciplines<br>:star:code
- Cross-Platform Video Person ReID: A New Benchmark Dataset and Adaptation Approach<br>:star:code
- R^2-Bench: Benchmarking the Robustness of Referring Perception Models under Perturbations<br>:star:code
- m&m’s: A Benchmark to Evaluate Tool-Use for multi-step multi-modal Tasks<br>:star:code<br>🤗huggingface
- PathMMU: A Massive Multimodal Expert-Level Benchmark for Understanding and Reasoning in Pathology<br>🤗huggingface
- LayeredFlow: A Real-World Benchmark for Non-Lambertian Multi-Layer Optical Flow<br>:house:project
- HIMO: A New Benchmark for Full-Body Human Interacting with Multiple Objects<br>:star:code
- When Pedestrian Detection Meets Multi-Modal Learning: Generalist Model and Benchmark Dataset<br>:star:code
- 数据集
- VITATECS: A Diagnostic Dataset for Temporal Concept Understanding of Video-Language Models<br>:star:code
- HaloQuest: A Visual Hallucination Dataset for Advancing Multimodal Reasoning<br>:star:code
- OmniACT: A Dataset and Benchmark for Enabling Multimodal Generalist Autonomous Agents for Desktop and Web
- COM Kitchens: An Unedited Overhead-view Procedural Videos Dataset a Vision-Language Benchmark<br>:sunflower:dataset
- Seeing Faces in Things: A Model and Dataset for Pareidolia<br>:sunflower:dataset
- Towards Dual Transparent Liquid Level Estimation in Biomedical Lab: Dataset, Methods and Practice<br>:sunflower:dataset
- GarmentCodeData: A Dataset of 3D Made-to-Measure Garments With Sewing Patterns<br>:house:project
- SemTrack: A Large-scale Dataset for Semantic Tracking in the Wild<br>:sunflower:dataset
- WiMANS: A Benchmark Dataset for WiFi-based Multi-user Activity Sensing<br>:star:code
- BugNIST - a Large Volumetric Dataset for Detection under Domain Shift
- Defect Spectrum: A Granular Look of Large-scale Defect Datasets with Rich Semantics<br>:star:code<br>:house:project大规模缺陷数据集
- Raindrop Clarity: A Dual-Focused Dataset for Day and Night Raindrop Removal<br>:star:code
- PartImageNet++ Dataset: Scaling up Part-based Models for Robust Recognition<br>:star:code
- WTS: A Pedestrian-Centric Traffic Video Dataset for Fine-grained Spatial-Temporal Understanding<br>:star:code
- MMVR: Millimeter-wave Multi-View Radar Dataset and Benchmark for Indoor Perception
- SkyScenes: A Synthetic Dataset for Aerial Scene Understanding<br>:house:project
- Caltech Aerial RGB-Thermal Dataset in the Wild<br>:star:code
- V2X-Real: a Largs-Scale Dataset for Vehicle-to-Everything Cooperative Perception
- H-V2X: A Large Scale Highway Dataset for BEV Perception
- PetFace: A Large-Scale Dataset and Benchmark for Animal Identification<br>:star:code
- Long-range Turbulence Mitigation: A Large-scale Dataset and A Coarse-to-fine Framework
- OmniNOCS: A unified NOCS dataset and model for 3D lifting of 2D objects<br>:star:code
- SignAvatars: A Large-scale 3D Sign Language Holistic Motion Dataset and Benchmark<br>:star:code<br>:house:project
- Insect Identification in the Wild: The AMI Dataset<br>:star:code野外昆虫识别:AMI 数据集
- RoScenes: A Large-scale Multi-view 3D Dataset for Roadside Perception<br>:sunflower:dataset
- 数据增强
- SUMix: Mixup with Semantic and Uncertain Information<br>:star:code
- Data Augmentation via Latent Diffusion for Saliency Prediction
- FreeAugment: Data Augmentation Search Across All Degrees of Freedom<br>:star:code
- Enhancing Recipe Retrieval with Foundation Models: A Data Augmentation Perspective<br>:star:code
43.Sound
- Audio-Synchronized Visual Animation<br>:star:code<br>:house:project
- Listen to Look into the Future: Audio-Visual Egocentric Gaze Anticipation<br>:house:project
- Label-anticipated Event Disentanglement for Audio-Visual Video Parsing
- Masked Generative Video-to-Audio Transformers with Enhanced Synchronicity<br>:star:code
- Spherical World-Locking for Audio-Visual Localization in Egocentric Videos
- Self-Supervised Audio-Visual Soundscape Stylization<br>:house:project
- CAT: Enhancing Multimodal Large Language Model to Answer Questions in Dynamic Audio-Visual Scenarios<br>:star:code视听场景
- Perceptual Evaluation of Audio-Visual Synchrony Grounded in Viewers’ Opinion Scores
- Siamese Vision Transformers are Scalable Audio-visual Learners<br>:star:code视听学习器
- Action2Sound: Ambient-Aware Generation of Action Sounds from Egocentric Videos<br>:house:project生成环境感知的动作声音
- Audio-visual Generalized Zero-shot Learning the Easy Way
- 视听分割
42.Optical Flow Estimation(光流估计)
<a name="41"/>41.Biomedical(生物特征识别)
<a name="40"/>40.Object Pose Estimation(物体姿态估计)
- SCAPE: A Simple and Strong Category-Agnostic Pose Estimator<br>:star:code
- SRPose: Two-view Relative Pose Estimation with Sparse Keypoints<br>:house:project
- FAFA: Frequency-Aware Flow-Aided Self-Supervision for Underwater Object Pose Estimation<br>:star:code
- A Graph-Based Approach for Category-Agnostic Pose Estimation<br>:house:project
- GS-Pose: Category-Level Object Pose Estimation via Geometric and Semantic Correspondence
- OP-Align: Object-level and Part-level Alignment for Self-supervised Category-level Articulated Object Pose Estimation<br>:star:code
- FoundPose: Unseen Object Pose Estimation with Foundation Features<br>:house:project
- LaPose: Laplacian Mixture Shape Modeling for RGB-Based Category-Level Object Pose Estimation<br>:star:code
- U-COPE: Taking a Further Step to Universal 9D Category-level Object Pose Estimation
- PACE: Pose Annotations in Cluttered Environments<br>:star:code
- 6-DoF
- An Economic Framework for 6-DoF Grasp Detection<br>:star:code
- Pseudo-keypoint RKHS Learning for Self-supervised 6DoF Pose Estimation
- Language-Driven 6-DoF Grasp Detection Using Negative Prompt Guidance<br>:star:code
- Omni6D: Large-Vocabulary 3D Object Dataset for Category-Level 6D Object Pose Estimation<br>:star:code
- 6DGS: 6D Pose Estimation from a Single Image and a 3D Gaussian Splatting Model<br>:star:code
- FreeZe: Training-free zero-shot 6D pose estimation with geometric and vision foundation models<br>:house:project
- 相机姿态估计
- 计数
- AFreeCA: Annotation-Free Counting for All计数
- Zero-shot Object Counting with Good Exemplars
- ABC Easy as 123: A Blind Counter for Exemplar-Free Multi-Class Class-agnostic Counting<br>:star:code<br>:house:project计数
- Class-Agnostic Object Counting with Text-to-Image Diffusion Model
- Shifted Autoencoders for Point Annotation Restoration in Object Counting
39.Robots(机器人)
- See and Think: Embodied Agent in Virtual Environment<br>:house:project
- SceneGraphLoc: Cross-Modal Coarse Visual Localization on 3D Scene Graphs
- V-IRL: Grounding Virtual Intelligence in Real Life<br>:star:code
- 机器人
- Robo-ABC: Affordance Generalization Beyond Categories via Semantic Correspondence for Robot Manipulation<br>:house:project
- Learning Cross-hand Policies of High-DOF Reaching and Grasping机器人
- DISCO: Embodied Navigation and Interaction via Differentiable Scene Semantics and Dual-level Control<br>:star:code
- Real-time Holistic Robot Pose Estimation with Unknown States<br>:star:code
- ManiGaussian: Dynamic Gaussian Splatting for Multi-task Robotic Manipulation<br>:star:code<br>:house:project
- Adapt2Reward: Adapting Video-Language Models to Generalizable Robotic Rewards via Failure Prompts
- GraspXL: Generating Grasping Motions for Diverse Objects at Scale<br>:star:code<br>:house:project
- UGG: Unified Generative Grasping<br>:house:project机器人
- Decomposed Vector-Quantized Variational Autoencoder for Human Grasp Generation<br>:star:code
- Track2Act: Predicting Point Tracks from Internet Videos enables Generalizable Robot Manipulation<br>:house:project机器人
- 导航
- VPR
- Close, But Not There: Boosting Geographic Distance Sensitivity in Visual Place Recognition<br>:star:code
- Navigation Instruction Generation with BEV Perception and Large Language Models<br>:star:code
- Revisit Anything: Visual Place Recognition via Image Segment Retrieval<br>:star:code
- VLAD-BuFF: Burst-aware Fast Feature Aggregation for Visual Place Recognition<br>:star:code
- MeshVPR: Citywide Visual Place Recognition Using 3D Meshes<br>:star:code
- SLAM
- Deep Patch Visual SLAM<br>:star:code
- RGBD GS-ICP SLAM<br>:star:code
- I2-SLAM: Inverting Imaging Process for Robust Photorealistic Dense SLAM
- Hyperion - A fast, versatile symbolic Gaussian Belief Propagation framework for Continuous-Time SLAM<br>:star:code
- SGS-SLAM: Semantic Gaussian Splatting For Neural Dense SLAM
- LRSLAM: Low-rank Representation of Signed Distance Fields in Dense Visual SLAM System
- I$^2$-SLAM: Inverting Imaging Process for Robust Photorealistic Dense SLAM
- Learn to Memorize and to Forget: A Continual Learning Perspective of Dynamic SLAM
- Self-Supervised Underwater Caustics Removal and Descattering via Deep Monocular SLAM
- CG-SLAM: Efficient Dense RGB-D SLAM in a Consistent Uncertainty-aware 3D Gaussian Field<br>:star:code
- Try-On
- Time-Efficient and Identity-Consistent Virtual Try-On Using A Variant of Altered Diffusion Models
- Improving Virtual Try-On with Garment-focused Diffusion Models<br>:star:code
- Wear-Any-Way: Manipulable Virtual Try-on via Sparse Correspondence Alignment<br>:star:code<br>:house:project
- Improving Diffusion Models for Authentic Virtual Try-on in the Wild<br>:star:code
- D4-VTON: Dynamic Semantics Disentangling for Differential Diffusion based Virtual Try-On<br>:star:code
- WildVidFit: Video Virtual Try-On in the Wild via Image-Based Controlled Diffusion Models<br>:star:code
- 交叉地理定位
- GAReT: Cross-view Video Geolocalization with Adapters and Auto-Regressive Transformers<br>:star:code
- Cross-view image geo-localization with Panorama-BEV Co-Retrieval Network<br>:star:code
- ConGeo: Robust Cross-view Geo-localization across Ground View Variations<br>:star:code<br>:house:project交叉视角地理定位
- Benchmarking the Robustness of Cross-view Geo-localization Models
- CityGuessr: City-Level Video Geo-Localization on a Global Scale
- 地理定位
- Avatars(虚拟人)
- CanonicalFusion: Generating Drivable 3D Human Avatars from Multiple Images<br>:star:code
- RodinHD: High-Fidelity 3D Avatar Generation with Diffusion Models<br>:star:code
- MeshAvatar: Learning High-quality Triangular Human Avatars from Multi-view Videos<br>:star:code
- PhysAvatar: Learning the Physics of Dressed 3D Avatars from Visual Observations<br>:house:project
- iHuman: Instant Animatable Digital Humans From Monocular Videos
- PAV: Personalized Head Avatar from Unstructured Video Collection<br>:house:project
- Disentangled Clothed Avatar Generation from Text Descriptions<br>:house:project服装头像生成
- MagicMirror: Fast and High-Quality Avatar Generation with Constrained Search Space<br>:house:project
- 3DFG-PIFu: 3D Feature Grids for Human Digitization from Sparse Views
- FAMOUS: High-Fidelity Monocular 3D Human Digitization Using View Synthesis<br>:star:code3D 人体数字化
- Instant 3D Human Avatar Generation using Image Diffusion Models<br>:house:project
- Let the Avatar Talk using Texts without Paired Training Data
- VR
38.Human-Object Interaction(人机交互)
- Controllable Human-Object Interaction Synthesis<br>:house:project
- F-HOI: Toward Fine-grained Semantic-Aligned 3D Human-Object Interactions
- Interaction-centric Spatio-Temporal Context Reasoning for Multi-Person Video HOI Recognition<br>:star:code
- Look Hear: Gaze Prediction for Speech-directed Human Attention<br>:star:code
- Boosting Gaze Object Prediction via Pixel-level Supervision from Vision Foundation Model<br>:star:code
- Revisit Human-Scene Interaction via Space Occupancy<br>:house:project人机交互
- Exploring Conditional Multi-Modal Prompts for Zero-shot HOI Detection<br>:star:code
- AFF-ttention! Affordances and Attention models for Short-Term Object Interaction Anticipation
- 手-物
- NL2Contact: Natural Language Guided 3D Hand-Object Contact Modeling with Diffusion Model
- Dense Hand-Object(HO) GraspNet with Full Grasping Taxonomy and Dynamics<br>:star:code
- Are Synthetic Data Useful for Egocentric Hand-Object Interaction Detection?<br>:star:code
- Coarse-to-Fine Implicit Representation Learning for 3D Hand-Object Reconstruction from a Single RGB-D Image
37.Style Transfer(风格迁移)
<a name="36"/>36.Gaze Estimation
- De-confounded Gaze Estimation
- 3DGazeNet: Generalizing Gaze Estimation with Weak Supervision from Synthetic Views<br>:star:code
- LG-Gaze: Learning Geometry-aware Continuous Prompts for Language-Guided Gaze Estimation
- Gaze Target Detection Based on Head-Local-Global Coordination
35.Action Detection(动作检测)
- LEGO: Learning EGOcentric Action Frame Generation via Visual Instruction Tuning<br>:star:code<br>:house:project
- ActionSwitch: Class-agnostic Detection of Simultaneous Actions in Streaming Videos
- Spatio-Temporal Proximity-Aware Dual-Path Model for Panoramic Activity Recognition
- Motion Keyframe Interpolation for Any Human Skeleton using Point Cloud-based Human Motion Data Homogenisation运动关键帧插值
- 基于骨架的动作识别
- SA-DVAE: Improving Zero-Shot Skeleton-Based Action Recognition by Disentangled Variational Autoencoders<br>:star:code
- Towards Physical World Backdoor Attacks against Skeleton Action Recognition<br>:house:project
- S-JEPA: A Joint Embedding Predictive Architecture for Skeletal Action Recognition<br>:house:project
- Idempotent Unsupervised Representation Learning for Skeleton-Based Action Recognition<br>:star:code
- CrossGLG: LLM Guides One-shot Skeleton-based 3D Action Recognition in a Cross-level Manner
- 小样本动作识别
- 时序动作检测
- 时序动作定位
- HAT: History-Augmented Anchor Transformer for Online Temporal Action Localization<br>:star:code
- Towards Adaptive Pseudo-label Learning for Semi-Supervised Temporal Action Localization
- Online Temporal Action Localization with Memory-Augmented Transformer<br>:house:project
- Stepwise Multi-grained Boundary Detector for Point-supervised Temporal Action Localization
- 时序动作分割
- Long-Tail Temporal Action Segmentation with Group-wise Temporal Logit Adjustment<br>:star:code
- Two-Stage Active Learning for Efficient Temporal Action Segmentation
- Language-Assisted Skeleton Action Understanding for Skeleton-Based Temporal Action Segmentation<br>:star:code
- Synchronization is All You Need: Exocentric-to-Egocentric Transfer for Temporal Action Segmentation with Unlabeled Synchronized Video Pairs<br>:star:code
- 动作质量评估
- Semi-Supervised Teacher-Reference-Student Architecture for Action Quality Assessment<br>:star:code
- RICA^2: Rubric-Informed, Calibrated Assessment of Actions<br>:house:project
- Vision-Language Action Knowledge Learning for Semantic-Aware Action Quality Assessment动作质量评估
- MAGR: Manifold-Aligned Graph Regularization for Continual Action Quality Assessment<br>:star:code
- 动作预测
- 动作识别
- Referring Atomic Video Action Recognition<br>:star:code
- DEAR: Depth-Enhanced Action Recognition
- Bayesian Evidential Deep Learning for Online Action Detection
- C2C: Component-to-Composition Learning for Zero-Shot Compositional Action Recognition<br>:star:code
- Masked Video and Body-worn IMU Autoencoder for Egocentric Action Recognition
- Classification Matters: Improving Video Action Detection with Class-Specific Attention
- FinePseudo: Improving Pseudo-Labelling through Temporal-Alignablity for Semi-Supervised Fine-Grained Action Recognition<br>:house:project
- Context-Aware Action Recognition: Introducing a Comprehensive Dataset for Behavior Contrast
- Multimodal Cross-Domain Few-Shot Learning for Egocentric Action Recognition<br>:house:project
- On the Utility of 3D Hand Poses for Action Recognition<br>:house:project
- POET: Prompt Offset Tuning for Continual Human Action Adaptation<br>:star:code
- Occluded Gait Recognition with Mixture of Experts: An Action Detection Perspective<br>:star:code
- Leveraging temporal contextualization for video action recognition<br>:star:code
- Optimizing Factorized Encoder Models: Time and Memory Reduction for Scalable and Efficient Action Recognition
- SkateFormer: Skeletal-Temporal Transformer for Human Action Recognition<br>:house:project
- 动作理解
- 群体动作识别
- 癫痫发作检测
34.Visual Question Answering(视觉问答)
- DriveLM: Driving with Graph Visual Question Answering<br>:star:code
- Diffusion-Refined VQA Annotations for Semi-Supervised Gaze Following
- WSI-VQA: Interpreting Whole Slide Images by Generative Visual Question Answering<br>:star:code
- GRACE: Graph-Based Contextual Debiasing for Fair Visual Question Answering
- Q&A Prompts: Discovering Rich Visual Clues through Mining Question-Answer Prompts for VQA requiring Diverse World Knowledge<br>:star:code
- Compositional Substitutivity of Visual Reasoning for Visual Question Answering<br>:star:code
- Fully Authentic Visual Question Answering Dataset from Online Communities<br>:house:project
- An Explainable Vision Question Answer Model via Diffusion Chain-of-Thought
- 音视频问答
- 视频问答
- Video Question Answering with Procedural Programs<br>:house:project
- ViLA: Efficient Video-Language Alignment for Video Question Answering<br>:star:code
- TimeCraft: Navigate Weakly-Supervised Temporal Grounded Video Question Answering via Bi-directional ReasoningVQA
- AutoEval-Video: An Automatic Benchmark for Assessing Large Vision Language Models in Open-Ended Video Question Answering<br>:star:code
- 视听问答
33.Motion Generation(人体运动生成)
- Event-Based Motion Magnification<br>:star:code
- Learning-based Axial Video Motion Magnification<br>:house:project
- SMooDi: Stylized Motion Diffusion Model<br>:star:code
- Length-Aware Motion Synthesis via Latent Diffusion<br>:star:code
- HUMOS: Human Motion Model Conditioned on Body Shape<br>:star:code
- HumanRefiner: Benchmarking Abnormal Human Generation and Refining with Coarse-to-fine Pose-Reversible Guidance<br>:star:code
- Generating Physically Realistic and Directable Human Motions from Multi-Modal Inputs<br>:house:project
- Generating Human Interaction Motions in Scenes with Text Control<br>:house:project运动生成
- Motion Mamba: Efficient and Long Sequence Motion Generation<br>:star:code<br>:house:project
- Large Motion Model for Unified Multi-Modal Motion Generation<br>:house:project
- EMDM: Efficient Motion Diffusion Model for Fast and High-Quality Motion Generation<br>:star:code<br>:house:project
- Bridging the Gap Between Human Motion and Action Semantics via Kinematics Phrases<br>:house:project人体运动
- TRAM: Global Trajectory and Motion of 3D Humans from in-the-wild Videos<br>:house:project人体运动
- Nymeria: A Massive Collection of Egocentric Multi-modal Human Motion in the Wild人体运动
- FreeMotion: MoCap-Free Human Motion Synthesis with Multimodal Large Language Models
- MotionLCM: Real-time Controllable Motion Generation via Latent Consistency Model<br>:star:code
- Realistic Human Motion Generation with Cross-Diffusion Models<br>:house:project人体运动
- CoMo: Controllable Motion Generation through Language Guided Pose Code Editing<br>:house:project生成可控运动
- TLControl: Trajectory and Language Control for Human Motion Synthesis<br>:house:project人体运动合成
- Retrieval Robust to Object Motion Blur<br>:star:[code]((https://github.com/Rong-Zou/Retrieval-Robust-to-Object-Motion-Blur)
- 三维人体运动合成
- 文本-动作合成
- FreeMotion: A Unified Framework for Number-free Text-to-Motion Synthesis
- Local Action-Guided Motion Diffusion Model for Text-to-Motion Generation<br>:star:code
- Plan, Posture and Go: Towards Open-vocabulary Text-to-Motion Generation<br>:house:project
- ParCo: Part-Coordinating Text-to-Motion Synthesis<br>:star:code
- 人体运动预测
- 人体运动估计
- 运动估计
- 舞蹈生成
- 行为生成
- 运动迁移
- 运动预测
32.Person Re-Identification(人员重识别)
- Human-in-the-Loop Visual Re-ID for Population Size Estimation<br>:star:code
- 行人重识别
- Keypoint Promptable Re-Identification<br>:star:code
- Privacy-Preserving Adaptive Re-Identification without Image Transfer
- Rethinking Normalization Layers for Domain Generalizable Person Re-identification<br>:star:code
- Domain Shifting: A Generalized Solution for Heterogeneous Cross-Modality Person Re-Identification
- VI-ReID
- 人物搜索
- 步态识别
- 计数
31.Point Clouds(点云)
- SEED: A Simple and Effective 3D DETR in Point Clouds<br>:star:code
- PointLLM: Empowering Large Language Models to Understand Point Clouds<br>:star:code<br>:house:project
- TransCAD: A Hierarchical Transformer for CAD Sequence Inference from Point Clouds
- Learning to Adapt SAM for Segmenting Cross-domain Point Clouds
- Meerkat: Audio-Visual Large Language Model for Grounding in Space and Time
- milliFlow: Scene Flow Estimation on mmWave Radar Point Cloud for Human Motion Sensing<br>:star:code
- Fast Point Cloud Geometry Compression with Context-based Residual Coding and INR-based Refinement
- Learning Local Pattern Modularization for Point Cloud Reconstruction from Unseen Classes<br>:star:code
- T-MAE: Temporal Masked Autoencoders for Point Cloud Representation Learning<br>:star:code
- Progressive Classifier and Feature Extractor Adaptation for Unsupervised Domain Adaptation on Point Clouds<br>:star:code
- PFGS: High Fidelity Point Cloud Rendering via Feature Splatting<br>:star:code
- Masked Motion Prediction with Semantic Contrast for Point Cloud Sequence Learning<br>:star:code
- To Supervise or Not to Supervise: Understanding and Addressing the Key Challenges of Point Cloud Transfer Learning
- Relightable 3D Gaussians: Realistic Point Cloud Relighting with BRDF Decomposition and Ray Tracing<br>:star:code
- FastPCI: Motion-Structure Guided Fast Point Cloud Frame Interpolation<br>:star:code
- 点云生成
- RangeLDM: Fast Realistic LiDAR Point Cloud Generation<br>:star:code
- Text2LiDAR: Text-guided LiDAR Point Clouds Generation via Equirectangular Transformer<br>:star:code
- Fast Training of Diffusion Transformer with Extreme Masking for 3D Point Clouds Generation<br>:house:project
- FrePolad: Frequency-Rectified Point Latent Diffusion for Point Cloud Generation<br>:house:project
- 点云完成
- Explicitly Guided Information Interaction Network for Cross-modal Point Cloud Completion<br>:star:code
- T-CorresNet: Template Guided 3D Point Cloud Completion with Correspondence Pooling Query Generation Strategy<br>:star:code
- AEDNet: Adaptive Embedding and Multiview-Aware Disentanglement for Point Cloud Completion
- EINet: Point Cloud Completion via Extrapolation and Interpolation<br>:star:code
- Syn-to-Real Domain Adaptation for Point Cloud Completion via Part-based Approach<br>:star:code
- ProtoComp: Diverse Point Cloud Completion with Controllable Prototype<br>:star:code
- 点云重建
- 点云理解
- 点云配准
- ML-SemReg: Boosting Point Cloud Registration with Multi-level Semantic Consistency<br>:star:code
- PointRegGPT: Boosting 3D Point Cloud Registration using Generative Point-Cloud Pairs for Training<br>:star:code
- SemReg: Semantics Constrained Point Cloud Registration<br>:star:code
- Correspondence-Free SE(3) Point Cloud Registration in RKHS via Unsupervised Equivariant Learning<br>:house:project
- UMERegRobust – Universal Manifold Embedding Compatible Features for Robust Point Cloud Registration<br>:star:code
- PARE-Net: Position-Aware Rotation-Equivariant Networks for Robust Point Cloud Registration<br>:star:code
- UMERegRobust -- Universal Manifold Embedding Compatible Features for Robust Point Cloud Registration<br>:star:code
- Equi-GSPR: Equivariant SE(3) Graph Network Model for Sparse Point Cloud Registration点云配准
- 点云分割
- Dual-level Adaptive Self-Labeling for Novel Class Discovery in Point Cloud Segmentation
- HGL: Hierarchical Geometry Learning for Test-time Adaptation in 3D Point Cloud Segmentation<br>:star:code
- SegPoint: Segment Any Point Cloud via Large Language Model<br>:star:code
- Localization and Expansion: A Decoupled Framework for Point Cloud Few-shot Semantic Segmentation
- Pseudo-Embedding for Generalized Few-Shot Point Cloud Segmentation<br>:star:code
- Subspace Prototype Guidance for Mitigating Class Imbalance in Point Cloud Semantic Segmentation<br>:star:code
- 点云理解
- 3D点云
- Implicit Filtering for Learning Neural Signed Distance Functions from 3D Point Clouds<br>:star:code
- CloudFixer: Test-Time Adaptation for 3D Point Clouds via Diffusion-Guided Geometric Transformation<br>:star:code
- FLAT: Flux-aware Imperceptible Adversarial Attacks on 3D Point Clouds
- RISurConv: Rotation Invariant Surface Attention-Augmented Convolutions for 3D Point Cloud Classification and Segmentation
- P2P-Bridge: Diffusion Bridges for 3D Point Cloud Denoising<br>:star:code
- Heterogeneous Graph Learning for Scene Graph Prediction in 3D Point Clouds
- Hiding Imperceptible Noise in Curvature-Aware Patches for 3D Point Cloud Attack3D 点云攻击
- Continuous SO(3) Equivariant Convolution for 3D Point Cloud Analysis<br>:star:code
- Frugal 3D Point Cloud Model Training via Progressive Near Point Filtering and Fused Aggregation
30.Anomaly Detection(异常检测)
- Continuous Memory Representation for Anomaly Detection<br>:star:code
- Dissolving Is Amplifying: Towards Fine-Grained Anomaly Detection<br>:star:code
- Few-Shot Anomaly-Driven Generation for Anomaly Classification and Segmentation<br>:star:code
- GeneralAD: Anomaly Detection Across Domains by Attending to Distorted Features<br>:star:code
- Learning Diffusion Models for Multi-View Anomaly Detection
- Hierarchical Gaussian Mixture Normalizing Flow Modeling for Unified Anomaly Detection<br>:star:code
- TransFusion -- A Transparency-Based Diffusion Model for Anomaly Detection<br>:star:code
- Unsupervised, Online and On-The-Fly Anomaly Detection For Non-Stationary Image Distributions<br>:star:code
- MoEAD: A Parameter-efficient Model for Multi-class Anomaly Detection<br>:star:code
- 缺陷检测
- 故障检测
- 3D异常检测
- 工业异常检测
- Self-supervised Feature Adaptation for 3D Industrial Anomaly Detection
- A Unified Anomaly Synthesis Strategy with Gradient Ascent for Industrial Anomaly Detection and Localization<br>:star:code
- GLAD: Towards Better Reconstruction with Global and Local Adaptive Diffusion Models for Unsupervised Anomaly Detection<br>:star:code
- AD3: Introducing a score for Anomaly Detection Dataset Difficulty assessment using VIADUCT dataset
- Learning to Detect Multi-class Anomalies with Just One Normal Image Prompt
- 零样本异常检测
- 多类异常检测
- OOD
- Gradient-Regularized Out-of-Distribution Detection
- SAFT: Towards Out-of-Distribution Generalization in Fine-Tuning
- PixOOD: Pixel-Level Out-of-Distribution Detection<br>:star:code
- An Information Theoretical View for Out-Of-Distribution Detection
- Learning Non-Linear Invariants for Unsupervised Out-of-Distribution Detection
- LAPT: Label-driven Automated Prompt Tuning for OOD Detection with Vision-Language Models<br>:star:code
- ProSub: Probabilistic Open-Set Semi-Supervised Learning with Subspace-Based Out-of-Distribution Detection<br>:star:code
- Diffusion for Out-of-Distribution Detection on Road Scenes and Beyond<br>:star:code
- Can Your Generative Model Detect Out-of-Distribution Covariate Shift?
- Gradient-based Out-of-Distribution Detection
- Vision-Language Dual-Pattern Matching for Out-of-Distribution Detection
- TAG: Text Prompt Augmentation for Zero-Shot Out-of-Distribution Detection<br>:star:code
- 异常值检测
- 零样本异常分割
29.Semi/self-supervised learning(半/自监督)
- SweepNet: Unsupervised Learning Shape Abstraction via Neural Sweepers<br>:house:project
- Region-aware Distribution Contrast: A Novel Approach to Multi-Task Partially Supervised Learning<br>:star:code
- 自监督
- CroMo-Mixup: Augmenting Cross-Model Representations for Continual Self-Supervised Learning<br>:star:code
- HPFF: Hierarchical Locally Supervised Learning with Patch Feature Fusion<br>:star:code
- SCPNet: Unsupervised Cross-modal Homography Estimation via Intra-modal Self-supervised Learning<br>:star:code
- Efficient Unsupervised Visual Representation Learning with Explicit Cluster Balancing
- OmniSat: Self-Supervised Modality Fusion for Earth Observation<br>:star:code<br>:house:project<br>:sunflower:dataset
- FroSSL: Frobenius Norm Minimization for Efficient Multiview Self-Supervised Learning
- Self-supervised visual learning from interactions with objects
- Exploiting Supervised Poison Vulnerability to Strengthen Self-Supervised Defense
- GenView: Enhancing View Quality with Pretrained Generative Model for Self-Supervised Learning<br>:star:code
- On Pretraining Data Diversity for Self-Supervised Learning<br>:star:code
- Decoupling Common and Unique Representations for Multimodal Self-supervised Learning<br>:star:code
- POA: Pre-training Once for Models of All Sizes<br>:star:code
- ViC-MAE: Self-Supervised Representation Learning from Images and Video with Contrastive Masked Autoencoders自监督表示学习
- Pose-Aware Self-Supervised Learning with Viewpoint Trajectory Regularization<br>:house:project自监督学习
- SSL-Cleanse: Trojan Detection and Mitigation in Self-Supervised Learning<br>:star:code
- 半监督
- Image-Feature Weak-to-Strong Consistency: An Enhanced Paradigm for Semi-Supervised Learning
- Improving 3D Semi-supervised Learning by Effectively Utilizing All Unlabelled Data<br>:star:code
- SCOMatch: Alleviating Overtrusting in Open-set Semi-supervised Learning<br>:star:code
- ExMatch: Self-guided Exploitation for Semi-Supervised Learning with Scarce Labeled Samples
- Rebalancing Using Estimated Class Distribution for Imbalanced Semi-Supervised Learning under Class Distribution Mismatch半监督学习
- Towards Latent Masked Image Modeling for Self-Supervised Visual Representation Learning<br>:star:code
- Flexible Distribution Alignment: Towards Long-tailed Semi-supervised Learning with Proper Calibration<br>:star:code
28.Novel Class Discovery(新类发现)
<a name="27"/>27.GNN/GCN
- GKGNet: Group K-Nearest Neighbor based Graph Convolutional Network for Multi-Label Image Recognition<br>:star:codeGNN
- Graph Neural Network Causal Explanation via Neural Causal Models<br>:star:code
- On the Topology Awareness and Generalization Performance of Graph Neural Networks
- Causal Subgraphs and Information Bottlenecks: Redefining OOD Robustness in Graph Neural Networks
26.NAS
- Auto-GAS: Automated Proxy Discovery for Training-free Generative Architecture Search<br>:star:code
- Auto-DAS: Automated Proxy Discovery for Training-free Distillation-aware Architecture Search<br>:star:code蒸馏感
- SuperFedNAS: Cost-Efficient Federated Neural Architecture Search for On-Device Inference
- Dependency-aware Differentiable Neural Architecture Search
25.MC/KD/Pruning(模型压缩/知识蒸馏/剪枝)
- DεpS: Delayed ε-Shrinking for Faster Once-For-All Training
- 模型压缩
- 剪枝
- Non-transferable Pruning
- Straightforward Layer-wise Pruning for More Efficient Visual Adaptation
- Isomorphic Pruning for Vision Models<br>:star:code
- LPViT: Low-Power Semi-structured Pruning for Vision Transformers
- PaPr: Training-Free One-Step Patch Pruning with Lightweight ConvNets for Faster Inference<br>:star:code剪
- Enhanced Sparsification via Stimulative Training<br>:star:code
- SNP: Structured Neuron-level Pruning to Preserve Attention Scores<br>:star:code
- 量化
- GenQ: Quantization in Low Data Regimes with Generative Synthetic Data<br>:star:code
- MetaAug: Meta-Data Augmentation for Post-Training Quantization
- Toward INT4 Fixed-Point Training via Exploring Quantization Error for Gradients
- CLAMP-ViT: Contrastive Data-Free Learning for Adaptive Post-Training Quantization of ViTs<br>:star:code
- AdaLog: Post-Training Quantization for Vision Transformers with Adaptive Logarithm Quantizer<br>:star:code
- POCA: Post-training Quantization with Temporal Alignment for Codec Avatars<br>:house:project量化
- KD
- Simple Unsupervised Knowledge Distillation With Space Similarity知识蒸馏
- Direct Distillation between Different DomainsKD
- Harmonizing knowledge Transfer in Neural Network with Unified Distillation
- Good Teachers Explain: Explanation-Enhanced Knowledge Distillation
- The Role of Masking for Efficient Supervised Knowledge Distillation of Vision Transformers
- Improving Knowledge Distillation via Regularizing Feature Direction and Norm
- Adversarially Robust Distillation by Reducing the Student-Teacher Variance Gap蒸馏
- Improving Zero-shot Generalization of Learned Prompts via Unsupervised Knowledge Distillation<br>:star:code
- UNIKD: UNcertainty-Filtered Incremental Knowledge Distillation for Neural Implicit Representation<br>:star:code
- BKDSNN: Enhancing the Performance of Learning-based Spiking Neural Networks Training with Blurred Knowledge Distillation
- Nickel and Diming Your GAN: A Dual-Method Approach to Enhancing GAN Efficiency via Knowledge Distillation
- How to Train the Teacher Model for Effective Knowledge Distillation
- Markov Knowledge Distillation: Make Nasty Teachers trained by Self-undermining Knowledge Distillation Fully Distillable
24.Vision Transformer
- Spline-based Transformers
- Denoising Vision Transformers
- FairViT: Fair Vision Transformer via Adaptive Masking
- Rotary Position Embedding for Vision Transformer<br>:star:code
- Bidirectional Progressive Transformer for Interaction Intention Anticipation
- Robustness Tokens: Towards Adversarial Robustness of Transformers
- SpecFormer: Guarding Vision Transformer Robustness via Maximum Singular Value Penalization<br>:star:code
- PDiscoFormer: Relaxing Part Discovery Constraints with Vision Transformers
- OAT: Object-Level Attention Transformer for Gaze Scanpath Prediction<br>:star:code
- AugDETR: Improving Multi-scale Learning for Detection TransformerTransformer
- AttnZero: Efficient Attention Discovery for Vision Transformers<br>:star:code
- SpatialFormer: Towards Generalizable Vision Transformers with Explicit Spatial Understanding<br>:star:code
- Efficient Vision Transformers with Partial Attention
- SiT: Exploring Flow and Diffusion-based Generative Models with Scalable Interpolant Transformers<br>:star:code
- Stitched ViTs are Flexible Vision Backbones<br>:star:code
- Token Compensator: Altering Inference Cost of Vision Transformer without Re-Tuning
- Uncertainty-Driven Spectral Compressive Imaging with Spatial-Frequency Transformer<br>:star:code
- GiT: Towards Generalist Vision Transformer through Universal Language Interface<br>:star:code
- An Optimal Control View of LoRA and Binary Controller Design for Vision Transformers
- Fairness-aware Vision Transformer via Debiased Self-Attention<br>:star:code
- ScatterFormer: Efficient Voxel Transformer with Scattered Linear Attention<br>:star:code
- LiFT: A Surprisingly Simple Lightweight Feature Transform for Dense ViT Descriptors<br>:house:project
- Parameter-Efficient and Memory-Efficient Tuning for Vision Transformer: A Disentangled Approach<br>:house:project
- LayoutDETR: Detection Transformer Is a Good Multimodal Layout Designer<br>:star:code
- Efficient Diffusion Transformer with Step-wise Dynamic Attention Mediators<br>:star:code
- BAM-DETR: Boundary-Aligned Moment Detection Transformer for Temporal Sentence Grounding in Videos<br>:star:code
- An Efficient and Effective Transformer Decoder-Based Framework for Multi-Task Visual Grounding<br>:star:code
23.Machine Learning(机器学习)
- Learning to Unlearn for Robust Machine Unlearning
- Is Retain Set All You Need in Machine Unlearning? Restoring Performance of Unlearned Models with Out-Of-Distribution Images<br>:star:code机器学习
- 机器遗忘
- 对抗
- Improving Adversarial Transferability via Model Alignment<br>:star:code
- Event Trojan: Asynchronous Event-based Backdoor Attacks<br>:star:code
- Data Poisoning Quantization Backdoor Attack
- Flatness-aware Sequential Learning Generates Resilient Backdoors
- WBP: Training-time Backdoor Attacks through Hardware-based Weight Bit Poisoning<br>:star:code
- Cocktail Universal Adversarial Attack on Deep Neural Networks
- TrojVLM: Backdoor Attack Against Vision Language Models
- CatchBackdoor: Backdoor Detection via Critical Trojan Neural Path Fuzzing
- Prompt-Driven Contrastive Learning for Transferable Adversarial Attacks<br>:star:code
- Self-Supervised Representation Learning for Adversarial Attack Detection
- Prediction Exposes Your Face: Black-box Model Inversion via Prediction Alignment
- CLIP-Guided Networks for Transferable Targeted Attacks
- CLIP-Guided Generative Networks for Transferable Targeted Adversarial Attacks
- Exploring Vulnerabilities in Spiking Neural Networks: Direct Adversarial Attacks on Raw Event Data
- UNIT: Backdoor Mitigation via Automated Neural Distribution Tightening<br>:star:code
- Inter-Class Topology Alignment for Efficient Black-Box Substitute Attacks黑盒
- Any Target Can be Offense: Adversarial Example Generation via Generalized Latent Infection<br>:star:code
- AdvDiff: Generating Unrestricted Adversarial Examples using Diffusion Models<br>:star:code
- Enhancing Tracking Robustness with Auxiliary Adversarial Defense Networks
- DIFFender: Diffusion-Based Adversarial Defense against Patch Attacks<br>:star:code
- 持续学习
- CLEO: Continual Learning of Evolving Ontologies
- One-stage Prompt-based Continual Learning
- Exemplar-free Continual Representation Learning via Learnable Drift Compensation<br>:star:code
- Diffusion-Driven Data Replay: A Novel Approach to Combat Forgetting in Federated Class Continual Learning<br>:star:code
- Semantic Residual Prompts for Continual Learning<br>:star:code
- Pick-a-back: Selective Device-to-Device Knowledge Transfer in Federated Continual Learning<br>:star:code
- RCS-Prompt: Learning Prompt to Rearrange Class Space for Prompt-based Continual Learning<br>:star:code
- PromptFusion: Decoupling Stability and Plasticity for Continual Learning<br>:star:code
- Information Bottleneck Based Data Correction in Continual Learning
- Revisiting Supervision for Continual Representation Learning<br>:star:code持续
- Anytime Continual Learning for Open Vocabulary Classification<br>:star:code
- MagMax: Leveraging Model Merging for Seamless Continual Learning
- Beyond Prompt Learning: Continual Adapter for Efficient Rehearsal-Free Continual Learning
- 迁移学习
- 主动学习
- Dataset Quantization with Active Learning based Adaptive Sampling
- Generalized Coverage for More Robust Low-Budget Active Learning
- Bad Students Make Great Teachers: Active Learning Accelerates Large-Scale Visual Understanding主动学习
- Bidirectional Uncertainty-Based Active Learning for Open-Set Annotation主动学习
- Exploring Active Learning in Meta-Learning: Enhancing Context Set Labeling
- 强化学习
- Reinforcement Learning Meets Visual Odometry
- Large-scale Reinforcement Learning for Diffusion Models
- Reinforcement Learning via Auxillary Task Distillation
- Reinforcement Learning Friendly Vision-Language Model for Minecraft<br>:star:code
- Multimodal Label Relevance Ranking via Reinforcement Learning<br>:star:code
- Enhancing Diffusion Models with Text-Encoder Reinforcement Learning<br>:star:code
- Diffusion Models as Optimizers for Efficient Planning in Offline RL<br>:star:code
- Unified Local-Cloud Decision-Making via Reinforcement Learning<br>:house:project强化学习
- 联邦学习
- Towards Multi-modal Transformers in Federated Learning<br>:star:code
- FedHide: Federated Learning by Hiding in the Neighbors
- FedHARM: Harmonizing Model Architectural Diversity in Federated Learning<br>:star:code
- FedTSA: A Cluster-based Two-Stage Aggregation Method for Model-heterogeneous Federated Learning
- Unlocking the Potential of Federated Learning: The Symphony of Dataset Distillation via Deep Generative Latents<br>:star:code
- PFedEdit: Personalized Federated Learning via Automated Model Editing<br>:star:code
- Overcome Modal Bias in Multi-modal Federated Learning via Balanced Modality Selection
- Fisher Calibration for Backdoor-Robust Heterogeneous Federated Learning<br>:star:code
- Federated Learning with Local Openset Noisy Labels<br>:star:code
- SkyMask: Attack-agnostic Robust Federated Learning with Fine-grained Learnable Masks<br>:star:code
- 对比学习
- FlowCon: Out-of-Distribution Detection using Flow-based Contrastive Learning<br>:star:code
- Improving Medical Multi-modal Contrastive Learning with Expert Annotations
- Contrastive Learning with Synthetic Positives对比学习
- Understanding and Mitigating Human-Labelling Errors in Supervised Contrastive Learning
- Adaptive Multi-head Contrastive Learning
- CLAP: Isolating Content from Style through Contrastive Learning with Augmented Prompts<br>:star:code对比学习
- Learning the Unlearned: Mitigating Feature Suppression in Contrastive Learning<br>:star:code
- 类增量
- Rethinking Few-shot Class-incremental Learning: Learning from Yourself<br>:star:code
- Few-shot Class Incremental Learning with Attention-Aware Self-Adaptive Prompt
- Class-Incremental Learning with CLIP: Adaptive Representation Adjustment and Parameter Fusion<br>:star:code
- Versatile Incremental Learning: Towards Class and Domain-Agnostic Incremental Learning<br>:star:code
- Confidence Self-Calibration for Multi-Label Class-Incremental Learning<br>:star:[code](https://github.com/ Kaile-Du/CSC)
- Canonical Shape Projection is All You Need for 3D Few-shot Class Incremental Learning<br>:star:code
- Personalized Federated Domain-Incremental Learning based on Adaptive Knowledge Matching
- PILoRA: Prototype Guided Incremental LoRA for Federated Class-Incremental Learning<br>:star:code
- CLOSER: Towards Better Representation Learning for Few-Shot Class-Incremental Learning<br>:star:code
- Non-Exemplar Domain Incremental Learning via Cross-Domain Concept Integration<br>:star:code
- On the Approximation Risk of Few-Shot Class-Incremental Learning<br>:star:code
- iNeMo: Incremental Neural Mesh Models for Robust Class-Incremental Learning<br>:star:code
- DiffClass: Diffusion-Based Class Incremental Learning
- 上下文学习
- 多任务学习
- 多实例学习
- 多模态学习
22.Few/Zero-Shot Learning/DG/A(小/零样本/域泛化/域适应)
- Source-Free Domain-Invariant Performance Prediction
- The Devil is in the Few Shots: Iterative Visual Knowledge Completion for Few-shot Learning<br>:star:code
- DG
- Towards Multimodal Open-Set Domain Generalization and Adaptation through Self-supervision<br>:star:code
- Feature Diversification and Adaptation for Federated Domain Generalization
- Soft Prompt Generation for Domain Generalization<br>:star:code
- Integrating Markov Blanket Discovery into Causal Representation Learning for Domain Generalization
- Rethinking LiDAR Domain Generalization: Single Source as Multiple Density Domains<br>:star:code
- Improving Zero-Shot Generalization for CLIP with Variational Adapter
- Representation Enhancement-Stabilization: Reducing Bias-Variance of Domain Generalization<br>:star:code
- Local and Global Flatness for Federated Domain Generalization<br>:star:code
- Learn to Preserve and Diversify: Parameter-Efficient Group with Orthogonal Regularization for Domain Generalization
- Disentangling Masked Autoencoders for Unsupervised Domain Generalization<br>:star:code
- DA
- Training-Free Model Merging for Multi-target Domain Adaptation<br>:star:code
- MC-PanDA: Mask Confidence for Panoptic Domain Adaptation<br>:star:code
- Is user feedback always informative? Retrieval Latent Defending for Semi-Supervised Domain Adaptation without Source Data<br>:house:project
- De-Confusing Pseudo-Labels in Source-Free Domain Adaptation
- Open-set Domain Adaptation via Joint Error based Multi-class Positive and Unlabeled Learning
- Robust Nearest Neighbors for Source-Free Domain Adaptation under Class Distribution Shift<br>:star:code
- HVCLIP: High-dimensional Vector in CLIP for Unsupervised Domain Adaptation
- Hierarchical Unsupervised Relation Distillation for Source Free Domain Adaptation
- Learn from the Learnt: Source-Free Active Domain Adaptation via Contrastive Sampling and Visual Persistence<br>:star:code
- Improving Unsupervised Domain Adaptation: A Pseudo-Candidate Set Approach
- UDA-Bench: Revisiting Common Assumptions in Unsupervised Domain Adaptation Using a Standardized Framework<br>:star:code
- Forget More to Learn More: Domain-specific Feature Unlearning for Semi-supervised and Unsupervised Domain Adaptation
- Train Till You Drop: Towards Stable and Robust Source-free Unsupervised 3D Domain Adaptation
- Get Your Embedding Space in Order: Domain-Adaptive Regression for Forest Monitoring<br>:house:project
- CoDA: Instructive Chain-of-Domain Adaptation with Severity-Aware Visual Prompt Tuning<br>:star:code
- COD: Learning Conditional Invariant Representation for Domain Adaptation Regression
- Plug and Play: A Representation Enhanced Domain Adapter for Collaborative Perception<br>:star:code
- DA-BEV: Unsupervised Domain Adaptation for Bird's Eye View Perception
- 零样本
21.Vision-Language(视觉语言)
- Sapiens: Foundation for Human Vision Models
- Conceptual Codebook Learning for Vision-Language Models
- DEAL: Disentangle and Localize Concept-level Explanations for VLMs
- FlexAttention for Efficient High-Resolution Vision-Language Models<br>:house:project
- QUAR-VLA: Vision-Language-Action Model for Quadruped Robots
- Vary: Scaling up the Vision Vocabulary for Large Vision-Language Models
- REVISION: Rendering Tools Enable Spatial Fidelity in Vision-Language Models<br>:house:project
- Octopus: Embodied Vision-Language Programmer from Environmental Feedback<br>:house:project
- GazeXplain: Learning to Predict Natural Language Explanations of Visual Scanpaths
- Learning Chain of Counterfactual Thought for Bias-Robust Vision-Language Reasoning<br>:star:code
- Boosting Transferability in Vision-Language Attacks via Diversification along the Intersection Region of Adversarial Trajectory<br>:star:code
- Cascade Prompt Learning for Vision-Language Model Adaptation<br>:star:code
- The Hard Positive Truth about Vision-Language Compositionality
- Improving 2D Feature Representations by 3D-Aware Fine-Tuning<br>:star:code
- Unveiling Typographic Deceptions: Insights of the Typographic Vulnerability in Large Vision-Language Models<br>:star:code
- Turbo: Informativity-Driven Acceleration Plug-In for Vision-Language Large Models
- ClearCLIP: Decomposing CLIP Representations for Dense Vision-Language Inference<br>:star:code
- FALIP: Visual Prompt as Foveal Attention Boosts CLIP Zero-Shot Performance<br>:house:project
- Deciphering the Role of Representation Disentanglement: Investigating Compositional Generalization in CLIP Models<br>:star:code
- GalLoP: Learning Global and Local Prompts for Vision-Language Models
- Quantized Prompt for Efficient Generalization of Vision-Language Models<br>:star:code
- AddressCLIP: Empowering Vision-Language Models for City-wide Image Address Localization<br>:star:code<br>:Thumbsup:AddressCLIP:一张图实现街道级定位,端到端图像地理定位大模型
- SQ-LLaVA: Self-Questioning for Large Vision-Language Assistant
- Contrastive Region Guidance: Improving Grounding in Vision-Language Models without Training<br>:house:project
- Cascade Prompt Learning for Visual-Language Model Adaptation<br>:star:code
- Select and Distill: Selective Dual-Teacher Knowledge Transfer for Continual Learning on Vision-Language Models<br>:house:project
- Unveiling Typographic Deceptions: Insights of the Typographic Vulnerability in Large Vision-Language Model
- Introducing Routing Functions to Vision-Language Parameter-Efficient Fine-Tuning with Low-Rank Bottlenecks
- Take A Step Back: Rethinking the Two Stages in Visual Reasoning<br>:star:code
- HYDRA: A Hyper Agent for Dynamic Compositional Visual Reasoning<br>:star:code视觉推理
- Mind the Interference: Retaining Pre-trained Knowledge in Parameter Efficient Continual Learning of Vision-Language Models<br>:star:code
- An Image is Worth 1/2 Tokens After Layer 2: Plug-and-Play Inference Acceleration for Large Vision-Language Models<br>:star:code
- Improving Vision and Language Concepts Understanding with Multimodal Counterfactual Samples<br>:star:code
- Reflective Instruction Tuning: Mitigating Hallucinations in Large Vision-Language Models<br>:star:code
- SDPT: Synchronous Dual Prompt Tuning for Fusion-based Visual-Language Pre-trained Models<br>:star:code
- Robust Calibration of Large Vision-Language Adapters<br>:star:code
- BEAF: Observing BEfore-AFter Changes to Evaluate Hallucination in Vision-language Models<br>:star:code
- CLIP-DPO: Vision-Language Models as a Source of Preference for Fixing Hallucinations in LVLMs
- MyVLM: Personalizing VLMs for User-Specific Queries<br>:star:code
- BRAVE: Broadening the visual encoding of vision-language models<br>:house:project
- IVTP: Instruction-guided Visual Token Pruning for Large Vision-Language Models
- ShareGPT4V: Improving Large Multi-Modal Models with Better Captions<br>:star:code<br>:house:project
- The First to Know: How Token Distributions Reveal Hidden Knowledge in Large Vision-Language Models?<br>:star:code
- Safe-CLIP: Removing NSFW Concepts from Vision-and-Language Models<br>:star:code
- Adapt without Forgetting: Distill Proximity from Dual Teachers in Vision-Language Models<br>:star:code
- Exploiting Semantic Reconstruction to Mitigate Hallucinations in Vision-Language Models
- uCAP: An Unsupervised Prompting Method for Vision-Language Models
- Training A Small Emotional Vision Language Model for Visual Art Comprehension<br>:star:code
- Understanding Multi-compositional learning in Vision and Language models via Category Theory<br>:star:code
- Adversarial Prompt Tuning for Vision-Language Models<br>:star:code
- Language-Image Pre-training with Long Captions<br>:star:code
- CoReS: Orchestrating the Dance of Reasoning and Segmentation<br>:star:code<br>:house:project
- Attention Prompting on Image for Large Vision-Language Models<br>:star:code
- SILC: Improving Vision Language Pretraining with Self-Distillation
- SCLIP: Rethinking Self-Attention for Dense Vision-Language Inference<br>:star:code
- AdaGlimpse: Active Visual Exploration with Arbitrary Glimpse Position and Scale<br>:star:code
- Video-Language
- VLN
- LLM
- BLINK: Multimodal Large Language Models Can See but Not Perceive<br>:house:project
- Ferret-UI: Grounded Mobile UI Understanding with Multimodal LLMs
- X-InstructBLIP: A Framework for aligning X-Modal instruction-aware representations to LLMs and Emergent Cross-modal Reasoning<br>:star:code<br>:house:project
- X-Former: Unifying Contrastive and Reconstruction Learning for MLLMs
- Instruction Tuning-free Visual Token Complement for Multimodal LLMs
- Merlin: Empowering Multimodal LLMs with Foresight Minds<br>:house:project
- Groma: Localized Visual Tokenization for Grounding Multimodal Large Language Models<br>:house:project
- MathVerse: Does Your Multi-modal LLM Truly See the Diagrams in Visual Math Problems?<br>:star:code
- UniCode: Learning a Unified Codebook for Multimodal Large Language Models
- When Do We Not Need Larger Vision Models?<br>:star:code
- ControlLLM: Augment Language Models with Tools by Searching on Graphs<br>:star:code
- Towards Open-Ended Visual Recognition with Large Language Models<br>:star:code
- SPHINX: A Mixer of Weights, Visual Embeddings and Image Scales for Multi-modal Large Language Models<br>:star:code
- ST-LLM: Large Language Models Are Effective Temporal Learners<br>:star:code
- Dissecting Dissonance: Benchmarking Large Multimodal Models Against Self-Contradictory Instructions<br>:house:project
- How Many Unicorns Are in This Image? A Safety Evaluation Benchmark for Vision LLMs<br>:star:code
- BenchLMM: Benchmarking Cross-style Visual Capability of Large Multimodal Models<br>:star:code
- MoAI: Mixture of All Intelligence for Large Language and Vision Models<br>:star:code<br>🤗huggingface
- Paying More Attention to Images: A Training-Free Method for Alleviating Hallucination in LVLMs<br>:house:project
- LLaVA-UHD: an LMM Perceiving any Aspect Ratio and High-Resolution Images<br>:star:code
- Griffon: Spelling out All Object Locations at Any Granularity with Large Language Models<br>:star:code
- LLMGA: Multimodal Large Language Model based Generation Assistant<br>:star:code<br>:house:project
- Images are Achilles' Heel of Alignment: Exploiting Visual Vulnerabilities for Jailbreaking Multimodal Large Language Models<br>:star:code
- LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models<br>:star:code
- LLM as Dataset Analyst: Subpopulation Structure Discovery with Large Language Model<br>:house:project
- Strengthening Multimodal Large Language Model with Bootstrapped Preference Optimization
- MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training
- LLaVA-Grounding: Grounded Visual Chat with Large Multimodal Models<br>:star:code
- ShapeLLM: Universal 3D Object Understanding for Embodied Interaction<br>:star:code
- Making Large Language Models Better Planners with Reasoning-Decision Alignment
- Self-Adapting Large Visual-Language Models to Edge Devices across Visual Modalities
- Propose, Assess, Search: Harnessing LLMs for Goal-Oriented Planning in Instructional Videos<br>:house:project
- AdaShield: Safeguarding Multimodal Large Language Models from Structure-based Attack via Adaptive Shield Prompting<br>:star:code
- DetToolChain: A New Prompting Paradigm to Unleash Detection Ability of MLLM<br>:star:code
- GENIXER: Empowering Multimodal Large Language Models as a Powerful Data Generator<br>:star:code
- Elysium: Exploring Object-level Perception in Videos through Semantic Integration Using MLLMs<br>:star:code
- 视觉定位
- Visual Grounding
- 视觉意图理解
- 引用表达理解
- 视觉语言理解
20.Scene
- LatentEditor: Text Driven Local Editing of 3D Scenes<br>:house:project
- RoomTex: Texturing Compositional Indoor Scenes via Iterative Inpainting<br>:star:code室内场景
- Forecasting Future Videos from Novel Views via Disentangled 3D Scene Representation<br>:star:code
- Compact 3D Scene Representation via Self-Organizing Gaussian Grids<br>:star:code
- CARFF: Conditional Auto-encoded Radiance Field for 3D Scene Forecasting<br>:house:project
- 场景合成
- Pyramid Diffusion for Fine 3D Large Scene Generation<br>:star:code<br>:house:project<br>:Thumbsup:西南交大&利兹大学等联合提出金字塔离散扩散模型(PDD),实现了3D户外场景生成的粗到细的策略
- External Knowledge Enhanced 3D Scene Generation from Sketch3D 场景生成
- SceneTeller: Language-to-3D Scene Generation<br>:star:code
- Forest2Seq: Revitalizing Order Prior for Sequential Indoor Scene Synthesis
- Gaussian Grouping: Segment and Edit Anything in 3D Scenes<br>:star:code
- EchoScene: Indoor Scene Generation via Information Echo over Scene Graph Diffusion<br>:star:code
- AnyHome: Open-Vocabulary Large-Scale Indoor Scene Generation with First-Person View Exploration室内场景生成
- BeyondScene: Higher-Resolution Human-Centric Scene Generation With Pretrained Diffusion<br>:house:project
- The Fabrication of Reality and Fantasy: Scene Generation with LLM-Assisted Prompt Interpretation<br>:star:code
- Language-Driven Physics-Based Scene Synthesis and Editing via Feature Splatting<br>:house:project场景合成和编辑
- WoVoGen: World Volume-aware Diffusion for Controllable Multi-camera Driving Scene Generation<br>:star:code驾驶场景生成
- 场景理解
- N2F2: Hierarchical Scene Understanding with Nested Neural Feature Fields
- Dense Multimodal Alignment for Open-Vocabulary 3D Scene Understanding
- SceneVerse: Scaling 3D Vision-Language Learning for Grounded Scene Understanding<br>:house:project
- Shape2Scene: 3D Scene Representation Learning Through Pre-training on Shape Data<br>:star:code
- nuCraft: Crafting High Resolution 3D Semantic Occupancy for Unified 3D Scene Understanding
- R3DS: Reality-linked 3D Scenes for Panoramic Scene Understanding<br>:house:project
- Open Vocabulary 3D Scene Understanding via Geometry Guided Self-Distillation<br>:star:code
- Agent3D-Zero: An Agent for Zero-shot 3D Understanding<br>:house:project
- MTMamba: Enhancing Multi-Task Dense Scene Understanding by Mamba-Based Decoders<br>:star:code密集场景理解
- 语义场景完
- 场景图生成
- OpenPSG: Open-set Panoptic Scene Graph Generation via Large Multimodal Models<br>:star:code
- Semantic Diversity-aware Prototype-based Learning for Unbiased Scene Graph Generation<br>:star:code
- Fine-Grained Scene Graph Generation via Sample-Level Bias Prediction<br>:star:code
- Expanding Scene Graph Boundaries: Fully Open-vocabulary Scene Graph Generation via Visual-Concept Alignment and Retention<br>:star:code<br>:thumbsup:突破场景图生成的边界:OvSGTR 实现全开放词汇场景图生成
- A Fair Ranking and New Model for Panoptic Scene Graph Generation<br>:house:project
- Multi-Granularity Sparse Relationship Matrix Prediction Network for End-to-End Scene Graph Generation<br>:star:code
- Towards Scene Graph Anticipation<br>:star:code
19.UAV/Remote Sensing/Satellite Image(无人机/遥感/卫星图像)
- Masked Angle-Aware Autoencoder for Remote Sensing Images<br>:star:code
- Radiance Field Learners As UAV First-Person Viewers
- Geospecific View Generation - Geometry-Context Aware High-resolution Ground View Inference from Satellite Views<br>:house:project卫星视图
- Probabilistic Image-Driven Traffic Modeling via Remote Sensing
- UAV First-Person Viewers Are Radiance Field Learners<br>:house:project
- MutDet: Mutually Optimizing Pre-training for Remote Sensing Object Detection<br>:star:code
- Towards Natural Language-Guided Drones: GeoText-1652 Benchmark with Spatial Relation Matching
- Free-Viewpoint Video of Outdoor Sports Using a Drone
- Learning Representations of Satellite Images From Metadata Supervision(https://github.com/preligens-lab/satmip)卫星图像
- Multi-scale Cross Distillation for Object Detection in Aerial Images
- LHRS-Bot: Empowering Remote Sensing with VGI-Enhanced Large Multimodal Language Model<br>:star:code
- PDT Uav Target Detection Dataset for Pests and Diseases Tree<br>:star:code
- Contrastive ground-level image and remote sensing pre-training improves representation learning for natural world imagery<br>🤗huggingface遥感
- Toward Open Vocabulary Aerial Object Detection with CLIP-Activated Student-Teacher Learning<br>:star:code
18.Automated Driving(自动驾驶)
- Online Vectorized HD Map Construction using Geometry<br>:star:code
- MUSES: The Multi-Sensor Semantic Perception Dataset for Driving under Uncertainty<br>:star:code
- HENet: Hybrid Encoding for End-to-end Multi-task 3D Perception from Multi-view Cameras<br>:star:code
- Continuity Preserving Online CenterLine Graph Learning
- Accelerating Online Mapping and Behavior Prediction via Direct BEV Feature Attention<br>:star:code
- RepVF: A Unified Vector Fields Representation for Multi-task 3D Perception<br>:star:code
- MapDistill: Boosting Efficient Camera-based HD Map Construction via Camera-LiDAR Fusion Model Distillation
- Generative End-to-End Autonomous Driving<br>:star:code
- CARB-Net: Camera-Assisted Radar-Based Network for Vulnerable Road User Detection<br>:star:code驾驶
- FipTR: A Simple yet Effective Transformer Framework for Future Instance Prediction in Autonomous Driving<br>:star:code
- Mask2Map: Vectorized HD Map Construction Using Bird's Eye View Segmentation Masks<br>:star:code驾驶
- Gated Temporal Diffusion for Stochastic Long-Term Dense Anticipation<br>:star:code
- CarFormer: Self-Driving with Learned Object-Centric Representations<br>:star:code
- Image-to-Lidar Relational Distillation for Autonomous Driving Data
- Modelling Competitive Behaviors in Autonomous Driving Under Generative World Model<br>:star:code
- LingoQA: Video Question Answering for Autonomous Driving<br>:star:code
- PreSight: Enhancing Autonomous Vehicle Perception with City-Scale NeRF Priors<br>:star:code
- VQA-Diff: Exploiting VQA and Diffusion for Zero-Shot Image-to-3D Vehicle Asset Generation in Autonomous Driving
- TCLC-GS: Tightly Coupled LiDAR-Camera Gaussian Splatting for Autonomous Driving
- Learning to Drive via Asymmetric Self-Play<br>:house:project
- Embodied Understanding of Driving Scenarios<br>:star:code
- Early Anticipation of Driving Maneuvers<br>:house:project
- RealGen: Retrieval Augmented Generation for Controllable Traffic Scenarios<br>:house:project
- Event-Aided Time-To-Collision Estimation for Autonomous Driving<br>:house:project
- Dolphins: Multimodal Language Model for Driving<br>:house:project
- PPAD: Iterative Interactions of Prediction and Planning for End-to-end Autonomous Driving<br>:star:code
- Asynchronous Large Language Model Enhanced Planner for Autonomous Driving<br>:star:code
- Neural Volumetric World Models for Autonomous Driving
- SeFlow: A Self-Supervised Scene Flow Method in Autonomous Driving<br>:star:code
- Random Walk on Pixel Manifolds for Anomaly Segmentation of Complex Driving Scenes<br>:star:code自动驾驶
- SLEDGE: Synthesizing Driving Environments with Generative Models and Rule-Based Traffic
- I Can't Believe It's Not Scene Flow!<br>:star:code场景流
- Safe-Sim: Safety-Critical Closed-Loop Traffic Simulation with Diffusion-Controllable Adversaries<br>:house:project交通
- UniM2AE: Multi-modal Masked Autoencoders with Unified 3D Representation for 3D Perception in Autonomous Driving<br>:star:code
- DriveDreamer: Towards Real-world-driven World Models for Autonomous Driving<br>:house:project
- Think2Drive: Efficient Reinforcement Learning by Thinking with Latent World Model for Autonomous Driving (in CARLA-v2)<br>:house:project
- Reason2Drive: Towards Interpretable and Chain-based Reasoning for Autonomous Driving<br>:star:code
- Improving Agent Behaviors with RL Fine-tuning for Autonomous Driving
- Lane Graph as Path: Continuity-preserving Path-wise Modeling for Online Lane Graph Construction<br>:star:code
- Beyond the Data Imbalance: Employing the Heterogeneous Datasets for Vehicle Maneuver Prediction<br>:star:code
- 轨迹预测
- Learning Semantic Latent Directions for Accurate and Controllable Human Motion Prediction<br>:star:code
- NeRMo: Learning Implicit Neural Representations for 3D Human Motion Prediction
- CoMusion: Towards Consistent Stochastic Human Motion Prediction via Motion Diffusion人体运动预测
- Risk-Aware Self-Consistent Imitation Learning for Trajectory Planning in Autonomous Driving
- Progressive Pretext Task Learning for Human Trajectory Prediction<br>:star:code
- DySeT: a Dynamic Masked Self-distillation Approach for Robust Trajectory Prediction
- VisionTrap: Vision-Augmented Trajectory Prediction Guided by Textual Descriptions<br>:star:code
- Optimizing Diffusion Models for Joint Trajectory Prediction and Controllable Generation<br>:star:code
- Adaptive Human Trajectory Prediction via Latent Corridors<br>:house:project
- NeuroNCAP: Photorealistic Closed-loop Safety Testing for Autonomous Driving<br>:star:code
- MART: MultiscAle Relational Transformer Networks for Multi-agent Trajectory Prediction<br>:star:code
- Local Occupancy-Enhanced Object Grasping with Multiple Triplanar Projection
- 车辆轨迹预测
- 占据预测
- VEON: Vocabulary-Enhanced Occupancy Prediction
- OccWorld: Learning a 3D Occupancy World Model for Autonomous Driving<br>:star:code
- OccGen: Generative Multi-modal 3D Occupancy Prediction for Autonomous Driving<br>:house:project
- Fully Sparse 3D Occupancy Prediction<br>:star:code
- Monocular Occupancy Prediction for Scalable Indoor Scenes<br>:star:code
- ViewFormer: Exploring Spatiotemporal Modeling for Multi-View 3D Occupancy Perception via View-Guided Transformers<br>:star:code
- CVT-Occ: Cost Volume Temporal Fusion for 3D Occupancy Prediction<br>:star:code
- GaussianFormer: Scene as Gaussians for Vision-Based 3D Semantic Occupancy Prediction<br>:star:code3D 语义占用预测
- 车道线检测
- 车辆监控
17.Video
- Stable Video Portraits<br>:house:project
- Text-Guided Video Masked Autoencoder
- Multi-Modal Video Dialog State Tracking in the Wild
- Training-free Video Temporal Grounding using Large-scale Pre-trained Models<br>:star:code
- Weakly-Supervised Spatio-Temporal Video Grounding with Variational Cross-Modal Alignment
- E3M: Zero-Shot Spatio-Temporal Video Grounding with Expectation-Maximization Multimodal Modulation<br>:star:code
- Rethinking Weakly-supervised Video Temporal Grounding From a Game Perspective
- Fast Encoding and Decoding for Implicit Video Representation<br>:star:code<br>:house:project
- DEVIAS: Learning Disentangled Video Representations of Action and Scene<br>:star:code
- VideoStudio: Generating Consistent-Content and Multi-Scene Videos<br>:house:project
- VAD
- Cross-Domain Learning for Video Anomaly Detection with Limited Supervision
- Learning Anomalies with Normality Prior for Unsupervised Video Anomaly Detection<br>:star:code
- Follow the Rules: Reasoning for Video Anomaly Detection with Large Language Models<br>:star:code
- FedVAD: Enhancing Federated Video Anomaly Detection with GPT-Driven Semantic Distillation<br>:star:code
- Interleaving One-Class and Weakly-Supervised Models with Adaptive Thresholding for Unsupervised Video Anomaly Detection<br>:star:code视频异常检测
- 视频摘要
- 视频理解
- VideoMamba: Spatio-Temporal Selective State Space Model<br>:star:code
- VideoMamba: State Space Model for Efficient Video Understanding<br>:star:code
- Goldfish: Vision-Language Understanding of Arbitrarily Long Videos<br>:star:code
- Learning Video Context as Interleaved Multimodal Sequences<br>:star:code
- FunQA: Towards Surprising Video Comprehension<br>:house:project
- Vamos: Versatile Action Models for Video Understanding<br>:star:code<br>:house:project
- Towards Neuro-Symbolic Video Understanding<br>:star:code
- LongVLM: Efficient Long Video Understanding via Large Language Models<br>:star:code
- VideoAgent: A Memory-augmented Multimodal Agent for Video Understanding<br>:star:code<br>:house:project
- Ex2Eg-MAE: A Framework for Adaptation of Exocentric Video Masked Autoencoders for Egocentric Social Role Understanding
- VideoAgent: Long-form Video Understanding with Large Language Model as Agent<br>🤗huggingface
- InternVideo2: Scaling Foundation Models for Multimodal Video Understanding<br>:star:code
- Text-Conditioned Resampler For Long Form Video Understanding
- 视频分类
- 视频解析
- 视频帧插值
- 视频类增量
- 视频抄袭片段定位
16.Medical Image Progress(医学影响处理)
- Adaptive Compressed Sensing with Diffusion-Based Posterior Sampling<br>:star:code
- Co-synthesis of Histopathology Nuclei Image-Label Pairs using a Context-Conditioned Joint Diffusion Model
- Identity-Consistent Diffusion Network for Grading Knee Osteoarthritis Progression in Radiographic Imaging
- Multistain Pretraining for Slide Representation Learning in Pathology<br>:star:code
- Energy-induced Explicit quantification for Multi-modality MRI fusion<br>:star:code
- Brain-ID: Learning Contrast-agnostic Anatomical Representations for Brain Imaging<br>:star:code
- CardiacNet: Learning to Reconstruct Abnormalities for Cardiac Disease Assessment from Echocardiogram Videos<br>:star:code心脏病评估
- Knowledge-enhanced Visual-Language Pretraining for Computational Pathology<br>:star:code
- Effective Lymph Nodes Detection in CT Scans Using Location Debiased Query Selection and Contrastive Query Representation in TransformerCT
- Bridging the Pathology Domain Gap: Efficiently Adapting CLIP for Pathology Image Analysis with Limited Labeled Data病理学图像分析
- Unified Medical Image Pre-training in Language-Guided Common Semantic Space
- Rethinking Deep Unrolled Model for Accelerated MRI Reconstruction<br>:star:code
- Style-Extracting Diffusion Models for Semi-Supervised Histopathology Segmentation<br>:star:code半监督组织病理学分割
- 组织病理学图像分类
- 切片图像分类
- DGR-MIL: Exploring Diverse Global Representation in Multiple Instance Learning for Whole Slide Image Classification<br>:star:code
- Pathology-knowledge Enhanced Multi-instance Prompt Learning for Few-shot Whole Slide Image Classification
- Snuffy: Efficient Whole Slide Image Classifier
- Norma: A Noise Robust Memory-Augmented Framework for Whole Slide Image Classification<br>:star:code
- Attention-Challenging Multiple Instance Learning for Whole Slide Image Classification<br>:star:code
- 医学图像分割
- FairDomain: Achieving Fairness in Cross-Domain Medical Image Segmentation and Classification<br>:house:project<br>:star:code
- PMT: Progressive Mean Teacher via Exploring Temporal Consistency for Semi-Supervised Medical Image Segmentation<br>:star:code
- The Devil is in the Statistics: Mitigating and Exploiting Statistics Difference for Generalizable Semi-supervised Medical Image Segmentation<br>:star:code
- Domesticating SAM for Breast Ultrasound Image Segmentation via Spatial-frequency Fusion and Uncertainty Correction<br>:star:code
- Gradient-Aware for Class-Imbalanced Semi-supervised Medical Image Segmentation<br>:star:code
- AnatoMask: Enhancing Medical Image Segmentation with Reconstruction-guided Self-masking<br>:star:code
- Alternate Diverse Teaching for Semi-supervised Medical Image Segmentation<br>:star:code
- I-MedSAM: Implicit Medical Image Segmentation with Segment Anything<br>:star:code
- VP-SAM: Taming Segment Anything Model for Video Polyp Segmentation via Disentanglement and Spatio-temporal Side Network<br>:star:code息肉分割
- 医学图像配准
- 医学报告生成
- X 光片
- 医学机器人
- 生物医学图像
- CT
15.GAN/Image Synthesis(图像生成)
- Diffusion Models as Data Mining Tools<br>:star:code
- ProCreate, Don't Reproduce! Propulsive Energy Diffusion for Creative Generation<br>:house:project
- Learning to Generate Conditional Tri-plane for 3D-aware Expression Controllable Portrait Animation<br>:house:project
- HiEI: A Universal Framework for Generating High-quality Emerging Images from Natural Images
- UDiffText: A Unified Framework for High-quality Text Synthesis in Arbitrary Images via Character-aware Diffusion Models<br>:star:code
- Score Distillation Sampling with Learned Manifold Corrective
- CTRLorALTer: Conditional LoRAdapter for Efficient 0-Shot Control & Altering of T2I Models<br>:house:project
- EMO: Emote Portrait Alive - Generating Expressive Portrait Videos with Audio2Video Diffusion Model under Weak Conditions
- Inf-DiT: Upsampling any-resolution image with memory-efficient diffusion transformer<br>:star:code
- The Lottery Ticket Hypothesis in Denoising: Towards Semantic-Driven Initialization<br>:house:project
- MONTAGE: Monitoring Training for Attribution of Generative Diffusion Models
- TP2O: Creative Text Pair-to-Object Generation using Balance Swap-Sampling
- Free-ATM: Harnessing Free Attention Masks for Representation Learning on Diffusion-Generated Images
- Idea2Img: Iterative Self-Refinement with GPT-4V for Automatic Image Design and Generation<br>:house:project
- OMG: Occlusion-friendly Personalized Multi-concept Generation in Diffusion Models<br>:star:code
- DreamDiffusion: High-Quality EEG-to-Image Generation with Temporal Masked Signal Modeling and CLIP Alignment
- V-Trans4Style: Visual Transition Recommendation for Video Production Style Adaptation视频制作风格适配的视觉转场推荐
- GAN
- CLR-GAN: Improving GANs Stability and Quality via Consistent Latent Representation and Reconstruction<br>:star:code
- A Closer Look at GAN Priors: Exploiting Intermediate Features for Enhanced Model Inversion Attacks<br>:star:code
- Distilling Diffusion Models into Conditional GANs
- Exploring Guided Sampling of Conditional GANs<br>:star:code
- Learning 3D-aware GANs from Unposed Images with Template Feature Field<br>:house:project
- 扩散
- Measuring Style Similarity in Diffusion Models<br>:star:code
- Do text-free diffusion models learn discriminative visual representations
- Iterative Ensemble Training with Anti-Gradient Control for Mitigating Memorization in Diffusion Models<br>:star:code
- ShoeModel: Learning to Wear on the User-specified Shoes via Diffusion Model
- HiDiffusion: Unlocking Higher-Resolution Creativity and Efficiency in Pretrained Diffusion Models<br>:star:code
- Chains of Diffusion Models
- To Generate or Not? Safety-Driven Unlearned Diffusion Models Are Still Easy To Generate Unsafe Images ... For Now<br>:star:code
- FontStudio: Shape-Adaptive Diffusion Model for Coherent and Consistent Font Effect Generation<br>:house:project
- Beta-Tuned Timestep Diffusion Model
- SMooDi: Stylized Motion Diffusion Model<br>:house:project
- Make a Cheap Scaling: A Self-Cascade Diffusion Model for Higher-Resolution Adaptation<br>:star:code
- Surf-D: Generating High-Quality Surfaces of Arbitrary Topologies Using Diffusion Models<br>:star:code<br>:house:project
- Implicit Concept Removal of Diffusion Models<br>:house:project
- ZigMa: A DiT-style Zigzag Mamba Diffusion Model<br>:star:code
- ColorPeel: Color Prompt Learning with Diffusion Models via Color and Shape Disentanglement<br>:star:code<br>:house:project
- Timestep-Aware Correction for Quantized Diffusion Models
- Shapefusion: 3D localized human diffusion models<br>:house:project
- MVDD: Multi-View Depth Diffusion Models
- SwiftBrush v2: Make Your One-step Diffusion Model Better Than Its Teacher<br>:star:code<br>:house:project<br>:tv:video
- Layout-Corrector: Alleviating Layout Sticking Phenomenon in Discrete Diffusion Model<br>:star:code
- Compensation Sampling for Improved Convergence in Diffusion Models<br>:star:code
- ConceptExpress: Harnessing Diffusion Models for Single-image Unsupervised Concept Extraction<br>:star:code<br>:star:code
- Self-Guided Generation of Minority Samples Using Diffusion Models<br>:star:code
- Zero-Shot Adaptation for Approximate Posterior Sampling of Diffusion Models in Inverse Problems<br>:star:code
- LogoSticker: Inserting Logos into Diffusion Models for Customized Generation<br>:star:code
- 纹理合成
- 图像合成
- Editable Image Elements for Controllable Synthesis<br>:house:project
- Assessing Sample Quality via the Latent Space of Generative Models<br>:star:code
- SCP-Diff: Photo-Realistic Semantic Image Synthesis with Spatial-Categorical Joint Prior<br>:house:project
- Zero-shot Text-guided Infinite Image Synthesis with LLM guidance
- $\infty$-Brush: Controllable Large Image Synthesis with Diffusion Models in Infinite Dimensions<br>:star:code<br>:star:code
- EpipolarGAN: Omnidirectional Image Synthesis with Explicit Camera Control
- LayerDiff: Exploring Text-guided Multi-layered Composable Image Synthesis via Layer-Collaborative Diffusion Model
- Layered Rendering Diffusion Model for Controllable Zero-Shot Image Synthesis
- Label-free Neural Semantic Image Synthesis
- Improving image synthesis with diffusion-negative sampling
- SCP-Diff: Spatial-Categorical Joint Prior for Diffusion Based Semantic Image Synthesis<br>:house:project
- 2S-ODIS: Two-Stage Omni-Directional Image Synthesis by Geometric Distortion Correction<br>:star:code
- Rejection Sampling IMLE: Designing Priors for Better Few-Shot Image Synthesis
- FouriScale: A Frequency Perspective on Training-Free High-Resolution Image Synthesis<br>:star:code
- 图像生成
- MoMA: Multimodal LLM Adapter for Fast Personalized Image Generation<br>:star:code<br>:house:project
- Few-Shot Image Generation by Conditional Relaxing Diffusion Inversion
- Context Diffusion: In-Context Aware Image Generation<br>:house:project
- Few-shot Defect Image Generation based on Consistency Modeling<br>:star:code
- Linearly Controllable GAN: Unsupervised Feature Categorization and Decomposition for Image Generation and Manipulation
- PanoFree: Tuning-Free Holistic Multi-view Image Generation with Cross-view Self-Guidance
- AdaNAT: Exploring Adaptive Policy for Token-Based Image Generation<br>:star:code
- AccDiffusion: An Accurate Method for Higher-Resolution Image Generation<br>:house:project<br>:thumbsup:成功地进行无重复高分辨率的图像生成
- Towards Reliable Advertising Image Generation Using Human Feedback<br>:star:code
- StoryImager: A Unified and Efficient Framework for Coherent Story Visualization and Completion<br>:star:code
- Model-agnostic Origin Attribution of Generated Images with Few-shot Examples
- Improving Geo-diversity of Generated Images with Contextualized Vendi Score Guidance<br>:star:code
- Tuning-Free Image Customization with Image and Text Guidance<br>:house:project
- Collaborative Control for Geometry-Conditioned PBR Image Generation<br>:house:project
- DiffiT: Diffusion Vision Transformers for Image Generation<br>:star:code
- MultiGen: Zero-shot Image Generation from Multi-modal Prompts
- Accelerating Image Generation with Sub-path Linear Approximation Model<br>:star:code
- 视频生成
- FreeInit: Bridging Initialization Gap in Video Diffusion Models<br>:star:code<br>:house:project
- HARIVO: Harnessing Text-to-Image Models for Video Generation<br>:house:project
- SignGen: End-to-End Sign Language Video Generation with Latent Diffusion<br>:star:code
- DragAnything: Motion Control for Anything using Entity Representation<br>:star:code<br>:house:project
- Physics-Based Interaction with 3D Objects via Video Generation<br>:star:code<br>:house:project
- DrivingDiffusion: Layout-Guided Multi-View Driving Scenarios Video Generation with Latent Diffusion Model<br>:house:project
- Photorealistic Video Generation with Diffusion Models<br>:house:project
- DynamiCrafter: Animating Open-domain Images with Video Diffusion Priors<br>:star:code
- PoseCrafter: One-Shot Personalized Video Synthesis Following Flexible Pose Control<br>:house:project
- MoVideo: Motion-Aware Video Generation with Diffusion Models<br>:house:project
- IDOL: Unified Dual-Modal Latent Diffusion for Human-Centric Joint Video-Depth Generation<br>:star:code
- MagDiff: Multi-Alignment Diffusion for High-Fidelity Video Generation and Editing
- 文本-视频质量评估
- 视频编辑
- DragVideo: Interactive Drag-style Video Editing<br>:house:project
- Video Editing via Factorized Diffusion Distillation
- SAVE: Protagonist Diversification with Structure Agnostic Video Editing<br>:house:project
- DNI: Dilutional Noise Initialization for Diffusion Video Editing
- DeCo: Decoupled Human-Centered Diffusion Video Editing with Motion Consistency
- DreamMotion: Space-Time Self-Similar Score Distillation for Zero-Shot Video Editing<br>:house:project
- Videoshop: Localized Semantic Video Editing with Noise-Extrapolated Diffusion Inversion<br>:star:code
- Object-Centric Diffusion for Efficient Video Editing<br>:house:project
- 图像编辑
- ObjectAdd: Adding Objects into Image via a Training-Free Diffusion Modification Fashion
- Guide-and-Rescale: Self-Guidance Mechanism for Effective Tuning-Free Real Image Editing<br>🤗huggingface
- COMPOSE: Comprehensive Portrait Shadow Editing
- Fast Diffusion-Based Counterfactuals for Shortcut Removal and Generation编辑
- FreeDiff: Progressive Frequency Truncation for Image Editing with Diffusion Models<br>:star:code
- ByteEdit: Boost, Comply and Accelerate Generative Image Editing
- RegionDrag: Fast Region-Based Image Editing with Diffusion Models<br>:star:code
- 3DEgo: 3D Editing on the Go!<br>:star:code
- View-Consistent 3D Editing with Gaussian Splatting<br>:house:project
- Diffusion Models are Geometry Critics: Single Image 3D Editing Using Pre-Trained Diffusion Priors<br>:house:project
- Watch Your Steps: Local Image and Scene Editing by Text Instructions<br>:house:project
- Chat-Edit-3D: Interactive 3D Scene Editing via Text Prompts<br>:house:project
- Free-Editor: Zero-shot Text-driven 3D Scene Editing<br>:house:project
- InstructGIE: Towards Generalizable Image Editing
- Lazy Diffusion Transformer for Interactive Image Editing<br>:house:project
- DATENeRF: Depth-Aware Text-based Editing of NeRFs<br>:star:code<br>:house:project
- TurboEdit: Real-time text-based disentangled real image editing
- DGE: Direct Gaussian 3D Editing by Consistent Multi-view Editing<br>:star:code
- StableDrag: Stable Dragging for Point-based Image Editing
- ST-LDM: A Universal Framework for Text-Grounded Object Generation in Real Images图像编辑
- SwapAnything: Enabling Arbitrary Object Swapping in Personalized Image Editing<br>:house:project
- Source Prompt Disentangled Inversion for Boosting Image Editability with Diffusion Models<br>:star:code
- Robust-Wide: Robust Watermarking against Instruction-driven Image Editing<br>:star:code
- FlexiEdit: Frequency-Aware Latent Refinement for Enhanced Non-Rigid Editing
- Eta Inversion: Designing an Optimal Eta Function for Diffusion-based Real Image Editing<br>:star:code
- RadEdit: stress-testing biomedical vision models via diffusion image editing
- Responsible Visual Editing<br>:star:code
- 3D-GOI: 3D GAN Omni-Inversion for Multifaceted and Multi-object Editing<br>:house:project
- Thinking Outside the BBox: Unconstrained Generative Object Compositing物体合成
- EditShield: Protecting Unauthorized Image Editing by Instruction-guided Diffusion Models
- 图像-视频
- Rethinking Image-to-Video Adaptation: An Object-centric Perspective
- R2-Tuning: Efficient Image-to-Video Transfer Learning for Video Temporal Grounding<br>:star:code
- PhysGen: Rigid-Body Physics-Grounded Image-to-Video Generation<br>:star:code
- ZeroI2V: Zero-Cost Adaptation of Pre-Trained Transformers from Image to Video
- 文本-视频
- E.T. the Exceptional Trajectories: Text-to-camera-trajectory generation with character awareness<br>:house:project
- WAVE: Warping DDIM Inversion Features for Zero-shot Text-to-Video Editing<br>:house:project
- MotionDirector: Motion Customization of Text-to-Video Diffusion Models<br>:star:code<br>:house:project
- Factorizing Text-to-Video Generation by Explicit Image Conditioning<br>🤗huggingface
- SparseCtrl: Adding Sparse Controls to Text-to-Video Diffusion Models<br>:house:project
- MEVG: Multi-event Video Generation with Text-to-Video Models<br>:house:project
- Customize-A-Video: One-Shot Motion Customization of Text-to-Video Diffusion Models<br>:house:project
- xGen-VideoSyn-1: High-fidelity Text-to-Video Synthesis with Compressed Representations
- 文本-3D
- Diverse Text-to-3D Synthesis with Augmented Text Embedding<br>:star:code<br>:house:project
- LATTE3D: Large-scale Amortized Text-To-Enhanced3D Synthesis<br>:house:project
- DreamMesh: Jointly Manipulating and Texturing Triangle Meshes for Text-to-3D Generation<br>:star:code
- DreamReward: Aligning Human Preference in Text-to-3D Generation<br>:house:project
- DreamDissector: Learning Disentangled Text-to-3D Generation from 2D Diffusion Priors<br>:star:code
- DreamReward: Text-to-3D Generation with Human Preference<br>:house:project
- GVGEN: Text-to-3D Generation with Volumetric Representation<br>:star:code
- WordRobe: Text-Guided Generation of Textured 3D Garments<br>:house:project
- UniDream: Unifying Diffusion Priors for Relightable Text-to-3D Generation<br>:star:code<br>:house:project
- ScaleDreamer: Scalable Text-to-3D Synthesis with Asynchronous Score Distillation<br>:star:code
- CRM: Single Image to 3D Textured Mesh with Convolutional Reconstruction Model<br>:house:project
- Repaint123: Fast and High-quality One Image to 3D Generation with Progressive Controllable Repainting<br>:house:project
- VividDreamer: Invariant Score Distillation for Hyper-Realistic Text-to-3D Generation
- DreamScene: 3D Gaussian-based Text-to-3D Scene Generation via Formation Pattern Sampling<br>:star:code
- HiFi-123: Towards High-fidelity One Image to 3D Content Generation<br>:house:project
- JointDreamer: Ensuring Geometry Consistency and Text Congruence in Text-to-3D Generation via Joint Score Distillation
- Connecting Consistency Distillation to Score Distillation for Text-to-3D Generation<br>:star:code
- TPA3D: Triplane Attention for Fast Text-to-3D Generation<br>:house:project
- DreamView: Injecting View-specific Text Guidance into Text-to-3D Generation<br>:star:code
- 文本-图像
- [Navigating Text-to-lmage Generative Bias acrossIndic Languages]
- Bridging Different Language Models and Generative Vision Models for Text-to-Image Generation<br>:star:code<br>:house:project
- MobileDiffusion: Instant Text-to-Image Generation on Mobile Devices
- PreciseControl: Enhancing Text-To-Image Diffusion Models with Fine-Grained Attribute Control<br>:house:project
- PEA-Diffusion: Parameter-Efficient Adapter with Knowledge Distillation in non-English Text-to-Image Generation<br>:star:code
- Diffusion Soup: Model Merging for Text-to-Image Diffusion Models
- IMMA: Immunizing text-to-image Models against Malicious Adaptation<br>:star:code
- Receler: Reliable Concept Erasing of Text-to-Image Diffusion Models via Lightweight Erasers<br>:house:project
- ControlNet-XS: Rethinking the Control of Text-to-Image Diffusion Models as Feedback-Control Systems
- Skews in the Phenomenon Space Hinder Generalization in Text-to-Image Generation<br>:star:code
- Enhancing Semantic Fidelity in Text-to-Image Synthesis: Attention Regulation in Diffusion Models<br>:star:code
- Post-training Quantization with Progressive Calibration and Activation Relaxing for Text-to-Image Diffusion Models
- Textual-Visual Logic Challenge: Understanding and Reasoning in Text-to-Image Generation<br>:star:code
- Navigating Text-to-Image Generative Bias across Indic Languages<br>:house:project
- Safeguard Text-to-Image Diffusion Models with Human Feedback Inversion<br>:star:code
- Diff-Tracker: Text-to-Image Diffusion Models are Unsupervised Trackers
- DreamDrone: Text-to-Image Diffusion Models are Zero-shot Perpetual View Generators<br>:house:project
- Harnessing Text-to-Image Diffusion Models for Category-Agnostic Pose Estimation
- MaxFusion: Plug&Play Multi-Modal Generation in Text-to-Image Diffusion Models<br>:star:code<br>:house:project
- Deep Reward Supervisions for Tuning Text-to-Image Diffusion Models
- R.A.C.E.: Robust Adversarial Concept Erasure for Secure Text-to-Image Diffusion Model<br>:star:code
- MixDQ: Memory-Efficient Few-Step Text-to-Image Diffusion Models with Metric-Decoupled Mixed Precision Quantization<br>:star:code<br>:house:project
- ReCON: Training-Free Acceleration for Text-to-Image Synthesis with Retrieval of Concept Prompt Trajectories<br>:house:project
- LCM-Lookahead for Encoder-based Text-to-Image Personalization<br>:star:code<br>:house:project
- Lego: Learning to Disentangle and Invert Personalized Concepts Beyond Object Appearance in Text-to-Image Diffusion Models
- Reliable and Efficient Concept Erasure of Text-to-Image Diffusion Models<br>:star:code
- Exploring Phrase-Level Grounding with Text-to-Image Diffusion Model<br>:star:code<br>:thumbsup:DiffPNG实现了最佳的性能,证明了T2I扩散模型在短语级理解视觉内容的能力
- T2IShield: Defending Against Backdoors on Text-to-Image Diffusion Models<br>:star:code
- Be Yourself: Bounded Attention for Multi-Subject Text-to-Image Generation<br>:house:project
- Latent Guard: a Safety Framework for Text-to-image Generation<br>:star:code
- Getting it Right: Improving Spatial Consistency in Text-to-Image Models<br>:star:code<br>:house:project
- Unveiling and Mitigating Memorization in Text-to-image Diffusion Models through Cross Attention<br>:star:code
- Powerful and Flexible: Personalized Text-to-Image Generation via Reinforcement Learning<br>:star:code
- PanGu-Draw: Advancing Resource-Efficient Text-to-Image Synthesis with Time-Decoupled Training and Reusable Coop-Diffusion<br>:house:project
- Text-Anchored Score Composition: Tackling Condition Misalignment in Text-to-Image Diffusion Models<br>:star:code
- MasterWeaver: Taming Editability and Face Identity for Personalized Text-to-Image Generation<br>:star:code
- Parrot: Pareto-optimal Multi-Reward Reinforcement Learning Framework for Text-to-Image Generation
- ProTIP: Probabilistic Robustness Verification on Text-to-Image Diffusion Models against Stochastic Perturbation
- PixArt-Sigma: Weak-to-Strong Training of Diffusion Transformer for 4K Text-to-Image Generation
- AnyControl: Create Your Artwork with Versatile Control on Text-to-Image Generation<br>:star:code<br>:house:project
- CogView3: Finer and Faster Text-to-Image Generation via Relay Diffusion<br>:star:code
- Object-Conditioned Energy-Based Attention Map Alignment in Text-to-Image Diffusion Models
- SpeedUpNet: A Plug-and-Play Adapter Network for Accelerating Text-to-Image Diffusion Models<br>:house:project
- TIBET: Identifying and Evaluating Biases in Text-to-Image Generative Models<br>:house:project
- An Empirical Study and Analysis of Text-to-Image Generation Using Large Language Model-Powered Textual Representation
- Adversarial Robustification via Text-to-Image Diffusion Models<br>:star:code
- Stable Preference: Redefining training paradigm of human preference model for Text-to-Image Synthesis
- Lost in Translation: Latent Concept Misalignment in Text-to-Image Diffusion Models<br>:house:project
- Text-to-Sticker: Style Tailoring Latent Diffusion Models for Human Expression
- 图像-文本
- 文本-视频对齐
- 图像-文本对齐
- 图像-文本
- Emergent Visual-Semantic Hierarchies in Image-Text Representations<br>:star:code<br>:house:project
- Mismatch Quest: Visual and Textual Feedback for Image-Text Misalignment<br>:star:code<br>:house:project
- 3D(内容)生成
- Make-Your-3D: Fast and Consistent Subject-Driven 3D Content Generation<br>:house:project
- LGM: Large Multi-View Gaussian Model for High-Resolution 3D Content Creation<br>:house:project
- LN3Diff: Scalable Latent Neural Fields Diffusion for Speedy 3D Generation<br>:house:project
- SV3D: Novel Multi-view Synthesis and 3D Generation from a Single Image using Latent Video Diffusion<br>:house:project
- VFusion3D: Learning Scalable 3D Generative Models from Video Diffusion Models<br>:house:project
- Compress3D: a Compressed Latent Space for 3D Generation from a Single Image
- AnimatableDreamer: Text-Guided Non-rigid 3D Model Generation and Reconstruction with Canonical Score Distillation<br>:house:project
- Learn to Optimize Denoising Scores: A Unified and Improved Diffusion Prior for 3D Generation<br>:house:project
- 视觉文本渲染
- GIF 生成
- 布局生成
- 布局-图像
- 图像-图像翻译
- 图像翻译
- Text-to-4D
- TC4D: Trajectory-Conditioned Text-to-4D Generation<br>:star:code<br>:house:project
- Video-to-4D
- SC4D: Sparse-Controlled Video-to-4D Generation and Motion Transfer<br>:star:code<br>:house:project
- 网页设计
- Text-to-Garment
- 图像风格化
- StyleTokenizer: Defining Image Style by a Single Instance for Controlling Diffusion Models<br>:star:code
- ZipLoRA: Any Subject in Any Style by Effectively Merging LoRAs<br>:star:code<br>:house:project风格
- InstaStyle: Inversion Noise of a Stylized Image is Secretly a Style Adviser<br>:star:code风格化
- StyleCity: Large-Scale 3D Urban Scenes Stylization<br>:house:project城市场景风格化
- Scene-Conditional 3D Object Stylization and Composition
- 图像矢量化
- 视频拼接
- 文本到相机轨迹生成
- 文本到 3D 场景
- 身份保留的个性化
- 主题驱动生成
- 风格内容分离
- Implicit Style-Content Separation using B-LoRA<br>:star:code<br>:house:project
- 文本生成多运动
- 文本驱动的3D编辑
- GaussCtrl: Multi-View Consistent Text-Driven 3D Gaussian Splatting Editing<br>:star:code文本驱动的 3D 高斯泼溅编辑
- 图像插值
- 图像合成
- 图像动画
- LivePhoto: Real Image Animation with Text-guided Motion Control<br>:star:code文本引导运动控制的真实图像动画
- MOFA-Video: Controllable Image Animation via Generative Motion Field Adaptions in Frozen Image-to-Video Diffusion Model<br>:star:code<br>:house:project
- ZoLA: Zero-Shot Creative Long Animation Generation with Short Video Model<br>:house:project
- Champ: Controllable and Consistent Human Image Animation with 3D Parametric Guidance<br>:star:code人体图像动画
- TCAN: Animating Human Images with Temporally Consistent Pose Guidance using Diffusion Models<br>:house:project
- 集体照合成
- 图像裁剪
14.Image Captioning(图像/视频字幕)
- DECap: Towards Generalized Explicit Caption Editing via Diffusion Mechanism
- ControlCap: Controllable Region-level Captioning<br>:star:code字幕
- MarineInst: A Foundation Model for Marine Image Analysis with Instance Visual Description<br>:star:code视觉描述
- Distractors-Immune Representation Learning with Cross-modal Contrastive Regularization for Change Captioning<br>:star:code
- CIC-BART-SSA: Controllable Image Captioning with Structured Semantic Augmentation<br>:star:code
- Controllable Contextualized Image Captioning: Directing the Visual Narrative through User-Defined Highlights<br>:star:code
- BRIDGE: Bridging Gaps in Image Captioning Evaluation with Stronger Visual Cues<br>:star:code
- View Selection for 3D Captioning via Diffusion Ranking<br>:star:code
- Unleashing Text-to-Image Diffusion Prior for Zero-Shot Image Captioning<br>:house:project
- HiFi-Score: Fine-grained Image Description Evaluation with Hierarchical Parsing Graphs细粒度图像描述
- 视频字幕
- 密集字幕
13.Image/video Compression(图像/视频压缩)
- SAH-SCI: Self-Supervised Adapter for Efficient Hyperspectral Snapshot Compressive Imaging
- Adaptive Selection of Sampling-Reconstruction in Fourier Compressed Sensing
- Image Compression for Machine and Human Vision with Spatial-Frequency Adaptation<br>:star:code
- Bidirectional Stereo Image Compression with Cross-Dimensional Entropy Model
- Rate-Distortion-Cognition Controllable Versatile Neural Image Compression
- BaSIC: BayesNet Structure Learning for Computational Scalable Neural Image Compression<br>:star:code
- Region-Adaptive Transform with Segmentation Prior for Image Compression<br>:star:code
- EGIC: Enhanced Low-Bit-Rate Generative Image Compression Guided by Semantic Segmentation
- Lagrangian Hashing for Compressed Neural Field Representations<br>:house:project
- Latent Diffusion Prior Enhanced Deep Unfolding for Snapshot Spectral Compressive Imaging<br>:star:code快照光谱压缩
- Image Compression for Machine and Human Vision With Spatial-Frequency Adaptation<br>:star:code
- Learned HDR Image Compression for Perceptually Optimal Storage and Display<br>:star:code
- WeConvene: Learned Image Compression with Wavelet-Domain Convolution and Entropy Model
- Lossy Image Compression with Foundation Diffusion Models
- A Unified Image Compression Method for Human Perception and Multiple Vision Tasks
- 视频压缩
- Hierarchical Separable Video Transformer for Snapshot Compressive Imaging<br>:star:code
- A Simple Low-bit Quantization Framework for Video Snapshot Compressive Imaging<br>:star:code
- Free-VSC: Free Semantics from Visual Foundation Models for Unsupervised Video Semantic Compression
- Long-term Temporal Context Gathering for Neural Video Compression
- Learned Rate Control for Frame-Level Adaptive Neural Video Compression via Dynamic Neural Network
- 视频解码
- 快照光谱成像
- 运动估计
12.Image Retrieval(图像检索)
- RGNet: A Unified Clip Retrieval and Grounding Network for Long Videos<br>:star:code<br>:house:project
- AMES: Asymmetric and Memory-Efficient Similarity Estimation for Instance-level Retrieval<br>:star:code
- IRGen: Generative Modeling for Image Retrieval<br>:star:code
- FastCAD: Real-Time CAD Retrieval and Alignment from Scans and Videos
- Spherical Linear Interpolation and Text-Anchoring for Zero-shot Composed Image Retrieval<br>:star:code
- FreestyleRet: Retrieving Images from Style-Diversified Queries<br>:star:code
- 基于草图的图像检索
- 视频-文本检索
- 图像-文本检索
- 视频检索
- 近邻搜索
11.Image Segmentation(图像分割)
- Occlusion-Aware Seamless Segmentation<br>:star:code
- SegVG: Transferring Object Bounding Box to Segmentation for Visual Grounding<br>:star:code<br>:Thumbsup:视觉定位新SOTA!SegVG:将视觉定位的目标边界框转化为分割信号(已开源)
- Segment, Lift and Fit: Automatic 3D Shape Labeling from 2D Prompts
- Open-Vocabulary SAM: Segment and Recognize Twenty-thousand Classes Interactively<br>:star:code<br>:house:project
- Segment and Recognize Anything at Any Granularity<br>:star:code
- Enriching Information and Preserving Semantic Consistency in Expanding Curvilinear Object Segmentation Datasets<br>:star:code
- Semi-supervised Segmentation of Histopathology Images with Noise-Aware Topological Consistency<br>:star:code
- CoPT: Unsupervised Domain Adaptive Segmentation using Domain-Agnostic Text Embeddings<br>:star:code
- From Pixels to Objects: A Hierarchical Approach for Part and Object Segmentation Using Local and Global Aggregation
- CC-SAM: Enhancing SAM with Cross-feature Attention and Context for Ultrasound Image Segmentation
- Unsupervised Moving Object Segmentation with Atmospheric Turbulence
- Lite-SAM Is Actually What You Need for Segment Everything
- Textual Query-Driven Mask Transformer for Domain Generalized Segmentation<br>:star:code
- Can Textual Semantics Mitigate Sounding Object Segmentation Preference?<br>:star:code
- RAPiD-Seg: Range-Aware Pointwise Distance Distribution Networks for 3D LiDAR Segmentation<br>:star:code
- SPIN: Hierarchical Segmentation with Subpart Granularity in Natural Images<br>:star:code
- CC-SAM: SAM with Cross-feature Attention and Context for Ultrasound Image Segmentation
- FlashSplat: 2D to 3D Gaussian Splatting Segmentation Solved Optimally<br>:star:code
- SegGen: Supercharging Segmentation Models with Text2Mask and Mask2Img Synthesis<br>:house:project
- PQ-SAM: Post-training Quantization for Segment Anything Model
- Active Coarse-to-Fine Segmentation of Moveable Parts from Real Images<br>:house:project
- LiteSAM is Actually what you Need for segment Everything
- SEGIC: Unleashing the Emergent Correspondence for In-Context Segmentation<br>:star:code
- A Semantic Space is Worth 256 Language Descriptions: Make Stronger Segmentation Models with Descriptive Properties<br>:star:code
- Placing Objects in Context via Inpainting for Out-of-distribution Segmentation<br>:star:code
- Rethinking and Improving Visual Prompt Selection for In-Context Learning Segmentation Framework<br>:star:code
- Better Call SAL: Towards Learning to Segment Anything in Lidar<br>:star:code
- 抠图
- 3D分割
- Bayesian Self-Training for Semi-Supervised 3D Segmentation
- Segment3D: Learning Fine-Grained Class-Agnostic 3D Segmentation without Manual Labels<br>:house:project
- EgoLifter: Open-world 3D Segmentation for Egocentric Perception<br>:house:project<br>🤗huggingface
- View-Consistent Hierarchical 3D Segmentation Using Ultrametric Feature Fields<br>:star:code
- 视频分割
- General and Task-Oriented Video Segmentation<br>:star:code
- DVIS-DAQ: Improving Video Segmentation via Dynamic Anchor Queries<br>:star:code<br>:house:project
- 实例分割
- Unleashing the Power of Prompt-driven Nucleus Instance Segmentation<br>:star:code
- 3D实例分割
- Part2Object: Hierarchical Unsupervised 3D Instance Segmentation<br>:star:code
- SAM-guided Graph Cut for 3D Instance Segmentation<br>:star:code<br>:house:project
- Continual Learning and Unknown Object Discovery in 3D Scenes via Self-Distillation<br>:star:code实例分割
- OpenIns3D: Snap and Lookup for 3D Open-vocabulary Instance Segmentation<br>:star:code
- 无监督实例分割
- 开发世界实例分割
- 全景分割
- Open Panoramic Segmentation<br>:star:code
- A Simple Latent Diffusion Approach for Panoptic Segmentation and Mask Inpainting<br>:star:code
- Point-supervised Panoptic Segmentation via Estimating Pseudo Labels from Learnable Distance
- Strike a Balance in Continual Panoptic Segmentation<br>:star:code
- 3D Open-Vocabulary Panoptic Segmentation with 2D-3D Vision-Language Distillation
- 语义分割
- Rethinking Data Augmentation for Robust LiDAR Semantic Segmentation in Adverse Weather<br>:star:code
- Exploring Reliable Matching with Phase Enhancement for Night-time Semantic Segmentation
- MTA-CLIP: Language-Guided Semantic Segmentation with Mask-Text Alignment
- Sparse Refinement for Efficient High-Resolution Semantic Segmentation<br>:house:project
- Embedding-Free Transformer with Inference Spatial Reduction for Efficient Semantic Segmentation<br>:star:code
- Weakly Supervised Co-training with Swapping Assignments for Semantic Segmentation<br>:star:code
- Context-Guided Spatial Feature Reconstruction for Efficient Semantic Segmentation<br>:star:code
- On-the-fly Category Discovery for LiDAR Semantic Segmentation<br>:star:code
- On the Viability of Monocular Depth Pre-training for Semantic Segmentation
- Efficient Active Domain Adaptation for Semantic Segmentation by Selecting Information-rich Superpixels<br>:star:code
- Distributed Semantic Segmentation with Efficient Joint Source and Task Decoding
- FREST: Feature RESToration for Semantic Segmentation under Multiple Adverse Conditions
- Make a Strong Teacher with Label Assistance: A Novel Knowledge Distillation Approach for Semantic Segmentation<br>:star:code
- MeshSegmenter: Zero-Shot Mesh Segmentation via Texture Synthesis<br>:star:code
- Towards Reliable Evaluation and Fast Training of Robust Semantic Segmentation Models<br>:star:code
- ItTakesTwo: Leveraging Peer Representations for Semi-supervised LiDAR Semantic Segmentation<br>:star:code
- Evaluating the Adversarial Robustness of Semantic Segmentation: Trying Harder Pays Off<br>:star:code
- Open-Vocabulary RGB-Thermal Semantic Segmentation<br>:star:code
- Centering the Value of Every Modality: Towards Efficient and Resilient Modality-agnostic Semantic Segmentation
- Learning Modality-agnostic Representation for Semantic Segmentation from Any Modalities
- SFPNet: Sparse Focal Point Network for Semantic Segmentation on General LiDAR Point Clouds<br>:house:project<br>:star:code
- Reliability in Semantic Segmentation: Can We Use Synthetic Data?<br>:star:code
- Cs2K: Class-specific and Class-shared Knowledge Guidance for Incremental Semantic Segmentation
- MeshSegmenter: Zero-Shot Mesh Semantic Segmentation via Texture Synthesis<br>:star:code
- 3D语义分割
- 跨域语义分割
- 无监督语义分割
- 半监督语义分割
- Beyond Pixels: Semi-Supervised Semantic Segmentation with a Multi-scale Patch-based Multi-Label Classifier<br>:star:code
- Weighting Pseudo-Labels via High-Activation Feature Index Similarity and Object Detection for Semi-Supervised Segmentation<br>:star:code
- SemiVL: Semi-Supervised Semantic Segmentation with Vision-Language Guidance<br>:star:code
- 弱监督语义分割
- Knowledge Transfer with Simulated Inter-Image Erasing for Weakly Supervised Semantic Segmentation<br>:star:code<br>通过模拟图像间擦除实现知识转移,弱监督语义分割再也不怕过扩展问题,助力精准目标定位!
- Tendency-driven Mutual Exclusivity for Weakly Supervised Incremental Semantic Segmentation
- DIAL: Dense Image-text ALignment for Weakly Supervised Semantic Segmentation
- Finding Meaning in Points: Weakly Supervised Semantic Segmentation for Event Cameras<br>:star:code
- 3D weakly supervised semantic segmentation with 2D vision-language guidance
- Learning from the Web: Language Drives Weakly-Supervised Incremental Learning for Semantic Segmentation
- Phase Concentration and Shortcut Suppression for Weakly Supervised Semantic Segmentation<br>:star:code
- DHR: Dual Features-Driven Hierarchical Rebalancing in Inter- and Intra-Class Regions for Weakly-Supervised Semantic Segmentation<br>:star:code
- 3D Weakly Supervised Semantic Segmentation with 2D Vision-Language Guidance<br>:star:code
- Diffusion-Guided Weakly Supervised Semantic Segmentation<br>:star:code
- 域适应语义分割
- 域泛化语义分割
- 类增量语义分割
- Background Adaptation with Residual Modeling for Exemplar-Free Class-Incremental Semantic Segmentation<br>:star:code
- Mitigating Background Shift in Class-Incremental Semantic Segmentation<br>:star:code
- Early Preparation Pays Off: New Classifier Pre-tuning for Class Incremental Semantic Segmentation<br>:star:code
- 零样本语义分割
- 开放词汇语义分割
- CLIP-DINOiser: Teaching CLIP a few DINO tricks for open-vocabulary semantic segmentation<br>:star:code<br>:house:project
- Explore the Potential of CLIP for Training-Free Open Vocabulary Semantic Segmentation<br>:star:code
- In Defense of Lazy Visual Grounding for Open-Vocabulary Semantic Segmentation
- ProxyCLIP: Proxy Attention Improves CLIP for Open-Vocabulary Segmentation<br>:star:code
- 部分分割
- 3x2: 3D Object Part Segmentation by 2D Semantic Correspondences<br>:star:code
- WPS-SAM: Towards Weakly-Supervised Part Segmentation with Foundation Models
- PartSTAD: 2D-to-3D Part Segmentation Task Adaptation<br>:star:code
- PARIS3D: Reasoning-based 3D Part Segmentation Using Large Multimodal Model<br>:star:code
- 运动分割
- 烟雾分割
- 线段分割
- 场景解析
- 交互式分割
- 小样本分割
- 伪装目标分割
- 参考图像分割
- 指代图像分割
- 场景文本分割
- 开放词汇分割
- 指代表达式分割
- VIS
- VOS
- ActionVOS: Actions as Prompts for Video Object Segmentation<br>:star:code
- VISA: Reasoning Video Object Segmentation via Large Language Model<br>:star:code
- PanoVOS: Bridging Non-panoramic and Panoramic Views with Transformer for Video Segmentation<br>:star:code
- Exploring Pre-trained Text-to-Video Diffusion Models for Referring Video Object Segmentation<br>:star:code
- Betrayed by Attention: A Simple yet Effective Approach for Self-supervised Video Object Segmentation<br>:star:code
- OneVOS: Unifying Video Object Segmentation with All-in-One Transformer Framework
- Spatial-Temporal Multi-level Association for Video Object Segmentation
10.Image Classification(图像分类)
- Labeled Data Selection for Category Discovery
- Active Generation for Image Classification<br>:star:code
- Meta-Prompting for Automating Zero-shot Visual Recognition with LLMs<br>:house:project
- Dyn-Adapter: Towards Disentangled Representation for Efficient Visual Recognition
- Wavelet Convolutions for Large Receptive Fields<br>:star:code
- Momentum Auxiliary Network for Supervised Local Learning<br>:star:code
- An accurate detection is not all you need to combat label noise in web-noisy datasets<br>:star:code
- Dual-stage Hyperspectral Image Classification Model with Spectral Supertoken<br>:star:code
- DEPICT: Diffusion-Enabled Permutation Importance for Image Classification Tasks
- NOVUM: Neural Object Volumes for Robust Object Classification<br>:star:code
- EntAugment: Entropy-Driven Adaptive Data Augmentation Framework for Image Classification<br>:star:code
- Distribution-Aware Robust Learning from Long-Tailed Data with Noisy Labels<br>:star:code
- Discovering Unwritten Visual Classifiers with Large Language Models
- 广义类别发现
- SelEx: Self-Expertise in Fine-Grained Generalized Category Discovery<br>:star:code
- Textual Knowledge Matters: Cross-Modality Co-Teaching for Generalized Visual Class Discovery<br>:star:code广义类别发现(Generalized Category Discovery,GCD)
- Learning to Distinguish Samples for Generalized Category Discovery<br>:star:code
- PromptCCD: Learning Gaussian Mixture Prompt Pool for Continual Category Discovery<br>:star:code
- Online Continuous Generalized Category Discovery<br>:star:code广义类别发现
- Category Adaptation Meets Projected Distillation in Generalized Continual Category Discovery<br>:star:code
- 多标签图像分类
- 小样本分类
- 零样本分类
- 多标签识别
- 长尾识别
- 细粒度
- On Learning Discriminative Features from Synthesized Data for Self-Supervised Fine-Grained Visual Recognition
- A Rotation-invariant Texture ViT for Fine-Grained Recognition of Esophageal Cancer Endoscopic Ultrasound Images<br>:star:code
- Adapting Fine-Grained Cross-View Localization to Areas without Fine Ground Truth
9.Image Progress(图像/视频处理)
- ReNoise: Real Image Inversion Through Iterative Noising<br>:star:code<br>:house:project
- UniProcessor: A Text-induced Unified Low-level Image Processor
- 恢复
- MoE-DiffIR: Task-customized Diffusion Priors for Universal Compressed Image Restoration<br>:star:code
- Panel-Specific Degradation Representation for Raw Under-Display Camera Image Restoration<br>:star:code
- Unsupervised Variational Translator for Bridging Image Restoration and High-Level Vision Tasks
- A Comparative Study of Image Restoration Networks for General Backbone Network Design
- GRIDS: Grouped Multiple-Degradation Restoration with Image Degradation Similarity
- Restoring Images in Adverse Weather Conditions via Histogram Transformer
- InstructIR: High-Quality Image Restoration Following Human Instructions<br>:star:code
- Diffusion Prior-Based Amortized Variational Inference for Noisy Inverse Problems<br>:star:code
- Towards Real-World Adverse Weather Image Restoration: Enhancing Clearness and Semantics with Vision-Language Models
- Teaching Tailored to Talent: Adverse Weather Restoration via Prompt Pool and Depth-Anything Constraint
- Seeing the Unseen: A Frequency Prompt Guided Transformer for Image Restoration<br>:star:code
- MambaIR: A Simple Baseline for Image Restoration with State-Space Model<br>:star:code<br>:thumbsup:MambaIR: 基于Mamba的图像复原基准模型
- AutoDIR: Automatic All-in-One Image Restoration with Latent Diffusion<br>:star:code
- SPIRE: Semantic Prompt-Driven Image Restoration<br>:house:project
- Efficient Cascaded Multiscale Adaptive Network for Image Restoration
- Restore Anything with Masks: Leveraging Mask Image Modeling for Blind All-in-One Image Restoration<br>:star:code
- When Fast Fourier Transform Meets Transformer for Image Restoration<br>:star:code
- Osmosis: RGBD Diffusion Prior for Underwater Image Restoration<br>:house:project
- Contribution-based Low-Rank Adaptation with Pre-training Model for Real Image Restoration<br>:house:project
- DiffBIR: Toward Blind Image Restoration with Generative Diffusion Prior<br>:star:code
- MetaWeather: Few-Shot Weather-Degraded Image Restoration<br>:star:code
- Depth-Aware Blind Image Decomposition for Real-World Adverse Weather Recovery
- 修补
- A Task is Worth One Word: Learning with Task Prompts for High-Quality Versatile Image Inpainting<br>:star:code<br>:house:project
- Improving Text-guided Object Inpainting with Semantic Pre-inpainting<br>:star:code
- BrushNet: A Plug-and-Play Image Inpainting Model with Decomposed Dual-Branch Diffusion<br>:house:project
- 去雨
- 去噪
- TTT-MIM: Test-Time Training with Masked Image Modeling for Denoising Distribution Shifts<br>:star:code
- Region-Aware Sequence-to-Sequence Learning for Hyperspectral Denoising<br>:star:code
- DualDn: Dual-domain Denoising via Differentiable ISP<br>:star:code
- Exploiting Dual-Correlation for Multi-frame Time-of-Flight Denoising<br>:star:code
- EDformer: Transformer-Based Event Denoising Across Varied Noise Levels
- denoiSplit: a method for joint microscopy image splitting and unsupervised denoising去噪
- Asymmetric Mask Scheme for Self-Supervised Real Image Denoising<br>:star:code
- Switch Diffusion Transformer: Synergizing Denoising Tasks with Sparse Mixture-of-Experts<br>:house:project
- Enhancing Plausibility Evaluation for Generated Designs with Denoising Autoencoder<br>:star:code
- 去雾
- 去模糊
- Deblur e-NeRF: NeRF from Motion-Blurred Events under High-speed or Low-light Conditions<br>:star:code
- UniINR: Event-guided Unified Rolling Shutter Correction, Deblurring, and Interpolation<br>:star:code
- Blind image deblurring with noise-robust kernel estimation<br>:star:code
- Motion Aware Event Representation-driven Image Deblurring(https://github.com/ZhijingS/DA_event_deblur)
- BAD-Gaussians: Bundle Adjusted Deblur Gaussian Splatting<br>:star:code
- 去卷积
- 去反射
- 去伪影
- 去摩尔纹
- 去马赛克
- 目标移除
- 扩图
- 图像修饰
- 图像增强
- LightenDiffusion: Unsupervised Low-Light Image Enhancement with Latent-Retinex Diffusion Models<br>:star:code
- LMT-GP: Combined Latent Mean-Teacher and Gaussian Process for Semi-supervised Low-light Image Enhancement<br>:star:code
- RAVE: Residual Vector Embedding for CLIP-Guided Backlit Image Enhancement<br>:star:code
- Image-adaptive 3D Lookup Tables for Real-time Image Enhancement with Bilateral Grids<br>:star:code
- NamedCurves: Learned Image Enhancement via Color Naming
- Joint RGB-Spectral Decomposition Model Guided Image Enhancement in Mobile Photography<br>:star:code
- GLARE: Low Light Image Enhancement via Generative Latent Feature based Codebook Retrieval<br>:star:code<br>:thumbsup:GLARE 利用外部正常光照先验,实现逼真的低光照增强效果!
- Fast Context-Based Low-Light Image Enhancement via Neural Implicit Representations<br>:star:code
- Unveiling Advanced Frequency Disentanglement Paradigm for Low-Light Image Enhancement<br>:star:code
- 图像质量评估
- DSMix: Distortion-Induced Sensitivity Map Based Pre-training for No-Reference Image Quality Assessment<br>:star:code
- A Comprehensive Study of Multimodal Large Language Models for Image Quality Assessment
- Towards Open-ended Visual Quality Comparison<br>:star:code<br>:Thumbsup:Co-Instruct: 让通用多模态大模型学会比较视觉质量
- PromptIQA: Boosting the Performance and Generalization for No-Reference Image Quality Assessment via Prompts无参考图像质量评估
- Assessing UHD Image Quality from Aesthetics, Distortions, and Saliency<br>:star:code
- Depicting Beyond Scores: Advancing Image Quality Assessment through Multi-modal Language Models<br>:house:project
- DSMix: Distortion-Induced Saliency Map Based Pre-training for No-Reference Image Quality Assessment<br>:star:code
- 图像美学质量评价
- 视频恢复
- Quanta Video Restoration<br>:star:code
- 视频着色
- 视频增强
- 视频去雨
- 视频去噪
- 视频去雪
- 视频去模糊
- Domain-adaptive Video Deblurring via Test-time Blurring<br>:star:code
- CMTA: Cross-Modal Temporal Alignment for Event-guided Video Deblurring<br>:star:code
- Cross-Modal Temporal Alignment for Event-guided Video Deblurring<br>:star:code
- Towards Real-world Event-guided Low-light Video Enhancement and Deblurring<br>:star:code
- Rethinking Video Deblurring with Wavelet-Aware Dynamic Transformer and Diffusion Model<br>:star:code
- 视频去闪烁
- 视频去马赛克
- 视频质量增强
- 重照明
8.Super-Resolution(超分辨率)
- Data Overfitting for On-Device Super-Resolution with Dynamic Algorithm and Compiler Co-Design<br>:star:code
- SMFANet: A Lightweight Self-Modulation Feature Aggregation Network for Efficient Image Super-Resolution<br>:star:code
- Towards Robust Full Low-bit Quantization of Super Resolution Networks
- BurstM: Deep Burst Multi-scale SR using Fourier Space with Optical Flow<br>:star:code
- HiT-SR: Hierarchical Transformer for Efficient Image Super-Resolution
- Pairwise Distance Distillation for Unsupervised Real-World Image Super-Resolution<br>:star:code
- UCIP: A Universal Framework for Compressed Image Super-Resolution using Dynamic Prompt<br>:star:code
- Accelerating Image Super-Resolution Networks with Pixel-Level Classification<br>:star:code
- Rethinking Image Super-Resolution from Training Data Perspectives
- Spatially-Variant Degradation Model for Dataset-free Super-resolution<br>:star:code
- Learning Exhaustive Correlation for Spectral Super-Resolution: Where Spatial-Spectral Attention Meets Linear Dependence
- Contourlet Residual for Prompt Learning Enhanced Infrared Image Super-Resolution<br>:star:code
- Confidence-Based Iterative Generation for Real-World Image Super-Resolution<br>:star:code
- Adaptive Multi-modal Fusion of Spatially Variant Kernel Refinement with Diffusion Model for Blind Image Super-Resolution
- Pixel-Aware Stable Diffusion for Realistic Image Super-Resolution and Personalized Stylization<br>:star:code
- XPSR: Cross-modal Priors for Diffusion-based Image Super-Resolution<br>:star:code
- AdaDiffSR: Adaptive Region-aware Dynamic acceleration Diffusion Model for Real-World Image Super-Resolution
- Overcoming Distribution Mismatch in Quantizing Image Super-Resolution Networks<br>:star:code
- Rethinking Image Super Resolution from Training Data Perspectives<br>:star:code
- MTKD: Multi-Teacher Knowledge Distillation for Image Super-Resolution<br>:star:code
- OmniSSR: Zero-shot Omnidirectional Image Super-Resolution using Stable Diffusion Model
- Learning Dual-Level Deformable Implicit Representation for Real-World Scale Arbitrary Super-Resolution<br>:star:code
- You Only Need One Step: Fast Super-Resolution with Stable Diffusion via Scale Distillation<br>:star:code
- A New Dataset and Framework for Real-World Blurred Images Super-Resolution<br>:star:code
- 场景文本图像超分辨率
- VSR
- Arbitrary-Scale Video Super-Resolution with Structural and Textural Priors<br>:star:code
- RealViformer: Investigating Attention for Real-World Video Super-Resolution<br>:star:code
- Enhancing Perceptual Quality in Video Super-Resolution through Temporally-Consistent Detail Synthesis using Diffusion Models<br>:star:code
- SuperGaussian: Repurposing Video Models for 3D Super Resolution<br>:house:project
- Event-Adapted Video Super-Resolution
- Motion-Guided Latent Diffusion for Temporally Consistent Real-world Video Super-resolution<br>:star:code
7.Object Detection(目标检测)
- Can OOD Object Detectors Learn from Foundation Models?<br>:star:code
- Crowd-SAM:SAM as a smart annotator for object detection in crowded scenes<br>:star:code
- Distilling Knowledge from Large-Scale Image Models for Object Detection
- DeTra: A Unified Model for Object Detection and Trajectory Forecasting
- Modality Translation for Object Detection Adaptation without forgetting prior knowledge
- OpenSight: A Simple Open-Vocabulary Framework for LiDAR-Based Object Detection
- LEROjD: Lidar Extended Radar-Only Object Detection<br>:star:code
- Bucketed Ranking-based Losses for Efficient Training of Object Detectors<br>:star:code
- Plain-Det: A Plain Multi-Dataset Object Detector<br>:star:code
- On Calibration of Object Detectors: Pitfalls, Evaluation and Baselines<br>:star:code
- Towards Open-World Object-based Anomaly Detection via Self-Supervised Outlier Synthesis<br>:star:code
- Weak-to-Strong Compositional Learning from Generative Models for Language-based Object Detection
- PartGLEE: A Foundation Model for Recognizing and Parsing Any Objects<br>:star:code
- Crowd-SAM: SAM as a Smart Annotator for Object Detection in Crowded Scenes<br>:star:code
- Relation DETR: Exploring Explicit Position Relation Prior for Object Detection<br>:star:code
- Bridge Past and Future: Overcoming Information Asymmetry in Incremental Object Detection<br>:star:code
- Modality Translation for Object Detection Adaptation Without Forgetting Prior Knowledge<br>:star:code
- T-Rex2: Towards Generic Object Detection via Text-Visual Prompt Synergy<br>:star:code
- Fine-grained Dynamic Network for Generic Event Boundary Detection<br>:star:code
- CSOT: Cross-Scan Object Transfer for Semi-Supervised LiDAR Object Detection
- Bayesian Detector Combination for Object Detection with Crowdsourced Annotations<br>:star:code
- Simplifying Source-Free Domain Adaptation for Object Detection: Effective Self-Training Strategies and Performance Insights<br>:star:code
- Out-of-Bounding-Box Triggers: A Stealthy Approach to Cheat Object Detectors<br>:star:code
- A Simple Background Augmentation Method for Object Detection with Diffusion Model
- Look Around and Learn: Self-Training Object Detection by Exploration<br>:star:code
- Co-Student: Collaborating Strong and Weak Students for Sparsely Annotated Object Detection<br>:star:code
- Benchmarking Object Detectors with COCO: A New Path Forward<br>:sunflower:dataset
- DAMSDet: Dynamic Adaptive Multispectral Detection Transformer with Competitive Query Selection and Adaptive Feature Fusion<br>:star:code目标检测
- Integer-Valued Training and Spike-driven Inference Spiking Neural Network for High-performance and Energy-efficient Object Detection<br>:star:code
- YOLOv9: Learning What You Want to Learn Using Programmable Gradient Information<br>:star:code
- Equivariant Spatio-Temporal Self-Supervision for LiDAR Object Detection
- Projecting Points to Axes: Oriented Object Detection via Point-Axis Representation
- GRA: Detecting Oriented Objects through Group-wise Rotating and Attention
- Embracing Events and Frames with Hierarchical Feature Refinement Network for Object Detection<br>:star:code
- Dynamic Retraining-Updating Mean Teacher for Source-Free Object Detection<br>:star:code
- Zero-Shot Detection of AI-Generated Images<br>:star:code<br>:house:project
- MOD-UV: Learning Mobile Object Detectors from Unlabeled Videos<br>:star:code
- Category-level Object Detection, Pose Estimation and Reconstruction from Stereo Images<br>:house:project
- Rethinking Features-Fused-Pyramid-Neck for Object Detection<br>:star:code
- 3D目标检测
- Make Your ViT-based Multi-view 3D Detectors Faster via Token Compression<br>:star:code
- Approaching Outside: Scaling Unsupervised 3D Object Detection from 2D Scene
- SparseLIF: High-Performance Sparse LiDAR-Camera Fusion for 3D Object Detection
- MonoWAD: Weather-Adaptive Diffusion Model for Robust Monocular 3D Object Detection<br>:star:code
- Transfer Learning from Simulated to Real Scenes for Monocular 3D Object Detection<br>:star:code
- Domain Generalization of 3D Object Detection by Density-Resampling<br>:star:code
- Learning High-resolution Vector Representation from Multi-Camera Images for 3D Object Detection<br>:star:code
- LiDAR-based All-weather 3D Object Detection via Prompting and Distilling 4D Radar
- Towards Stable 3D Object Detection
- RecurrentBEV: A Long-term Temporal Fusion Framework for Multi-view 3D Detection<br>:star:code
- LISO: Lidar-only Self-Supervised 3D Object Detection<br>:star:code
- Diffusion Model for Robust Multi-Sensor Fusion in 3D Object Detection and BEV Segmentation<br>:star:code
- SAMFusion: Sensor-Adaptive Multimodal Fusion for 3D Object Detection in Adverse Weather
- Unlocking Textual and Visual Wisdom: Open-Vocabulary 3D Object Detection Enhanced by Comprehensive Guidance from Text and Image
- OV-Uni3DETR: Towards Unified Open-Vocabulary 3D Object Detection via Cycle-Modality Propagation<br>:star:code
- Better Regression Makes Better Test-time Adaptive 3D Object Detection
- Find n' Propagate: Open-Vocabulary 3D Object Detection in Urban Environments<br>:star:code
- Diff3DETR: Agent-based Diffusion Model for Semi-supervised 3D Object Detection
- Weakly Supervised 3D Object Detection via Multi-Level Visual Guidance<br>:star:code
- SimPB: A Single Model for 2D and 3D Object Detection from Multiple Cameras<br>:star:code
- CMD: A Cross Mechanism Domain Adaptation Dataset for 3D Object Detection<br>:star:code<br>:thumbsup:DIG从密度、强度和几何三方面缓和传感器体制带来的点云数据差异,显著提升了域自适应算法的性能。
- Ray Denoising: Depth-aware Hard Negative Sampling for Multi-view 3D Object Detection<br>:star:code
- OPEN: Object-wise Position Embedding for Multi-view 3D Object Detection<br>:star:code
- LabelDistill: Label-guided Cross-modal Knowledge Distillation for Camera-based 3D Object Detection<br>:star:code
- MonoTTA: Fully Test-Time Adaptation for Monocular 3D Object Detection<br>:star:code
- FSD-BEV: Foreground Self-Distillation for Multi-view 3D Object Detection<br>:star:code
- General Geometry-aware Weakly Supervised 3D Object Detection<br>:star:code
- Interactive 3D Object Detection with Prompts
- Beyond Viewpoint: Robust 3D Object Recognition under Arbitrary Views through Joint Multi-Part Representation
- Detecting As Labeling: Rethinking LiDAR-camera Fusion in 3D Object Detection<br>:star:code
- TCC-Det: Temporarily consistent cues for weakly-supervised 3D detection<br>:star:code
- GraphBEV: Towards Robust BEV Feature Alignment for Multi-Modal 3D Object Detection<br>:star:code
- 小目标检测
- IRSAM: Advancing Segment Anything Model for Infrared Small Target Detection
- Visible and Clear: Finding Tiny Objects in Difference Map
- DQ-DETR: DETR with Dynamic Query for Tiny Object Detection<br>:star:code
- 3D Small Object Detection with Dynamic Spatial Pruning<br>:star:code<br>:thumbsup:DSPDet3D:基于动态空间剪枝的高效率3D小目标检测
- 伪装目标检测
- CamoTeacher: Dual-Rotation Consistency Learning for Semi-Supervised Camouflaged Object Detection<br>:thumbsup:有效减少了像素级、实例级噪声问题
- Learning Camouflaged Object Detection from Noisy Pseudo Label<br>:star:code
- Just a Hint: Point-Supervised Camouflaged Object Detection
- SAM-COD: SAM-guided Unified Framework for Weakly-Supervised Camouflaged Object Detection
- Frequency-Spatial Entanglement Learning for Camouflaged Object Detection<br>:star:code
- FocusDiffuser: Perceiving Local Disparities for Camouflaged Object Detection<br>:star:code
- 长尾目标检测
- 显著目标检测
- 域适应目标检测
- 小样本目标检测
- SMILe: Leveraging Submodular Mutual Information For Robust Few-Shot Object Detection<br>:house:project
- Adaptive Multi-task Learning for Few-shot Object Detection<br>:star:code
- Cross-Domain Few-Shot Object Detection via Enhanced Open-Set Object Detector<br>:house:project<br>:thumbsup:跨域小样本物体检测CD-FSOD新数据集、CD-ViTO新方法(数据代码均已开源)
- 共同显著目标检测
- 开放词汇目标检测
- Global-Local Collaborative Inference with LLM for Lidar-Based Open-Vocabulary Detection<br>:star:code
- LaMI-DETR: Open-Vocabulary Detection with Language Model Instruction
- MarvelOVD: Marrying Object Recognition and Vision-Language Models for Robust Open-Vocabulary Object Detection<br>:star:code
- Region-centric Image-Language Pretraining for Open-Vocabulary Detection<br>:star:code
- CLIFF: Continual Latent Diffusion for Open-Vocabulary Object Detection<br>:star:code
- 水印检测
- 阴影检测
- 开集识别
- 目标定位
6.Object Tracking(目标跟踪)
- Local All-Pair Correspondence for Point Tracking<br>:star:code<br>:star:code
- Track Everything Everywhere Fast and Robustly<br>:star:code<br>:house:project
- CoTracker: It is Better to Track Together<br>:house:project
- DINO-Tracker: Taming DINO for Self-Supervised Point Tracking in a Single Video<br>:star:code<br>:house:project
- Decomposition Betters Tracking Everything Everywhere<br>:star:code
- Self-Supervised Any-Point Tracking by Contrastive Random Walks<br>:house:project<br>:star:code
- TAPTR: Tracking Any Point with Transformers as Detection<br>:star:code
- MapTracker: Tracking with Strided Memory Fusion for Consistent Vector HD Mapping<br>:house:project
- SPAMming Labels: Efficient Annotations for the Trackers of Tomorrow
- SLAck: Semantic, Location, and Appearance Aware Open-Vocabulary Tracking<br>:star:code
- OneTrack: Demystifying the Conflict Between Detection and Tracking in End-to-End 3D Trackers
- Exploring the Feature Extraction and Relation Modeling For Light-Weight Transformer Tracking<br>:star:code
- Empowering Embodied Visual Tracking with Visual Foundation Models and Offline RL
- 3D目标跟踪
- 多目标跟踪
- Lost and Found: Overcoming Detector Failures in Online Multi-Object Tracking<br>:star:code
- Walker: Self-supervised Multiple Object Tracking by Walking on Temporal Object Appearance Graphs
- Beyond MOT: Semantic Multi-Object Tracking<br>:star:code
- PapMOT: Exploring Adversarial Patch Attack against Multiple Object Tracking
- VETRA: A Dataset for Vehicle Tracking in Aerial Imagery - New Challenges for Multi-Object Tracking<br>:house:project
- 细胞跟踪
5.OCR
- Parrot Captions Teach CLIP to Spot Text<br>:star:code
- WeCromCL: Weakly Supervised Cross-Modality Contrastive Learning for Transcription-only Supervised Text Spotting
- FineMatch: Aspect-based Fine-grained Image and Text Mismatch Detection and Correction<br>:house:project
- Bridging Synthetic and Real Worlds for Pre-training Scene Text Detectors<br>:star:code
- 手写文本检测
- Align, Minimize and Diversify: A Source-Free Unsupervised Domain Adaptation Method for Handwritten Text Recognition
- PosFormer: Recognizing Complex Handwritten Mathematical Expression with Position Forest Transformer<br>:star:code<br>:Thumbsup:上交推出 PosFormer!优化位置识别任务来辅助表达式识别,复杂公式识别能力再创新SOTA!
- Elegantly Written: Disentangling Writer and Character Styles for Enhancing Online Chinese Handwriting
- NAMER: Non-Autoregressive Modeling for Handwritten Mathematical Expression Recognition
- 手写文本合成
- 场景文本删除
- 文档理解
- 文本分割
- 文本合成
- Visual Text Generation in the Wild<br>:star:code
- 文本修复
4.Pose(姿态估计)
- X-Pose: Detecting Any Keypoints<br>:star:code
- VQ-HPS: Human Pose and Shape Estimation in a Vector-Quantized Latent Space<br>:house:project
- Expressive Whole-Body 3D Gaussian Avatar<br>:star:code
- GTPT: Group-based Token Pruning Transformer for Efficient Human Pose Estimation
- PoseSOR: Human Pose Can Guide Our Attention<br>:star:code
- COSMU: Complete 3D human shape from monocular unconstrained images
- Modeling and Driving Human Body Soundfields through Acoustic Primitives<br>:house:project
- Domain-Adaptive 2D Human Pose Estimation via Dual Teachers in Extremely Low-Light Conditions<br>:star:code
- SpaRP: Fast 3D Object Reconstruction and Pose Estimation from Sparse Views
- PoseEmbroider: Towards a 3D, Visual, Semantic-aware Human Pose Representation
- PoseAugment: Generative Human Pose Data Augmentation with Physical Plausibility for IMU-based Motion Capture<br>:star:code
- HPE-Li: WiFi-enabled Lightweight Dual Selective Kernel Convolution for Human Pose Estimation人体姿势估计
- EgoPoser: Robust Real-Time Egocentric Pose Estimation from Sparse and Intermittent Observations Everywhere<br>:house:project
- You Only Learn One Query: Learning Unified Human Query for Single-Stage Multi-Person Multi-Task Human-Centric Perception<br>:star:code
- Within the Dynamic Context: Inertia-aware 3D Human Modeling with Pose Sequence
- Benchmarks and Challenges in Pose Estimation for Egocentric Hand Interactions with Objects
- Human Pose Recognition via Occlusion-Preserving Abstract Images
- 文本驱动的人体生成
- 多人姿势预测
- 3D人体姿态估计
- MPL: Lifting 3D Human Pose from Multi-view 2D Poses<br>:star:code
- RT-Pose: A 4D Radar Tensor-based 3D Human Pose Estimation and Localization Benchmark<br>:house:project
- AvatarPose: Avatar-guided 3D Pose Estimation of Close Human Interaction from Sparse Multi-view Videos<br>:star:code<br>:house:project
- Mask as Supervision: Leveraging Unified Mask Information for Unsupervised 3D Pose Estimation<br>:star:code
- 3D Human Pose Estimation via Non-Causal Retentive Networks<br>:star:code
- UPose3D: Uncertainty-Aware 3D Human Pose Estimation with Cross-View and Temporal Cues
- Occlusion Handling in 3D Human Pose Estimation with Perturbed Positional Encoding
- RePOSE: 3D Human Pose Estimation via Spatio-Temporal Depth Relational Consistency<br>:star:code
- RT-Pose: A 4D Radar-Tensor based 3D Human Pose Estimation and Localization Benchmark<br>🤗huggingface
- EgoPoseFormer: A Simple Baseline for Stereo Egocentric 3D Human Pose Estimation<br>:star:code
- 3DSA:Multi-View 3D Human Pose Estimation With 3D Space Attention Mechanisms
- WorldPose: A World Cup Dataset for Global 3D Human Pose Estimation
- Rotated Orthographic Projection for Self-Supervised 3D Human Pose Estimation
- NICP: Neural ICP for 3D Human Registration at Scale<br>:house:project
- 人体网格恢复
- Multi-HMR: Multi-Person Whole-Body Human Mesh Recovery in a Single Shot<br>:star:code
- Divide and Fuse: Body Part Mesh Recovery from Partially Visible Human Images
- Global-to-Pixel Regression for Human Mesh Recovery
- WindPoly: Polygonal Mesh Reconstruction via Winding Numbers<br>:house:project
- Multi-RoI Human Mesh Recovery with Camera Consistency and Contrastive Losses<br>:star:code
- 3D人体纹理生成
- 3D人体生成
- StructLDM: Structured Latent Diffusion for 3D Human Generation<br>:star:code<br>:house:project<br>:thumbsup:南洋理工三维数字人生成新范式:结构扩散模型
- Text Motion Translator: A Bi-Directional Model for Enhanced 3D Human Motion Generation from Open-Vocabulary Descriptions
- Text to Layer-wise 3D Clothed Human Generation<br>:house:project
- SemanticHuman-HD: High Resolution Semantic disentangled 3D Human Generation<br>:house:project3D 人类生成
- 人体重建
- 动作捕捉
- 手语识别
- 手部网格
- 3D手部序列恢复
- 3D手部重建
- 手部姿态估计
- 手部运动预测
- 头部姿态估计
- 手持物体重建
- 头部姿态估计
- 4D 头部捕获
- Topo4D: Topology-Preserving Gaussian Splatting for High-Fidelity 4D Head Capture<br>:star:code<br>:house:project4D 头部捕获
- 动作捕捉
- 手语视频生成
3.Face(人脸)
- Task-adaptive Q-Face
- Faceptor: A Generalist Model for Face Perception<br>:star:code
- A Light Stage on Every Desk<br>:house:project
- Face Adapter for Pre-Trained Diffusion Models with Fine-Grained ID and Attribute Control<br>:star:code<br>:house:project
- ReSyncer: Rewiring Style-based Generator for Unified Audio-Visually Synced Facial Performer<br>:star:code
- Facial Affective Behavior Analysis with Instruction Tuning<br>:star:code<br>:house:project
- Arc2Face: A Foundation Model for ID-Consistent Human Faces<br>:star:code<br>:house:project
- GAMMA-FACE: GAussian Mixture Models Amend Diffusion Models for Bias Mitigation in Face Images
- GRAPE: Generalizable and Robust Multi-view Facial Capture
- High-Quality Mesh Blendshape Generation from Face Videos via Neural Inverse Rendering<br>:star:code
- 人脸交换
- 人脸模糊
- 人脸识别
- Towards Certifiably Robust Face Recognition
- AdaDistill: Adaptive Knowledge Distillation for Deep Face Recognition<br>:star:code
- ARoFace: Alignment Robustness to Improve Low-Quality Face Recognition<br>:star:code
- Personalized Privacy Protection Mask Against Unauthorized Facial Recognition
- MST-KD: Multiple Specialized Teachers Knowledge Distillation for Fair Face Recognition
- dversariaLeak: External Information Leakage Attack Using Adversarial Samples on Face Recognition Systems
- 人脸聚类
- 人脸重建
- 人脸表情
- 人脸编辑
- 人脸动画
- 说话头合成
- ScanTalk: 3D Talking Heads from Unregistered Scans<br>:star:code
- EDTalk: Efficient Disentanglement for Emotional Talking Head Synthesis<br>:house:project头部合成
- EmoTalk3D: High-Fidelity Free-View Synthesis of Emotional 3D Talking Head<br>:star:code
- All You Need is Your Voice: Emotional Face Representation with Audio Perspective for Emotional Talking Face Generation<br>:star:code
- Audio-driven Talking Face Generation with Stabilized Synchronization Loss
- Head360: Learning a Parametric 3D Full-Head for Free-View Synthesis in 360°<br>:star:code
- S^3D-NeRF: Single-Shot Speech-Driven Neural Radiance Field for High Fidelity Talking Head Synthesis
- Gaussian3Diff: 3D Gaussian Diffusion for 3D Full Head Synthesis and Editing<br>:house:project头部合成
- Tri2-plane: Thinking Head Avatar via Feature Pyramid<br>:house:project
- Avatar Fingerprinting for Authorized Use of Synthetic Talking-Head Videos<br>:house:project
- TalkingGaussian: Structure-Persistent 3D Talking Head Synthesis via Gaussian Splatting<br>:star:code3D 说话头合成
- 动画头部头像
- 人脸超分辨
- 人脸活体检测
- TF-FAS: Twofold-Element Fine-Grained Semantic Guidance for Generalizable Face Anti-Spoofing<br>:star:code<br>:thumbsup:通过双重元素细粒度语义指导来增强泛化能力
- DiffFAS: Face Anti-Spoofing via Generative Diffusion Models<br>:star:code
- Towards Unified Representation of Invariant-Specific Features in Missing Modality Face Anti-Spoofing
- Bottom-Up Domain Prompt Tuning for Generalized Face Anti-Spoofing
- 头部合成
- 情绪识别
- 人脸动作单元检测
- 假脸检测
2.3D Visual
- GroundUp: Rapid Sketch-Based 3D City Massing<br>:house:project
- Ray-Distance Volume Rendering for Neural Scene Reconstruction
- HSR: Holistic 3D Human-Scene Reconstruction from Monocular Videos<br>:house:project
- Decomposition of Neural Discrete Representations for Large-Scale 3D Mapping<br>:star:code
- BlenderAlchemy: Editing 3D Graphics with Vision-Language Models<br>:house:project
- Temporal Event Stereo via Joint Learning with Stereoscopic Flow<br>:star:code
- GenRC: Generative 3D Room Completion from Sparse Image Collections<br>:star:code
- Single-Photon 3D Imaging with Equi-Depth Photon Histograms<br>:house:project
- Viewpoint textual inversion: discovering scene representations and 3D view control in 2D diffusion models<br>:star:code
- SparseCraft: Few-Shot Neural Reconstruction through Stereopsis Guided Geometric Linearization<br>:star:code
- 3D Congealing: 3D-Aware Image Alignment in the Wild<br>:house:project
- ClusteringSDF: Self-Organized Neural Implicit Surfaces for 3D Decomposition<br>:house:project
- BAGS: Blur Agnostic Gaussian Splatting through Multi-Scale Kernel Modeling<br>:star:code
- Soft Shadow Diffusion (SSD): Physics-inspired Learning for 3D Computational Periscopy
- DPA-Net: Structured 3D Abstraction from Sparse Views via Differentiable Primitive Assembly
- An Optimization Framework to Enforce Multi-View Consistency for Texturing 3D Meshes<br>:house:project
- Omniview-Tuning: Boosting Viewpoint Invariance of Vision-Language Pre-training Models<br>:house:project
- Diffusion Model is a Good Pose Estimator from 3D RF-Vision<br>:house:project
- Nuvo: Neural UV Mapping for Unruly 3D Representations<br>:house:project
- MAP-ADAPT: Real-Time Quality-Adaptive Semantic 3D Maps<br>:house:project
- MinD-3D: Reconstruct High-quality 3D objects in Human Brain<br>:house:project3D
- UpFusion: Novel View Diffusion from Unposed Sparse View Observations<br>:star:code3D
- MVS
- 3D Visual Grounding
- Empowering 3D Visual Grounding with Reasoning Capabilities<br>:house:project
- Multi-branch Collaborative Learning Network for 3D Visual Grounding<br>:star:code<br>:thumbsup:3DREC的Acc@0.5提高了 3.27%,3DRES的mIOU 提高了5.22%
- ScanReason: Empowering 3D Visual Grounding with Reasoning Capabilities<br>:star:code
- Four Ways to Improve Verbo-visual Fusion for Dense 3D Visual Grounding<br>:house:project
- Stereo Matching
- 3DGS
- GaussReg: Fast 3D Registration with Gaussian Splatting
- 3iGS: Factorised Tensorial Illumination for 3D Gaussian Splatting
- Texture-GS: Disentangle the Geometry and Texture for 3D Gaussian Splatting Editing<br>:star:code
- Compact3D: Smaller and Faster Gaussian Splatting with Vector Quantization<br>:star:code<br>:house:project
- CoR-GS: Sparse-View 3D Gaussian Splatting via Co-Regularization<br>:star:code<br>:house:project3DGS
- End-to-End Rate-Distortion Optimized 3D Gaussian Representation
- Deblurring 3D Gaussian Splatting<br>:star:code<br>:house:project
- Per-Gaussian Embedding-Based Deformation for Deformable 3D Gaussian Splatting<br>:house:project
- HAC: Hash-grid Assisted Context for 3D Gaussian Splatting Compression<br>:house:project
- On the Error Analysis of 3D Gaussian Splatting and an Optimal Projection Strategy<br>:star:code
- Analytic-Splatting: Anti-Aliased 3D Gaussian Splatting via Analytic Integration<br>:star:code
- MVSGaussian: Fast Generalizable Gaussian Splatting Reconstruction from Multi-View Stereo<br>:star:code<br>:house:project
- Street Gaussians: Modeling Dynamic Urban Scenes with Gaussian Splatting<br>:star:code
- DGD: Dynamic 3D Gaussians Distillation<br>:house:project
- EAGLES: Efficient Accelerated 3D Gaussians with Lightweight EncodingS<br>:star:code
- Revising Densification in Gaussian Splatting
- HO-Gaussian: Hybrid Optimization of 3D Gaussian Splatting for Urban Scenes
- RoDUS: Robust Decomposition of Static and Dynamic Elements in Urban Scenes
- SWinGS: Sliding Windows for Dynamic 3D Gaussian Splatting
- VEGS: View Extrapolation of Urban Scenes in 3D Gaussian Splatting using Learned Priors<br>:star:code
- MesonGS: Post-training Compression of 3D Gaussians via Efficient Attribute Transformation
- MIGS: Multi-Identity Gaussian Splatting via Tensor Decomposition<br>:star:code
- SAGS: Structure-Aware 3D Gaussian Splatting<br>:house:project
- GGRt: Towards Generalizable 3D Gaussians without Pose Priors in Real-Time<br>:house:project
- Gaussian in the wild: 3D Gaussian Splatting for Unconstrained Image Collections<br>:star:code
- Pixel-GS Density Control with Pixel-aware Gradient for 3D Gaussian Splatting<br>:star:code
- WaSt-3D: Wasserstein-2 Distance for Scene-to-Scene Stylization on 3D Gaussians<br>:house:project
- MVSplat: Efficient 3D Gaussian Splatting from Sparse Multi-View Images<br>:star:code
- GS-LRM: Large Reconstruction Model for 3D Gaussian Splatting<br>:house:project
- DynMF: Neural Motion Factorization for Real-time Dynamic View Synthesis with 3D Gaussian Splatting<br>:star:code
- 深度估计
- Leveraging Near-Field Lighting for Monocular Depth Estimation from Endoscopy Videos<br>:star:code<br>:house:project
- PatchRefiner: Leveraging Synthetic Data for Real-Domain High-Resolution Monocular Metric Depth Estimation<br>:star:code
- Remove Projective LiDAR Depthmap Artifacts via Exploiting Epipolar Geometry深度图伪影
- Revisit Self-supervision with Local Structure-from-Motion
- DoubleTake: Geometry Guided Depth Estimation<br>:star:code
- Physics-informed Knowledge Transfer for Underwater Monocular Depth Estimation
- FutureDepth: Learning to Predict the Future Improves Video Depth Estimation
- ProDepth: Boosting Self-Supervised Multi-Frame Monocular Depth with Probabilistic Fusion<br>:star:code
- Mono-ViFI: A Unified Learning Framework for Self-supervised Single- and Multi-frame Monocular Depth Estimation<br>:star:code
- Diffusion Models for Monocular Depth Estimation: Overcoming Challenging Conditions<br>:star:code<br>:star:code
- High-Precision Self-Supervised Monocular Depth Estimation with Rich-Resource Prior
- DiffusionDepth: Diffusion Denoising Approach for Monocular Depth Estimation<br>:star:code
- SEDiff: Structure Extraction for Domain Adaptive Depth Estimation via Denoising Diffusion Models
- Camera Height Doesn't Change: Unsupervised Training for Metric Monocular Road-Scene Depth Estimation<br>:house:project
- GroCo: Ground Constraint for Metric Self-Supervised Monocular Depth<br>:star:code深度估计
- M2Depth: Self-supervised Two-Frame Multi-camera Metric Depth Estimation<br>:house:project
- Improving Domain Generalization in Self-Supervised Monocular Depth Estimation via Stabilized Adversarial Training
- 深度补全
- Deep Cost Ray Fusion for Sparse Depth Video Completion
- OGNI-DC: Robust Depth Completion with Optimization-Guided Neural Iterations<br>:star:code
- AugUndo: Scaling Up Augmentations for Monocular Depth Completion and Estimation<br>:star:code
- Sparse Beats Dense: Rethinking Supervision in Radar-Camera Depth Completion<br>:star:code
- 表面重建
- Surface Reconstruction from Gaussian Splatting via Novel Stereo Views<br>:house:project
- SG-NeRF: Neural Surface Reconstruction with Scene Graph Optimization<br>:star:code
- Sur2f: A Hybrid Representation for High-Quality and Efficient Surface Reconstruction from Multi-view Images
- Improving Neural Surface Reconstruction with Feature Priors from Multi-View Image
- DiffSurf: A Transformer-based Diffusion Model for Generating and Reconstructing 3D Surfaces in Pose
- Rethinking Directional Parameterization in Neural Implicit Surface Reconstruction
- PISR: Polarimetric Neural Implicit Surface Reconstruction for Textureless and Specular Objects<br>:star:code
- Surface Reconstruction for 3D Gaussian Splatting via Local Structural Hints<br>:house:project
- EMIE-MAP: Large-Scale Road Surface Reconstruction Based on Explicit Mesh and Implicit Encoding
- GS2Mesh: Surface Reconstruction from Gaussian Splatting via Novel Stereo Views<br>:house:project
- Improving Neural Surface Reconstruction with Feature Priors from Multi-View Images<br>:star:code
- DynoSurf: Neural Deformation-based Temporally Consistent Dynamic Surface Reconstruction<br>:star:code
- Surface-Centric Modeling for High-Fidelity Generalizable Neural Surface Reconstruction<br>:star:code
- Parameterization-driven Neural Surface Reconstruction for Object-oriented Editing in Neural Rendering<br>:house:project
- 三维重建
- GSD: View-Guided Gaussian Splatting Diffusion for 3D Reconstruction<br>:house:project
- InfoNorm: Mutual Information Shaping of Normals for Sparse-View Reconstruction<br>:star:code
- fMRI-3D: A Comprehensive Dataset for Enhancing fMRI-based 3D Reconstruction<br>:star:code<br>:house:project<br>:house:project
- Reconstruction and Simulation of Elastic Objects with Spring-Mass 3D Gaussians<br>:star:code
- GRM: Large Gaussian Reconstruction Model for Efficient 3D Reconstruction and Generation<br>:star:code<br>:house:project
- latentSplat: Autoencoding Variational Gaussians for Fast Generalizable 3D Reconstruction<br>:star:code
- MirrorGaussian: Reflecting 3D Gaussians for Reconstructing Mirror Reflections<br>:house:project
- Resolving Scale Ambiguity in Multi-view 3D Reconstruction using Dual-Pixel Sensors<br>:star:code
- 3D Reconstruction of Objects in Hands without Real World 3D Supervision
- Human Hair Reconstruction with Strand-Aligned 3D Gaussians<br>:house:project
- SUP-NeRF: A Streamlined Unification of Pose Estimation and NeRF for Monocular 3D Object Reconstruction<br>:star:code
- MVDiffHD: A Dense High-resolution Multi-view Diffusion Model for Single or Sparse-view 3D Object Reconstruction(https://github.com/Tangshitao/MVDiffusion_plusplus)
- NeuSDFusion: A Spatial-Aware Generative Model for 3D Shape Completion, Reconstruction, and Generation<br>:house:project
- Sketch2Vox: Learning 3D Reconstruction from a Single Monocular Sketch Image<br>:sunflower:dataset
- Analysis-by-Synthesis Transformer for Single-View 3D Reconstruction<br>:star:code
- 三维形状
- Synchronous Diffusion for Unsupervised Smooth Non-Rigid 3D Shape Matching
- Transferable 3D Adversarial Shape Completion using Diffusion Models
- Self-supervised Shape Completion via Involution and Implicit Correspondences
- TetraDiffusion: Tetrahedral Diffusion Models for 3D Shape Generation<br>:star:code<br>:house:project
- Learning Neural Deformation Representation for 4D Dynamic Shape Generation
- AWOL: Analysis WithOut synthesis using Language3D shape
- DiscoMatch: Fast Discrete Optimisation for Geometrically Consistent 3D Shape Matching
- 视频重建
- 四维重建
- 3D 纹理形状
1.Other(其它)
- Dataset Growth<br>:star:code
- Adaptive Parametric Activation<br>:star:code
- Nonverbal Interaction Detection<br>:star:code
- Situated Instruction Following<br>:house:project<br>:house:project
- Optimizing Illuminant Estimation in Dual-Exposure HDR Imaging
- Unsupervised Exposure Correction<br>:star:code
- Global Structure-from-Motion Revisited<br>:star:code
- Fast Sprite Decomposition from Animated Graphics<br>:house:project
- Sync from the Sea: Retrieving Alignable Videos from Large-Scale Datasets<br>:house:project
- MERLiN: Single-Shot Material Estimation and Relighting for Photometric Stereo
- Enhancing Vectorized Map Perception with Historical Rasterized Maps<br>:star:code
- Align before Collaborate: Mitigating Feature Misalignment for Robust Multi-Agent Perception
- DC-Solver: Improving Predictor-Corrector Diffusion Sampler via Dynamic Compensation<br>:star:code
- Weight Conditioning for Smooth Optimization of Neural Networks
- Bones Can't Be Triangles: Accurate and Efficient Vertebrae Keypoint Estimation through Collaborative Error Revision<br>:star:code
- SphereHead: Stable 3D Full-head Synthesis with Spherical Tri-plane Representation<br>:house:project
- VisionLLaMA: A Unified LLaMA Backbone for Vision Tasks<br>:star:code
- Global Counterfactual Directions
- Which Model Generated This Image? A Model-Agnostic Approach for Origin Attribution
- Computing the Lipschitz constant needed for fast scene recovery from CASSI measurements
- Pseudo-Labelling Should Be Aware of Disguising Channel Activations
- FMBoost: Boosting Latent Diffusion with Flow Matching<br>:star:code
- Holodepth: Programmable Depth-Varying Projection via Computer-Generated Holography<br>:house:project
- Adversarial Diffusion Distillation
- When and How do negative prompts take effect
- Preventing Catastrophic Forgetting through Memory Networks in Continuous Detection
- SCOD: From Heuristics to Theory
- Unsupervised Representation Learning by Balanced Self Attention Matching<br>:star:code
- DualBEV: Unifying Dual View Transformation with Probabilistic Correspondences<br>:star:code
- Linking in Style: Understanding learned features in deep learning models<br>:star:code
- CliffPhys: Camera-based Respiratory Measurement using Clifford Neural Networks
- Synthesizing Environment-Specific People in Photographs<br>:house:project
- Implicit Steganography Beyond the Constraints of Modality
- Energy-Clibrated VAE with Test Time Free Lunch
- Debiasing surgeon: fantastic weights and how to find them
- SparseRadNet: Sparse Perception Neural Network on Subsampled Radar Data
- Learning Where to Look: Self-supervised Viewpoint Selection for Active Localization using Geometrical Information
- Using My Artistic Style? You Must Obtain My Authorization<br>:star:code
- IFTR: An Instance-Level Fusion Transformer for Visual Collaborative Perception<br>:star:code
- Convex Relaxations for Manifold-Valued Markov Random Fields with Approximation Guarantees
- Adapting to Shifting Correlations with Unlabeled Data Calibration
- On Spectral Properties of Gradient-based Explanation Methods
- O2V-Mapping: Online Open-Vocabulary Mapping with Neural Implicit Representation
- A Framework for Efficient Model Evaluation through Stratification, Sampling, and Estimation
- Non-Line-of-Sight Estimation of Fast Human Motion with Slow Scanning Imagers
- Image Manipulation Detection With Implicit Neural Representation and Limited Supervision
- Adaptive Bounding Box Uncertainties via Two-Step Conformal Prediction
- GOEmbed: Gradient Origin Embeddings for Representation Agnostic 3D Feature Learning<br>:house:project
- AddBiomechanics Dataset: Capturing the Physics of Human Motion at Scale<br>:house:project
- Tight and Efficient Upper Bound on Spectral Norm of Convolutional Layers
- Learning Multimodal Latent Generative Models with Energy-Based Prior
- Hierarchical Conditioning of Diffusion Models Using Tree-of-Life for Studying Species Evolution
- Learning to Build by Building Your Own Instructions<br>:star:code
- LNL+K: Enhancing Learning with Noisy Labels Through Noise Source Knowledge Integration<br>:star:code
- Deep Online Probability Aggregation Clustering
- Camera Calibration using a Collimator System<br>:star:code
- Asynchronous Bioplausible Neuron for Spiking Neural Networks for Event-Based Vision
- LITA: Language Instructed Temporal-Localization Assistant<br>:star:code
- INTRA: Interaction Relationship-aware Weakly Supervised Affordance Grounding<br>:house:project
- Elucidating the Hierarchical Nature of Behavior with Masked Autoencoders<br>:star:code
- MetaAT: Active Testing for Label-Efficient Evaluation of Dense Recognition Tasks
- Generalizable Symbolic Optimizer Learning<br>:star:code
- Training A Secure Model against Data-Free Model Extraction
- EraseDraw : Learning to Insert Objects by Erasing Them from Images
- AdaDiff: Accelerating Diffusion Models through Step-Wise Adaptive Computation
- Imaging with Confidence: Uncertainty Quantification for High-dimensional Undersampled MR Images<br>:star:code
- Learning to Make Keypoints Sub-Pixel Accurate<br>:star:code
- Explorative Inbetweening of Time and Space<br>:house:project
- Salience-Based Adaptive Masking: Revisiting Token Dynamics for Enhanced Pre-training<br>:star:code
- Instant Uncertainty Calibration of NeRFs Using a Meta-Calibrator<br>:star:code
- Improving Robustness to Model Inversion Attacks via Sparse Coding Architectures
- Constructing Concept-based Models to Mitigate Spurious Correlations with Minimal Human Effort
- Neural Poisson Solver: A Universal and Continuous Framework for Natural Signal Blending
- Augmented Neural Fine-tuning for Efficient Backdoor Purification
- REDIR: Refocus-free Event-based De-occlusion Image Reconstruction
- Comprehensive Attribution: Inherently Explainable Vision Model with Feature Detector<br>:star:code
- Pre-trained Visual Dynamics Representations for Efficient Policy Learning
- MMEarth: Exploring Multi-Modal Pretext Tasks For Geospatial Representation Learning<br>:house:project
- MARs: Multi-view Attention Regularizations for Patch-based Feature Recognition of Space Terrain<br>:house:project
- Diff-Reg: Diffusion Model in Doubly Stochastic Matrix Space for Registration Problem<br>:star:code
- Hetecooper: Feature Collaboration Graph for Heterogeneous Collaborative Perception
- FARSE-CNN: Fully Asynchronous, Recurrent and Sparse Event-Based CNN<br>:star:code
- Unmasking Bias in Diffusion Model Training<br>:star:code
- Cross-Input Certified Training for Universal Perturbations
- Investigating Style Similarity in Diffusion Models
- Delving into Adversarial Robustness on Document Tampering Localization<br>:star:code
- AMD: Automatic Multi-step Distillation of Large-scale Vision Models
- Learning Scalable Model Soup on a Single GPU: An Efficient Subspace Training Strategy<br>:star:code
- JDT3D: Addressing the Gaps in LiDAR-Based Tracking-by-Attention<br>:star:code
- SPARO: Selective Attention for Robust and Compositional Transformer Encodings for Vision
- Adaptive Annealing for Robust Averaging
- Lost in Translation: Modern Neural Networks Still Struggle With Small Realistic Image Transformations
- Generalizing to Unseen Domains via Text-guided Augmentation
- MO-EMT-NAS: Multi-Objective Continuous Transfer of Architectural Knowledge Between Tasks from Different Datasets
- Learning a Dynamic Privacy-preserving Camera Robust to Inversion Attacks
- MaxMI: A Maximal Mutual Information Criterion for Manipulation Concept Discovery
- CadVLM: Bridging Language and Vision in the Generation of Parametric CAD Sketches
- Towards Image Ambient Lighting Normalization
- Synthesizing Time-varying BRDFs via Latent Space
- HoloADMM: High-Quality Holographic Complex Field Recovery
- Fundamental Matrix Estimation Using Relative Depths
- MTaDCS: Moving Trace and Feature Density-based Confidence Sample Selection under Label Noise<br>:star:code
- CipherDM: Secure Three-Party Inference for Diffusion Model Sampling<br>:star:code
- Weighted Ensemble Models Are Strong Continual Learners<br>:star:code
- Learning Equilibrium Transformation for Gamut Expansion and Color Restoration<br>:star:code
- Implicit Neural Models to Extract Heart Rate from Video<br>:house:project
- Learning Quantized Adaptive Conditions for Diffusion Models
- High-Fidelity Modeling of Generalizable Wrinkle Deformation
- Efficient Learning of Event-based Dense Representation using Hierarchical Memories with Adaptive Update
- SlimFlow: Training Smaller One-Step Diffusion Models with Rectified Flow<br>:star:code
- DreamSampler: Unifying Diffusion Sampling and Score Distillation for Image Manipulation<br>:star:code
- PosterLlama: Bridging Design Ability of Langauge Model to Content-Aware Layout Generation
- Integration of Global and Local Representations for Fine-grained Cross-modal Alignment
- Veil Privacy on Visual Data: Concealing Privacy for Humans, Unveiling for DNNs
- A high-quality robust diffusion framework for corrupted dataset<br>:star:code
- FRDiff : Feature Reuse for Universal Training-free Acceleration of Diffusion Models<br>:star:code
- Leveraging Imperfect Restoration for Data Availability Attack<br>:star:code
- Oulu Remote-photoplethysmography Physical Domain Attacks Database (ORPDAD)<br>:star:code
- Spiking Wavelet Transformer<br>:star:code
- Hypernetworks for Generalizable BRDF Representation<br>:house:project
- Solving the inverse problem of microscopy deconvolution with a residual Beylkin-Coifman-Rokhlin neural network
- Photon Inhibition for Energy-Efficient Single-Photon Imaging<br>:house:project
- RANRAC: Robust Neural Scene Representations via Random Ray Consensus<br>:house:project
- Characterizing Model Robustness via Natural Input Gradients
- Emerging Property of Masked Token for Effective Pre-training
- SWAG: Splatting in the Wild images with Appearance-conditioned Gaussians
- Curved Diffusion: A Generative Model With Optical Geometry Control<br>:house:project
- Mini-Splatting: Representing Scenes with a Constrained Number of Gaussians<br>:star:code
- Skeleton Recall Loss for Connectivity Conserving and Resource Efficient Segmentation of Thin Tubular Structures<br>:star:code
- RAP: Retrieval-Augmented Planner for Adaptive Procedure Planning in Instructional Videos
- Rethinking Fast Adversarial Training: A Splitting Technique To Overcome Catastrophic Overfitting
- Optimization-based Uncertainty Attribution Via Learning Informative Perturbations
- CPT-VR: Improving Surface Rendering via Closest Point Transform with View-Reflection Appearance
- Think before Placement: Common Sense Enhanced Transformer for Object Placement
- Efficient Bias Mitigation Without Privileged Information
- Region-Native Visual Tokenization<br>:star:code
- DiffCD: A Symmetric Differentiable Chamfer Distance for Neural Implicit Surface Fitting<br>:star:code
- Efficient Neural Video Representation with Temporally Coherent Modulation
- Made to Order: Discovering monotonic temporal changes via self-supervised video ordering<br>:star:code
- Concise Plane Arrangements for Low-Poly Surface and Volume Modelling<br>:star:code
- ViPer: Visual Personalization of Generative Models via Individual Preference Learning<br>:star:code
- How Far Can a 1-Pixel Camera Go? Solving Vision Tasks using Photoreceptors and Computationally Designed Visual Morphology
- Watching it in Dark: A Target-aware Representation Learning Framework for High-Level Vision Tasks in Low Illumination<br>:star:code
- 3R-INN: How to be climate friendly while consuming/delivering videos
- Model Breadcrumbs: Scaling Multi-Task Model Merging with Sparse Masks<br>:star:code
- Dynamic Guidance Adversarial Distillation with Enhanced Teacher Knowledge<br>:star:code
- Idling Neurons, Appropriately Lenient Workload During Fine-tuning Leads to Better Generalization
- ConDense: Consistent 2D-3D Pre-training for Dense and Sparse Features from Multi-View Images
- Tokenize Anything via Prompting<br>:star:code<br>🤗huggingface
- Efficient Diffusion-Driven Corruption Editor for Test-Time Adaptation<br>:star:code
- Long-CLIP: Unlocking the Long-Text Capability of CLIP<br>:star:code
- Dolfin: Diffusion Layout Transformers without Autoencoder
- Beyond the Contact: Discovering Comprehensive Affordance for 3D Objects from Pre-trained 2D Diffusion Models<br>:star:code
- Hybrid Video Diffusion Models with 2D Triplane and 3D Wavelet Representation<br>:star:code
- Zero-Shot Image Feature Consensus with Deep Functional Maps
- LLaVA-Plus: Learning to Use Tools for Creating Multimodal Agents<br>:house:project
- Scissorhands: Scrub Data Influence via Connection Sensitivity in Networks<br>:star:code
- FuseTeacher: Modality-fused Encoders are Strong Vision Supervisors<br>:star:code
- SmartControl: Enhancing ControlNet for Handling Rough Visual Conditions<br>:house:project
- CoSIGN: Few-Step Guidance of ConSIstency Model to Solve General INverse Problems<br>:star:code
- FedRA: A Random Allocation Strategy for Federated Tuning to Unleash the Power of Heterogeneous Clients<br>:star:code
- Uncertainty Calibration with Energy Based Instance-wise Scaling in the Wild Dataset<br>:star:code
- Learning to Enhance Aperture Phasor Field for Non-Line-of-Sight Imaging<br>:star:code
- UniFS: Universal Few-shot Instance Perception with Point Representations<br>:star:code
- Combining Generative and Geometry Priors for Wide-Angle Portrait Correction<br>:star:code
- FlashTex: Fast Relightable Mesh Texturing with LightControlNet<br>:house:project重新照明
- Consistent 3D Line Mapping<br>:star:code
- RSL-BA: Rolling Shutter Line Bundle Adjustment
- Discovering Novel Actions from Open World Egocentric Videos with Object-Grounded Visual Commonsense Reasoning
- EAS-SNN: End-to-End Adaptive Sampling and Representation for Event-based Detection with Recurrent Spiking Neural Networks<br>:star:code
- PairingNet: A Learning-based Pair-searching and -matching Network for Image Fragments
- Distributed Active Client Selection With Noisy Clients Using Model Association Scores
- Towards a Density Preserving Objective Function for Learning on Point Sets
- Task-Driven Uncertainty Quantification in Inverse Problems via Conformal Prediction<br>:star:code
- SIGMA: Sinkhorn-Guided Masked Video Modeling<br>:house:project
- LiDAR-Event Stereo Fusion with Hallucinations<br>:star:code<br>:house:project
- Dual-Camera Smooth Zoom on Mobile Phones<br>:star:code<br>:house:project
- Learning by Aligning 2D Skeleton Sequences and Multi-Modality Fusion
- Agent Attention: On the Integration of Softmax and Linear Attention<br>:star:code
- Deep Feature Surgery: Towards Accurate and Efficient Multi-Exit Networks<br>:star:code
- Grid-Attention: Enhancing Computational Efficiency of Large Vision Models without Fine-Tuning<br>:star:code
- Customized Generation Reimagined: Fidelity and Editability Harmonized<br>:star:code
- Temporal Residual Guided Diffusion Framework for Event-Driven Video Reconstruction视频重建
- Cascade-Zero123: One Image to Highly Consistent 3D with Self-Prompted Nearby Views<br>:house:project
- Controlling the World by Sleight of Hand
- Probabilistic Weather Forecasting with Deterministic Guidance-based Diffusion Model<br>:star:code概率天气预报
- Representing Topological Self-Similarity Using Fractal Feature Maps for Accurate Segmentation of Tubular Structures<br>:star:code
- Functional Transform-Based Low-Rank Tensor Factorization for Multi-Dimensional Data Recovery
- G3R: Gradient Guided Generalizable Reconstruction<br>:house:project
- SAIR: Learning Semantic-aware Implicit Representation
- Spectral Subsurface Scattering for Material Classification
- Instance-dependent Noisy-label Learning with Graphical Model Based Noise-rate Estimation
- Tracking Meets LoRA: Faster Training, Larger Model, Stronger Performance<br>:star:code
- A Direct Approach to Viewing Graph Solvability
- Boosting the Power of Small Multimodal Reasoning Models to Match Larger Models with Self-Consistency Training<br>:star:code
- Towards Multimodal Sentiment Analysis Debiasing via Bias Purification
- Improving Feature Stability during Upsampling -- Spectral Artifacts and the Importance of Spatial Context
- From Fake to Real: Pretraining on Balanced Synthetic Images to Prevent Spurious Correlations in Image Recognition<br>:star:code
- Quantization-Friendly Winograd Transformations for Convolutional Neural Networks
- LetsMap: Unsupervised Representation Learning for Label-Efficient Semantic BEV Mapping
- M3DBench: Towards Omni 3D Assistant with Interleaved Multi-modal Instructions
- Leveraging Enhanced Queries of Point Sets for Vectorized Map Construction<br>:star:code
- StereoGlue: Joint Feature Matching and Robust Estimation<br>:star:code
- Factorized Diffusion: Perceptual Illusions by Noise Decomposition<br>:house:project
- GIVT: Generative Infinite-Vocabulary Transformers<br>:star:code
- Tiny Models are the Computational Saver for Large Models
- Unlocking Attributes' Contribution to Successful Camouflage: A Combined Textual and Visual Analysis Strategy<br>:star:code
- SNeRV: Spectra-preserving Neural Representation for Video<br>:star:code
- COMO: Compact Mapping and Odometry<br>:star:code
- Multi-Sentence Grounding for Long-term Instructional Video
- Scene Coordinate Reconstruction: Posing of Image Collections via Incremental Learning of a Relocalizer<br>:house:project
- Exact Diffusion Inversion via Bidirectional Integration Approximation<br>:star:code
- McGrids: Monte Carlo-Driven Adaptive Grids for Iso-Surface Extraction
- Regulating Model Reliance on Non-Robust Features by Smoothing Input Marginal Density<br>:star:code
- Dynamic Data Selection for Efficient SSL via Coarse-to-Fine Refinement
- ZeST: Zero-Shot Material Transfer from a Single Image<br>:star:code
- PCF-Lift: Panoptic Lifting by Probabilistic Contrastive Fusion
- SemGrasp: Semantic Grasp Generation via Language Aligned Discretization<br>:house:project
- DragAPart: Learning a Part-Level Motion Prior for Articulated Objects<br>:house:project
- Superpixel-informed Implicit Neural Representation for Multi-Dimensional Data
- Physics-Free Spectrally Multiplexed Photometric Stereo under Unknown Spectral Composition
- Robust Fitting on a Gate Quantum Computer
- Edge-Guided Fusion and Motion Augmentation for Event-Image Stereo
- Mahalanobis Distance-based Multi-view Optimal Transport for Multi-view Crowd Localization<br>:house:project
- On the Vulnerability of Skip Connections to Model Inversion Attacks<br>:star:code
- Taming CLIP for Fine-grained and Structured Visual Understanding of Museum Exhibits<br>:star:code
- GMM-IKRS: Gaussian Mixture Models for Interpretable Keypoint Refinement and Scoring
- ConDense: Consistent 2D/3D Pre-training for Dense and Sparse Features from Multi-View Images
- Does Data-Efficient Generalization Exacerbate Bias in Foundation Models?
- InfMAE: A Foundation Model in The Infrared Modality红外
- Teach CLIP to Develop a Number Sense for Ordinal Regression<br>:star:code
- GlobalPointer: Large-Scale Plane Adjustment with Bi-Convex Relaxation<br>:star:code
- ColorMAE: Exploring data-independent masking strategies in Masked AutoEncoders
- Scalar Function Topology Divergence: Comparing Topology of 3D Objects
- OneRestore: A Universal Restoration Framework for Composite Degradation<br>:star:code
- RoofDiffusion: Constructing Roofs from Severely Corrupted Point Data via Diffusion
- Binomial Self-compensation for Motion Error in Dynamic 3D Scanning
- Encapsulating Knowledge in One Prompt<br>:star:code
- iMatching: Imperative Correspondence Learning
- An Adaptive Screen-Space Meshing Approach for Normal Integration
- Efficient Pre-training for Localized Instruction Generation of Procedural Videos<br>:star:code
- Shape from Heat Conduction
- Put Myself in Your Shoes: Lifting the Egocentric Perspective from Exocentric Videos
- Optimal Transport of Diverse Unsupervised Tasks for Robust Learning from Noisy Few-Shot Data
- Finding Visual Task Vectors<br>:star:code
- Occupancy as Set of Points<br>:star:code
- Learning to Robustly Reconstruct Dynamic Scenes from Low-light Spike Streams
- AID-AppEAL: Automatic Image Dataset and Algorithm for Content Appeal Enhancement and Assessment Labeling<br>:star:code
- Retargeting Visual Data with Deformation Fields
- Delving Deep into Engagement Prediction of Short Videos<br>:star:code
- Temporal-Mapping Photography for Event Cameras<br>:star:code
- Six-Point Method for Multi-Camera Systems with Reduced Solution Space<br>:star:code
- BK-SDM: A Lightweight, Fast, and Cheap Version of Stable Diffusion<br>:star:code
- Physical-Based Event Camera Simulator<br>:star:code
- REFRAME: Reflective Surface Real-Time Rendering for Mobile Devices<br>:star:code
- Self-Training Room Layout via Geometry-aware Ray-casting
- Closed-Loop Unsupervised Representation Disentanglement with β-VAE Distillation and Diffusion Probabilistic Feedback
- UMG-CLIP: A Unified Multi-Granularity Vision Generalist for Open-World Understanding<br>:star:code
- EventBind: Learning a Unified Representation to Bind Them All for Event-based Open-world Understanding<br>:star:code
- Where am I? Scene Retrieval with Language
- Event Camera Data Dense Pre-training
- Unsqueeze [CLS] Bottleneck to Learn Rich Representations<br>:star:code
- VeCLIP: Improving CLIP Training via Visual-enriched Captions<br>:star:code
- Spike-Temporal Latent Representation for Energy-Efficient Event-to-Video Reconstruction
- Catastrophic Overfitting: A Potential Blessing in Disguise
- Diffusion Reward: Learning Rewards via Conditional Video Diffusion<br>:house:project
- Data-to-Model Distillation: Data-Efficient Learning Framework
- Neural graphics texture compression supporting random access
- ReMatching: Low-Resolution Representations for Scalable Shape Correspondence
- EgoPet: Egomotion and Interaction Data from an Animal's Perspective<br>:house:project
- This Probably Looks Exactly Like That: An Invertible Prototypical Network<br>:star:code
- Revisiting Feature Disentanglement Strategy in Diffusion Training and Breaking Conditional Independence Assumption in Sampling
- ArtVLM: Attribute Recognition Through Vision-Based Prefix Language Modeling<br>:star:code属性识别
- Stream Query Denoising for Vectorized HD-Map Construction
- PartCraft: Crafting Creative Objects by Parts<br>:star:code
- ControlNet++: Improving Conditional Controls with Efficient Consistency Feedback<br>:star:code
- Dropout Mixture Low-Rank Adaptation for Visual Parameters-Efficient Fine-Tuning
- UNIC: Universal Classification Models via Multi-teacher Distillation
- Efficient Training of Spiking Neural Networks with Multi-Parallel Implicit Stream Architecture<br>:star:code尖峰神经网络
- Visual Prompting via Partial Optimal Transport
- E3V-K5: An Authentic Benchmark for Redefining Video-Based Energy Expenditure Estimation<br>:star:code
- Understanding Physical Dynamics with Counterfactual World Modeling<br>:house:project
- 4Diff: 3D-Aware Diffusion Model for Third-to-First Viewpoint Translation<br>:house:project
- Revisiting Calibration of Wide-Angle Radially Symmetric Cameras<br>:star:code相机校准
- STAG4D: Spatial-Temporal Anchored Generative 4D Gaussians<br>:star:code
- Synchronization of Projective Transformations
- UniCal: Unified Neural Sensor Calibration<br>:house:project
- Rawformer: Unpaired Raw-to-Raw Translation for Learnable Camera ISPs<br>:star:code
- Robust Incremental Structure-from-Motion with Hybrid Features
- Any2Point: Empowering Any-modality Transformers for Efficient 3D Understanding<br>:star:code
- CompGS: Smaller and Faster Gaussian Splatting with Vector Quantization<br>:star:code
- Multiscale Graph Texture Network<br>:star:code
- Enhancing Optimization Robustness in 1-bit Neural Networks through Stochastic Sign Descent<br>:star:code
- Domain Reduction Strategy for Non-Line-of-Sight Imaging<br>:star:code
- BI-MDRG: Bridging Image History in Multimodal Dialogue Response Generation<br>:star:code
- Mamba-ND: Selective State Space Modeling for Multi-Dimensional Data<br>:star:code
- Model Stock: All we need is just a few fine-tuned models<br>:star:code
- DreamStruct: Understanding Slides and User Interfaces via Synthetic Data Generation
- DetailSemNet: Elevating Signature Verification through Detail-Semantic Integration<br>:star:code
- SLIM: Spuriousness Mitigation with Minimal Human Annotations<br>:star:code
- Scaling Backwards: Minimal Synthetic Pre-training?<br>:star:code
- On the Evaluation Consistency of Attribution-based Explanations<br>:star:code
- GTP-4o: Modality-prompted Heterogeneous Graph Learning for Omni-modal Biomedical Representation<br>:star:code
- OvSW: Overcoming Silent Weights for Accurate Binary Neural Networks<br>:star:code
- SHINE: Saliency-aware HIerarchical NEgative Ranking for Compositional Temporal Grounding<br>:star:code
- ReGround: Improving Textual and Spatial Grounding at No Cost<br>:house:project
- ComboVerse: Compositional 3D Assets Creation Using Spatially-Aware Diffusion Guidance<br>:house:project
- WHAC: World-grounded Humans and Cameras<br>:house:project
- Unlocking Attributes' Contribution to Successful Camouflage: A Combined Textual and VisualAnalysis Strategy<br>:star:code
- Neural Metamorphosis<br>:house:project
- Light-in-Flight for a World-in-Motion
- Learning with Unmasked Tokens Drives Stronger Vision Learners<br>:star:code
- PSALM: Pixelwise Segmentation with Large Multi-modal Model<br>:star:code
- InsMapper: Exploring Inner-instance Information for Vectorized HD Mapping<br>:star:code
- The All-Seeing Project V2: Towards General Relation Comprehension of the Open World<br>:star:code
- Refine, Discriminate and Align: Stealing Encoders via Sample-Wise Prototypes and Multi-Relational Extraction
- Multi-Task Domain Adaptation for Language Grounding with 3D Objects<br>:house:project
- QueryCDR: Query-based Controllable Distortion Rectification Network for Fisheye Images<br>:star:code鱼眼图像
- BAMM: Bidirectional Autoregressive Motion Model<br>:house:project
- Handling The Non-Smooth Challenge in Tensor SVD: A Multi-Objective Tensor Recovery Framework
- Latent-INR: A Flexible Framework for Implicit Representations of Videos with Discriminative Semantics
- RPBG: Towards Robust Neural Point-based Graphics in the Wild<br>:star:code
- Memory-Efficient Fine-Tuning for Quantized Diffusion Model<br>:star:code
- Towards Architecture-Agnostic Untrained Networks Priors for Image Reconstruction with Frequency Regularization<br>:star:code
- Similarity of Neural Architectures using Adversarial Attack Transferability
- NeuroPictor: Refining fMRI-to-Image Reconstruction via Multi-individual Pretraining and Multi-level Modulation<br>:house:project
- Robustness Preserving Fine-tuning using Neuron Importance
- A Riemannian Approach for Spatiotemporal Analysis and Generation of 4D Tree-shaped Structures
- Dual-Path Adversarial Lifting for Domain Shift Correction in Online Test-time Adaptation<br>:star:code
- FTBC: Forward Temporal Bias Correction for Optimizing ANN-SNN Conversion
- Test-time Model Adaptation for Image Reconstruction Using Self-supervised Adaptive Layers图像重建
- Unveiling Privacy Risks in Stochastic Neural Networks Training: Effective Image Reconstruction from Gradients图像重建
- CrossScore: A Multi-View Approach to Image Evaluation and Scoring
- ADMap: Anti-disturbance Framework for Vectorized HD Map Construction
- GaussianImage: 1000 FPS Image Representation and Compression by 2D Gaussian Splatting<br>:star:code
- PYRA: Parallel Yielding Re-Activation for Training-Inference Efficient Task Adaptation<br>:star:code
- ReALFRED: An Embodied Instruction Following Benchmark in Photo-Realistic Environments<br>:star:code
- DVLO: Deep Visual-LiDAR Odometry with Local-to-Global Feature Fusion and Bi-Directional Structure Alignment<br>:star:code
- UL-VIO: Ultra-lightweight Visual-Inertial Odometry with Noise Robust Test-time Adaptation视觉惯性里程计
- Real-data-driven 2000 FPS Color Video from Mosaicked Chromatic Spikes<br>🤗huggingface
- RoGUENeRF: A Robust Geometry-Consistent Universal Enhancer for NeRF<br>:house:project
- LaRa: Efficient Large-Baseline Radiance Fields<br>:star:code
- Bi-TTA: Bidirectional Test-Time Adapter for Remote Physiological Measurement<br>:house:project
- ELSE: Efficient Deep Neural Network Inference through Line-based Sparsity Exploration
- Open-World Dynamic Prompt and Continual Visual Representation Learning
- GeoCalib: Learning Single-image Calibration with Geometric Optimization<br>:star:code
- LEIA: Latent View-invariant Embeddings for Implicit 3D Articulation<br>:star:code
- Alignist: CAD-Informed Orientation Distribution Estimation by Fusing Shape and Correspondences
- Weakly-supervised Camera Localization by Ground-to-satellite Image Registration
- Learning Neural Volumetric Pose Features for Camera Localization<br>:house:project
- SPVLoc: Semantic Panoramic Viewport Matching for 6D Camera Localization in Unseen Environments<br>:star:code
- DECOLLAGE: 3D Detailization by Controllable, Localized, and Learned Geometry Enhancement<br>:star:code
- Event-based Mosaicing Bundle Adjustment<br>:star:code
- Reprojection Errors as Prompts for Efficient Scene Coordinate Regression
- Depth on Demand: Streaming Dense Depth from a Low Frame Rate Active Sensor
- AMEGO: Active Memory from long EGOcentric videos<br>:star:code
- Vista3D: Unravel the 3D Darkside of a Single Image<br>:star:code
- Agglomerative Token Clustering<br>:house:project
- Formula-Supervised Visual-Geometric Pre-training<br>:star:code
- Interpretability-Guided Test-Time Adversarial Defense<br>:star:code
- Towards Model-Agnostic Dataset Condensation by Heterogeneous Models<br>:star:code
- MVPGS: Excavating Multi-view Priors for Gaussian Splatting from Sparse Input Views<br>:star:code
- Intrinsic Single-Image HDR Reconstruction
- Disentangled Generation and Aggregation for Robust Radiance Fields<br>:star:code
- Mixture of Efficient Diffusion Experts Through Automatic Interval and Sub-Network Selection
- Commonly Interesting Images
- Sequential Representation Learning via Static-Dynamic Conditional Disentanglement
- QuasiSim: Parameterized Quasi-Physical Simulators for Dexterous Manipulations Transfer<br>:star:code<br>:house:project
- Dataset Distillation by Automatic Training Trajectories<br>:star:code
- Neural Graphics Texture Compression Supporting Random Acces
- LookupViT: Compressing visual information to a limited number of tokens
- Bridging the Gap: Studio-like Avatar Creation from a Monocular Phone Capture<br>:house:project
- Generating 3D House Wireframes with Semantics<br>:star:code<br>:house:project
- Flying with Photons: Rendering Novel Views of Propagating Light<br>:star:code
- Deep Nets with Subsampling Layers Unwittingly Discard Useful Activations at Test-Time<br>:star:code
- MobileNetV4: Universal Models for the Mobile Ecosystem
- Gravity-aligned Rotation Averaging with Circular Regression<br>:star:code
- Concept Sliders: LoRA Adaptors for Precise Control in Diffusion Models<br>:house:project
- HYPE: Hyperbolic Entailment Filtering for Underspecified Images and Texts
- DoughNet: A Visual Predictive Model for Topological Manipulation of Deformable Objects<br>:house:project
- TrajPrompt: Aligning Color Trajectory with Vision-Language Representations<br>:star:code
- DomainFusion: Generalizing To Unseen Domains with Latent Diffusion Models
- Toward Tiny and High-quality Facial Makeup with Data Amplify Learning<br>:star:code
- Multi-Label Cluster Discrimination for Visual Representation Learning
- Preventing Catastrophic Overfitting in Fast Adversarial Training: A Bi-level Optimization Perspective
- MemBN: Robust Test-Time Adaptation via Batch Norm with Statistics Memory
- SeiT++: Masked Token Modeling Improves Storage-efficient Training<br>:star:code
- MagicEraser: Erasing Any Objects via Semantics-Aware Control
- Reliable Spatial-Temporal Voxels For Multi-Modal Test-Time Adaptation<br>:house:project
- A Cephalometric Landmark Regression Method based on Dual-encoder for High-resolution X-ray Image<br>:star:code
- Resilience of Entropy Model in Distributed Neural Networks<br>:star:code
- GeoWizard: Unleashing the Diffusion Priors for 3D Geometry Estimation from a Single Image<br>:house:project
- MotionChain: Conversational Motion Controllers via Multimodal Prompts
- MacDiff: Unified Skeleton Modeling with Masked Conditional Diffusion<br>:house:project
- Improving Intervention Efficacy via Concept Realignment in Concept Bottleneck Models<br>:star:code
- Brain Netflix: Scaling Data to Reconstruct Videos from Brain Signals
- Latent-INR: A Flexible Framework for Implicit Representations of Videos with Discriminative Semantics
- Tensorial template matching for fast cross-correlation with rotations and its application for tomography
- SelfGeo: Self-supervised and Geodesic-consistent Estimation of Keypoints on Deformable Shapes<br>:star:code
- Explain via Any Concept: Concept Bottleneck Model with Open Vocabulary Concepts
- Motion and Structure from Event-based Normal Flow<br>:house:project
- SENC: Handling Self-collision in Neural Cloth Simulation
- Distribution Alignment for Fully Test-Time Adaptation with Dynamic Online Data Streams
- Animate Your Motion: Turning Still Images into Dynamic Videos<br>:house:project
- Gaussian Splatting on the Move:Blur and Rolling Shutter Compensation for Natural Camera Motion<br>:star:code<br>:house:project
- Relightable Neural Actor with Intrinsic Decomposition and Pose Control<br>:house:project
- Layer-Wise Relevance Propagation with Conservation Property for ResNet<br>:house:project
- Self-Rectifying Diffusion Sampling with Perturbed-Attention Guidance<br>:house:project
- SparseSSP: 3D Subcellular Structure Prediction from Sparse-View Transmitted Light Images
- ViG-Bias: Visually Grounded Bias Discovery and Mitigation
- DOCCI: Descriptions of Connected and Contrasting Images<br>:house:project
- Geometry Fidelity for Spherical Images
- Efficient Inference of Vision Instruction-Following Models with Elastic Cache<br>:star:code
- Mew: Multiplexed Immunofluorescence Image Analysis through an Efficient Multiplex Network<br>:star:code
- Topology-Preserving Downsampling of Binary Images
- Quality Assured: Rethinking Annotation Strategies in Imaging AI
- Chronologically Accurate Retrieval for Temporal Grounding of Motion-Language Models<br>:house:project
- Data Collection-free Masked Video Modeling
- Efficient Image Pre-Training with Siamese Cropped Masked Autoencoders<br>:star:code
- Möbius Transform for Mitigating Perspective Distortions in Representation Learning<br>:house:project
- Foster Adaptivity and Balance in Learning with Noisy Labels<br>:star:code<br>无需先验知识即可高效解决深度学习中的噪声标签问题,让模型性能和鲁棒性大幅提升!
- Solving Motion Planning Tasks with a Scalable Generative Model<br>:star:code
- 4D Contrastive Superflows are Dense 3D Representation Learners<br>:star:code
- Learning to Complement and to Defer to Multiple Users<br>:star:code
- Shedding More Light on Robust Classifiers under the lens of Energy-based Models<br>:star:code
- TIP: Tabular-Image Pre-training for Multimodal Classification with Incomplete Data<br>:star:code
- UMBRAE: Unified Multimodal Brain Decoding<br>:star:code<br>:house:project
- Trainable Highly-expressive Activation Functions<br>:star:code
- Controllable Navigation Instruction Generation with Chain of Thought Prompting
- Recursive Visual Programming<br>:star:code
- Reshaping the Online Data Buffering and Organizing Mechanism for Continual Test-Time Adaptation<br>:star:code
- Imaging Interiors: An Implicit Solution to Electromagnetic Inverse Scattering Problems<br>:star:code
- The Gaussian Discriminant Variational Autoencoder (GdVAE): A Self-Explainable Model with Counterfactual Explanations<br>:star:code<br>:house:project
- HyperSpaceX: Radial and Angular Exploration of HyperSpherical Dimensions
- DataDream: Few-shot Guided Dataset Generation<br>:star:code
- Aligning Neuronal Coding of Dynamic Visual Scenes with Foundation Vision Models<br>:star:code<br>:house:project
- Towards Robust Event-based Networks for Nighttime via Unpaired Day-to-Night Event Translation<br>:star:code
- FRI-Net: Floorplan Reconstruction via Room-wise Implicit Representation
- Hierarchically Structured Neural Bones for Reconstructing Animatable Objects from Casual Videos<br>:star:code
- Deep Diffusion Image Prior for Efficient OOD Adaptation in 3D Inverse Problems<br>:star:code
- Pathformer3D: A 3D Scanpath Transformer for 360° Images<br>:star:code
- Kinetic Typography Diffusion Model<br>:star:code
- PolyRoom: Room-aware Transformer for Floorplan Reconstruction<br>:star:code
- Tree-D Fusion: Simulation-Ready Tree Dataset from Single Images with Diffusion Priors
- Multiscale Sliced Wasserstein Distances as Perceptual Color Difference Measures<br>:star:code
- Augmented Neural Fine-Tuning for Efficient Backdoor Purification
- Improving Hyperbolic Representations via Gromov-Wasserstein Regularization
- Mutual Learning for Acoustic Matching and Dereverberation via Visual Scene-driven Diffusion<br>:star:code
- Efficient Training with Denoised Neural Weights<br>:star:code
- SpaceJAM: a Lightweight and Regularization-free Method for Fast Joint Alignment of Images<br>:star:code
- Discover-then-Name: Task-Agnostic Concept Bottlenecks via Automated Concept Discovery<br>:star:code
- SceneScript: Reconstructing Scenes With An Autoregressive Structured Language Model<br>:house:project
- Multi-modal Relation Distillation for Unified 3D Representation Learning
- Continual Learning for Remote Physiological Measurement: Minimize Forgetting and Simplify Inference<br>:star:code
- TreeSBA: Tree-Transformer for Self-Supervised Sequential Brick Assembly
- SIGMA:Sinkhorn-Guided Masked Video Modeling<br>:star:code
- Attention Beats Linear for Fast Implicit Neural Representation Generation<br>:star:code
- Text2Place: Affordance-aware Text Guided Human Placement<br>:star:code
- RoadPainter: Points Are Ideal Navigators for Topology transformER
- STAMP: Outlier-Aware Test-Time Adaptation with Stable Memory Replay<br>:star:code
- Differentiable Convex Polyhedra Optimization from Multi-view Images<br>:star:code
- A Diffusion Model for Simulation Ready Coronary Anatomy with Morpho-skeletal Control<br>:star:code
- Power Variable Projection for Initialization-Free Large-Scale Bundle Adjustment
- Multi-label Cluster Discrimination for Visual Representation Learning
- SINDER: Repairing the Singular Defects of DINOv2<br>:star:code
- SHIC: Shape-Image Correspondences with no Keypoint Supervision<br>:house:project
- Semicalibrated Relative Pose from an Affine Correspondence and Monodepth相对位姿半校准
- Scalable Group Choreography via Variational Phase Manifold Learning
- Deep Companion Learning: Enhancing Generalization Through Historical Consistency
- Revisit Event Generation Model: Self-Supervised Learning of Event-to-Video Reconstruction with Implicit Neural Representations<br>:star:code
- Neural Surface Detection for Unsigned Distance Fields
- Merging and Splitting Diffusion Paths for Semantically Coherent Panoramas<br>:star:code
- Platypus: A Generalized Specialist Model for Reading Text in Various Forms<br>:star:code
- RAW-Adapter: Adapting Pre-trained Visual Model to Camera RAW Images<br>:star:code
- Learning Differentially Private Diffusion Models via Stochastic Adversarial Distillation
- Affine steerers for structured keypoint description
- SeA: Semantic Adversarial Augmentation for Last Layer Features from Unsupervised Representation Learning<br>:star:code
- MMBench: Is Your Multi-modal Model an All-around Player?<br>:star:code<br>:house:project
- DenseNets Reloaded: Paradigm Shift Beyond ResNets and ViTs<br>:star:code
- PreLAR: World Model Pre-training with Learnable Action Representation<br>:star:code
- Dataset Enhancement with Instance-Level Augmentations<br>:star:code
- Non-parametric Sensor Noise Modeling and Synthesis
- Stripe Observation Guided Inference Cost-free Attention Mechanism<br>:star:code
- Leveraging Hierarchical Feature Sharing for Efficient Dataset Condensation
- Object-Aware NIR-to-Visible Translation<br>:star:code<br>:sunflower:datasetLow-level Vision
2020 年论文分类汇总戳这里
↘️CVPR-2020-Papers ↘️ECCV-2020-Papers
<a name="00"/>2021 年论文分类汇总戳这里
↘️ICCV-2021-Papers ↘️CVPR-2021-Papers
<a name="000"/>2022 年论文分类汇总戳这里
↘️CVPR-2022-Papers ↘️WACV-2022-Papers ↘️ECCV-2022-Papers
<a name="0000"/>2023 年论文分类汇总戳这里
↘️CVPR-2023-Papers ↘️WACV-2023-Papers ↘️ICCV-2023-Papers
扫码CV君微信(注明:CVPR)入微信交流群:
# ECCV-2024-Papers