Home

Awesome

CVPR 2024 论文和开源项目合集(Papers with Code)

CVPR 2024 decisions are now available on OpenReview!

注1:欢迎各位大佬提交issue,分享CVPR 2024论文和开源项目!

注2:关于往年CV顶会论文以及其他优质CV论文和大盘点,详见: https://github.com/amusi/daily-paper-computer-vision

欢迎扫码加入【CVer学术交流群】,这是最大的计算机视觉AI知识星球!每日更新,第一时间分享最新最前沿的计算机视觉、AI绘画、图像处理、深度学习、自动驾驶、医疗影像和AIGC等方向的学习资料,学起来!

【CVPR 2024 论文开源目录】

<a name="3DGS"></a>

3DGS(Gaussian Splatting)

Scaffold-GS: Structured 3D Gaussians for View-Adaptive Rendering

GPS-Gaussian: Generalizable Pixel-wise 3D Gaussian Splatting for Real-time Human Novel View Synthesis

GaussianAvatar: Towards Realistic Human Avatar Modeling from a Single Video via Animatable 3D Gaussians

GaussianEditor: Swift and Controllable 3D Editing with Gaussian Splatting

Deformable 3D Gaussians for High-Fidelity Monocular Dynamic Scene Reconstruction

SC-GS: Sparse-Controlled Gaussian Splatting for Editable Dynamic Scenes

Spacetime Gaussian Feature Splatting for Real-Time Dynamic View Synthesis

DNGaussian: Optimizing Sparse-View 3D Gaussian Radiance Fields with Global-Local Depth Normalization

4D Gaussian Splatting for Real-Time Dynamic Scene Rendering

GaussianDreamer: Fast Generation from Text to 3D Gaussians by Bridging 2D and 3D Diffusion Models

<a name="Avatars"></a>

Avatars

GaussianAvatar: Towards Realistic Human Avatar Modeling from a Single Video via Animatable 3D Gaussians

Real-Time Simulated Avatar from Head-Mounted Sensors

<a name="Backbone"></a>

Backbone

RepViT: Revisiting Mobile CNN From ViT Perspective

TransNeXt: Robust Foveal Visual Perception for Vision Transformers

<a name="CLIP"></a>

CLIP

Alpha-CLIP: A CLIP Model Focusing on Wherever You Want

FairCLIP: Harnessing Fairness in Vision-Language Learning

<a name="MAE"></a>

MAE

<a name="Embodied-AI"></a>

Embodied AI

EmbodiedScan: A Holistic Multi-Modal 3D Perception Suite Towards Embodied AI

MP5: A Multi-modal Open-ended Embodied System in Minecraft via Active Perception

LEMON: Learning 3D Human-Object Interaction Relation from 2D Images

<a name="GAN"></a>

GAN

<a name="OCR"></a>

OCR

An Empirical Study of Scaling Law for OCR

ODM: A Text-Image Further Alignment Pre-training Approach for Scene Text Detection and Spotting

<a name="NeRF"></a>

NeRF

PIE-NeRF🍕: Physics-based Interactive Elastodynamics with NeRF

<a name="DETR"></a>

DETR

DETRs Beat YOLOs on Real-time Object Detection

Salience DETR: Enhancing Detection Transformer with Hierarchical Salience Filtering Refinement

<a name="Prompt"></a>

Prompt

<a name="MLLM"></a>

多模态大语言模型(MLLM)

mPLUG-Owl2: Revolutionizing Multi-modal Large Language Model with Modality Collaboration

Link-Context Learning for Multimodal LLMs

OPERA: Alleviating Hallucination in Multi-Modal Large Language Models via Over-Trust Penalty and Retrospection-Allocation

Making Large Multimodal Models Understand Arbitrary Visual Prompts

Pink: Unveiling the power of referential comprehension for multi-modal llms

Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding

OneLLM: One Framework to Align All Modalities with Language

<a name="LLM"></a>

大语言模型(LLM)

VTimeLLM: Empower LLM to Grasp Video Moments

<a name="NAS"></a>

NAS

<a name="ReID"></a>

ReID(重识别)

Magic Tokens: Select Diverse Tokens for Multi-modal Object Re-Identification

Noisy-Correspondence Learning for Text-to-Image Person Re-identification

<a name="Diffusion"></a>

扩散模型(Diffusion Models)

InstanceDiffusion: Instance-level Control for Image Generation

Residual Denoising Diffusion Models

DeepCache: Accelerating Diffusion Models for Free

DEADiff: An Efficient Stylization Diffusion Model with Disentangled Representations

SVGDreamer: Text Guided SVG Generation with Diffusion Model

InteractDiffusion: Interaction-Control for Text-to-Image Diffusion Model

MMA-Diffusion: MultiModal Attack on Diffusion Models

VMC: Video Motion Customization using Temporal Attention Adaption for Text-to-Video Diffusion Models

<a name="Vision-Transformer"></a>

Vision Transformer

TransNeXt: Robust Foveal Visual Perception for Vision Transformers

RepViT: Revisiting Mobile CNN From ViT Perspective

A General and Efficient Training for Transformer via Token Expansion

<a name="VL"></a>

视觉和语言(Vision-Language)

PromptKD: Unsupervised Prompt Distillation for Vision-Language Models

FairCLIP: Harnessing Fairness in Vision-Language Learning

<a name="Object-Detection"></a>

目标检测(Object Detection)

DETRs Beat YOLOs on Real-time Object Detection

Boosting Object Detection with Zero-Shot Day-Night Domain Adaptation

YOLO-World: Real-Time Open-Vocabulary Object Detection

Salience DETR: Enhancing Detection Transformer with Hierarchical Salience Filtering Refinement

<a name="Anomaly-Detection"></a>

异常检测(Anomaly Detection)

Anomaly Heterogeneity Learning for Open-set Supervised Anomaly Detection

<a name="VT"></a>

目标跟踪(Object Tracking)

Delving into the Trajectory Long-tail Distribution for Muti-object Tracking

<a name="Semantic-Segmentation"></a>

语义分割(Semantic Segmentation)

Stronger, Fewer, & Superior: Harnessing Vision Foundation Models for Domain Generalized Semantic Segmentation

SED: A Simple Encoder-Decoder for Open-Vocabulary Semantic Segmentation

<a name="MI"></a>

医学图像(Medical Image)

Feature Re-Embedding: Towards Foundation Model-Level Performance in Computational Pathology

VoCo: A Simple-yet-Effective Volume Contrastive Learning Framework for 3D Medical Image Analysis

ChAda-ViT : Channel Adaptive Attention for Joint Representation Learning of Heterogeneous Microscopy Images

<a name="MIS"></a>

医学图像分割(Medical Image Segmentation)

<a name="Autonomous-Driving"></a>

自动驾驶(Autonomous Driving)

UniPAD: A Universal Pre-training Paradigm for Autonomous Driving

Cam4DOcc: Benchmark for Camera-Only 4D Occupancy Forecasting in Autonomous Driving Applications

Memory-based Adapters for Online 3D Scene Perception

Symphonize 3D Semantic Scene Completion with Contextual Instance Queries

A Real-world Large-scale Dataset for Roadside Cooperative Perception

Adaptive Fusion of Single-View and Multi-View Depth for Autonomous Driving

Traffic Scene Parsing through the TSP6K Dataset

<a name="3D-Point-Cloud"></a>

3D点云(3D-Point-Cloud)

<a name="3DOD"></a>

3D目标检测(3D Object Detection)

PTT: Point-Trajectory Transformer for Efficient Temporal 3D Object Detection

UniMODE: Unified Monocular 3D Object Detection

<a name="3DOD"></a>

3D语义分割(3D Semantic Segmentation)

<a name="Image-Editing"></a>

图像编辑(Image Editing)

Edit One for All: Interactive Batch Image Editing

<a name="Video-Editing"></a>

视频编辑(Video Editing)

MaskINT: Video Editing via Interpolative Non-autoregressive Masked Transformers

<a name="LLV"></a>

Low-level Vision

Residual Denoising Diffusion Models

Boosting Image Restoration via Priors from Pre-trained Models

<a name="SR"></a>

超分辨率(Super-Resolution)

SeD: Semantic-Aware Discriminator for Image Super-Resolution

APISR: Anime Production Inspired Real-World Anime Super-Resolution

<a name="Denoising"></a>

去噪(Denoising)

图像去噪(Image Denoising)

<a name="3D-Human-Pose-Estimation"></a>

3D人体姿态估计(3D Human Pose Estimation)

Hourglass Tokenizer for Efficient Transformer-Based 3D Human Pose Estimation

<a name="Image-Generation"></a>

图像生成(Image Generation)

InstanceDiffusion: Instance-level Control for Image Generation

ECLIPSE: A Resource-Efficient Text-to-Image Prior for Image Generations

Instruct-Imagen: Image Generation with Multi-modal Instruction

Residual Denoising Diffusion Models

UniGS: Unified Representation for Image Generation and Segmentation

Multi-Instance Generation Controller for Text-to-Image Synthesis

SVGDreamer: Text Guided SVG Generation with Diffusion Model

InteractDiffusion: Interaction-Control for Text-to-Image Diffusion Model

Ranni: Taming Text-to-Image Diffusion for Accurate Prompt Following

<a name="Video-Generation"></a>

视频生成(Video Generation)

Vlogger: Make Your Dream A Vlog

VBench: Comprehensive Benchmark Suite for Video Generative Models

VMC: Video Motion Customization using Temporal Attention Adaption for Text-to-Video Diffusion Models

<a name="3D-Generation"></a>

3D生成

CityDreamer: Compositional Generative Model of Unbounded 3D Cities

LucidDreamer: Towards High-Fidelity Text-to-3D Generation via Interval Score Matching

<a name="Video-Understanding"></a>

视频理解(Video Understanding)

MVBench: A Comprehensive Multi-modal Video Understanding Benchmark

<a name="KD"></a>

知识蒸馏(Knowledge Distillation)

Logit Standardization in Knowledge Distillation

Efficient Dataset Distillation via Minimax Diffusion

<a name="Stereo-Matching"></a>

立体匹配(Stereo Matching)

Neural Markov Random Field for Stereo Matching

<a name="SGG"></a>

场景图生成(Scene Graph Generation)

HiKER-SGG: Hierarchical Knowledge Enhanced Robust Scene Graph Generation

<a name="Video-Quality-Assessment"></a>

视频质量评价(Video Quality Assessment)

KVQ: Kaleidoscope Video Quality Assessment for Short-form Videos

<a name="Datasets"></a>

数据集(Datasets)

A Real-world Large-scale Dataset for Roadside Cooperative Perception

Traffic Scene Parsing through the TSP6K Dataset

<a name="Others"></a>

其他(Others)

Object Recognition as Next Token Prediction

ParameterNet: Parameters Are All You Need for Large-scale Visual Pretraining of Mobile Networks

Seamless Human Motion Composition with Blended Positional Encodings

LL3DA: Visual Interactive Instruction Tuning for Omni-3D Understanding, Reasoning, and Planning

CLOVA: A Closed-LOop Visual Assistant with Tool Usage and Update

MoMask: Generative Masked Modeling of 3D Human Motions

Amodal Ground Truth and Completion in the Wild

Improved Visual Grounding through Self-Consistent Explanations

ImageNet-D: Benchmarking Neural Network Robustness on Diffusion Synthetic Object

Learning from Synthetic Human Group Activities

A Cross-Subject Brain Decoding Framework

Multi-Task Dense Prediction via Mixture of Low-Rank Experts

Contrastive Mean-Shift Learning for Generalized Category Discovery