CVPR 2025 论文和开源项目合集(Papers with Code)
CVPR 2025 decisions are now available on OpenReview!22.1% = 2878 / 13008
注1:欢迎各位大佬提交issue,分享CVPR 2025论文和开源项目!
注2:关于往年CV顶会论文以及其他优质CV论文和大盘点,详见: https://github.com/amusi/daily-paper-computer-vision
【CVPR 2025 论文开源目录】
- 3DGS(Gaussian Splatting)
- Avatars
- Backbone
- CLIP
- Mamba
- Embodied AI
- GAN
- GNN
- 多模态大语言模型(MLLM)
- 大语言模型(LLM)
- NAS
- OCR
- NeRF
- DETR
- 扩散模型(Diffusion Models)
- ReID(重识别)
- 长尾分布(Long-Tail)
- Vision Transformer
- 视觉和语言(Vision-Language)
- 自监督学习(Self-supervised Learning)
- 数据增强(Data Augmentation)
- 目标检测(Object Detection)
- 异常检测(Anomaly Detection)
- 目标跟踪(Visual Tracking)
- 语义分割(Semantic Segmentation)
- 实例分割(Instance Segmentation)
- 全景分割(Panoptic Segmentation)
- 医学图像(Medical Image)
- 医学图像分割(Medical Image Segmentation)
- 视频目标分割(Video Object Segmentation)
- 视频实例分割(Video Instance Segmentation)
- 参考图像分割(Referring Image Segmentation)
- 图像抠图(Image Matting)
- 图像编辑(Image Editing)
- Low-level Vision
- 超分辨率(Super-Resolution)
- 去噪(Denoising)
- 去模糊(Deblur)
- 自动驾驶(Autonomous Driving)
- 3D点云(3D Point Cloud)
- 3D目标检测(3D Object Detection)
- 3D语义分割(3D Semantic Segmentation)
- 3D目标跟踪(3D Object Tracking)
- 3D语义场景补全(3D Semantic Scene Completion)
- 3D配准(3D Registration)
- 3D人体姿态估计(3D Human Pose Estimation)
- 3D人体Mesh估计(3D Human Mesh Estimation)
- 医学图像(Medical Image)
- 图像生成(Image Generation)
- 视频生成(Video Generation)
- 3D生成(3D Generation)
- 视频理解(Video Understanding)
- 行为检测(Action Detection)
- 具身智能(Embodied AI)
- 文本检测(Text Detection)
- 知识蒸馏(Knowledge Distillation)
- 模型剪枝(Model Pruning)
- 图像压缩(Image Compression)
- 三维重建(3D Reconstruction)
- 深度估计(Depth Estimation)
- 轨迹预测(Trajectory Prediction)
- 车道线检测(Lane Detection)
- 图像描述(Image Captioning)
- 视觉问答(Visual Question Answering)
- 手语识别(Sign Language Recognition)
- 视频预测(Video Prediction)
- 新视点合成(Novel View Synthesis)
- Zero-Shot Learning(零样本学习)
- 立体匹配(Stereo Matching)
- 特征匹配(Feature Matching)
- 暗光图像增强(Low-light Image Enhancement)
- 场景图生成(Scene Graph Generation)
- 风格迁移(Style Transfer)
- 隐式神经表示(Implicit Neural Representations)
- 图像质量评价(Image Quality Assessment)
- 视频质量评价(Video Quality Assessment)
- 数据集(Datasets)
- 新任务(New Tasks)
- 其他(Others)
3DGS(Gaussian Splatting)
Avatars
Backbone
CLIP
Mamba
MambaVision: A Hybrid Mamba-Transformer Vision Backbone
- Paper: [2407.08083] MambaVision: A Hybrid Mamba-Transformer Vision Backbone
- Code: https://github.com/NVlabs/MambaVision
MobileMamba: Lightweight Multi-Receptive Visual Mamba Network
- Paper: [2411.15941] MobileMamba: Lightweight Multi-Receptive Visual Mamba Network
- Code: https://github.com/lewandofskee/MobileMamba
Embodied AI
CityWalker: Learning Embodied Urban Navigation from Web-Scale Videos
- Project: CityWalker
- Paper: [2411.17820] CityWalker: Learning Embodied Urban Navigation from Web-Scale Videos
- Code: https://github.com/ai4ce/CityWalker
GAN
OCR
NeRF
DETR
Prompt
多模态大语言模型(MLLM)
LSceneLLM: Enhancing Large 3D Scene Understanding Using Adaptive Visual Preferences
- Paper: [2412.01292] LSceneLLM: Enhancing Large 3D Scene Understanding Using Adaptive Visual Preferences
- Code: https://github.com/Hoyyyaard/LSceneLLM
DynRefer: Delving into Region-level Multimodal Tasks via Dynamic Resolution
- Paper: [2405.16071] DynRefer: Delving into Region-level Multimodal Tasks via Dynamic Resolution
- Code: https://github.com/callsys/DynRefer
Retrieval-Augmented Personalization for Multimodal Large Language Models
- Project Page: RAP-MLLM
- Paper: [2410.13360] Retrieval-Augmented Personalization for Multimodal Large Language Models
- Code: https://github.com/Hoar012/RAP-MLLM
大语言模型(LLM)
NAS
ReID(重识别)
From Poses to Identity: Training-Free Person Re-Identification via Feature Centralization
- Paper: [2503.00938] From Poses to Identity: Training-Free Person Re-Identification via Feature Centralization
- Code: https://github.com/yuanc3/Pose2ID
AirRoom: Objects Matter in Room Reidentification
- Project: https://sairlab.org/airroom/
- Paper: [2503.01130] AirRoom: Objects Matter in Room Reidentification
扩散模型(Diffusion Models)
TinyFusion: Diffusion Transformers Learned Shallow
- Paper: [2412.01199] TinyFusion: Diffusion Transformers Learned Shallow
- Code: https://github.com/VainF/TinyFusion
Vision Transformer
视觉和语言(Vision-Language)
NLPrompt: Noise-Label Prompt Learning for Vision-Language Models
- Paper: [2412.01256] NLPrompt: Noise-Label Prompt Learning for Vision-Language Models
- Code: https://github.com/qunovo/NLPrompt
目标检测(Object Detection)
LLMDet: Learning Strong Open-Vocabulary Object Detectors under the Supervision of Large Language Models
- Paper: [2501.18954] LLMDet: Learning Strong Open-Vocabulary Object Detectors under the Supervision of Large Language Models
- Code:https://github.com/iSEE-Laboratory/LLMDet
异常检测(Anomaly Detection)
目标跟踪(Object Tracking)
Multiple Object Tracking as ID Prediction
Omnidirectional Multi-Object Tracking
医学图像(Medical Image)
医学图像分割(Medical Image Segmentation)
自动驾驶(Autonomous Driving)
LiMoE: Mixture of LiDAR Representation Learners from Automotive Scenes
- Project: https://ldkong.com/LiMoE
- Paper: [2501.04004] LiMoE: Mixture of LiDAR Representation Learners from Automotive Scenes
- Code: https://github.com/Xiangxu-0103/LiMoE
3D点云(3D-Point-Cloud)
3D目标检测(3D Object Detection)
3D语义分割(3D Semantic Segmentation)
Low-level Vision
超分辨率(Super-Resolution)
AESOP: Auto-Encoded Supervision for Perceptual Image Super-Resolution
- Paper: [2412.00124] Auto-Encoded Supervision for Perceptual Image Super-Resolution
- Code: https://github.com/2minkyulee/AESOP-Auto-Encoded-Supervision-for-Perceptual-Image-Super-Resolution
去噪(Denoising)
图像去噪(Image Denoising)
3D人体姿态估计(3D Human Pose Estimation)
图像生成(Image Generation)
Reconstruction vs. Generation: Taming Optimization Dilemma in Latent Diffusion Models
- Paper: [2501.01423] Reconstruction vs. Generation: Taming Optimization Dilemma in Latent Diffusion Models
- Code: https://github.com/hustvl/LightningDiT
SleeperMark: Towards Robust Watermark against Fine-Tuning Text-to-image Diffusion Models
- Paper: [2412.04852] SleeperMark: Towards Robust Watermark against Fine-Tuning Text-to-image Diffusion Models
- Code: https://github.com/taco-group/SleeperMark
TokenFlow: Unified Image Tokenizer for Multimodal Understanding and Generation
- Homepage: TokenFlow
- Code: https://github.com/ByteFlow-AI/TokenFlow
- Paper:[2412.03069] TokenFlow: Unified Image Tokenizer for Multimodal Understanding and Generation
PAR: Parallelized Autoregressive Visual Generation
- Project: Parallelized Autoregressive Visual Generation
- Paper: [2412.15119] Parallelized Autoregressive Visual Generation
- Code: https://github.com/Epiphqny/PAR
Generative Photography: Scene-Consistent Camera Control for Realistic Text-to-Image Synthesis
- Project: Generative Photography: Scene-Consistent Camera Control for Realistic Text-to-Image Synthesis
- Paper: [2412.02168] Generative Photography: Scene-Consistent Camera Control for Realistic Text-to-Image Synthesis
- Code: https://github.com/pandayuanyu/generative-photography
视频生成(Video Generation)
Identity-Preserving Text-to-Video Generation by Frequency Decomposition
- Paper: [2411.17440] Identity-Preserving Text-to-Video Generation by Frequency Decomposition
- Code: https://github.com/PKU-YuanGroup/ConsisID
Cinemo: Consistent and Controllable Image Animation with Motion Diffusion Models
- Paper: [2407.15642] Cinemo: Consistent and Controllable Image Animation with Motion Diffusion Models
- Code: https://github.com/maxin-cn/Cinemo
X-Dyna: Expressive Dynamic Human Image Animation
- Paper: [2501.10021] X-Dyna: Expressive Dynamic Human Image Animation
- Code: https://github.com/bytedance/X-Dyna
PhyT2V: LLM-Guided Iterative Self-Refinement for Physics-Grounded Text-to-Video Generation
Timestep Embedding Tells: It's Time to Cache for Video Diffusion Model
- Project: Timestep Embedding Tells: It's Time to Cache for Video Diffusion Model
- Paper: [2411.19108] Timestep Embedding Tells: It's Time to Cache for Video Diffusion Model
- Code: https://github.com/ali-vilab/TeaCache
AR-Diffusion: Asynchronous Video Generation with Auto-Regressive Diffusion
- Project: AR-Diffusion: Asynchronous Video Generation with Auto-Regressive Diffusion
- Paper: [2503.07418] AR-Diffusion: Asynchronous Video Generation with Auto-Regressive Diffusion
- Code: https://github.com/iva-mzsun/AR-Diffusion
图像编辑(Image Editing)
Edit Away and My Face Will not Stay: Personal Biometric Defense against Malicious Generative Editing
- Paper: [2411.16832] Edit Away and My Face Will not Stay: Personal Biometric Defense against Malicious Generative Editing
- Code: https://github.com/taco-group/FaceLock
h-Edit: Effective and Flexible Diffusion-Based Editing via Doob’s h-Transform
- Paper: [2503.02187] h-Edit: Effective and Flexible Diffusion-Based Editing via Doob's h-Transform
- Code: https://github.com/nktoan/h-edit
视频编辑(Video Editing)
3D生成(3D Generation)
Generative Gaussian Splatting for Unbounded 3D City Generation
- Project: GaussianCity | Infinite Script
- Paper: [2406.06526] Generative Gaussian Splatting for Unbounded 3D City Generation
- Code: https://github.com/hzxie/GaussianCity
StdGEN: Semantic-Decomposed 3D Character Generation from Single Images
- Project: StdGEN
- Paper: [2411.05738] StdGEN: Semantic-Decomposed 3D Character Generation from Single Images
- Code: https://github.com/hyz317/StdGEN
3D重建(3D Reconstruction)
Fast3R: Towards 3D Reconstruction of 1000+ Images in One Forward Pass
- Project: Fast3R: Towards 3D Reconstruction of 1000+ Images in One Forward Pass
- Paper: [2501.13928] Fast3R: Towards 3D Reconstruction of 1000+ Images in One Forward Pass
人体运动生成(Human Motion Generation)
SemGeoMo: Dynamic Contextual Human Motion Generation with Semantic and Geometric Guidance
- Project: SemGeoMo: Dynamic Contextual Human Motion Generation with Semantic and Geometric Guidance
- Paper: [2503.01291] SemGeoMo: Dynamic Contextual Human Motion Generation with Semantic and Geometric Guidance
- https://github.com/4DVLab/SemGeoMo
视频理解(Video Understanding)
Temporal Grounding Videos like Flipping Manga
- Paper: [2411.10332] Number it: Temporal Grounding Videos like Flipping Manga
- Code: https://github.com/yongliang-wu/NumPro
具身智能(Embodied AI)
Universal Actions for Enhanced Embodied Foundation Models
- Project: Universal Actions for Enhanced Embodied Foundation Models
- Paper: [2501.10105] Universal Actions for Enhanced Embodied Foundation Models
- Code: https://github.com/2toinf/UniAct
知识蒸馏(Knowledge Distillation)
深度估计(Depth Estimation)
DepthCrafter: Generating Consistent Long Depth Sequences for Open-world Videos
- Project: https://depthcrafter.github.io
- Paper: [2409.02095] DepthCrafter: Generating Consistent Long Depth Sequences for Open-world Videos
- Code: https://github.com/Tencent/DepthCrafter
MonSter: Marry Monodepth to Stereo Unleashes Power
- Paper: [2501.08643] MonSter: Marry Monodepth to Stereo Unleashes Power
- Code: https://github.com/Junda24/MonSter
立体匹配(Stereo Matching)
MonSter: Marry Monodepth to Stereo Unleashes Power
- Paper: [2501.08643] MonSter: Marry Monodepth to Stereo Unleashes Power
- Code: https://github.com/Junda24/MonSter
暗光图像增强(Low-light Image Enhancement)
HVI: A New color space for Low-light Image Enhancement
- Paper: [2502.20272] HVI: A New Color Space for Low-light Image Enhancement
- Code: https://github.com/Fediory/HVI-CIDNet
- Demo: https://huggingface.co/spaces/Fediory/HVI-CIDNet_Low-light-Image-Enhancement_
ReDDiT: Efficient Diffusion as Low Light Enhancer
- Paper: [2410.12346] Efficient Diffusion as Low Light Enhancer
- Code: https://github.com/lgz-0713/ReDDiT
场景图生成(Scene Graph Generation)
风格迁移(Style Transfer)
StyleStudio: Text-Driven Style Transfer with Selective Control of Style Elements
- Project: StyleStudio: Text-Driven Style Transfer with Selective Control of Style Elements
- Paper: [2412.08503] StyleStudio: Text-Driven Style Transfer with Selective Control of Style Elements
- Code: https://github.com/Westlake-AGI-Lab/StyleStudio
视频质量评价(Video Quality Assessment)
数据集(Datasets)
其他(Others)
注:All copied from [GitHub - amusi/CVPR2025-Papers-with-Code: CVPR 2025 论文和开源项目合集]