CVPR-2024 三维人体姿态(3D Human Pose)相关论文10篇
Multiple View Geometry Transformers for 3D Human Pose Estimation
文章解读: http://www.studyai.com/xueshu/paper/detail/2a3f36ffd0
文章链接: (https://openaccess.thecvf.com/content/CVPR2024/html/Liao_Multiple_View_Geometry_Transformers_for_3D_Human_Pose_Estimation_CVPR_2024_paper.html)
摘要
In this work we aim to improve the 3D reasoning ability of Transformers in multi-view 3D human pose estimation.
Recent works have focused on end-to-end learning-based transformer designs which struggle to resolve geometric information accurately particularly during occlusion.
Instead we propose a novel hybrid model MVGFormer which has a series of geometric and appearance modules organized in an iterative manner.
The geometry modules are learning-free and handle all viewpoint-dependent 3D tasks geometrically which notably improves the model’s generalization ability.
The appearance modules are learnable and are dedicated to estimating 2D poses from image signals end-to-end which enables them to achieve accurate estimates even when occlusion occurs leading to a model that is both accurate and generalizable to new cameras and geometries.
We evaluate our approach for both in-domain and out-of-domain settings where our model consistently outperforms state-of-the-art methods and especially does so by a significant margin in the out-of-domain setting.
We will release the code and models: https://github.com/XunshanMan/MVGFormer…
A Dual-Augmentor Framework for Domain Generalization in 3D Human Pose Estimation
文章解读: http://www.studyai.com/xueshu/paper/detail/388f98e96f
文章链接: (https://openaccess.thecvf.com/content/CVPR2024/html/Peng_A_Dual-Augmentor_Framework_for_Domain_Generalization_in_3D_Human_Pose_CVPR_2024_paper.html)
摘要
3D human pose data collected in controlled laboratory settings present challenges for pose estimators that generalize across diverse scenarios.
To address this domain generalization is employed.
Current methodologies in domain generalization for 3D human pose estimation typically utilize adversarial training to generate synthetic poses for training.
Nonetheless these approaches exhibit several limitations.
First the lack of prior information about the target domain complicates the application of suitable augmentation through a single pose augmentor affecting generalization on target domains.
Moreover adversarial training’s discriminator tends to enforce similarity between source and synthesized poses impeding the exploration of out-of-source distributions.
Furthermore the pose estimator’s optimization is not exposed to domain shifts limiting its overall generalization ability.
To address these limitations we propose a novel framework featuring two pose augmentors: the weak and the strong augmentors.
Our framework employs differential strategies for generation and discrimination processes facilitating the preservation of knowledge related to source poses and the exploration of out-of-source distributions without prior information about target poses.
Besides we leverage meta-optimization to simulate domain shifts in the optimization process of the pose estimator thereby improving its generalization ability.
Our proposed approach significantly outperforms existing methods as demonstrated through comprehensive experiments on various benchmark datasets.
ChatPose: Chatting about 3D Human Pose
文章解读: http://www.studyai.com/xueshu/paper/detail/45e73967da
文章链接: (https://openaccess.thecvf.com/content/CVPR2024/html/Feng_ChatPose_Chatting_about_3D_Human_Pose_CVPR_2024_paper.html)
摘要
We introduce ChatPose a framework employing Large Language Models (LLMs) to understand and reason about 3D human poses from images or textual descriptions.
Our work is motivated by the human ability to intuitively understand postures from a single image or a brief description a process that intertwines image interpretation world knowledge and an understanding of body language.
Traditional human pose estimation and generation methods often operate in isolation lacking semantic understanding and reasoning abilities.
ChatPose addresses these limitations by embedding SMPL poses as distinct signal tokens within a multimodal LLM enabling the direct generation of 3D body poses from both textual and visual inputs.
Leveraging the powerful capabilities of multimodal LLMs ChatPose unifies classical 3D human pose and generation tasks while offering user interactions.
Additionally ChatPose empowers LLMs to apply their extensive world knowledge in reasoning about human poses leading to two advanced tasks: speculative pose generation and reasoning about pose estimation.
These tasks involve reasoning about humans to generate 3D poses from subtle text queries possibly accompanied by images.
We establish benchmarks for these tasks moving beyond traditional 3D pose generation and estimation methods.
Our results show that ChatPose out-performs existing multimodal LLMs and task-specific methods on these newly proposed tasks.
Furthermore ChatPose’s ability to understand and generate 3D human poses based on complex reasoning opens new directions in human pose analysis.
Code and data are available for research at https://yfeng95.github.io/ChatPose.
Hourglass Tokenizer for Efficient Transformer-Based 3D Human Pose Estimation
文章解读: http://www.studyai.com/xueshu/paper/detail/4b77f398ce
文章链接: (https://openaccess.thecvf.com/content/CVPR2024/html/Li_Hourglass_Tokenizer_for_Efficient_Transformer-Based_3D_Human_Pose_Estimation_CVPR_2024_paper.html)
摘要
Transformers have been successfully applied in the field of video-based 3D human pose estimation.
However the high computational costs of these video pose transformers (VPTs) make them impractical on resource-constrained devices.
In this paper we present a plug-and-play pruning-and-recovering framework called Hourglass Tokenizer (HoT) for efficient transformer-based 3D human pose estimation from videos.
Our HoT begins with pruning pose tokens of redundant frames and ends with recovering full-length tokens resulting in a few pose tokens in the intermediate transformer blocks and thus improving the model efficiency.
To effectively achieve this we propose a token pruning cluster (TPC) that dynamically selects a few representative tokens with high semantic diversity while eliminating the redundancy of video frames.
In addition we develop a token recovering attention (TRA) to restore the detailed spatio-temporal information based on the selected tokens thereby expanding the network output to the original full-length temporal resolution for fast inference.
Extensive experiments on two benchmark datasets (i.e.
Human3.6M and MPI-INF-3DHP) demonstrate that our method can achieve both high efficiency and estimation accuracy compared to the original VPT models.
For instance applying to MotionBERT and MixSTE on Human3.6M our HoT can save nearly 50% FLOPs without sacrificing accuracy and nearly 40% FLOPs with only 0.2% accuracy drop respectively.
Code and models are available at https://github.com/NationalGAILab/HoT.
Multi-agent Long-term 3D Human Pose Forecasting via Interaction-aware Trajectory Conditioning
文章解读: http://www.studyai.com/xueshu/paper/detail/54ea8aa524
文章链接: (https://openaccess.thecvf.com/content/CVPR2024/html/Jeong_Multi-agent_Long-term_3D_Human_Pose_Forecasting_via_Interaction-aware_Trajectory_Conditioning_CVPR_2024_paper.html)
摘要
Human pose forecasting garners attention for its diverse applications.
However challenges in modeling the multi-modal nature of human motion and intricate interactions among agents persist particularly with longer timescales and more agents.
In this paper we propose an interaction-aware trajectory-conditioned long-term multi-agent human pose forecasting model utilizing a coarse-to-fine prediction approach: multi-modal global trajectories are initially forecasted followed by respective local pose forecasts conditioned on each mode.
In doing so our Trajectory2Pose model introduces a graph-based agent-wise interaction module for a reciprocal forecast of local motion-conditioned global trajectory and trajectory-conditioned local pose.
Our model effectively handles the multi-modality of human motion and the complexity of long-term multi-agent interactions improving performance in complex environments.
Furthermore we address the lack of long-term (6s+) multi-agent (5+) datasets by constructing a new dataset from real-world images and 2D annotations enabling a comprehensive evaluation of our proposed model.
State-of-the-art prediction performance on both complex and simpler datasets confirms the generalized effectiveness of our method.
The code is available at https://github.com/Jaewoo97/T2P.
KTPFormer: Kinematics and Trajectory Prior Knowledge-Enhanced Transformer for 3D Human Pose Estimation
文章解读: http://www.studyai.com/xueshu/paper/detail/84a46d0c61
文章链接: (https://openaccess.thecvf.com/content/CVPR2024/html/Peng_KTPFormer_Kinematics_and_Trajectory_Prior_Knowledge-Enhanced_Transformer_for_3D_Human_CVPR_2024_paper.html)
摘要
This paper presents a novel Kinematics and Trajectory Prior Knowledge-Enhanced Transformer (KTPFormer) which overcomes the weakness in existing transformer-based methods for 3D human pose estimation that the derivation of Q K V vectors in their self-attention mechanisms are all based on simple linear mapping.
We propose two prior attention modules namely Kinematics Prior Attention (KPA) and Trajectory Prior Attention (TPA) to take advantage of the known anatomical structure of the human body and motion trajectory information to facilitate effective learning of global dependencies and features in the multi-head self-attention.
KPA models kinematic relationships in the human body by constructing a topology of kinematics while TPA builds a trajectory topology to learn the information of joint motion trajectory across frames.
Yielding Q K V vectors with prior knowledge the two modules enable KTPFormer to model both spatial and temporal correlations simultaneously.
Extensive experiments on three benchmarks (Human3.6M MPI-INF-3DHP and HumanEva) show that KTPFormer achieves superior performance in comparison to state-of-the-art methods.
More importantly our KPA and TPA modules have lightweight plug-and-play designs and can be integrated into various transformer-based networks (i.e.
diffusion-based) to improve the performance with only a very small increase in the computational overhead.
The code is available at: https://github.com/JihuaPeng/KTPFormer…
FreeMan: Towards Benchmarking 3D Human Pose Estimation under Real-World Conditions
文章解读: http://www.studyai.com/xueshu/paper/detail/8830af23a5
文章链接: (https://openaccess.thecvf.com/content/CVPR2024/html/Wang_FreeMan_Towards_Benchmarking_3D_Human_Pose_Estimation_under_Real-World_Conditions_CVPR_2024_paper.html)
摘要
Estimating the 3D structure of the human body from natural scenes is a fundamental aspect of visual perception.
3D human pose estimation is a vital step in advancing fields like AIGC and human-robot interaction serving as a crucial technique for understanding and interacting with human actions in real-world settings.
However the current datasets often collected under single laboratory conditions using complex motion capture equipment and unvarying backgrounds are insufficient.
The absence of datasets on variable conditions is stalling the progress of this crucial task.
To facilitate the development of 3D pose estimation we present FreeMan the first large-scale multi-view dataset collected under the realworld conditions.
FreeMan was captured by synchronizing 8 smartphones across diverse scenarios.
It comprises 11M frames from 8000 sequences viewed from different perspectives.
These sequences cover 40 subjects across 10 different scenarios each with varying lighting conditions.
We have also established an semi-automated pipeline containing error detection to reduce the workload of manual check and ensure precise annotation.
We provide comprehensive evaluation baselines for a range of tasks underlining the significant challenges posed by FreeMan.
Further evaluations of standard indoor/outdoor human sensing datasets reveal that FreeMan offers robust representation transferability in real and complex scenes.
FreeMan is publicly available at https://wangjiongw.github.io/freeman…
PoseIRM: Enhance 3D Human Pose Estimation on Unseen Camera Settings via Invariant Risk Minimization
文章解读: http://www.studyai.com/xueshu/paper/detail/8cb5eaf48d
文章链接: (https://openaccess.thecvf.com/content/CVPR2024/html/Cai_PoseIRM_Enhance_3D_Human_Pose_Estimation_on_Unseen_Camera_Settings_CVPR_2024_paper.html)
摘要
Camera-parameter-free multi-view pose estimation is an emerging technique for 3D human pose estimation (HPE).
They can infer the camera settings implicitly or explicitly to mitigate the depth uncertainty impact showcasing significant potential in real applications.
However due to the limited camera setting diversity in the available datasets the inferred camera parameters are always simply hardcoded into the model during training and not adaptable to the input in inference making the learned models cannot generalize well under unseen camera settings.
A natural solution is to artificially synthesize some samples i.e.
2D-3D pose pairs under massive new camera settings.
Unfortunately to prevent over-fitting the existing camera setting the number of synthesized samples for each new camera setting should be comparable with that for the existing one which multiplies the scale of training and even makes it computationally prohibitive.
In this paper we propose a novel HPE approach under the invariant risk minimization (IRM) paradigm.
Precisely we first synthesize 2D poses from myriad camera settings.
We then train our model under the IRM paradigm which targets at learning a common optimal model across all camera settings and thus enforces the model to automatically learn the camera parameters based on the input data.
This allows the model to accurately infer 3D poses on unseen data by training on only a handful of samples from each synthesized setting and thus avoid the unbearable training cost increment.
Another appealing feature of our method is that benefited from the capability of IRM in identifying the invariant features its performance on the seen camera settings is enhanced as well.
Comprehensive experiments verify the superiority of our approach…
FinePOSE: Fine-Grained Prompt-Driven 3D Human Pose Estimation via Diffusion Models
文章解读: http://www.studyai.com/xueshu/paper/detail/baeda7808b
文章链接: (https://openaccess.thecvf.com/content/CVPR2024/html/Xu_FinePOSE_Fine-Grained_Prompt-Driven_3D_Human_Pose_Estimation_via_Diffusion_Models_CVPR_2024_paper.html)
摘要
The 3D Human Pose Estimation (3D HPE) task uses 2D images or videos to predict human joint coordinates in 3D space.
Despite recent advancements in deep learning-based methods they mostly ignore the capability of coupling accessible texts and naturally feasible knowledge of humans missing out on valuable implicit supervision to guide the 3D HPE task.
Moreover previous efforts often study this task from the perspective of the whole human body neglecting fine-grained guidance hidden in different body parts.
To this end we present a new Fine-Grained Prompt-Driven Denoiser based on a diffusion model for 3D HPE named FinePOSE.
It consists of three core blocks enhancing the reverse process of the diffusion model: (1) Fine-grained Part-aware Prompt learning (FPP) block constructs fine-grained part-aware prompts via coupling accessible texts and naturally feasible knowledge of body parts with learnable prompts to model implicit guidance.
(2) Fine-grained Prompt-pose Communication (FPC) block establishes fine-grained communications between learned part-aware prompts and poses to improve the denoising quality.
(3) Prompt-driven Timestamp Stylization (PTS) block integrates learned prompt embedding and temporal information related to the noise level to enable adaptive adjustment at each denoising step.
Extensive experiments on public single-human pose estimation datasets show that FinePOSE outperforms state-of-the-art methods.
We further extend FinePOSE to multi-human pose estimation.
Achieving 34.3mm average MPJPE on the EgoHumans dataset demonstrates the potential of FinePOSE to deal with complex multi-human scenarios.
Code is available at https://github.com/PKU-ICST-MIPL/FinePOSE_CVPR2024…
3D Human Pose Perception from Egocentric Stereo Videos
文章解读: http://www.studyai.com/xueshu/paper/detail/cefb08cdb0
文章链接: (https://openaccess.thecvf.com/content/CVPR2024/html/Akada_3D_Human_Pose_Perception_from_Egocentric_Stereo_Videos_CVPR_2024_paper.html)
摘要
While head-mounted devices are becoming more compact they provide egocentric views with significant self-occlusions of the device user.
Hence existing methods often fail to accurately estimate complex 3D poses from egocentric views.
In this work we propose a new transformer-based framework to improve egocentric stereo 3D human pose estimation which leverages the scene information and temporal context of egocentric stereo videos.
Specifically we utilize 1) depth features from our 3D scene reconstruction module with uniformly sampled windows of egocentric stereo frames and 2) human joint queries enhanced by temporal features of the video inputs.
Our method is able to accurately estimate human poses even in challenging scenarios such as crouching and sitting.
Furthermore we introduce two new benchmark datasets i.e.
UnrealEgo2 and UnrealEgo-RW (RealWorld).
UnrealEgo2 is a large-scale in-the-wild dataset captured in synthetic 3D scenes.
UnrealEgo-RW is a real-world dataset captured with our newly developed device.
The proposed datasets offer a much larger number of egocentric stereo views with a wider variety of human motions than the existing datasets allowing comprehensive evaluation of existing and upcoming methods.
Our extensive experiments show that the proposed approach significantly outperforms previous methods.
UnrealEgo2 UnrealEgo-RW and trained models are available on our project page and Benchmark Challenge…