SPIN、VIBE 等 3D Human Pose Estimation 方法中的弱透视投影 (Weak Perspective Projection)

最新推荐文章于 2023-08-09 12:50:20 发布

youthy333

最新推荐文章于 2023-08-09 12:50:20 发布

阅读量1.8k

点赞数 2

分类专栏： 3D Human Pose Estimation 文章标签： 3d

本文链接：https://blog.csdn.net/qq_37099774/article/details/124399583

版权

3D Human Pose Estimation 专栏收录该内容

2 篇文章 0 订阅

订阅专栏

弱透视投影 (Weak Perspective Projection)

弱透视投影假设焦距与物距足够大，此时物体在 $z$ 轴(光轴)上的变化可以忽略。

SPIN、VIBE 等 3D Human Pose Estimation 方法中的弱透视投影

首先，3D 关键点已经位于一个 $1, 1]^3$ 的一个立方体内。且相机位于立方体中心(世界坐标系原点)，相机坐标系与世界坐标系完全对齐。如下图所示：
在这里插入图片描述
图1. 初始状态

为了进行弱透视投影，需要将物距增大，按照下式进行增大
$t_z = \frac{2\times f}{Res \times s}$
其中 $f$ 是焦距； $R e s$ 是 crop 并 resize 后图像大小，即输入图片大小，在文中一般设置为 224； $s$ 是网络预测得到的 cam 参数中的一个， $t x, t y, s = c a m$ ， $t x, t y$ 表示关键点应该在 $1, 1]^3$ 立方体内应该偏移的位置， $s$ 表示人体在 $224 \times 224$ 中的比例。可以按照下图来理解 $t_z$ 的计算公式。
在这里插入图片描述
图2. 物距 $t_z$ 计算示意图

投影步骤

对关键点按照 $t_x, t_y, t_z]$ 进行平移。
构造相机内参矩阵，对关键点进行变换得到像素坐标。

注意： 从代码来看，与上述过程只有一点不同，代码中 $u_0$ 与 $v_0$ 都设置为了0，这是因为 GT 的 2D joints 已经已 crop 的图像中心为原点，归一化到了 $[0, 1]$ 。

def projection(pred_joints, pred_camera):
    pred_cam_t = torch.stack([pred_camera[:, 1],
                              pred_camera[:, 2],
                              2 * 5000. / (224. * pred_camera[:, 0] + 1e-9)], dim=-1)
    batch_size = pred_joints.shape[0]
    camera_center = torch.zeros(batch_size, 2)
    pred_keypoints_2d = perspective_projection(pred_joints,
                                               rotation=torch.eye(3).unsqueeze(0).expand(batch_size, -1, -1).to(pred_joints.device),
                                               translation=pred_cam_t,
                                               focal_length=5000.,
                                               camera_center=camera_center)
    # Normalize keypoints to [-1,1]
    pred_keypoints_2d = pred_keypoints_2d / (224. / 2.)
    return pred_keypoints_2d

def perspective_projection(points, rotation, translation,
                           focal_length, camera_center):
    """
    This function computes the perspective projection of a set of points.
    Input:
        points (bs, N, 3): 3D points
        rotation (bs, 3, 3): Camera rotation
        translation (bs, 3): Camera translation
        focal_length (bs,) or scalar: Focal length
        camera_center (bs, 2): Camera center
    """
    batch_size = points.shape[0]
    K = torch.zeros([batch_size, 3, 3], device=points.device)
    K[:,0,0] = focal_length
    K[:,1,1] = focal_length
    K[:,2,2] = 1.
    K[:,:-1, -1] = camera_center

    # Transform points
    points = torch.einsum('bij,bkj->bki', rotation, points)
    points = points + translation.unsqueeze(1)

    # Apply perspective distortion
    projected_points = points / points[:,:,-1].unsqueeze(-1) # 提前除以 Z_c

    # Apply camera intrinsics
    projected_points = torch.einsum('bij,bkj->bki', K, projected_points)

    return projected_points[:, :, :-1]