img2pose: Face Alignment and Detection via 6DoF, Face Pose Estimation（CVPR21）

最新推荐文章于 2024-04-16 09:38:33 发布

o0Helloworld0o

最新推荐文章于 2024-04-16 09:38:33 发布

阅读量1.5k

点赞数 4

分类专栏：读书笔记文章标签：计算机视觉

本文链接：https://blog.csdn.net/o0Helloworld0o/article/details/114212318

版权

读书笔记专栏收录该内容

40 篇文章 1 订阅

订阅专栏

WIDER_Face数据集说明

总体来说，因为WIDER_Face的人脸尺寸小，所以数据质量不高
作者提供的68/5关键点坐标其实是3D透视版本的

datasets
   └─WIDER_Face
        ├─WIDER_train
        │    └─images（共62个文件夹，12880幅jpg图像）
        │        ├─0--Parade
                 │    └─0_Parade_marchingband_1_100.jpg等
        │        ├─...
        │        └─61--Street_Battle
        ├─WIDER_test
        │    └─images（共62个文件夹，3226幅jpg图像）
        │        ├─0--Parade
        │        ├─...
        │        └─61--Street_Battle
        └─wider_face_split（作者没有使用）
             ├─readme.txt
             └─...

annotations
   ├─WIDER_train_annotations.txt（共12880行，每一行指定了一个json文件路径）
   ├─WIDER_val_annotations.txt（共3226行）
   ├─WIDER_train
   │    ├─0--Parade
   │    │    └─0_Parade_marchingband_1_100.json等
   │    ├─...
   │    └─61--Street_Battle
   └─WIDER_val
        ├─0--Parade
        ├─...
        └─61--Street_Battle

json文件内容示例
{
    "image_path": "0--Parade/0_Parade_marchingband_1_100.jpg",
    "bboxes": [
        [433.0, 189.0, 467.0, 231.0],
        [80.0, 188.0, 142.0, 262.0],
        [5.0, 203.0, 36.0, 236.0],
        [296.0, 174.0, 341.0, 226.0],
        [213.0, 151.0, 259.0, 214.0],
        [900.0, 274.0, 981.0, 376.0],
        [780.0, 189.0, 805.0, 224.0],
        [576.0, 161.0, 616.0, 206.0],
        [529.0, 180.0, 563.0, 220.0]
    ],
    "landmarks": [
        每个元素是[5, 2]或[68, 2]
    ]
}

示例代码

# 理解原始数据
if 0:
    dataset_path='./datasets/WIDER_Face/WIDER_train/images'
    json_list='./annotations/WIDER_train_annotations.txt'
    image_paths = pd.read_csv(json_list, delimiter=" ", header=None)
    image_paths = np.asarray(image_paths).squeeze()


    index = np.random.randint(len(image_paths))
    #index = 4881
    print('index =', index)
    image_path = image_paths[index]


    with open(image_path) as f:
        image_json = json.load(f)

    img_path = image_json["image_path"]
    img_path = os.path.join(dataset_path, img_path)
    print('img_path =', img_path)


    img = cv2.imread(img_path)


    bboxes = image_json["bboxes"]
    landmarks = image_json["landmarks"]
    print('\nlen(bboxes) =', len(bboxes))
    print('len(landmarks) =', len(landmarks), '\n')

    # bboxes: 二维list，每个元素长度为4或5
    # 例如[387, 322, 410, 348, 0]，长度为5，int型，最后一个数是什么意思？
    # 例如[211.0, 319.0, 236.0, 353.0]，长度为4，float型，但数值上为整数
    # 根据作者源代码，可能存在非法的bbox，即left>=right，bottom>=up

    # landmarks: 二维list，每个元素是[5, 2]或[68, 2]
    # 条件好的人脸（尺寸大，脸比较正，遮挡少）能检测出68关键点，其它人脸只能检测5个关键点，或不提供关键点（取值-1）


    # 遍历处理每一个人脸
    has_landmark = 0
    num_faces = len(bboxes)

    for i in range(num_faces):
        bbox = np.asarray(bboxes[i])[:4].astype(int)
        landmark = np.asarray(landmarks[i])[:, :2].astype(float)
    
        if bbox[0] >= bbox[2] or bbox[1] >= bbox[3]:
            print('find invalid bbox:', bboxes[i])
            continue

        if -1 in landmark:
            cv2.rectangle(img, (bbox[0], bbox[1]), (bbox[2], bbox[3]), (0, 255, 255), thickness=1)
            continue
    
        has_landmark += 1
        cv2.rectangle(img, (bbox[0], bbox[1]), (bbox[2], bbox[3]), (255, 255, 0), thickness=1)
        for p in landmark.astype(int):
            cv2.circle(img, (p[0], p[1]), radius=1, color=(0, 255, 0), thickness=-1)


    print('has_landmark =', has_landmark)
    cv2.imwrite('debug_dir/display.jpg', img)

数据部分

首先是按照原始数据的格式，直接载入的API，位于utils/json_loader.py

FrameJsonList继承自ImageFolder（感觉继承Dataset也可以）
对于每一个样本，首先构造global_intrinsics，然后遍历所有人脸框

def __getitem__(self, index):
	# skip some lines

	(w, h) = img.size	# 对于不同图像，w,h的取值不是固定的
	global_intrinsics = np.array(
		[[w + h, 0, w // 2], [0, w + h, h // 2], [0, 0, 1]]
	)	# 不一定非要使用整除吧？
	
	for i in range(len(bboxes)):
		1. 过滤 x1≥x2, y1≥y2 的非法bbox
		2. 有些bbox没有进一步检测关键点，将global/local pose置为-9
		3. landmark为5/68关键点，减去左上角，即转为local下的坐标
		   根据bbox尺寸构造bbox_intrinsics = [
		   		[w_bb+h_bb,         0, w_bb/2],
		   		[        0, w_bb+h_bb, h_bb/2],
		   		[        0,         0,      1]
		   ]
		   意味着对于不同的bbox，w_bb和h_bb的取值也随之变化
	
		   P, pose = get_pose(标准3D点, landmark, bbox_intrinsics)
		   # get_pose其实就是PnP算法，返回值pose其实是local下的[rvecs, tvecs]
		   # P是将bbox_intrinsics和RT乘起来的大的变换矩阵，一般用不上
		   # 注：在local下用PnP算法，顶多保证眼镜和嘴吻合，脸型一般是不可能吻合的
		4. 利用论文中Algo.1，将local pose转global
		   global_pose = pose_bbox_to_full_image(
	           pose, global_intrinsics, bbox_is_dict(bbox)
		   )
		   
	# 最终返回5个值
	return (
		raw_img,			# data[0], 二进制bytes型
		global_pose_labels,	# data[1]，list型
		bbox_labels,		# data[2]，list型
		pose_labels,		# data[3], local pose，list型
		landmark_labels,	# data[4]，list型
	)
	# 图像类型为bytes，标签类型为list，是为了便于使用msgpack.dumps(...)
	# 提前说明一点，在data_loader_lmdb.py中的LMDB里，data[1]没用上（算了个寂寞）
	# 在data_loader_lmdb_augmenter.py中的LMDB里，data[1], data[3]没用上

Q：其实可以直接使用global landmark来估计global pose，为什么要先估local pose再转global pose？
A：原因可能是使用PnP直接估local pose就可以直接拿来训练了，而估global pose再转local pose是有误差的
待确认：是不是代码中所有用到PnP的地方都是在估计local pose

JsonLoader继承自DataLoader，其实就是简单地把FrameJsonList包了一层，默认batch_size=1，设置num_workers=16, collate_fn=lambda x: x
其中collate_fn=lambda x: x的作用是不要打包为batchTensor，直接保留原来的list格式

接下来看到convert_json_list_to_lmdb.py，创建了一个JsonLoader，保存为lmdb格式（类似于TFRecord，将数据保存为byte，能让I/O更高效）
如果是训练集--train，那么计算local pose的均值和标准差（计算时注意不要混入-9）

接下来定义的Dataset和DataLoader才是真正在训练过程中使用的
data_loader_lmdb.py中的class LMDB（仅能做noise/ contrast augmentation）
data_loader_lmdb_augmenter.py中的class LMDB（可以做random_flip, random_crop）

先看data_loader_lmdb.py中的class LMDB，这是验证集使用的版本

def __getitem__(self, index):
	# skip some lines

	# 从data[0]中取出图像
	imgbuf = data[0]
    buf = six.BytesIO()
    buf.write(imgbuf)
    buf.seek(0)
    img = Image.open(buf).convert("RGB")
    # 强制转换RGB很有必要，因为数据集有一些老照片，可能是灰度图像

	# data[1] global pose不用了
	
	bbox_labels = np.asarray(data[2])

	pose_labels = np.asarray(data[3])

	landmark_labels = data[4]

	# 对图像做简单的数据增强，简单是因为传参bboxes, landmarks均为None
	for augmentation_method in self.augmentation_methods:
		img, _, _ = augmentation_method(img, None, None)

	# 根据图像尺寸，取w和h的值，构造global_intrinsics
	(w, h) = img.size
    global_intrinsics = np.array(
    	[[w + h, 0, w // 2], [0, w + h, h // 2], [0, 0, 1]]
    )
	
	# 接下来遍历处理每一个人脸框
	for i in range(len(pose_labels)):
		# skip some lines
		
		# 对于某些bbox，没有检测进一步检测关键点，landmark取值为-1，pose取值为-9
		# 作者把这些bbox涂成黑色，希望屏蔽掉
		if -1 in lms:
			img[int(bbox[1]) : int(bbox[3]), int(bbox[0]) : int(bbox[2]), :] = 0
			continue

		# local pose转global pose
		# 这一步其实冗余了，只要图像尺寸不变，那么得到的global pose其实就保存在data[1]
		pose_label = pose_bbox_to_full_image(pose_label, global_intrinsics, bbox)

		# 转global pose的目的是为了计算2D投影点，从而修正bbox
		# 同时global pose最终作为标签"dofs"返回
		projected_lms, _ = plot_3d_landmark(
			self.threed_68_points, pose_label, global_intrinsics
		)
		projected_bbox = expand_bbox_rectangle(
			w, h, 1.1, 1.1, projected_lms, roll=pose_label[2]
		)

	# self.transform一般是一个transforms.Compose([transforms.ToTensor()])
	# 没有涉及减均值除方差
	if self.transform is not None:
		img = self.transform(img)

	target = {
		"dofs": torch.from_numpy(np.asarray(new_pose_labels)).float(),	# 这是global pose
		"boxes": torch.from_numpy(np.asarray(projected_bbox_labels)).float(),
		"labels": torch.ones((len(projected_bbox_labels),), dtype=torch.int64),
	}
	# 返回img，以及taget的3个值
	# dofs: [num_faces, 6]
	# boxes: [num_faces, 4]
	# labels: [num_faces]
	return img, target
	# 提前说明一点，数据集部分返回的是global pose，然后在计算RoIHeads的损失时需要使用local pose作为标签
	# 于是在losses.py中的fastrcnn_loss里面，又做了一次global pose转local pose的操作（黑人问号？）

注意，self.pose_label_transform是一个对label做归一化的函数，但是在__getitem__中没有使用

然后看到class LMDBDataLoader，支持分布式功能，定义了batch数据的打包方式

def collate_fn(batch):
    return tuple(zip(*batch))

# 单个样本是
# img: [3, H, W]
# target: {'dofs': [num_faces, 6], 'boxes': [num_faces, 4], 'labels': [num_faces]}

# 打包成batch是
# img: [ [3, H, W], [3, H, W], [3, H, W], ... ]
# target: [
# 	{'dofs': [num_faces, 6], 'boxes': [num_faces, 4], 'labels': [num_faces]},
#	{'dofs': [num_faces, 6], 'boxes': [num_faces, 4], 'labels': [num_faces]},
#	{'dofs': [num_faces, 6], 'boxes': [num_faces, 4], 'labels': [num_faces]},
#	...
# ]

先看data_loader_lmdb_augmenter.py中的class LMDB，这是训练集使用的版本，通常指定--random_flip --random_crop

def __getitem__(self, index):
	# skip some lines

	# 从data[0]中取出图像
	imgbuf = data[0]
    buf = six.BytesIO()
    buf.write(imgbuf)
    buf.seek(0)
    img = Image.open(buf).convert("RGB")
    # 强制转换RGB很有必要，因为数据集有一些老照片，可能是灰度图像

	# data[1] global pose不用了
	
	bbox_labels = np.asarray(data[2])

	# data[3] local pose不用了

	landmark_labels = data[4]

	# 对图像做复杂的数据增强，同时更新bbox_labels, landmark_labels
	# 其中random_crop会改变图像尺寸，从区间[0.7, 1]中分别采样图像尺寸
	for augmentation_method in self.augmentation_methods:
		img, bbox_labels, landmark_labels = augmentation_method(
			img, bbox_labels, landmark_labels
		)

	# 因为图像尺寸变了，需要重新走一遍PnP，先估计local pose，再转global pose
	# 计算2D投影点，修正bbox，返回global pose的流程
	(img_w, img_h) = img.size
	global_intrinsics = np.array(
		[[img_w + img_h, 0, img_w // 2], [0, img_w + img_h, img_h // 2], [0, 0, 1]]
	)

	# 剩下的代码略

作者自己实现了一套数据增强，位于utils/augmentation.py，包括random_flip, random_crop, noise_augmentation, contrast_augmentation
其中random_crop会导致图像尺寸减小，实现如下

crop_size = random.uniform(0.7, 1)
crop_x = int(w * crop_size)
crop_y = int(h * crop_size)

可视化

可视化pose_references/vertices_trans.npy（右脸颊染红色）
c=0, [-0.891652, 0.890319], span=1.781972
c=1, [-0.975868, 1.000126], span=1.975995
c=2, [-0.751428, 0.774013], span=1.525441
center = [-0.00005079 -0.00001977 -0.00001119]
摆放姿势：左耳+x，头顶-y，鼻尖-z
只需要绕x轴旋转180度，就能回归标准姿势

在这里插入图片描述

模型层次概览

img2pose.py
class img2poseModel
    self.fpn_model -----> models.py (generalized_rcnn.py)
                          class FasterDoFRCNN(GeneralizedRCNN)
                              self.transform -----> torchvision.models.detection.transform.py
                                                    class GeneralizedRCNNTransform

                              self.backbone -----> torchvision.models.detection.backbone_utils.py
                                                   def resnet_fpn_backbone

                              self.rpn -----> rpn.py
                                              class RegionProposalNetwork

                              self.roi_heads -----> models.py (torchvision.models.detection.roi_heads.py)
                                                    class DOFRoIHeads(RoIHeads)

模型初始化梳理

进入 train.py 第42-52行（link）

# creates model
self.img2pose_model = img2poseModel(
      depth=self.config.depth,			# 18
      min_size=self.config.min_size,	# [640, 672, 704, 736, 768, 800]
      max_size=self.config.max_size,	# 1400
      device=self.config.device,		# device(type='cuda')
      pose_mean=self.config.pose_mean,	# 6D向量
      pose_stddev=self.config.pose_stddev,	# 6D向量
      distributed=self.config.distributed,	# False
      gpu=self.config.gpu,	# 0
      threed_68_points=np.load(self.config.threed_68_points),	# (68, 3)
      threed_5_points=np.load(self.config.threed_5_points),		# (5, 3)
)

进入 img2pose.py（link）

class img2poseModel:
    def __init__(
        self,
        depth,		# 18
        min_size,	# [640, 672, 704, 736, 768, 800]
        max_size,	# 1400
        model_path=None,
        device=None,		# device(type='cuda')
        pose_mean=None,		# 6D向量
        pose_stddev=None,	# 6D向量
        distributed=False,	# False
        gpu=0,	# 0
        threed_68_points=None,	# (68, 3)
        threed_5_points=None,	# (5, 3)
        rpn_pre_nms_top_n_test=6000,
        rpn_post_nms_top_n_test=1000,
        bbox_x_factor=1.1,
        bbox_y_factor=1.1,
        expand_forehead=0.3,
    ):
	
		# skip some lines

		# create network backbone
        backbone = resnet_fpn_backbone(f"resnet{self.depth}", pretrained=True)
        # 注：from torchvision.models.detection.backbone_utils import resnet_fpn_backbone
	
		# skip some lines

		# create the feature pyramid network
        self.fpn_model = FasterDoFRCNN(
            backbone,
            2,
            min_size=self.min_size,	# [640, 672, 704, 736, 768, 800]
            max_size=self.max_size,	# 1400
            pose_mean=pose_mean,	# 6D Tensor
            pose_stddev=pose_stddev,	# 6D Tensor
            threed_68_points=threed_68_points,	# [68, 3] Tensor
            threed_5_points=threed_5_points,	# [5, 3] Tensor
            rpn_pre_nms_top_n_test=rpn_pre_nms_top_n_test,	# 6000
            rpn_post_nms_top_n_test=rpn_post_nms_top_n_test,	# 1000
            bbox_x_factor=bbox_x_factor,	# 1.1
            bbox_y_factor=bbox_y_factor,	# 1.1
            expand_forehead=expand_forehead,	# 0.3
        )

进入models.py（link），创建每一个组件

class FasterDoFRCNN(GeneralizedRCNN):
    def __init__(
        self,
        backbone,	# <class 'torchvision.models.detection.backbone_utils.BackboneWithFPN'>
        num_classes=None,	# 2
        # transform parameters
        min_size=800,	# [640, 672, 704, 736, 768, 800]
        max_size=1333,	# 1400
        image_mean=None,	
        image_std=None,
        # RPN parameters
        rpn_anchor_generator=None,
        rpn_head=None,
        rpn_pre_nms_top_n_train=6000,
        rpn_pre_nms_top_n_test=6000,	# 6000
        rpn_post_nms_top_n_train=2000,
        rpn_post_nms_top_n_test=1000,	# 1000
        rpn_nms_thresh=0.4,
        rpn_fg_iou_thresh=0.5,
        rpn_bg_iou_thresh=0.3,
        rpn_batch_size_per_image=256,
        rpn_positive_fraction=0.5,
        # Box parameters
        box_roi_pool=None,
        box_head=None,
        box_predictor=None,
        box_score_thresh=0.05,
        box_nms_thresh=0.5,
        box_detections_per_img=1000,
        box_fg_iou_thresh=0.5,
        box_bg_iou_thresh=0.5,
        box_batch_size_per_image=512,
        box_positive_fraction=0.25,
        bbox_reg_weights=None,
        pose_mean=None,		# 6D Tensor
        pose_stddev=None,	# 6D Tensor
        threed_68_points=None,	# [68, 3] Tensor
        threed_5_points=None,	# [5, 3] Tensor
        bbox_x_factor=1.1,	# 1.1
        bbox_y_factor=1.1,	# 1.1
        expand_forehead=0.3,	# 0.3
    ):

		# skip some lines

		if rpn_anchor_generator is None:
            anchor_sizes = ((16,), (32,), (64,), (128,), (256,), (512,))
            aspect_ratios = ((0.5, 1.0, 2.0),) * len(anchor_sizes)	# 重复6次
            rpn_anchor_generator = AnchorGenerator(anchor_sizes, aspect_ratios)

		if rpn_head is None:
            rpn_head = RPNHead(
                out_channels, rpn_anchor_generator.num_anchors_per_location()[0]
            )
            # 注：out_channels=256
            # rpn_anchor_generator.num_anchors_per_location() = [3, 3, 3, 3, 3, 3]

		rpn = RegionProposalNetwork(
            rpn_anchor_generator,	# 上面定义
            rpn_head,	# 上面定义
            rpn_fg_iou_thresh,	# 0.5
            rpn_bg_iou_thresh,	# 0.3
            rpn_batch_size_per_image,	# 256
            rpn_positive_fraction,	# 0.5
            rpn_pre_nms_top_n,	# {'training': 6000, 'testing': 6000}
            rpn_post_nms_top_n,	# {'training': 2000, 'testing': 1000}
            rpn_nms_thresh,		# 0.4
        )
        # from rpn import AnchorGenerator, RegionProposalNetwork

		if box_roi_pool is None:
            box_roi_pool = MultiScaleRoIAlign(
                featmap_names=["0", "1", "2", "3"], output_size=7, sampling_ratio=2
            )
            # 注：from torchvision.ops import MultiScaleRoIAlign

		if box_head is None:
            resolution = box_roi_pool.output_size[0]
            representation_size = 1024
            box_head = TwoMLPHead(out_channels * resolution ** 2, representation_size)
            # 注：box_roi_pool.output_size=[7, 7]
            # from torchvision.models.detection.faster_rcnn import TwoMLPHead

		if box_predictor is None:
            representation_size = 1024
            box_predictor = FastRCNNDoFPredictor(representation_size, num_classes)

		roi_heads = DOFRoIHeads(
            # Box
            box_roi_pool,	# 上面定义
            box_head,		# 上面定义
            box_predictor,	# 上面定义
            box_fg_iou_thresh,	# 0.5
            box_bg_iou_thresh,	# 0.5
            box_batch_size_per_image,	# 512
            box_positive_fraction,	# 0.25
            bbox_reg_weights,	# None
            box_score_thresh,	# 0.05
            box_nms_thresh,		# 0.5
            box_detections_per_img,	# 1000
            out_channels,	# 256
            pose_mean=pose_mean,		# 6D Tensor
            pose_stddev=pose_stddev,	# 6D Tensor
            threed_68_points=threed_68_points,	# [68, 3] Tensor
            threed_5_points=threed_5_points,	# [5, 3] Tensor
            bbox_x_factor=bbox_x_factor,	# 1.1
            bbox_y_factor=bbox_y_factor,	# 1.1
            expand_forehead=expand_forehead,	# 0.3
        )

		# 与普通的pytorch Transform不同，此处
		# from torchvision.models.detection.transform import GeneralizedRCNNTransform
		transform = GeneralizedRCNNTransform(min_size, max_size, image_mean, image_std)
		# 注：min_size=[640, 672, 704, 736, 768, 800]
		# max_size=1400

		super(FasterDoFRCNN, self).__init__(backbone, rpn, roi_heads, transform)
		# 总结：最后调用父类可以很清楚地看到
		# self.transform = transform
        # self.backbone = backbone
        # self.rpn = rpn
        # self.roi_heads = roi_heads
        # 以上4个组件构成了整个FasterDoFRCNN(继承GeneralizedRCNN)

前向传播梳理

在train.py中（link）

for idx, data in enumerate(self.train_loader):
    imgs, targets = data

    imgs = [image.to(self.config.device) for image in imgs]
    targets = [
        {k: v.to(self.config.device) for k, v in t.items()} for t in targets
    ]
	# 注：imgs: [ [3, 575, 767] Tensor, [3, 587, 783] Tensor ], 取值范围[0, 1]
	# targets: len=2的list，每个元素是dict，包含dofs, boxes, labels
	# dofs [num_faces, 6]
	# boxes [num_faces, 4]，范围0~w, 0~h
	# labels [num_faces]，取值一定为1

    self.optimizer.zero_grad()

    # forward pass
    losses = self.img2pose_model.forward(imgs, targets)
    # 注：losses为dict型，包含5个loss term
    # ['loss_classifier', 'loss_dof_reg', 'loss_points', 'loss_objectness', 'loss_rpn_box_reg']

进入img2pose.py，img2poseModel类前向传播相关的方法

def run_model(self, imgs, targets=None):
    outputs = self.fpn_model(imgs, targets)

    return outputs

def forward(self, imgs, targets):
    losses = self.run_model(imgs, targets)

    return losses

FasterDoFRCNN类的前向传播，在父类GeneralizedRCNN中实现
进入generalized_rcnn.py（link），GeneralizedRCNN类的forward函数

		# torch.jit.annotate是干什么的
		original_image_sizes = torch.jit.annotate(List[Tuple[int, int]], [])
        for img in images:
            val = img.shape[-2:]
            assert len(val) == 2
            original_image_sizes.append((val[0], val[1]))
		# 注：[(575, 767), (587, 783)]

		images, targets = self.transform(images, targets)
		# 注：images为<class 'torchvision.models.detection.image_list.ImageList'>型
		# targets仍为原来的格式

		# Check for degenerate boxes，检查bbox格式是否合法

		# 1. 首先进行backbone的前向传播
		# images.tensors: [2, 3, 768, 1024]，图像尺寸变统一了，范围[-2.11, 2.64]
		features = self.backbone(images.tensors)
		# 注：key包括['0', '1', '2', '3', 'pool']
		# 理解原理需要去查看resnet_fpn_backbone的源代码


		# 2. 然后进行rpn的前向传播，预测候选bbox，与bbox_gt计算loss
		proposals, proposal_losses = self.rpn(images, features, targets)
		# 注：proposals: [ [2000, 4], [2000, 4] ]
		# proposal_losses的keys=loss_objectness, loss_rpn_box_reg
		# 理解rpn前向传播的具体细节，需要去rpn.py查看forward函数
		

		# 3. 最后进行DOF head的前向传播，计算DOF相关的loss
		detections, detector_losses = self.roi_heads(
            features, proposals, images.image_sizes, targets
        )

理解self.backbone

# 这是定义的代码
backbone = resnet_fpn_backbone('resnet18', pretrained=False)

进入torchvision/models/detection/backbone_utils.py（link）

首先看到

def resnet_fpn_backbone(
    backbone_name,
    pretrained,
    norm_layer=misc_nn_ops.FrozenBatchNorm2d,
    trainable_layers=3,
    returned_layers=None,
    extra_blocks=None
):

	# 首先获取标准resnet网络
	backbone = resnet.__dict__[backbone_name](
        pretrained=pretrained,
        norm_layer=norm_layer)
    
    # skip some lines

	if extra_blocks is None:
        extra_blocks = LastLevelMaxPool()

    if returned_layers is None:
        returned_layers = [1, 2, 3, 4]
    assert min(returned_layers) > 0 and max(returned_layers) < 5
    return_layers = {f'layer{k}': str(v) for v, k in enumerate(returned_layers)}
    # 取值为{'layer1': '0', 'layer2': '1', 'layer3': '2', 'layer4': '3'}
    # key表示featture原来的name，value表示新的name

	# skip some lines
	
	return BackboneWithFPN(backbone, return_layers, in_channels_list, out_channels, extra_blocks=extra_blocks)
	# 注：in_channels_list=[64, 128, 256, 512], out_channels=256
	# extra_blocks为<class 'torchvision.ops.feature_pyramid_network.LastLevelMaxPool'>

然后看到

class BackboneWithFPN(nn.Module):

	def __init__(self, backbone, return_layers, in_channels_list, out_channels, extra_blocks=None):
		# skip some lines

		self.body = IntermediateLayerGetter(backbone, return_layers=return_layers)
		# IntermediateLayerGetter的定义在torchvision/models/_utils.py
		# 是一个用来收集网络前向传播过程中的一些Tensor的工具
		# 假设x为[N, 3, 224, 224]
		# x = self.body(x)
		# x = {
		#	'0': [N, 64, 56, 56],	传播至resnet.layer1，空间尺寸1/4
		#	'1': [N, 128, 28, 28],	传播至resnet.layer2，空间尺寸1/8
		#	'2': [N, 256, 14, 14],	传播至resnet.layer3，空间尺寸1/16
		#	'3': [N, 512, 7, 7]		传播至resnet.layer4，空间尺寸1/32
		# }

		self.fpn = FeaturePyramidNetwork(
            in_channels_list=in_channels_list,
            out_channels=out_channels,
            extra_blocks=extra_blocks,
        )
        # FeaturePyramidNetwork的细节要去torchvision/ops/feature_pyramid_network.py中查看
        # 大致的理解如下
        # x = self.fpn(x)
		# x = {
		# 	'0': [N, 256, 56, 56],	<- [N, 64, 56, 56]
		#	'1': [N, 256, 28, 28],	<- [N, 128, 28, 28
		#	'2': [N, 256, 14, 14],	<- [N, 256, 14, 14]
		#	'3': [N, 256, 7, 7],	<- [N, 512, 7, 7]
		# 	'pool': [N, 256, 4, 4]
		# }
		# 总结，相当于把resnet的一系列feature，统一提升通道数为256，然后多了一个pool的结果

整体来看，backbone的前向传播所做的事情，输入一幅图像，抽取layer1~layer4的特征，送入FeaturePyramidNetwork，输出一组特征金字塔，金字塔的尺寸为{1/4, 1/8, 1/16, 1/32, 1/64}，通道数统一为256

features = self.backbone(images.tensors)
# 例如，images.tensors: [2, 3, 768, 1024]
# features = {
#	'0': [2, 256, 192, 256],	1/4
#	'1': [2, 256, 96, 128],		1/8
# 	'2': [2, 256, 48, 64],		1/16
# 	'3': [2, 256, 24, 32],		1/32
# 	'pool': [2, 256, 12, 16]	1/64
# }

理解self.rpn

模型评估

evaluation/evaluate_wider.py是一个对WIDER_FACE数据集的推理，只负责评估bbox预测，landmarks和dofs不关心
为了刷榜，采用了一些trick

设置min_size = [200, 300, 500, 800, 1100, 1400, 1700], filp_list = [False, True]
于是构成二重循环，生成14幅图像进行推理，然后对14组bbox预测值做voting

从宏观理解forward

阅读generalized_rcnn.py中类GeneralizedRCNN的forward，理解起来比较容易

以推理一幅尺寸为1024x678的图像为例
提前设置

min_size = 200
max_size = 302
self.img2pose_model.fpn_model.module.set_max_min_size(max_size, min_size)

# images: [ [3, 678, 1024] ]
# targets: None
def forward(self, images, targets=None):
	# skip some lines
	
	将images的尺寸保存到original_image_sizes

	images, targets = self.transform(images, targets)
	# images为ImageList型，tensors.shape=[1, 3, 224, 320]
	# 注：220,302向上找最接近的32整数倍，就是224,320

	# skip some lines

	features = self.backbone(images.tensors)
	# features为dict型
	# '0': [1, 256, 56, 80]		1/4
	# '1': [1, 256, 28, 40]		1/8
	# '2': [1, 256, 14, 20]		1/16
	# '3': [1, 256, 7, 10]		1/32
	# 'pool': [1, 256, 4, 5]	1/64

	proposals, proposal_losses = self.rpn(images, features, targets)
	# proposals: [ [1000, 4] ]	# 尺寸224x320下的候选框
	# proposal_losses: {}


	# features: dict, '0', '1', '2', '3', 'pool'
	# proposals: [ [1000, 4] ]
	# images.image_sizes: [ (199, 302) ]
	detections, detector_losses = self.roi_heads(
		features, proposals, images.image_sizes, targets
	)
	# detections: [
	# 	{'boxes': [num_faces, 4], 'labels': [num_faces], 'scores': [num_faces], 'dofs': [num_faces, 6]}
	# ]
	# 注：boxes给出的是尺寸224x320下的预测人脸框，dofs做了denorm吗？应该做了
	# detector_losses: {}


	# 将预测的boxes还原到原图尺寸
	detections = self.transform.postprocess(
		detections, images.image_sizes, original_image_sizes
	)

	return self.eager_outputs(losses, detections, targets is not None)
	# 其实就是return detections

自己写一个推理脚本，给定一幅任意的图像，首先做一些填充，使得size为32的整数倍，然后对fpn_model进行set_max_min_size，最终获得推理结果

img2pose代码写得比较乱的地方，Dataset部分一直用的是global 6DoF，仅当计算loss时，才转换成local 6DoF
可以在fastrcnn_loss里看到，调用pose_full_image_to_bbox方法，将dof_regression_targets从global转换为local，然后再做归一化，最后与dof_regression（即网络预测值）计算dof_loss

o0Helloworld0o

关注

4
点赞
踩
6

收藏

觉得还不错? 一键收藏
2
评论
img2pose: Face Alignment and Detection via 6DoF, Face Pose Estimation（CVPR21）

可视化pose_references/vertices_trans.npyc=0, [-0.891652, 0.890319], span=1.781972c=1, [-0.975868, 1.000126], span=1.975995c=2, [-0.751428, 0.774013], span=1.525441center = [-0.00005079 -0.00001977 -0.00001119]
复制链接

扫一扫