WIDER_Face数据集说明
总体来说,因为WIDER_Face的人脸尺寸小,所以数据质量不高
作者提供的68/5关键点坐标其实是3D透视版本的
datasets
└─WIDER_Face
├─WIDER_train
│ └─images(共62个文件夹,12880幅jpg图像)
│ ├─0--Parade
│ └─0_Parade_marchingband_1_100.jpg等
│ ├─...
│ └─61--Street_Battle
├─WIDER_test
│ └─images(共62个文件夹,3226幅jpg图像)
│ ├─0--Parade
│ ├─...
│ └─61--Street_Battle
└─wider_face_split(作者没有使用)
├─readme.txt
└─...
annotations
├─WIDER_train_annotations.txt(共12880行,每一行指定了一个json文件路径)
├─WIDER_val_annotations.txt(共3226行)
├─WIDER_train
│ ├─0--Parade
│ │ └─0_Parade_marchingband_1_100.json等
│ ├─...
│ └─61--Street_Battle
└─WIDER_val
├─0--Parade
├─...
└─61--Street_Battle
json文件内容示例
{
"image_path": "0--Parade/0_Parade_marchingband_1_100.jpg",
"bboxes": [
[433.0, 189.0, 467.0, 231.0],
[80.0, 188.0, 142.0, 262.0],
[5.0, 203.0, 36.0, 236.0],
[296.0, 174.0, 341.0, 226.0],
[213.0, 151.0, 259.0, 214.0],
[900.0, 274.0, 981.0, 376.0],
[780.0, 189.0, 805.0, 224.0],
[576.0, 161.0, 616.0, 206.0],
[529.0, 180.0, 563.0, 220.0]
],
"landmarks": [
每个元素是[5, 2]或[68, 2]
]
}
示例代码
# 理解原始数据
if 0:
dataset_path='./datasets/WIDER_Face/WIDER_train/images'
json_list='./annotations/WIDER_train_annotations.txt'
image_paths = pd.read_csv(json_list, delimiter=" ", header=None)
image_paths = np.asarray(image_paths).squeeze()
index = np.random.randint(len(image_paths))
#index = 4881
print('index =', index)
image_path = image_paths[index]
with open(image_path) as f:
image_json = json.load(f)
img_path = image_json["image_path"]
img_path = os.path.join(dataset_path, img_path)
print('img_path =', img_path)
img = cv2.imread(img_path)
bboxes = image_json["bboxes"]
landmarks = image_json["landmarks"]
print('\nlen(bboxes) =', len(bboxes))
print('len(landmarks) =', len(landmarks), '\n')
# bboxes: 二维list,每个元素长度为4或5
# 例如[387, 322, 410, 348, 0],长度为5,int型,最后一个数是什么意思?
# 例如[211.0, 319.0, 236.0, 353.0],长度为4,float型,但数值上为整数
# 根据作者源代码,可能存在非法的bbox,即left>=right,bottom>=up
# landmarks: 二维list,每个元素是[5, 2]或[68, 2]
# 条件好的人脸(尺寸大,脸比较正,遮挡少)能检测出68关键点,其它人脸只能检测5个关键点,或不提供关键点(取值-1)
# 遍历处理每一个人脸
has_landmark = 0
num_faces = len(bboxes)
for i in range(num_faces):
bbox = np.asarray(bboxes[i])[:4].astype(int)
landmark = np.asarray(landmarks[i])[:, :2].astype(float)
if bbox[0] >= bbox[2] or bbox[1] >= bbox[3]:
print('find invalid bbox:', bboxes[i])
continue
if -1 in landmark:
cv2.rectangle(img, (bbox[0], bbox[1]), (bbox[2], bbox[3]), (0, 255, 255), thickness=1)
continue
has_landmark += 1
cv2.rectangle(img, (bbox[0], bbox[1]), (bbox[2], bbox[3]), (255, 255, 0), thickness=1)
for p in landmark.astype(int):
cv2.circle(img, (p[0], p[1]), radius=1, color=(0, 255, 0), thickness=-1)
print('has_landmark =', has_landmark)
cv2.imwrite('debug_dir/display.jpg', img)
数据部分
首先是按照原始数据的格式,直接载入的API,位于utils/json_loader.py
FrameJsonList
继承自ImageFolder
(感觉继承Dataset
也可以)
对于每一个样本,首先构造global_intrinsics
,然后遍历所有人脸框
def __getitem__(self, index):
# skip some lines
(w, h) = img.size # 对于不同图像,w,h的取值不是固定的
global_intrinsics = np.array(
[[w + h, 0, w // 2], [0, w + h, h // 2], [0, 0, 1]]
) # 不一定非要使用整除吧?
for i in range(len(bboxes)):
1. 过滤 x1≥x2, y1≥y2 的非法bbox
2. 有些bbox没有进一步检测关键点,将global/local pose置为-9
3. landmark为5/68关键点,减去左上角,即转为local下的坐标
根据bbox尺寸构造bbox_intrinsics = [
[w_bb+h_bb, 0, w_bb/2],
[ 0, w_bb+h_bb, h_bb/2],
[ 0, 0, 1]
]
意味着对于不同的bbox,w_bb和h_bb的取值也随之变化
P, pose = get_pose(标准3D点, landmark, bbox_intrinsics)
# get_pose其实就是PnP算法,返回值pose其实是local下的[rvecs, tvecs]
# P是将bbox_intrinsics和RT乘起来的大的变换矩阵,一般用不上
# 注:在local下用PnP算法,顶多保证眼镜和嘴吻合,脸型一般是不可能吻合的
4. 利用论文中Algo.1,将local pose转global
global_pose = pose_bbox_to_full_image(
pose, global_intrinsics, bbox_is_dict(bbox)
)
# 最终返回5个值
return (
raw_img, # data[0], 二进制bytes型
global_pose_labels, # data[1],list型
bbox_labels, # data[2],list型
pose_labels, # data[3], local pose,list型
landmark_labels, # data[4],list型
)
# 图像类型为bytes,标签类型为list,是为了便于使用msgpack.dumps(...)
# 提前说明一点,在data_loader_lmdb.py中的LMDB里,data[1]没用上(算了个寂寞)
# 在data_loader_lmdb_augmenter.py中的LMDB里,data[1], data[3]没用上
Q:其实可以直接使用global landmark来估计global pose,为什么要先估local pose再转global pose?
A:原因可能是使用PnP直接估local pose就可以直接拿来训练了,而估global pose再转local pose是有误差的
待确认:是不是代码中所有用到PnP的地方都是在估计local pose
JsonLoader
继承自DataLoader
,其实就是简单地把FrameJsonList
包了一层,默认batch_size=1
,设置num_workers=16, collate_fn=lambda x: x
其中collate_fn=lambda x: x
的作用是不要打包为batchTensor,直接保留原来的list格式
接下来看到convert_json_list_to_lmdb.py
,创建了一个JsonLoader
,保存为lmdb格式(类似于TFRecord,将数据保存为byte,能让I/O更高效)
如果是训练集--train
,那么计算local pose的均值和标准差(计算时注意不要混入-9
)
接下来定义的Dataset
和DataLoader
才是真正在训练过程中使用的
data_loader_lmdb.py
中的class LMDB
(仅能做noise/ contrast augmentation)
data_loader_lmdb_augmenter.py
中的class LMDB
(可以做random_flip, random_crop)
先看data_loader_lmdb.py
中的class LMDB
,这是验证集使用的版本
def __getitem__(self, index):
# skip some lines
# 从data[0]中取出图像
imgbuf = data[0]
buf = six.BytesIO()
buf.write(imgbuf)
buf.seek(0)
img = Image.open(buf).convert("RGB")
# 强制转换RGB很有必要,因为数据集有一些老照片,可能是灰度图像
# data[1] global pose不用了
bbox_labels = np.asarray(data[2])
pose_labels = np.asarray(data[3])
landmark_labels = data[4]
# 对图像做简单的数据增强,简单是因为传参bboxes, landmarks均为None
for augmentation_method in self.augmentation_methods:
img, _, _ = augmentation_method(img, None, None)
# 根据图像尺寸,取w和h的值,构造global_intrinsics
(w, h) = img.size
global_intrinsics = np.array(
[[w + h, 0, w // 2], [0, w + h, h // 2], [0, 0, 1]]
)
# 接下来遍历处理每一个人脸框
for i in range(len(pose_labels)):
# skip some lines
# 对于某些bbox,没有检测进一步检测关键点,landmark取值为-1,pose取值为-9
# 作者把这些bbox涂成黑色,希望屏蔽掉
if -1 in lms:
img[int(bbox[1]) : int(bbox[3]), int(bbox[0]) : int(bbox[2]), :] = 0
continue
# local pose转global pose
# 这一步其实冗余了,只要图像尺寸不变,那么得到的global pose其实就保存在data[1]
pose_label = pose_bbox_to_full_image(pose_label, global_intrinsics, bbox)
# 转global pose的目的是为了计算2D投影点,从而修正bbox
# 同时global pose最终作为标签"dofs"返回
projected_lms, _ = plot_3d_landmark(
self.threed_68_points, pose_label, global_intrinsics
)
projected_bbox = expand_bbox_rectangle(
w, h, 1.1, 1.1, projected_lms, roll=pose_label[2]
)
# self.transform一般是一个transforms.Compose([transforms.ToTensor()])
# 没有涉及减均值除方差
if self.transform is not None:
img = self.transform(img)
target = {
"dofs": torch.from_numpy(np.asarray(new_pose_labels)).float(), # 这是global pose
"boxes": torch.from_numpy(np.asarray(projected_bbox_labels)).float(),
"labels": torch.ones((len(projected_bbox_labels),), dtype=torch.int64),
}
# 返回img,以及taget的3个值
# dofs: [num_faces, 6]
# boxes: [num_faces, 4]
# labels: [num_faces]
return img, target
# 提前说明一点,数据集部分返回的是global pose,然后在计算RoIHeads的损失时需要使用local pose作为标签
# 于是在losses.py中的fastrcnn_loss里面,又做了一次global pose转local pose的操作(黑人问号?)
注意,self.pose_label_transform
是一个对label做归一化的函数,但是在__getitem__
中没有使用
然后看到class LMDBDataLoader
,支持分布式功能,定义了batch数据的打包方式
def collate_fn(batch):
return tuple(zip(*batch))
# 单个样本是
# img: [3, H, W]
# target: {'dofs': [num_faces, 6], 'boxes': [num_faces, 4], 'labels': [num_faces]}
# 打包成batch是
# img: [ [3, H, W], [3, H, W], [3, H, W], ... ]
# target: [
# {'dofs': [num_faces, 6], 'boxes': [num_faces, 4], 'labels': [num_faces]},
# {'dofs': [num_faces, 6], 'boxes': [num_faces, 4], 'labels': [num_faces]},
# {'dofs': [num_faces, 6], 'boxes': [num_faces, 4], 'labels': [num_faces]},
# ...
# ]
先看data_loader_lmdb_augmenter.py
中的class LMDB
,这是训练集使用的版本,通常指定--random_flip --random_crop
def __getitem__(self, index):
# skip some lines
# 从data[0]中取出图像
imgbuf = data[0]
buf = six.BytesIO()
buf.write(imgbuf)
buf.seek(0)
img = Image.open(buf).convert("RGB")
# 强制转换RGB很有必要,因为数据集有一些老照片,可能是灰度图像
# data[1] global pose不用了
bbox_labels = np.asarray(data[2])
# data[3] local pose不用了
landmark_labels = data[4]
# 对图像做复杂的数据增强,同时更新bbox_labels, landmark_labels
# 其中random_crop会改变图像尺寸,从区间[0.7, 1]中分别采样图像尺寸
for augmentation_method in self.augmentation_methods:
img, bbox_labels, landmark_labels = augmentation_method(
img, bbox_labels, landmark_labels
)
# 因为图像尺寸变了,需要重新走一遍PnP,先估计local pose,再转global pose
# 计算2D投影点,修正bbox,返回global pose的流程
(img_w, img_h) = img.size
global_intrinsics = np.array(
[[img_w + img_h, 0, img_w // 2], [0, img_w + img_h, img_h // 2], [0, 0, 1]]
)
# 剩下的代码略
作者自己实现了一套数据增强,位于utils/augmentation.py
,包括random_flip, random_crop, noise_augmentation, contrast_augmentation
其中random_crop
会导致图像尺寸减小,实现如下
crop_size = random.uniform(0.7, 1)
crop_x = int(w * crop_size)
crop_y = int(h * crop_size)
可视化
可视化pose_references/vertices_trans.npy(右脸颊染红色)
c=0, [-0.891652, 0.890319], span=1.781972
c=1, [-0.975868, 1.000126], span=1.975995
c=2, [-0.751428, 0.774013], span=1.525441
center = [-0.00005079 -0.00001977 -0.00001119]
摆放姿势:左耳+x,头顶-y,鼻尖-z
只需要绕x轴旋转180度,就能回归标准姿势
模型层次概览
img2pose.py
class img2poseModel
self.fpn_model -----> models.py (generalized_rcnn.py)
class FasterDoFRCNN(GeneralizedRCNN)
self.transform -----> torchvision.models.detection.transform.py
class GeneralizedRCNNTransform
self.backbone -----> torchvision.models.detection.backbone_utils.py
def resnet_fpn_backbone
self.rpn -----> rpn.py
class RegionProposalNetwork
self.roi_heads -----> models.py (torchvision.models.detection.roi_heads.py)
class DOFRoIHeads(RoIHeads)
模型初始化梳理
进入 train.py 第42-52行(link)
# creates model
self.img2pose_model = img2poseModel(
depth=self.config.depth, # 18
min_size=self.config.min_size, # [640, 672, 704, 736, 768, 800]
max_size=self.config.max_size, # 1400
device=self.config.device, # device(type='cuda')
pose_mean=self.config.pose_mean, # 6D向量
pose_stddev=self.config.pose_stddev, # 6D向量
distributed=self.config.distributed, # False
gpu=self.config.gpu, # 0
threed_68_points=np.load(self.config.threed_68_points), # (68, 3)
threed_5_points=np.load(self.config.threed_5_points), # (5, 3)
)
进入 img2pose.py(link)
class img2poseModel:
def __init__(
self,
depth, # 18
min_size, # [640, 672, 704, 736, 768, 800]
max_size, # 1400
model_path=None,
device=None, # device(type='cuda')
pose_mean=None, # 6D向量
pose_stddev=None, # 6D向量
distributed=False, # False
gpu=0, # 0
threed_68_points=None, # (68, 3)
threed_5_points=None, # (5, 3)
rpn_pre_nms_top_n_test=6000,
rpn_post_nms_top_n_test=1000,
bbox_x_factor=1.1,
bbox_y_factor=1.1,
expand_forehead=0.3,
):
# skip some lines
# create network backbone
backbone = resnet_fpn_backbone(f"resnet{self.depth}", pretrained=True)
# 注:from torchvision.models.detection.backbone_utils import resnet_fpn_backbone
# skip some lines
# create the feature pyramid network
self.fpn_model = FasterDoFRCNN(
backbone,
2,
min_size=self.min_size, # [640, 672, 704, 736, 768, 800]
max_size=self.max_size, # 1400
pose_mean=pose_mean, # 6D Tensor
pose_stddev=pose_stddev, # 6D Tensor
threed_68_points=threed_68_points, # [68, 3] Tensor
threed_5_points=threed_5_points, # [5, 3] Tensor
rpn_pre_nms_top_n_test=rpn_pre_nms_top_n_test, # 6000
rpn_post_nms_top_n_test=rpn_post_nms_top_n_test, # 1000
bbox_x_factor=bbox_x_factor, # 1.1
bbox_y_factor=bbox_y_factor, # 1.1
expand_forehead=expand_forehead, # 0.3
)
进入models.py(link),创建每一个组件
class FasterDoFRCNN(GeneralizedRCNN):
def __init__(
self,
backbone, # <class 'torchvision.models.detection.backbone_utils.BackboneWithFPN'>
num_classes=None, # 2
# transform parameters
min_size=800, # [640, 672, 704, 736, 768, 800]
max_size=1333, # 1400
image_mean=None,
image_std=None,
# RPN parameters
rpn_anchor_generator=None,
rpn_head=None,
rpn_pre_nms_top_n_train=6000,
rpn_pre_nms_top_n_test=6000, # 6000
rpn_post_nms_top_n_train=2000,
rpn_post_nms_top_n_test=1000, # 1000
rpn_nms_thresh=0.4,
rpn_fg_iou_thresh=0.5,
rpn_bg_iou_thresh=0.3,
rpn_batch_size_per_image=256,
rpn_positive_fraction=0.5,
# Box parameters
box_roi_pool=None,
box_head=None,
box_predictor=None,
box_score_thresh=0.05,
box_nms_thresh=0.5,
box_detections_per_img=1000,
box_fg_iou_thresh=0.5,
box_bg_iou_thresh=0.5,
box_batch_size_per_image=512,
box_positive_fraction=0.25,
bbox_reg_weights=None,
pose_mean=None, # 6D Tensor
pose_stddev=None, # 6D Tensor
threed_68_points=None, # [68, 3] Tensor
threed_5_points=None, # [5, 3] Tensor
bbox_x_factor=1.1, # 1.1
bbox_y_factor=1.1, # 1.1
expand_forehead=0.3, # 0.3
):
# skip some lines
if rpn_anchor_generator is None:
anchor_sizes = ((16,), (32,), (64,), (128,), (256,), (512,))
aspect_ratios = ((0.5, 1.0, 2.0),) * len(anchor_sizes) # 重复6次
rpn_anchor_generator = AnchorGenerator(anchor_sizes, aspect_ratios)
if rpn_head is None:
rpn_head = RPNHead(
out_channels, rpn_anchor_generator.num_anchors_per_location()[0]
)
# 注:out_channels=256
# rpn_anchor_generator.num_anchors_per_location() = [3, 3, 3, 3, 3, 3]
rpn = RegionProposalNetwork(
rpn_anchor_generator, # 上面定义
rpn_head, # 上面定义
rpn_fg_iou_thresh, # 0.5
rpn_bg_iou_thresh, # 0.3
rpn_batch_size_per_image, # 256
rpn_positive_fraction, # 0.5
rpn_pre_nms_top_n, # {'training': 6000, 'testing': 6000}
rpn_post_nms_top_n, # {'training': 2000, 'testing': 1000}
rpn_nms_thresh, # 0.4
)
# from rpn import AnchorGenerator, RegionProposalNetwork
if box_roi_pool is None:
box_roi_pool = MultiScaleRoIAlign(
featmap_names=["0", "1", "2", "3"], output_size=7, sampling_ratio=2
)
# 注:from torchvision.ops import MultiScaleRoIAlign
if box_head is None:
resolution = box_roi_pool.output_size[0]
representation_size = 1024
box_head = TwoMLPHead(out_channels * resolution ** 2, representation_size)
# 注:box_roi_pool.output_size=[7, 7]
# from torchvision.models.detection.faster_rcnn import TwoMLPHead
if box_predictor is None:
representation_size = 1024
box_predictor = FastRCNNDoFPredictor(representation_size, num_classes)
roi_heads = DOFRoIHeads(
# Box
box_roi_pool, # 上面定义
box_head, # 上面定义
box_predictor, # 上面定义
box_fg_iou_thresh, # 0.5
box_bg_iou_thresh, # 0.5
box_batch_size_per_image, # 512
box_positive_fraction, # 0.25
bbox_reg_weights, # None
box_score_thresh, # 0.05
box_nms_thresh, # 0.5
box_detections_per_img, # 1000
out_channels, # 256
pose_mean=pose_mean, # 6D Tensor
pose_stddev=pose_stddev, # 6D Tensor
threed_68_points=threed_68_points, # [68, 3] Tensor
threed_5_points=threed_5_points, # [5, 3] Tensor
bbox_x_factor=bbox_x_factor, # 1.1
bbox_y_factor=bbox_y_factor, # 1.1
expand_forehead=expand_forehead, # 0.3
)
# 与普通的pytorch Transform不同,此处
# from torchvision.models.detection.transform import GeneralizedRCNNTransform
transform = GeneralizedRCNNTransform(min_size, max_size, image_mean, image_std)
# 注:min_size=[640, 672, 704, 736, 768, 800]
# max_size=1400
super(FasterDoFRCNN, self).__init__(backbone, rpn, roi_heads, transform)
# 总结:最后调用父类可以很清楚地看到
# self.transform = transform
# self.backbone = backbone
# self.rpn = rpn
# self.roi_heads = roi_heads
# 以上4个组件构成了整个FasterDoFRCNN(继承GeneralizedRCNN)
前向传播梳理
在train.py中(link)
for idx, data in enumerate(self.train_loader):
imgs, targets = data
imgs = [image.to(self.config.device) for image in imgs]
targets = [
{k: v.to(self.config.device) for k, v in t.items()} for t in targets
]
# 注:imgs: [ [3, 575, 767] Tensor, [3, 587, 783] Tensor ], 取值范围[0, 1]
# targets: len=2的list,每个元素是dict,包含dofs, boxes, labels
# dofs [num_faces, 6]
# boxes [num_faces, 4],范围0~w, 0~h
# labels [num_faces],取值一定为1
self.optimizer.zero_grad()
# forward pass
losses = self.img2pose_model.forward(imgs, targets)
# 注:losses为dict型,包含5个loss term
# ['loss_classifier', 'loss_dof_reg', 'loss_points', 'loss_objectness', 'loss_rpn_box_reg']
进入img2pose.py,img2poseModel类前向传播相关的方法
def run_model(self, imgs, targets=None):
outputs = self.fpn_model(imgs, targets)
return outputs
def forward(self, imgs, targets):
losses = self.run_model(imgs, targets)
return losses
FasterDoFRCNN类的前向传播,在父类GeneralizedRCNN中实现
进入generalized_rcnn.py(link),GeneralizedRCNN类的forward函数
# torch.jit.annotate是干什么的
original_image_sizes = torch.jit.annotate(List[Tuple[int, int]], [])
for img in images:
val = img.shape[-2:]
assert len(val) == 2
original_image_sizes.append((val[0], val[1]))
# 注:[(575, 767), (587, 783)]
images, targets = self.transform(images, targets)
# 注:images为<class 'torchvision.models.detection.image_list.ImageList'>型
# targets仍为原来的格式
# Check for degenerate boxes,检查bbox格式是否合法
# 1. 首先进行backbone的前向传播
# images.tensors: [2, 3, 768, 1024],图像尺寸变统一了,范围[-2.11, 2.64]
features = self.backbone(images.tensors)
# 注:key包括['0', '1', '2', '3', 'pool']
# 理解原理需要去查看resnet_fpn_backbone的源代码
# 2. 然后进行rpn的前向传播,预测候选bbox,与bbox_gt计算loss
proposals, proposal_losses = self.rpn(images, features, targets)
# 注:proposals: [ [2000, 4], [2000, 4] ]
# proposal_losses的keys=loss_objectness, loss_rpn_box_reg
# 理解rpn前向传播的具体细节,需要去rpn.py查看forward函数
# 3. 最后进行DOF head的前向传播,计算DOF相关的loss
detections, detector_losses = self.roi_heads(
features, proposals, images.image_sizes, targets
)
理解self.backbone
# 这是定义的代码
backbone = resnet_fpn_backbone('resnet18', pretrained=False)
进入torchvision/models/detection/backbone_utils.py(link)
首先看到
def resnet_fpn_backbone(
backbone_name,
pretrained,
norm_layer=misc_nn_ops.FrozenBatchNorm2d,
trainable_layers=3,
returned_layers=None,
extra_blocks=None
):
# 首先获取标准resnet网络
backbone = resnet.__dict__[backbone_name](
pretrained=pretrained,
norm_layer=norm_layer)
# skip some lines
if extra_blocks is None:
extra_blocks = LastLevelMaxPool()
if returned_layers is None:
returned_layers = [1, 2, 3, 4]
assert min(returned_layers) > 0 and max(returned_layers) < 5
return_layers = {f'layer{k}': str(v) for v, k in enumerate(returned_layers)}
# 取值为{'layer1': '0', 'layer2': '1', 'layer3': '2', 'layer4': '3'}
# key表示featture原来的name,value表示新的name
# skip some lines
return BackboneWithFPN(backbone, return_layers, in_channels_list, out_channels, extra_blocks=extra_blocks)
# 注:in_channels_list=[64, 128, 256, 512], out_channels=256
# extra_blocks为<class 'torchvision.ops.feature_pyramid_network.LastLevelMaxPool'>
然后看到
class BackboneWithFPN(nn.Module):
def __init__(self, backbone, return_layers, in_channels_list, out_channels, extra_blocks=None):
# skip some lines
self.body = IntermediateLayerGetter(backbone, return_layers=return_layers)
# IntermediateLayerGetter的定义在torchvision/models/_utils.py
# 是一个用来收集网络前向传播过程中的一些Tensor的工具
# 假设x为[N, 3, 224, 224]
# x = self.body(x)
# x = {
# '0': [N, 64, 56, 56], 传播至resnet.layer1,空间尺寸1/4
# '1': [N, 128, 28, 28], 传播至resnet.layer2,空间尺寸1/8
# '2': [N, 256, 14, 14], 传播至resnet.layer3,空间尺寸1/16
# '3': [N, 512, 7, 7] 传播至resnet.layer4,空间尺寸1/32
# }
self.fpn = FeaturePyramidNetwork(
in_channels_list=in_channels_list,
out_channels=out_channels,
extra_blocks=extra_blocks,
)
# FeaturePyramidNetwork的细节要去torchvision/ops/feature_pyramid_network.py中查看
# 大致的理解如下
# x = self.fpn(x)
# x = {
# '0': [N, 256, 56, 56], <- [N, 64, 56, 56]
# '1': [N, 256, 28, 28], <- [N, 128, 28, 28
# '2': [N, 256, 14, 14], <- [N, 256, 14, 14]
# '3': [N, 256, 7, 7], <- [N, 512, 7, 7]
# 'pool': [N, 256, 4, 4]
# }
# 总结,相当于把resnet的一系列feature,统一提升通道数为256,然后多了一个pool的结果
整体来看,backbone的前向传播所做的事情,输入一幅图像,抽取layer1~layer4的特征,送入FeaturePyramidNetwork,输出一组特征金字塔,金字塔的尺寸为{1/4, 1/8, 1/16, 1/32, 1/64}
,通道数统一为256
features = self.backbone(images.tensors)
# 例如,images.tensors: [2, 3, 768, 1024]
# features = {
# '0': [2, 256, 192, 256], 1/4
# '1': [2, 256, 96, 128], 1/8
# '2': [2, 256, 48, 64], 1/16
# '3': [2, 256, 24, 32], 1/32
# 'pool': [2, 256, 12, 16] 1/64
# }
理解self.rpn
模型评估
evaluation/evaluate_wider.py
是一个对WIDER_FACE数据集的推理,只负责评估bbox预测,landmarks和dofs不关心
为了刷榜,采用了一些trick
设置min_size = [200, 300, 500, 800, 1100, 1400, 1700], filp_list = [False, True]
于是构成二重循环,生成14幅图像进行推理,然后对14组bbox预测值做voting
从宏观理解forward
阅读generalized_rcnn.py
中类GeneralizedRCNN
的forward
,理解起来比较容易
以推理一幅尺寸为1024x678的图像为例
提前设置
min_size = 200
max_size = 302
self.img2pose_model.fpn_model.module.set_max_min_size(max_size, min_size)
# images: [ [3, 678, 1024] ]
# targets: None
def forward(self, images, targets=None):
# skip some lines
将images的尺寸保存到original_image_sizes
images, targets = self.transform(images, targets)
# images为ImageList型,tensors.shape=[1, 3, 224, 320]
# 注:220,302向上找最接近的32整数倍,就是224,320
# skip some lines
features = self.backbone(images.tensors)
# features为dict型
# '0': [1, 256, 56, 80] 1/4
# '1': [1, 256, 28, 40] 1/8
# '2': [1, 256, 14, 20] 1/16
# '3': [1, 256, 7, 10] 1/32
# 'pool': [1, 256, 4, 5] 1/64
proposals, proposal_losses = self.rpn(images, features, targets)
# proposals: [ [1000, 4] ] # 尺寸224x320下的候选框
# proposal_losses: {}
# features: dict, '0', '1', '2', '3', 'pool'
# proposals: [ [1000, 4] ]
# images.image_sizes: [ (199, 302) ]
detections, detector_losses = self.roi_heads(
features, proposals, images.image_sizes, targets
)
# detections: [
# {'boxes': [num_faces, 4], 'labels': [num_faces], 'scores': [num_faces], 'dofs': [num_faces, 6]}
# ]
# 注:boxes给出的是尺寸224x320下的预测人脸框,dofs做了denorm吗?应该做了
# detector_losses: {}
# 将预测的boxes还原到原图尺寸
detections = self.transform.postprocess(
detections, images.image_sizes, original_image_sizes
)
return self.eager_outputs(losses, detections, targets is not None)
# 其实就是return detections
自己写一个推理脚本,给定一幅任意的图像,首先做一些填充,使得size为32的整数倍,然后对fpn_model进行set_max_min_size,最终获得推理结果
img2pose代码写得比较乱的地方,Dataset部分一直用的是global 6DoF,仅当计算loss时,才转换成local 6DoF
可以在fastrcnn_loss
里看到,调用pose_full_image_to_bbox
方法,将dof_regression_targets
从global转换为local,然后再做归一化,最后与dof_regression
(即网络预测值)计算dof_loss