最近在学习行人重识别方向的《Pose-Guided Feature Disentangling for Occluded Person Re-identification Based on Transformer》文章和源码,这篇文章网络PDF的backbone是采用《TransReID: Transformer-based Object Re-Identification》的主干网络TransReID,所以下面在书写的时候会一起说明。
该篇文章是对TransReID源码中关于调用HRNet识别人体关键点并返回heatmap部分,网络图中存在红箭头标记的部分,代码位于model\pose_net.py
的类SimpleHRNet
。
这里主要调用了predict
函数,_predict_single(self, image)
函数是用于调用HRNet直接检测单张图片的,学习这个函数即可明白调用过程。
def _predict_single(self, image):
# 判断是否只检测单人,在该文章中至用到该部分
if not self.multiperson:
# 记录原始图像分辨率
old_res = image.shape
# 如果有指定分辨率,则将图像缩放到这个分辨率
if self.resolution is not None:
image = cv2.resize(
image,
(self.resolution[1], self.resolution[0]), # (width, height)
interpolation=self.interpolation
)
# 对图像进行颜色空间转换并进行预处理,增加一个新维度以作为模型输入
images = self.transform(cv2.cvtColor(image, cv2.COLOR_BGR2RGB)).unsqueeze(dim=0)
# 创建一个全0的边界框数组,代表整个图像
boxes = np.asarray([[0, 0, old_res[1], old_res[0]]], dtype=np.float32)
# 创建一个全0的热图数组,用于存储关节热图
heatmaps = np.zeros((1, self.nof_joints, self.resolution[0] // 4, self.resolution[1] // 4),
dtype=np.float32)
else:
# 检测多人,本文中该部分不用
detections = self.detector.predict_single(image)
# 计算检测到的人数
nof_people = len(detections) if detections is not None else 0
# 初始化边界框和图像的数组
boxes = np.empty((nof_people, 4), dtype=np.int32)
images = torch.empty((nof_people, 3, self.resolution[0], self.resolution[1]))
heatmaps = np.zeros((nof_people, self.nof_joints, self.resolution[0] // 4, self.resolution[1] // 4),
dtype=np.float32)
# 如果检测到人体,对每个人体进行处理
if detections is not None:
for i, (x1, y1, x2, y2, conf, cls_conf, cls_pred) in enumerate(detections):
# 调整边界框大小以适应HRNet输入的宽高比
x1 = int(round(x1.item()))
x2 = int(round(x2.item()))
y1 = int(round(y1.item()))
y2 = int(round(y2.item()))
correction_factor = self.resolution[0] / self.resolution[1] * (x2 - x1) / (y2 - y1)
if correction_factor > 1:
# 增加y边界
center = y1 + (y2 - y1) // 2
length = int(round((y2 - y1) * correction_factor))
y1 = max(0, center - length // 2)
y2 = min(image.shape[0], center + length // 2)
elif correction_factor < 1:
# 增加x边界
center = x1 + (x2 - x1) // 2
length = int(round((x2 - x1) * 1 / correction_factor))
x1 = max(0, center - length // 2)
x2 = min(image.shape[1], center + length // 2)
boxes[i] = [x1, y1, x2, y2]
images[i] = self.transform(image[y1:y2, x1:x2, ::-1])
# 如果有处理后的图像,则进行模型预测
if images.shape[0] > 0:
images = images.to(self.device)
with torch.no_grad():
# 直接预测或分批预测
if len(images) <= self.max_batch_size:
out = self.model(images)
else:
out = torch.empty(
(images.shape[0], self.nof_joints, self.resolution[0] // 4, self.resolution[1] // 4),
device=self.device
)
for i in range(0, len(images), self.max_batch_size):
out[i:i + self.max_batch_size] = self.model(images[i:i + self.max_batch_size])
# 将模型输出转换为numpy数组
out = out.detach().cpu().numpy()
pts = np.empty((out.shape[0], out.shape[1], 3), dtype=np.float32)
# 对每个人的每个关节,找到热图上的最大值位置,并计算其在原始图像中的坐标和置信度
for i, human in enumerate(out):
heatmaps[i] = human
for j, joint in enumerate(human):
pt = np.unravel_index(np.argmax(joint), (self.resolution[0] // 4, self.resolution[1] // 4))
pts[i, j, 0] = pt[0] * 1. / (self.resolution[0] // 4) * (boxes[i][3] - boxes[i][1]) + boxes[i][1]
pts[i, j, 1] = pt[1] * 1. / (self.resolution[1] // 4) * (boxes[i][2] - boxes[i][0]) + boxes[i][0]
pts[i, j, 2] = joint[pt]
else:
# 如果没有检测到人体,则创建一个空的关节点数组
pts = np.empty((0, 0, 3), dtype=np.float32)
# 根据需要返回热图、边界框和关节点坐标
res = list()
if self.return_heatmaps:
res.append(heatmaps)
if self.return_bounding_boxes:
res.append(boxes)
res.append(pts)
# 根据返回值的数量,返回结果
if len(res) > 1:
return res
else:
return res[0]
在该部分,其实我们需要的只是heatmap,对应的调用部分在model\make_pfd.py
的forward
方法中
def forward(self, x, label=None, cam_label= None, view_label=None): #ht optinal
bs, c, h, w = x.shape # [batch, 3, 256, 128]
# HRNet:
heatmaps, joints = self.pose.predict(x)
heatmaps = torch.from_numpy(heatmaps).cuda() #[bs, 17, 64, 32]
所以,我们应该关注predict
函数中的下面这部分代码。out的维度应该是(human_index, joint_y, joint_x, confidence)
if images.shape[0] > 0:
images = images.to(self.device)
with torch.no_grad():
if len(images) <= self.max_batch_size:
out = self.model(images)
else:
out = torch.empty(
(images.shape[0], self.nof_joints, self.resolution[0] // 4, self.resolution[1] // 4),
device=self.device
)
for i in range(0, len(images), self.max_batch_size):
out[i:i + self.max_batch_size] = self.model(images[i:i + self.max_batch_size])
out = out.detach().cpu().numpy()
pts = np.empty((out.shape[0], out.shape[1], 3), dtype=np.float32)
# For each human, for each joint: y, x, confidence
for i, human in enumerate(out):
heatmaps[i] = human
for j, joint in enumerate(human):
pt = np.unravel_index(np.argmax(joint), (self.resolution[0] // 4, self.resolution[1] // 4))
# 0: pt_y / (height // 4) * (bb_y2 - bb_y1) + bb_y1
# 1: pt_x / (width // 4) * (bb_x2 - bb_x1) + bb_x1
# 2: confidences
pts[i, j, 0] = pt[0] * 1. / (self.resolution[0] // 4) * (boxes[i][3] - boxes[i][1]) + boxes[i][1]
pts[i, j, 1] = pt[1] * 1. / (self.resolution[1] // 4) * (boxes[i][2] - boxes[i][0]) + boxes[i][0]
pts[i, j, 2] = joint[pt]
heatmaps
的维度是[bs, 17, 64, 32]
尝试print了heatmaps
中的一个关节,结果如下,说明我们得到的其实是一个heatmap(其实就应该是,但是我比较较真- -),反映了每一4*4像素代表当前部位的置信度。这样就能更清晰的展示我们从HRNet得到了什么。
print(heatmaps[0][1])
tensor([[ 2.3842e-05, 2.3842e-05, 2.3842e-05, …, 2.3842e-05, 2.3842e-05, 2.3842e-05],
[ 2.3842e-05, 2.3842e-05, 2.3842e-05, …, 2.3842e-05, 2.3842e-05, 2.3842e-05],
[ 2.3842e-05, 2.3842e-05, 2.3842e-05, …, 2.3842e-05, 2.3842e-05, 2.3842e-05],
…,
[ 2.3842e-05, 2.3842e-05, 2.2471e-05, …, 5.1856e-06, 2.3842e-05, 2.3842e-05],
[ 2.2411e-05, 2.2233e-05, 2.9743e-05, …, 9.9003e-05, -9.8050e-05, -9.8050e-05],
[ 2.0862e-05, 2.1398e-05, 1.8537e-05, …, -3.6538e-05, -1.1623e-04, -9.8050e-05]],
device=‘cuda:0’)
然后针对每一个关节点(此处为17)过滤阈值(0.3):
,对应论文的以下部分。但是这里写的跟论文中不一致,这里把小于阈值的设置成为1,大于阈值的设置为0。但是因为后续处理一致,所以没有出错。