main
包含了 D a t a l o a d e r s Dataloaders Dataloaders, C o m p i l e L o s s Compile\;Loss CompileLoss, I n i t i a l i z e n e t w o r k s , o p t i m i z e r s a n d l r _ s c h e d u l e r s Initialize\;networks,\;optimizers\;and\;lr\_schedulers Initializenetworks,optimizersandlr_schedulers, S t a r t T r a i n i n g Start\;Training StartTraining四个部分。
Dataloaders
Compile Loss
loss = VIBELoss(
e_loss_weight=cfg.LOSS.KP_2D_W,
e_3d_loss_weight=cfg.LOSS.KP_3D_W,
e_pose_loss_weight=cfg.LOSS.POSE_W,
e_shape_loss_weight=cfg.LOSS.SHAPE_W,
d_motion_loss_weight=cfg.LOSS.D_MOTION_LOSS_W,
)
包含了2D关键点损失,3D关键点损失,pose参数损失,shape参数损失,以及motion损失。
- 2D/3D关键点损失用的是weighted MSELoss。
- pose和shape参数损失用的是MSELoss。其中pose损失在计算时,利用rodrigues公式,将 a x i s − a n g l e axis-angle axis−angle表示展开为 [ B , 24 , 3 , 3 ] [B, 24, 3, 3] [B,24,3,3]的旋转矩阵,和gt rotation matrix逐元素计算MSELoss。
- motion discriminator损失用以约束一段视频动作的合理性。
Initialize networks, optimizers and lr_schedulers
networks
VIBE
MotionDiscriminator
输入尺寸为 [ b a t c h _ s i z e , s e q _ l e n , i n p u t _ s i z e ] [batch\_size, seq\_len, input\_size] [batch_size,seq_len,input_size],假设输入尺寸为 [ 2 , 16 , 6 : 75 ] [2, 16, 6:75] [2,16,6:75],则表示一次输入两个batch,每个batch的视频序列长度为16帧,每帧输入的pose为预测姿态的6:75个参数。
batchsize, seqlen, input_size = sequence.shape
senquence = torch.transpose(sequence, 0, 1) # [b, s, i] => [s, b, i]
outputs, state = self.gru(sequence) # [s, b, i] => [s, b, hidden_size], [num_layers, b, hidden_size]
outputs = F.relu(outputs)
avg_pool = F.adaptive_avg_pool1d(outputs.permute(1, 2, 0), 1).view(batchsize, -1) # [b, s]
max_pool = F.adaptive_max_pool1d(outputs.permute(1, 2, 0), 1).view(batchsize, -1) # [b, s]
output = self.fc(torch.cat([avg_pool, max_pool], dim=1)) # [b, 2*s] => [b, output_size]