目录
写在前面
本次复现使用的是 PyTorch 1.7 里对视频的处理方法来进行训练,详情参考:https://blog.csdn.net/qq_36627158/article/details/113791050
复现论文 & 代码
GIthub:https://github.com/BizhuWu/C3D_PyTorch (如果可以的话,给个小星星嘛~)
paper:Learning Spatiotemporal Features with 3D Convolutional Networks
遇到的问题
1、学习率的变化 & lr_scheduler 的使用
参考:
scheduler = torch.optim.lr_scheduler.xxx()
for epoch in range(epochs):
train(...)
optimizer.step()
scheduler.step()
但我遇到了一个问题:
import torch
from torchvision.models import resnet18
net = resnet18()
optimizer = torch.optim.SGD(net.parameters(), 0.1)
scheduler = torch.optim.lr_scheduler.MultiStepLR(optimizer, milestones=[3, 6, 9], gamma=0.1)
scheduler = torch.optim.lr_scheduler.StepLR(optimizer, 3, gamma=0.1)
for i in range(10):
print(i, scheduler.get_lr())
scheduler.step()
输出有问题:
0 [0.1]
1 [0.1]
2 [0.1]
3 [0.0010000000000000002]
4 [0.010000000000000002]
5 [0.010000000000000002]
6 [0.00010000000000000003]
7 [0.0010000000000000002]
8 [0.0010000000000000002]
9 [1.0000000000000004e-05]
解决方案:
- https://blog.csdn.net/lrs1353281004/article/details/97291890#comments_15046605
- https://github.com/pytorch/pytorch/issues/22107
2、UCF101 Dataloader 的遍历
在遍历 dataloader 时
for i, (v, a, l) in enumerate(dataloader): # <- RunTimeError occurs here
pass
报出如下错误:
RuntimeError: invalid argument 0: Sizes of tensors must match except in dimension 0. Got 2 and 1 in dimension 1 at /opt/conda/conda-bld/pytorch_1579022060824/work/aten/src/TH/generic/THTensor.cpp:612
or
RuntimeError: stack expects each tensor to be equal size, but got [2, 28800] at entry 0 and [1, 28800] at entry 6
原因分析 & 参考解决方案:https://github.com/pytorch/vision/issues/2265
从错误信息来看,应该是 audio 出问题了
所以建议自己写一个 collate_fn,把返回的 audio 过滤掉。
collate_fn 用法参考:
- https://blog.csdn.net/weixin_42464187/article/details/104795574?ops_request_misc=%257B%2522request%255Fid%2522%253A%2522161388068816780271592099%2522%252C%2522scm%2522%253A%252220140713.130102334..%2522%257D&request_id=161388068816780271592099&biz_id=0&utm_medium=distribute.pc_search_result.none-task-blog-2~all~sobaiduend~default-1-104795574.first_rank_v2_pc_rank_v29&utm_term=collate_fn
- https://blog.csdn.net/AWhiteDongDong/article/details/110233400?utm_medium=distribute.pc_relevant.none-task-blog-BlogCommendFromMachineLearnPai2-1.baidujs&dist_request_id=29ea2b19-257e-4181-a160-49c4e50e2f71&depth_1-utm_source=distribute.pc_relevant.none-task-blog-BlogCommendFromMachineLearnPai2-1.baidujs
def custom_collate(batch):
filtered_batch = []
for video, _, label in batch:
filtered_batch.append((video, label))
torch.utils.data.dataloader.default_collate(filtered_batch)
结合加载 UCF101 数据集:
def custom_collate(batch):
filtered_batch = []
for video, _, label in batch:
filtered_batch.append((video, label))
return torch.utils.data.dataloader.default_collate(filtered_batch)
trainset = datasets.UCF101(
root='data/UCF101/UCF-101',
annotation_path='data/UCF101TrainTestSplits-RecognitionTask/ucfTrainTestlist',
frames_per_clip=FRAME_LENGTH,
num_workers=0,
transform=transform,
)
trainset_loader = DataLoader(
trainset,
batch_size=TRAIN_BATCH_SIZE,
shuffle=True,
num_workers=0,
collate_fn=custom_collate
)
3、保存损失值,可视化
在 plt.show() 的时候报出了错误:
TypeError: can't convert cuda:0 device type tensor to numpy. Use Tensor.cpu() to copy the tensor to host memory first.
检查了一下代码,忘记保存 loss 的时候要 .item() 变回 cpu tensor。。。(Github 上代码已更新)
参考:
- loss 保存和可视化方法:https://blog.csdn.net/weixin_38324105/article/details/90202840
- 代码中所有的loss都直接用loss表示的,结果就是每次迭代,空间占用就会增加,直到cpu或者gup爆炸。把除了loss.backward()之外的loss调用都改成loss.item(),就可以解决:https://blog.csdn.net/StarfishCu/article/details/112473856?utm_medium=distribute.pc_relevant_t0.none-task-blog-BlogCommendFromMachineLearnPai2-1.control&dist_request_id=9c3dad85-c316-46f1-8976-8ec6f5639307&depth_1-utm_source=distribute.pc_relevant_t0.none-task-blog-BlogCommendFromMachineLearnPai2-1.control
- .item() 的用法:https://www.jianshu.com/p/79da0eac5f01
plt_loss = []
iteration=[]
# Training
plt_loss.append(loss.item()) #total_loss.item()是你每一次inter输出的loss