参考:
0.环境
ubuntu16.04
python3.6
# pip安装
torch==1.1.0
torchvision==0.3.0
numpy>=1.15.2
onnx>=1.3.0
opencv-python>=3.4.3.18
pandas>=0.23.4
tensorboardX>=1.4
tqdm>=4.26.0
pretrainedmodels>=0.7.4
networkx==2.3
# apt安装
sudo apt-get install ffmpeg
1.数据准备
(1)下载数据
官网:https://www.crcv.ucf.edu/data/UCF101.php,下面有两处是我们需要的。第一处是avi数据,第二处是分训练集与测试集的。
UCF101视频分类数据集:http://www.crcv.ucf.edu/datasets/human-actions/ucf101/UCF101.rar
新建data目录,解压UCF101.rar文件到其中:
apt-get install unrar
unrar x UCF101.rar
下载第二处数据https://www.crcv.ucf.edu/data/UCF101/UCF101TrainTestSplits-DetectionTask.zip。放到自己想放的地方,我是放到data目录下。
参考:https://blog.csdn.net/qq_41185868/article/details/108474259
将utils/ucf101_json.py文件中所有的:
data.ix
替换为
data.iloc
(2)生成json文件
运行生成json文件:
python utils/ucf101_json.py ./data/ucf-101/ucfTrainTestlist
生成的数据在./data/ucf-101/ucfTrainTestlist目录下。
(3)视频转jpg
CUDA_VISIBLE_DEVICES="1" python utils/preprocess_videos.py --annotation_file ./data/ucf-101/ucfTrainTestlist/ucf101_01.json \
--raw_dir ./data/ucf-101/UCF-101/ \
--destination_dir ./data/data/UCF101_jpg \
--video-size 480 \
--video-format frames \
--video-quality 1 \
--threads 6
CUDA_VISIBLE_DEVICES="1" python utils/preprocess_videos.py --annotation_file ./data/ucf-101/ucfTrainTestlist/ucf101_02.json \
--raw_dir ./data/ucf-101/UCF-101/ \
--destination_dir ./data/data/UCF101_jpg \
--video-size 480 \
--video-format frames \
--video-quality 1 \
--threads 6
CUDA_VISIBLE_DEVICES="1" python utils/preprocess_videos.py --annotation_file ./data/ucf-101/ucfTrainTestlist/ucf101_03.json \
--raw_dir ./data/ucf-101/UCF-101/ \
--destination_dir ./data/data/UCF101_jpg \
--video-size 480 \
--video-format frames \
--video-quality 1 \
--threads 6
(4)生成n_frames
参考:https://github.com/kenshohara/3D-ResNets-PyTorch/tree/CVPR2018
创建生成n_frames脚本文件,放到utils目录下:
# -*- coding: UTF-8 -*-
'''
@author: mengting gu
@contact: 1065504814@qq.com
@time: 2021/1/7 15:47
@file: n_frames_ucf101_hmdb51.py
@desc:
'''
from __future__ import print_function, division
import os
import sys
import subprocess
def class_process(dir_path, class_name):
class_path = os.path.join(dir_path, class_name)
if not os.path.isdir(class_path):
return
for file_name in os.listdir(class_path):
video_dir_path = os.path.join(class_path, file_name)
image_indices = []
for image_file_name in os.listdir(video_dir_path):
if 'image' not in image_file_name:
continue
image_indices.append(int(image_file_name[6:11]))
if len(image_indices) == 0:
print('no image files', video_dir_path)
n_frames = 0
else:
image_indices.sort(reverse=True)
n_frames = image_indices[0]
print(video_dir_path, n_frames)
with open(os.path.join(video_dir_path, 'n_frames'), 'w') as dst_file:
dst_file.write(str(n_frames))
if __name__=="__main__":
dir_path = sys.argv[1]
for class_name in os.listdir(dir_path):
class_process(dir_path, class_name)
运行命令:
python utils/n_frames_ucf101_hmdb51.py ./data/data/UCF101_jpg
(5)文件结构
准备好数据后,文件结构:
.../
data/ (root dir)
data
UCF101_jpg/ (jpg files)
ucf-101/
UCF-101/ (video files)
ucfTrainTestlist/ (annotation path)
classInd.txt
ucf101_01.json
2.训练与评估
如分类任务以imagenet的预训练模型一样,视频分类主要以kinetics训练的模型作为预训练模型。
(1)准备预训练模型
准备预训练模型:
这里以ResNet34-VTN为例,下载对应的预训练模型,放到models下。
(2)文件结构
.../
data/ (root dir)
...
models/
resnet_34_vtn_rgbd_kinetics.pth
se_resnext_101_32x4d_vtn_rgbd_ucf101_s1.pth
(3)训练
预训练模型:resnet_34_vtn_rgbd_kinetics.pth
CUDA_VISIBLE_DEVICES="0" python main.py --root-path ./data --result-path ./logs/ --dataset ucf101_1 --model resnet34_vtn_rgbdiff -b16 --lr 1e-5 --seq 16 --pretrain-path ./models/resnet_34_vtn_rgbd_kinetics.pth --video-path ./UCF101_jpg
(4)评估
评估使用的预训练模型:se_resnext_101_32x4d_vtn_rgbd_ucf101_s1.pth
CUDA_VISIBLE_DEVICES="2" python main.py --root-path ./data --result-path ./logs/ --dataset ucf101_1 --model se-resnext101-32x4d_vtn_rgbdiff -b128 --lr 1e-5 --seq 16 --st 2 --no-mean-norm --no-std-norm --no-train --no-val --test --pretrain-path ./models/se_resnext_101_32x4d_vtn_rgbd_ucf101_s1.pth --video-path ./UCF101_jpg
(5)参数
使用什么模型与参数都是根据下图来的:
--dataset:ucf101或者kinetics;
--model:与Model与Input有关,以性能最好的93.44%模型为例,se-resnext101-32x4d_vtn_rgbdiff;
-b:batch_size;
--lr:学习率;
--pretrain-path:就是Checkpoint中下载的模型;
--video-path:如果不以默认的"./data/data/utf-101/jpeg2",可以自己设置。
其中--model与--pretrain-path不对应是会出问题的。
训练se_resnext_101_32x4d_vtn_rgbd,b=2大概需要7-8GB显存。
3.单帧图像推理
import sys
import time
from argparse import ArgumentParser
from collections import deque
from copy import deepcopy
import cv2
import numpy as np
import torch
import torch.nn.functional as F
from action_recognition.model import create_model
from action_recognition.options import add_input_args
from action_recognition.spatial_transforms import (CenterCrop, Compose,
Normalize, Scale, ToTensor, MEAN_STATISTICS, STD_STATISTICS)
from action_recognition.utils import load_state, generate_args
import os
TEXT_COLOR = (255, 255, 255)
TEXT_FONT_FACE = cv2.FONT_HERSHEY_DUPLEX
TEXT_FONT_SIZE = 1
TEXT_VERTICAL_INTERVAL = 45
NUM_LABELS_TO_DISPLAY = 2
class TorchActionRecognition:
def __init__(self, encoder, checkpoint_path, num_classes=400, **kwargs):
# model_type = "{}_vtn".format(encoder)
model_type = "{}".format(encoder)
args, _ = generate_args(model=model_type, n_classes=num_classes, layer_norm=False, **kwargs)
self.args = args
self.model, _ = create_model(args, model_type)
self.model = self.model.module
self.model.eval()
self.model.cuda()
checkpoint = torch.load(str(checkpoint_path))
load_state(self.model, checkpoint['state_dict'])
self.preprocessing = make_preprocessing(args)
self.embeds = deque(maxlen=(args.sample_duration * args.temporal_stride))
def preprocess_frame(self, frame):
frame = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
return self.preprocessing(frame)
def infer_frame(self, frame):
embedding = self._infer_embed(self.preprocess_frame(frame))
self.embeds.append(embedding)
sequence = self.get_seq()
return self._infer_logits(sequence)
def _infer_embed(self, frame):
with torch.no_grad():
frame_tensor = frame.unsqueeze(0).to('cuda')
tensor = self.model.resnet(frame_tensor)
tensor = self.model.reduce_conv(tensor)
embed = F.avg_pool2d(tensor, 7)
return embed.squeeze(-1).squeeze(-1)
def _infer_logits(self, embeddings):
with torch.no_grad():
ys = self.model.self_attention_decoder(embeddings)
ys = self.model.fc(ys)
ys = ys.mean(1)
return ys.cpu()
def _infer_seq(self, frame):
with torch.no_grad():
result = self.model(frame.view(1, self.args.sample_duration, 3,
self.args.sample_size, self.args.sample_size).to('cuda'))
return result.cpu()
def get_seq(self):
sequence = torch.stack(tuple(self.embeds), 1)
if self.args.temporal_stride > 1:
sequence = sequence[:, ::self.args.temporal_stride, :]
n = self.args.sample_duration
if sequence.size(1) < n:
num_repeats = (n - 1) // sequence.size(1) + 1
sequence = sequence.repeat(1, num_repeats, 1)[:, :n, :]
return sequence
def make_preprocessing(args):
return Compose([
Scale(args.sample_size),
CenterCrop(args.sample_size),
ToTensor(args.norm_value),
Normalize(MEAN_STATISTICS[args.mean_dataset], STD_STATISTICS[args.mean_dataset])
])
def draw_rect(image, bottom_left, top_right, color=(0, 0, 0), alpha=1.):
xmin, ymin = bottom_left
xmax, ymax = top_right
image[ymin:ymax, xmin:xmax, :] = image[ymin:ymax, xmin:xmax, :] * (1 - alpha) + np.asarray(color) * alpha
return image
def render_frame(frame, probs, labels):
order = probs.argsort(descending=True)
status_bar_coordinates = (
(0, 0), # top left
(650, 25 + TEXT_VERTICAL_INTERVAL * NUM_LABELS_TO_DISPLAY) # bottom right
)
draw_rect(frame, status_bar_coordinates[0], status_bar_coordinates[1], alpha=0.5)
for i, imax in enumerate(order[:NUM_LABELS_TO_DISPLAY]):
text = '{} - {:.1f}%'.format(labels[imax], probs[imax] * 100)
text = text.upper().replace("_", " ")
cv2.putText(frame, text, (15, TEXT_VERTICAL_INTERVAL * (i + 1)), TEXT_FONT_SIZE,
TEXT_FONT_FACE, TEXT_COLOR)
return frame
def run_demo(model, labels, input_path, save_path):
fps = 30
tick = time.time()
for file in sorted(os.listdir(input_path)):
frame = cv2.imread(os.path.join(input_path, file))
if frame is None:
print("图像为空:"+str(frame))
break
print("Now processing file : {}".format(file))
# while video_cap.isOpened():
# ok, frame = video_cap.read()
#
# if not ok:
# break
logits = model.infer_frame(frame)
probs = F.softmax(logits[0], dim=0)
frame = render_frame(frame, probs, labels)
tock = time.time()
expected_time = tick + 1 / fps
if tock < expected_time:
delay = max(1, int((expected_time - tock) * 1000))
tick = tock
cv2.imwrite(os.path.join(save_path, file), frame)
# cv2.imshow("demo", frame)
# key = cv2.waitKey(delay)
# if key == 27 or key == ord('q'):
# break
def main():
parser = ArgumentParser()
parser.add_argument("--encoder", help="What encoder to use ", default='resnet34')
parser.add_argument("--checkpoint", help="Path to pretrained model (.pth) file", required=True)
parser.add_argument("--input-video", type=str, help="Path to input img or video", required=True)
parser.add_argument("--save-path", type=str, help="Path to save img", required=True)
parser.add_argument("--labels", help="Path to labels file (new-line separated file with label names)", type=str,
required=True)
add_input_args(parser)
args = parser.parse_args()
with open(args.labels) as fd:
labels = fd.read().strip().split('\n')
extra_args = deepcopy(vars(args))
input_data_params = set(x.dest for x in parser._action_groups[-1]._group_actions)
for name in list(extra_args.keys()):
if name not in input_data_params:
del extra_args[name]
input_path = args.input_video
save_path = args.save_path
try:
model = TorchActionRecognition(args.encoder, args.checkpoint, num_classes=len(labels), **extra_args)
# cap = cv2.VideoCapture(args.input_video)
run_demo(model, labels, input_path, save_path)
except Exception as error:
print("an error occur : "+error)
if __name__ == '__main__':
sys.exit(main())
准备一张图像与模型,运行以下命令:
CUDA_VISIBLE_DEVICES="0" python vtn_jpg_demo.py --encoder resnet34_vtn --checkpoint ./data/models/resnet_34_vtn_rgb_ucf101_s1.pth --input-video ./data/ourdata/input --save-path ./data/ourdata/output --labels ./data/ucf-101/ucfTrainTestlist/classInd.txt
得到结果:
4.注意
1)出现各种各样关于模型的错误,很有可能就是调用的模型,与前面的模型名称不一致;
2)明明数据放在data对应目录下,调用模型错误,是因为代码内部设置了data/data+(video_path);
3)出现无n_frames文件,请看我的1.4;
4)测试预训练模型,出现错误的可能:
数据准备时,即python utils/preprocess_videos.py调用过程中默认使用的-q=4,图像质量最差的保存方式;
推理时,参数不对,多了或者少了,--no-mean-norm --no-std-norm。
注:这些都是花了很多时间的代价,还有一些记不起了。