【视频分类】training_extensions/action_recognition复现

参考:

https://github.com/openvinotoolkit/training_extensions/tree/develop/pytorch_toolkit/action_recognition

0.环境

ubuntu16.04
python3.6

# pip安装
torch==1.1.0
torchvision==0.3.0
numpy>=1.15.2
onnx>=1.3.0
opencv-python>=3.4.3.18
pandas>=0.23.4
tensorboardX>=1.4
tqdm>=4.26.0
pretrainedmodels>=0.7.4
networkx==2.3


# apt安装
sudo apt-get install ffmpeg

 

1.数据准备

(1)下载数据

官网:https://www.crcv.ucf.edu/data/UCF101.php,下面有两处是我们需要的。第一处是avi数据,第二处是分训练集与测试集的。

UCF101视频分类数据集:http://www.crcv.ucf.edu/datasets/human-actions/ucf101/UCF101.rar

新建data目录,解压UCF101.rar文件到其中:

apt-get install unrar
unrar x UCF101.rar

 

下载第二处数据https://www.crcv.ucf.edu/data/UCF101/UCF101TrainTestSplits-DetectionTask.zip。放到自己想放的地方,我是放到data目录下。

参考:https://blog.csdn.net/qq_41185868/article/details/108474259

将utils/ucf101_json.py文件中所有的:

data.ix
替换为
data.iloc

(2)生成json文件

运行生成json文件:

python utils/ucf101_json.py ./data/ucf-101/ucfTrainTestlist

生成的数据在./data/ucf-101/ucfTrainTestlist目录下。

(3)视频转jpg

CUDA_VISIBLE_DEVICES="1" python utils/preprocess_videos.py --annotation_file ./data/ucf-101/ucfTrainTestlist/ucf101_01.json \
    --raw_dir ./data/ucf-101/UCF-101/ \
    --destination_dir ./data/data/UCF101_jpg \
    --video-size 480 \
    --video-format frames \
    --video-quality 1 \
    --threads 6

CUDA_VISIBLE_DEVICES="1" python utils/preprocess_videos.py --annotation_file ./data/ucf-101/ucfTrainTestlist/ucf101_02.json \
    --raw_dir ./data/ucf-101/UCF-101/ \
    --destination_dir ./data/data/UCF101_jpg \
    --video-size 480 \
    --video-format frames \
    --video-quality 1 \
    --threads 6

CUDA_VISIBLE_DEVICES="1" python utils/preprocess_videos.py --annotation_file ./data/ucf-101/ucfTrainTestlist/ucf101_03.json \
    --raw_dir ./data/ucf-101/UCF-101/ \
    --destination_dir ./data/data/UCF101_jpg \
    --video-size 480 \
    --video-format frames \
    --video-quality 1 \
    --threads 6

(4)生成n_frames

参考:https://github.com/kenshohara/3D-ResNets-PyTorch/tree/CVPR2018

创建生成n_frames脚本文件,放到utils目录下:

# -*- coding: UTF-8 -*-
'''
@author: mengting gu
@contact: 1065504814@qq.com
@time: 2021/1/7 15:47
@file: n_frames_ucf101_hmdb51.py
@desc: 
'''

from __future__ import print_function, division
import os
import sys
import subprocess

def class_process(dir_path, class_name):
    class_path = os.path.join(dir_path, class_name)
    if not os.path.isdir(class_path):
        return

    for file_name in os.listdir(class_path):
        video_dir_path = os.path.join(class_path, file_name)
        image_indices = []
        for image_file_name in os.listdir(video_dir_path):
            if 'image' not in image_file_name:
                continue
            image_indices.append(int(image_file_name[6:11]))

        if len(image_indices) == 0:
            print('no image files', video_dir_path)
            n_frames = 0
        else:
            image_indices.sort(reverse=True)
            n_frames = image_indices[0]
            print(video_dir_path, n_frames)
        with open(os.path.join(video_dir_path, 'n_frames'), 'w') as dst_file:
            dst_file.write(str(n_frames))


if __name__=="__main__":
    dir_path = sys.argv[1]
    for class_name in os.listdir(dir_path):
        class_process(dir_path, class_name)

运行命令:

python utils/n_frames_ucf101_hmdb51.py ./data/data/UCF101_jpg

 

(5)文件结构

准备好数据后,文件结构: 

.../
    data/ (root dir)
        data
        	UCF101_jpg/  (jpg files)
        ucf-101/
            UCF-101/ (video files)
            ucfTrainTestlist/ (annotation path)
            	classInd.txt
            	ucf101_01.json
	   

2.训练与评估

如分类任务以imagenet的预训练模型一样,视频分类主要以kinetics训练的模型作为预训练模型。

(1)准备预训练模型

准备预训练模型:

这里以ResNet34-VTN为例,下载对应的预训练模型,放到models下

 

(2)文件结构

.../
    data/ (root dir)
        ...
	    models/
	    	resnet_34_vtn_rgbd_kinetics.pth  
	    	se_resnext_101_32x4d_vtn_rgbd_ucf101_s1.pth

(3)训练

预训练模型:resnet_34_vtn_rgbd_kinetics.pth 

CUDA_VISIBLE_DEVICES="0" python main.py --root-path ./data --result-path ./logs/ --dataset ucf101_1 --model resnet34_vtn_rgbdiff -b16 --lr 1e-5 --seq 16 --pretrain-path ./models/resnet_34_vtn_rgbd_kinetics.pth --video-path ./UCF101_jpg

 

(4)评估

评估使用的预训练模型:se_resnext_101_32x4d_vtn_rgbd_ucf101_s1.pth

CUDA_VISIBLE_DEVICES="2" python main.py --root-path ./data --result-path ./logs/ --dataset ucf101_1 --model se-resnext101-32x4d_vtn_rgbdiff -b128 --lr 1e-5 --seq 16 --st 2 --no-mean-norm --no-std-norm --no-train --no-val --test --pretrain-path ./models/se_resnext_101_32x4d_vtn_rgbd_ucf101_s1.pth --video-path ./UCF101_jpg

(5)参数

使用什么模型与参数都是根据下图来的:

 

--dataset:ucf101或者kinetics;

--model:与Model与Input有关,以性能最好的93.44%模型为例,se-resnext101-32x4d_vtn_rgbdiff;

-b:batch_size;

--lr:学习率;

--pretrain-path:就是Checkpoint中下载的模型;

--video-path:如果不以默认的"./data/data/utf-101/jpeg2",可以自己设置。

其中--model与--pretrain-path不对应是会出问题的。

训练se_resnext_101_32x4d_vtn_rgbd,b=2大概需要7-8GB显存。

3.单帧图像推理

import sys
import time
from argparse import ArgumentParser
from collections import deque
from copy import deepcopy

import cv2
import numpy as np
import torch
import torch.nn.functional as F

from action_recognition.model import create_model
from action_recognition.options import add_input_args
from action_recognition.spatial_transforms import (CenterCrop, Compose,
                                                   Normalize, Scale, ToTensor, MEAN_STATISTICS, STD_STATISTICS)
from action_recognition.utils import load_state, generate_args
import os

TEXT_COLOR = (255, 255, 255)
TEXT_FONT_FACE = cv2.FONT_HERSHEY_DUPLEX
TEXT_FONT_SIZE = 1
TEXT_VERTICAL_INTERVAL = 45
NUM_LABELS_TO_DISPLAY = 2


class TorchActionRecognition:
    def __init__(self, encoder, checkpoint_path, num_classes=400, **kwargs):
        # model_type = "{}_vtn".format(encoder)
        model_type = "{}".format(encoder)
        args, _ = generate_args(model=model_type, n_classes=num_classes, layer_norm=False, **kwargs)
        self.args = args
        self.model, _ = create_model(args, model_type)

        self.model = self.model.module
        self.model.eval()
        self.model.cuda()

        checkpoint = torch.load(str(checkpoint_path))
        load_state(self.model, checkpoint['state_dict'])

        self.preprocessing = make_preprocessing(args)
        self.embeds = deque(maxlen=(args.sample_duration * args.temporal_stride))

    def preprocess_frame(self, frame):
        frame = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
        return self.preprocessing(frame)

    def infer_frame(self, frame):
        embedding = self._infer_embed(self.preprocess_frame(frame))
        self.embeds.append(embedding)
        sequence = self.get_seq()
        return self._infer_logits(sequence)

    def _infer_embed(self, frame):
        with torch.no_grad():
            frame_tensor = frame.unsqueeze(0).to('cuda')
            tensor = self.model.resnet(frame_tensor)
            tensor = self.model.reduce_conv(tensor)
            embed = F.avg_pool2d(tensor, 7)
        return embed.squeeze(-1).squeeze(-1)

    def _infer_logits(self, embeddings):
        with torch.no_grad():
            ys = self.model.self_attention_decoder(embeddings)
            ys = self.model.fc(ys)
            ys = ys.mean(1)
        return ys.cpu()

    def _infer_seq(self, frame):
        with torch.no_grad():
            result = self.model(frame.view(1, self.args.sample_duration, 3,
                                           self.args.sample_size, self.args.sample_size).to('cuda'))
        return result.cpu()

    def get_seq(self):
        sequence = torch.stack(tuple(self.embeds), 1)
        if self.args.temporal_stride > 1:
            sequence = sequence[:, ::self.args.temporal_stride, :]

        n = self.args.sample_duration
        if sequence.size(1) < n:
            num_repeats = (n - 1) // sequence.size(1) + 1
            sequence = sequence.repeat(1, num_repeats, 1)[:, :n, :]

        return sequence


def make_preprocessing(args):
    return Compose([
        Scale(args.sample_size),
        CenterCrop(args.sample_size),
        ToTensor(args.norm_value),
        Normalize(MEAN_STATISTICS[args.mean_dataset], STD_STATISTICS[args.mean_dataset])
    ])


def draw_rect(image, bottom_left, top_right, color=(0, 0, 0), alpha=1.):
    xmin, ymin = bottom_left
    xmax, ymax = top_right

    image[ymin:ymax, xmin:xmax, :] = image[ymin:ymax, xmin:xmax, :] * (1 - alpha) + np.asarray(color) * alpha
    return image


def render_frame(frame, probs, labels):
    order = probs.argsort(descending=True)

    status_bar_coordinates = (
        (0, 0),  # top left
        (650, 25 + TEXT_VERTICAL_INTERVAL * NUM_LABELS_TO_DISPLAY)  # bottom right
    )

    draw_rect(frame, status_bar_coordinates[0], status_bar_coordinates[1], alpha=0.5)

    for i, imax in enumerate(order[:NUM_LABELS_TO_DISPLAY]):
        text = '{} - {:.1f}%'.format(labels[imax], probs[imax] * 100)
        text = text.upper().replace("_", " ")
        cv2.putText(frame, text, (15, TEXT_VERTICAL_INTERVAL * (i + 1)), TEXT_FONT_SIZE,
                    TEXT_FONT_FACE, TEXT_COLOR)

    return frame


def run_demo(model, labels, input_path, save_path):

    fps = 30
    tick = time.time()

    for file in sorted(os.listdir(input_path)):
        frame = cv2.imread(os.path.join(input_path, file))
        if frame is None:
            print("图像为空:"+str(frame))
            break
        print("Now processing file : {}".format(file))
    # while video_cap.isOpened():
    #     ok, frame = video_cap.read()
    #
    #     if not ok:
    #         break

        logits = model.infer_frame(frame)
        probs = F.softmax(logits[0], dim=0)
        frame = render_frame(frame, probs, labels)

        tock = time.time()
        expected_time = tick + 1 / fps
        if tock < expected_time:
            delay = max(1, int((expected_time - tock) * 1000))
        tick = tock

        cv2.imwrite(os.path.join(save_path, file), frame)
        # cv2.imshow("demo", frame)
        # key = cv2.waitKey(delay)
        # if key == 27 or key == ord('q'):
        #     break


def main():
    parser = ArgumentParser()
    parser.add_argument("--encoder", help="What encoder to use ", default='resnet34')
    parser.add_argument("--checkpoint", help="Path to pretrained model (.pth) file", required=True)
    parser.add_argument("--input-video", type=str, help="Path to input img or video", required=True)
    parser.add_argument("--save-path", type=str, help="Path to save img", required=True)
    parser.add_argument("--labels", help="Path to labels file (new-line separated file with label names)", type=str,
                        required=True)
    add_input_args(parser)
    args = parser.parse_args()

    with open(args.labels) as fd:
        labels = fd.read().strip().split('\n')

    extra_args = deepcopy(vars(args))
    input_data_params = set(x.dest for x in parser._action_groups[-1]._group_actions)
    for name in list(extra_args.keys()):
        if name not in input_data_params:
            del extra_args[name]

    input_path = args.input_video
    save_path = args.save_path
    try:
        model = TorchActionRecognition(args.encoder, args.checkpoint, num_classes=len(labels), **extra_args)
        # cap = cv2.VideoCapture(args.input_video)
        run_demo(model, labels, input_path, save_path)
    except Exception as error:
        print("an error occur : "+error)

if __name__ == '__main__':
    sys.exit(main())

准备一张图像与模型,运行以下命令:

CUDA_VISIBLE_DEVICES="0" python vtn_jpg_demo.py --encoder  resnet34_vtn --checkpoint ./data/models/resnet_34_vtn_rgb_ucf101_s1.pth --input-video ./data/ourdata/input --save-path ./data/ourdata/output --labels ./data/ucf-101/ucfTrainTestlist/classInd.txt

得到结果:

 

 

4.注意

1)出现各种各样关于模型的错误,很有可能就是调用的模型,与前面的模型名称不一致;

2)明明数据放在data对应目录下,调用模型错误,是因为代码内部设置了data/data+(video_path);

3)出现无n_frames文件,请看我的1.4;

4)测试预训练模型,出现错误的可能:

      数据准备时,即python utils/preprocess_videos.py调用过程中默认使用的-q=4,图像质量最差的保存方式;

      推理时,参数不对,多了或者少了,--no-mean-norm --no-std-norm。

注:这些都是花了很多时间的代价,还有一些记不起了。

 

  • 1
    点赞
  • 6
    收藏
    觉得还不错? 一键收藏
  • 打赏
    打赏
  • 3
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 3
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

wait a minutes

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值