自制AVA数据集在slowfast中训练(mmaction2工具箱)

前言、slowfast模型介绍

  1. SlowFast 模型是一种在视频动作识别领域表现卓越的深度学习模型,其核心创新在于采用了独特的双通道架构来有效处理视频中的时空信息。

  2. 该模型由慢速(Slow)和快速(Fast)两条通路组成,其中慢速通路以较低的帧率对视频进行采样,通常每隔一定帧数选取一帧,重点聚焦于提取视频中的空间语义信息,捕捉物体的外观、场景的布局等相对稳定的特征;而快速通路则以高帧率采样,能够捕捉到快速的动作变化和瞬间的运动信息,但其通道容量较低,通过较少的卷积核来减少计算量。

  3. 为了充分融合两条通路的优势,模型引入了侧向链接机制,将快速通路的特征单向融合到慢速通路中,使得慢速通路不仅拥有丰富的空间信息,还能整合快速通路捕捉到的动态信息。

  4. 这种设计巧妙地平衡了计算成本与性能,让模型在处理视频动作识别任务时,能够高效且精准地识别各种复杂动作,在多个公开数据集上取得了优异的成绩,被广泛应用于视频监控、体育赛事分析、智能娱乐等多个领域。

  5. 简单来说,就是一种可以识别视频中各种动作的模型

  6. 本文历时一个多月,详细介绍了slowfast的使用,希望大家多多支持!

一、数据集视频准备

  • 准备至少一个视频,各个视频的时长需为15min,如下所示,AVA数据集的时间戳为900-1800,历时15min,如需修改,可按照下图所示对参数进行修改

二、视频抽帧

需要对视频进行抽帧,分为1s抽1帧和1s抽30帧

  • 1s抽一帧用于标注

  • 1s抽30帧用于训练

以下是1s抽1帧的代码,路径记得自己修改。图片命名为“视频名称_帧数”(帧数固定六位数,不足用0补齐)

import os
import cv2

count = 0
def extract_frames(video_path, output_folder):
    # 打开视频文件
    cap = cv2.VideoCapture(video_path)
    fps = cap.get(cv2.CAP_PROP_FPS)  # 获取视频的帧率
    frame_interval = int(fps)  # 每秒抽取一帧
    global count
    if not os.path.exists(output_folder):
        os.makedirs(output_folder)

    frame_count = 0
    while True:
        ret, frame = cap.read()

        if not ret:
            break

        if frame_count % frame_interval == 0:
            count+=1
            frame_filename = os.path.join(output_folder,
                                          f"5_{count:06d}.jpg")
            cv2.imwrite(frame_filename, frame)
            print(f"Saved {frame_filename}")

        frame_count += 1

    cap.release()


def process_videos_in_folder(input_folder, output_folder):
    for root, dirs, files in os.walk(input_folder):
        for file in files:
            if file.endswith(('.mp4', '.avi', '.mov', '.mkv')):  # 支持的视频格式
                video_path = os.path.join(root, file)
                print(f"Processing {video_path}...")
                extract_frames(video_path, output_folder)


if name == "__main__":
    input_folder = r"D:\AI+edu\teacher dataset\5"  # 输入视频文件夹路径
    output_folder = r"D:\AI+edu\data\ava\labelframes\5"  # 输出图片文件夹路径
    process_videos_in_folder(input_folder, output_folder)

抽帧完后每个视频分别存入命名为“1、2、3...”的文件夹,这些文件夹都在labelframes文件夹下,之后1s抽30帧的均存在rawframes文件夹中。

labelframes和rawframes文件夹都在ava文件夹下

 

以下是1s抽30帧的代码,路径记得自己修改。图片命名为“视频名称_帧数”(帧数固定六位数,不足用0补齐)

import os
import cv2
count = 1
def extract_frames(video_path, output_folder, fps=30):
    # 打开视频文件
    cap = cv2.VideoCapture(video_path)
    if not cap.isOpened():
        print(f"无法打开视频文件: {video_path}")
        return

    # 获取视频的帧率
    video_fps = cap.get(cv2.CAP_PROP_FPS)
    if video_fps <= 0:
        print(f"视频文件 {video_path} 的帧率无效,使用默认帧率 30 fps")
        video_fps = 30  # 设置默认帧率
    frame_interval = int(video_fps / fps)  # 计算抽帧间隔
    if frame_interval <= 0:
        frame_interval = 1  # 确保 frame_interval 至少为 1

    frame_count = 0
    saved_frame_count = 0
    global count
    while True:
        ret, frame = cap.read()
        if not ret:
            break

        # 每隔frame_interval帧保存一次
        if frame_count % frame_interval == 0:
            saved_frame_count += 1
            frame_name = f"5_{count:06d}.jpg"
            frame_path = os.path.join(output_folder, frame_name)
            cv2.imwrite(frame_path, frame)
            print(f"保存帧: {frame_path}")
            count += 1

        frame_count += 1

    cap.release()
    print(f"抽帧完成,共保存了 {saved_frame_count} 帧")

def process_folder(input_folder, output_folder):
    if not os.path.exists(output_folder):
        os.makedirs(output_folder)

    for filename in os.listdir(input_folder):
        if filename.endswith(".mp4") or filename.endswith(".avi"):  # 支持常见的视频格式
            video_path = os.path.join(input_folder, filename)
            extract_frames(video_path, output_folder)

if name == "__main__":
    input_folder = r"D:\AI+edu\teacher dataset\5"  # 替换为你的视频文件夹路径
    output_folder = r"D:\AI+edu\data\ava\rawframes\5"  # 替换为保存抽帧图片的文件夹路径
    process_folder(input_folder, output_folder)

文件结构如下图所示:

三、标注数据集

用VIA进行标注(我使用的版本是via-3.0.11),步骤如下:

  1. 点击下载地址进行下载:https://gitlab.com/vgg/via/tree/master

  2. 下载后打开文件夹,点击via_image_annotator.html打开

  3. 进入网页后,点击加号图标,将labelframes下的图片全部导入

  1. 点击下图所示的图标,创建一个attribute

  1. anchor选择第二项,input type选择checkbox,在options中定义人的四个行为:stand,sit,talk to,listen,用英文状态下的逗号分割开,然后preview中勾选四个行为

  1. 开始标注图片,框选图片中的人,然后点击矩形框,勾选你认为人出现的行为,如下图所示:

  1. 全部标注完成后,点击如下图所示图标:

保持默认选项,点击“Export”导出csv文件,注意,该csv文件最好不要用Excel打开进行编辑!!!

此时会得到一个csv文件

 

四、via数据集转为slowfast格式

  • slowfast数据集要求ava格式,同时需要提供pkl文件,使用以下python脚本可一键生成全部所需配置文件(注意这里ava版本是2.1)

"""
Theme:ava format data transformer
author:Hongbo Jiang
time:2022/3/14/1:51:51
description:
    
    这是一个数据格式转换器,根据mmaction2的ava数据格式转换规则将来自网站:
    https://www.robots.ox.ac.uk/~vgg/software/via/app/via_video_annotator.html
    的、标注好的、视频理解类型的csv文件转换为mmaction2指定的数据格式。
    转换规则:
        # AVA Annotation Explained
        In this section, we explain the annotation format of AVA in details:
        ```
        mmaction2
        ├── data
        │   ├── ava
        │   │   ├── annotations
        │   │   |   ├── ava_dense_proposals_train.FAIR.recall_93.9.pkl
        │   │   |   ├── ava_dense_proposals_val.FAIR.recall_93.9.pkl
        │   │   |   ├── ava_dense_proposals_test.FAIR.recall_93.9.pkl
        │   │   |   ├── ava_train_v2.1.csv
        │   │   |   ├── ava_val_v2.1.csv
        │   │   |   ├── ava_train_excluded_timestamps_v2.1.csv
        │   │   |   ├── ava_val_excluded_timestamps_v2.1.csv
        │   │   |   ├── ava_action_list_v2.1.pbtxt
        ```
        ## The proposals generated by human detectors
        In the annotation folder, ava_dense_proposals_[train/val/test].FAIR.recall_93.9.pkl are human proposals generated by a human detector. They are used in training, validation and testing respectively. Take ava_dense_proposals_train.FAIR.recall_93.9.pkl as an example. It is a dictionary of size 203626. The key consists of the videoID and the timestamp. For example, the key -5KQ66BBWC4,0902 means the values are the detection results for the frame at the $$902_{nd}$$ second in the video -5KQ66BBWC4. The values in the dictionary are numpy arrays with shape $$N \times 5$$ , $$N$$ is the number of detected human bounding boxes in the corresponding frame. The format of bounding box is $$[x_1, y_1, x_2, y_2, score], 0 \le x_1, y_1, x_2, w_2, score \le 1$$. $$(x_1, y_1)$$ indicates the top-left corner of the bounding box, $$(x_2, y_2)$$ indicates the bottom-right corner of the bounding box; $$(0, 0)$$ indicates the top-left corner of the image, while $$(1, 1)$$ indicates the bottom-right corner of the image.
        ## The ground-truth labels for spatio-temporal action detection
        In the annotation folder, ava_[train/val]_v[2.1/2.2].csv are ground-truth labels for spatio-temporal action detection, which are used during training & validation. Take ava_train_v2.1.csv as an example, it is a csv file with 837318 lines, each line is the annotation for a human instance in one frame. For example, the first line in ava_train_v2.1.csv is '-5KQ66BBWC4,0902,0.077,0.151,0.283,0.811,80,1': the first two items -5KQ66BBWC4 and 0902 indicate that it corresponds to the $$902_{nd}$$ second in the video -5KQ66BBWC4. The next four items ($$[0.077(x_1), 0.151(y_1), 0.283(x_2), 0.811(y_2)]$$) indicates the location of the bounding box, the bbox format is the same as human proposals. The next item 80 is the action label. The last item 1 is the ID of this bounding box.
        ## Excluded timestamps
        ava_[train/val]_excludes_timestamps_v[2.1/2.2].csv contains excluded timestamps which are not used during training or validation. The format is video_id, second_idx .
        ## Label map
        ava_action_list_v[2.1/2.2]_for_activitynet_[2018/2019].pbtxt contains the label map of the AVA dataset, which maps the action name to the label index.
"""
 
import csv
import os
from distutils.log import info
import pickle
from matplotlib.pyplot import contour, show
import numpy as np
import cv2
from sklearn.utils import shuffle

def transformer(origin_csv_path, frame_image_dir,
                train_output_pkl_path, train_output_csv_path,
                valid_output_pkl_path, valid_output_csv_path,
                exclude_train_output_csv_path, exclude_valid_output_csv_path,
                out_action_list, out_labelmap_path, dataset_percent=0.8):
    """
    输入:
    origin_csv_path:从网站导出的csv文件路径。
    frame_image_dir:以"视频名_第n秒.jpg"格式命名的图片,这些图片是通过逐秒读取的。
    output_pkl_path:输出pkl文件路径
    output_csv_path:输出csv文件路径
    out_labelmap_path:输出labelmap.txt文件路径
    dataset_percent:训练集和测试集分割
    
    输出:无
    
    """
 
    # -----------------------------------------------------------------------------------------------
    get_label_map(origin_csv_path, out_action_list, out_labelmap_path)
    -----------------------------------------------------------------------------------------------
    information_array = [[], [], []]
    # 读取输入csv文件的位置信息段落
    with open(origin_csv_path, 'r') as csvfile:
        count = 0
        content = csv.reader(csvfile)
        for line in content:
            if count >= 10:
                try:
                    frame_image_name = eval(line[1])[0]  str
                    location_info = eval(line[4])[1:]  list
                    action_list_str = line[5]
                    if action_list_str and action_list_str != 'None':
                        action_list = list(eval(action_list_str).values())[0].split(',')
                        action_list = [int(x) for x in action_list]
                    else:
                        action_list = []
                except Exception as e:
                    print(f"Error processing line {line}: {e}")
                    continue

                information_array[0].append(frame_image_name)
                information_array[1].append(location_info)
                information_array[2].append(action_list)
            count += 1
    # 将:对应帧图片名字、物体位置信息、动作种类信息汇总为一个信息数组
    information_array = np.array(information_array, dtype=object).transpose()
    information_array = np.array(information_array)
    # -----------------------------------------------------------------------------------------------
    num_train = int(dataset_percent * len(information_array))
    train_info_array = information_array[:num_train]
    valid_info_array = information_array[num_train:]
    get_pkl_csv(train_info_array, train_output_pkl_path, train_output_csv_path, exclude_train_output_csv_path, frame_image_dir)
    get_pkl_csv(valid_info_array, valid_output_pkl_path, valid_output_csv_path, exclude_valid_output_csv_path, frame_image_dir)
 
 
def get_label_map(origin_csv_path, out_action_list, out_labelmap_path):
    classes_list = 0
    classes_content = ""
    labelmap_strings = ""
    # 提取出csv中的第9行的行为下标
    with open(origin_csv_path, 'r') as csvfile:
        count = 0
        content = csv.reader(csvfile)
        for line in content:
            if count == 8:
                classes_list = line
                break
            count += 1
    # 截取种类字典段落
    st = 0
    ed = 0
    for i in range(len(classes_list)):
        if classes_list[i].startswith('options'):
            st = i
        if classes_list[i].startswith('default_option_id'):
            ed = i
    for i in range(st, ed):
        if i == st:
            classes_content = classes_content + classes_list[i][len('options:'):] + ','
        else:
            classes_content = classes_content + classes_list[i] + ','
    classes_dict = eval(classes_content)[0]
    # 写入labelmap.txt文件
    with open(out_action_list, 'w') as f:  # 写入action_list文件
        for v, k in classes_dict.items():
            labelmap_strings = labelmap_strings + "label {{\n  name: \"{}\"\n  label_id: {}\n  label_type: PERSON_MOVEMENT\n}}\n".format(k, int(v)+1)
        f.write(labelmap_strings)
    labelmap_strings = ""
    with open(out_labelmap_path, 'w') as f:  # 写入label_map文件
        for v, k in classes_dict.items():
            labelmap_strings = labelmap_strings + "{}: {}\n".format(int(v)+1, k)
        f.write(labelmap_strings)
 
 
def get_pkl_csv(information_array, output_pkl_path, output_csv_path, exclude_output_csv_path, frame_image_dir):
    # 在遍历之前需要对我们的字典进行初始化
    pkl_data = dict()  # 存储pkl键值对信的字典(其值为普通list)
    csv_data = []  # 存储导出csv文件的2d数组
    read_data = {}  # 存储pkl键值对的字典(方便字典的值化为numpy数组)
 
    for i in range(len(information_array)):
        img_name = information_array[i][0]
        -------------------------------------------------------------------------------------------
        video_name, frame_name = '_'.join(img_name.split('_')[:-1]), format(int(img_name.split('_')[-1][:-4]), '04d')  # 我的格式是"视频名称_帧名称",格式不同可自行更改
        -------------------------------------------------------------------------------------------
        pkl_key = video_name + ',' + frame_name
        pkl_data[pkl_key] = []
    # 遍历所有的图片进行信息读取并写入pkl数据
    for i in range(len(information_array)):
        img_name = information_array[i][0]
        -------------------------------------------------------------------------------------------
        video_name, frame_name = '_'.join(img_name.split('_')[:-1]), str(int(img_name.split('_')[-1][:-4]))  # 我的格式是"视频名称_帧名称",格式不同可自行更改
        -------------------------------------------------------------------------------------------
        imgpath = frame_image_dir + '/' + img_name
        location_list = information_array[i][1]
        action_info = information_array[i][2]
        image_array = cv2.imread(imgpath)
        h, w = image_array.shape[:2]
        # 进行归一化
        location_list[0] /= w
        location_list[1] /= h
        location_list[2] /= w
        location_list[3] /= h
        location_list[2] = location_list[2]+location_list[0]
        location_list[3] = location_list[3]+location_list[1]
        # 置信度置为1
        # 组装pkl数据
 
        for kind_idx in action_info:
            csv_info = [video_name, frame_name, *location_list, kind_idx+1, 1]
            csv_data.append(csv_info)
 
        location_list = location_list [1]
        pkl_key = video_name + ',' + format(int(frame_name), '04d')
        pkl_value = location_list
        pkl_data[pkl_key].append(pkl_value)
 
    for k, v in pkl_data.items():
        read_data[k] = np.array(v)
 
    with open(output_pkl_path, 'wb') as f:  # 写入pkl文件
        pickle.dump(read_data, f)
 
    with open(output_csv_path, 'w', newline='') as f:  # 写入csv文件, 设定参数newline=''可以不换行。
        f_csv = csv.writer(f)
        f_csv.writerows(csv_data)
 
    with open(exclude_output_csv_path, 'w', newline='') as f:  # 写入csv文件, 设定参数newline=''可以不换行。
        f_csv = csv.writer(f)
        f_csv.writerows([])
 
def showpkl(pkl_path):
    with open(pkl_path, 'rb') as f:
        content = pickle.load(f)
    return content
 
 
def showcsv(csv_path):
    output = []
    with open(csv_path, 'r') as f:
        content = csv.reader(f)
        for line in content:
            output.append(line)
    return output
 
 
def showlabelmap(labelmap_path):
    classes_dict = dict()
    with open(labelmap_path, 'r') as f:
        content = (f.read().split('\n'))[:-1]
        for item in content:
            mid_idx = -1
            for i in range(len(item)):
                if item[i] == ":":
                    mid_idx = i
            classes_dict[item[:mid_idx]] = item[mid_idx + 1:]
    return classes_dict
 
 
os.makedirs('./ava/annotations', exist_ok=True)
transformer("./Unnamed-VIA Project13Jan2025_20h10m01s_export.csv", './ava/labelframes',
            './ava/annotations/ava_dense_proposals_train.FAIR.recall_93.9.pkl', './ava/annotations/ava_train_v2.1.csv',
            './ava/annotations/ava_dense_proposals_val.FAIR.recall_93.9.pkl', './ava/annotations/ava_val_v2.1.csv',
            './ava/annotations/ava_train_excluded_timestamps_v2.1.csv', './ava/annotations/ava_val_excluded_timestamps_v2.1.csv',
            './ava/annotations/ava_action_list_v2.1.pbtxt', './ava/annotations/labelmap.txt', 0.9)
print(showpkl('./ava/annotations/ava_dense_proposals_train.FAIR.recall_93.9.pkl'))
print(showcsv('././ava/annotations/ava_train_v2.1.csv'))
print(showlabelmap('././ava/annotations/labelmap.txt'))

 这里讲一个参数data_percent用于切分训练集和验证集,可自行修改

 

五、slowfast环境部署

这里用的是MMAction2 里的 SlowFast 模型(经杨帆老师修改调整)。我是在autodl上租的GPU拿来训练,显卡和pytorch版本选择如下图:

  • 以下是在autodl上部署的代码:

    git clone https://gitee.com/YFwinston/mmaction2_YF.git
    pip install mmcv-full==1.3.17 -f https://download.openmmlab.com/mmcv/dist/cu111/torch1.8.0/index.html
    pip install opencv-python-headless==4.1.2.30
    pip install moviepy
    cd mmaction2_YF
    pip install -r requirements/build.txt
    pip install -v -e .
    mkdir -p ./data/ava
    cd ..
    git clone https://gitee.com/YFwinston/mmdetection.git
    cd mmdetection
    pip install -r requirements/build.txt
    pip install -v -e .
    cd ../mmaction2_YF
    wget https://download.openmmlab.com/mmdetection/v2.0/faster_rcnn/faster_rcnn_r50_fpn_2x_coco/faster_rcnn_r50_fpn_2x_coco_bbox_mAP-0.384_20200504_210434-a5d8aa15.pth -P ./Checkpionts/mmdetection/
    wget https://download.openmmlab.com/mmaction/recognition/slowfast/slowfast_r50_8x8x1_256e_kinetics400_rgb/slowfast_r50_8x8x1_256e_kinetics400_rgb_20200716-73547d2b.pth -P ./Checkpionts/mmaction/

     

  • 配置文件

在 /mmaction2_YF/configs/detection/ava/下创建 my_slowfast_kinetics_pretrained_r50_4x16x1_20e_ava_rgb.py

# model setting
model = dict(
    type='FastRCNN',
    backbone=dict(
        type='ResNet3dSlowFast',
        pretrained=None,
        resample_rate=8,
        speed_ratio=8,
        channel_ratio=8,
        slow_pathway=dict(
            type='resnet3d',
            depth=50,
            pretrained=None,
            lateral=True,
            conv1_kernel=(1, 7, 7),
            dilations=(1, 1, 1, 1),
            conv1_stride_t=1,
            pool1_stride_t=1,
            inflate=(0, 0, 1, 1),
            spatial_strides=(1, 2, 2, 1)),
        fast_pathway=dict(
            type='resnet3d',
            depth=50,
            pretrained=None,
            lateral=False,
            base_channels=8,
            conv1_kernel=(5, 7, 7),
            conv1_stride_t=1,
            pool1_stride_t=1,
            spatial_strides=(1, 2, 2, 1))),
    roi_head=dict(
        type='AVARoIHead',
        bbox_roi_extractor=dict(
            type='SingleRoIExtractor3D',
            roi_layer_type='RoIAlign',
            output_size=8,
            with_temporal_pool=True),
        bbox_head=dict(
            type='BBoxHeadAVA',
            in_channels=2304,
            num_classes=81,
            multilabel=True,
            dropout_ratio=0.5)),
    train_cfg=dict(
        rcnn=dict(
            assigner=dict(
                type='MaxIoUAssignerAVA',
                pos_iou_thr=0.9,
                neg_iou_thr=0.9,
                min_pos_iou=0.9),
            sampler=dict(
                type='RandomSampler',
                num=32,
                pos_fraction=1,
                neg_pos_ub=-1,
                add_gt_as_proposals=True),
            pos_weight=1.0,
            debug=False)),
    test_cfg=dict(rcnn=dict(action_thr=0.002)))

dataset_type = 'AVADataset'
data_root = '/home/Custom-ava-dataset_Custom-Spatio-Temporally-Action-Video-Dataset/Dataset/rawframes'
anno_root = '/home/Custom-ava-dataset_Custom-Spatio-Temporally-Action-Video-Dataset/Dataset/annotations'


#ann_file_train = f'{anno_root}/ava_train_v2.1.csv'
ann_file_train = f'{anno_root}/train.csv'
#ann_file_val = f'{anno_root}/ava_val_v2.1.csv'
ann_file_val = f'{anno_root}/val.csv'

#exclude_file_train = f'{anno_root}/ava_train_excluded_timestamps_v2.1.csv'
#exclude_file_val = f'{anno_root}/ava_val_excluded_timestamps_v2.1.csv'

exclude_file_train = f'{anno_root}/train_excluded_timestamps.csv'
exclude_file_val = f'{anno_root}/val_excluded_timestamps.csv'

#label_file = f'{anno_root}/ava_action_list_v2.1_for_activitynet_2018.pbtxt'
label_file = f'{anno_root}/action_list.pbtxt'

proposal_file_train = (f'{anno_root}/dense_proposals_train.pkl')
proposal_file_val = f'{anno_root}/dense_proposals_val.pkl'

img_norm_cfg = dict(
    mean=[123.675, 116.28, 103.53], std=[58.395, 57.12, 57.375], to_bgr=False)

train_pipeline = [
    dict(type='SampleAVAFrames', clip_len=32, frame_interval=2),
    dict(type='RawFrameDecode'),
    dict(type='RandomRescale', scale_range=(256, 320)),
    dict(type='RandomCrop', size=256),
    dict(type='Flip', flip_ratio=0.5),
    dict(type='Normalize', **img_norm_cfg),
    dict(type='FormatShape', input_format='NCTHW', collapse=True),
    # Rename is needed to use mmdet detectors
    dict(type='Rename', mapping=dict(imgs='img')),
    dict(type='ToTensor', keys=['img', 'proposals', 'gt_bboxes', 'gt_labels']),
    dict(
        type='ToDataContainer',
        fields=[
            dict(key=['proposals', 'gt_bboxes', 'gt_labels'], stack=False)
        ]),
    dict(
        type='Collect',
        keys=['img', 'proposals', 'gt_bboxes', 'gt_labels'],
        meta_keys=['scores', 'entity_ids'])
]
# The testing is w/o. any cropping / flipping
val_pipeline = [
    dict(type='SampleAVAFrames', clip_len=32, frame_interval=2),
    dict(type='RawFrameDecode'),
    dict(type='Resize', scale=(-1, 256)),
    dict(type='Normalize', **img_norm_cfg),
    dict(type='FormatShape', input_format='NCTHW', collapse=True),
    # Rename is needed to use mmdet detectors
    dict(type='Rename', mapping=dict(imgs='img')),
    dict(type='ToTensor', keys=['img', 'proposals']),
    dict(type='ToDataContainer', fields=[dict(key='proposals', stack=False)]),
    dict(
        type='Collect',
        keys=['img', 'proposals'],
        meta_keys=['scores', 'img_shape'],
        nested=True)
]

data = dict(
    #videos_per_gpu=9,
    #workers_per_gpu=2,
    videos_per_gpu=5,
    workers_per_gpu=2,
    val_dataloader=dict(videos_per_gpu=1),
    test_dataloader=dict(videos_per_gpu=1),
    train=dict(
        type=dataset_type,
        ann_file=ann_file_train,
        exclude_file=exclude_file_train,
        pipeline=train_pipeline,
        label_file=label_file,
        proposal_file=proposal_file_train,
        person_det_score_thr=0.9,
        data_prefix=data_root,
        start_index=1,),
    val=dict(
        type=dataset_type,
        ann_file=ann_file_val,
        exclude_file=exclude_file_val,
        pipeline=val_pipeline,
        label_file=label_file,
        proposal_file=proposal_file_val,
        person_det_score_thr=0.9,
        data_prefix=data_root,
        start_index=1,))
data['test'] = data['val']

#optimizer = dict(type='SGD', lr=0.1125, momentum=0.9, weight_decay=0.00001)
optimizer = dict(type='SGD', lr=0.0125, momentum=0.9, weight_decay=0.00001)
# this lr is used for 8 gpus

optimizer_config = dict(grad_clip=dict(max_norm=40, norm_type=2))
# learning policy

lr_config = dict(
    policy='step',
    step=[10, 15],
    warmup='linear',
    warmup_by_epoch=True,
    warmup_iters=5,
    warmup_ratio=0.1)
#total_epochs = 20
total_epochs = 100
checkpoint_config = dict(interval=1)
workflow = [('train', 1)]
evaluation = dict(interval=1, save_best='mAP@0.5IOU')
log_config = dict(
    interval=20, hooks=[
        dict(type='TextLoggerHook'),
    ])
dist_params = dict(backend='nccl')
log_level = 'INFO'
work_dir = ('./work_dirs/ava/'
            'slowfast_kinetics_pretrained_r50_4x16x1_20e_ava_rgb')
load_from = ('https://download.openmmlab.com/mmaction/recognition/slowfast/'
             'slowfast_r50_4x16x1_256e_kinetics400_rgb/'
             'slowfast_r50_4x16x1_256e_kinetics400_rgb_20200704-bcde7ed7.pth')
resume_from = None
find_unused_parameters = False


以上配置文件需要注意以下四点:

  • 文件的路径

  • num_classes的值等于行为数+1,修改时需要修改以下几处:

在data中也要加入num_classes,因为默认的是80,不添加的话,训练时会出现维度不匹配的情况:

  • annotations下的文件名需要修改

根据代码中的文件名进行修改

  • 训练轮数修改

修改total_epochs的值为合适即可(默认为100)

 

六、常见问题及解决方案

  1. opencv的版本问题

在终端输入pip list,找到下图的三个库,需要对其统一版本,统一成4.1.2.30

pip install opencv-contrib-python==4.1.2.30 
pip install opencv-python==4.1.2.30
  1. yapf版本问题

yapf版本过高,执行如下代码即可:

pip install yapf==0.31.0
  1. rawframes图片名称修改

在取名上,裁剪的视频帧存在与训练不匹配的问题,需要修改图片名称 例如: 原本的名字:rawframes/1/1_000001.jpg 目标名字:rawframes/1/img_00001.jpg

以下代码用于修改图片名称,注意修改路径

import os
count = 1
def rename_images(folder_path, prefix="image_"):
    # 获取文件夹中所有文件的列表
    files = os.listdir(folder_path)
    # 过滤出图片文件(可以根据需要添加更多图片扩展名)
    image_extensions = ['.jpg', '.jpeg', '.png', '.gif']
    image_files = [f for f in files if any(f.lower().endswith(ext) for ext in image_extensions)]
    global count
    # 按文件在文件夹中的顺序重命名图片
    for i, image in enumerate(image_files):
        # 获取文件扩展名
        file_extension = os.path.splitext(image)[1]
        # 生成新的文件名
        new_name = f"img_{count:05d}.jpg"
        count += 1
        # 构建旧文件的完整路径
        old_path = os.path.join(folder_path, image)
        # 构建新文件的完整路径
        new_path = os.path.join(folder_path, new_name)
        # 重命名文件
        os.rename(old_path, new_path)
        print(f"Renamed {image} to {new_name}")

if __name__ == "__main__":
    # 指定要处理的文件夹路径
    folder_path = "mmaction2_YF/data/ava/rawframes/5"
    # 调用重命名函数
    rename_images(folder_path)
  1. GPU内存不足, RuntimeError: CUDA out of memory.

在train.py中加入以下代码即可解决

import torch

gc gc.collect() 
torch.cuda.empty_cache()

 

七、训练

运行以下代码:

cd mmaction2_YF
python tools/train.py configs/detection/ava/my_slowfast_kinetics_pretrained_r50_4x16x1_20e_ava_rgb.py --validate

八、测试

运行以下代码:

cd mmaction2_YF
python demo/demo_spatiotemporal_det.py --config configs/detection/ava/my_slowfast_kinetics_pretrained_r50_4x16x1_20e_ava_rgb.py --checkpoint work_dirs/ava/slowfast_kinetics_pretrained_r50_4x16x1_20e_ava_rgb/best_mAP@0.5IOU_epoch_10.pth --det-config demo/faster_rcnn_r50_fpn_2x_coco.py  --det-checkpoint Checkpionts/mmdetection/faster_rcnn_r50_fpn_2x_coco_bbox_mAP-0.384_20200504_210434-a5d8aa15.pth   --video data/1.mp4  --out-filename demo/det_1.mp4   --det-score-thr 0.5 --action-score-thr 0.5 --output-stepsize 4  --output-fps 6 --label-map tools/data/ava/label_map.txt
  • 其中 best_mAP@0.5IOU_epoch_10.pth 是训练后的权重,在mmaction2_YF/work_dirs/ava/slowfast_kinetics_pretrained_r50_4x16x1_20e_ava_rgb中:

  • 测试视频放在data文件夹中,命名为1.mp4,检测后的结果在demo文件夹中

 

本文参考文章如下:

自定义ava数据集及训练与测试 完整版 时空动作/行为 视频数据集制作 yolov5, deep sort, VIA MMAction, SlowFast-CSDN博客

SlowFast训练自己的数据集-CSDN博客

【mmaction2】AVA数据集处理遇到的问题-CSDN博客

 

### MMAction2 中 Faster R-CNN 和 SlowFast 模型的使用教程 #### 配置方法 对于希望在 MMAction2 中利用 Faster R-CNN 增强版的目标检测能力以及 SlowFast 的多路径视频分析特性,配置文件扮演着至关重要的角色。通常情况下,这些模型可以通过修改官方提供的预训练权重和默认参数来快速上手。 为了实现这一点,在 `configs/slowfast` 文件夹内可以找到适用于不同数据集的标准配置模板[^1]。针对想要融合 Faster R-CNN 功能的需求,则需参照 `faster_rcnn_r50_fpn_1x.py` 这样的基础配置文档,并将其适配到具体的任务环境中去。具体来说: - **调整输入帧率**:由于 SlowFast 设计之初就考虑到了高低频信息的重要性,因此建议保持原始视频较高的采样频率不变;而对于 Fast 路径而言,可适当降低其时间维度上的分辨率以便加速计算过程。 - **设置空间尺寸**:考虑到 GPU 显存限制等因素的影响,应当合理规划图像裁剪后的大小(如 22224),从而平衡精度与效率之间的关系。 ```yaml data: videos_per_gpu: 8 workers_per_gpu: 4 train: type: 'RawframeDataset' ann_file: data/kinetics400/train_videomap.txt data_prefix: data/kinetics400/rawframes_train pipeline: ... - type: 'Resize' scale: (256, 320) - type: 'RandomResizedCrop' area_range: (0.08, 1.0) aspect_ratio_range: (0.75, 1.3333) crop_size: 224 ``` 上述 YAML 片段展示了如何定义训练阶段的数据加载器及其处理流程的一部分[^2]。 #### 使用教程 MMAction2 提供了详尽的安装指南和支持多种框架环境部署的方式。完成必要的依赖项安装之后,用户可以直接运行如下命令启动基于 SlowFast 架构的任务实例化工作: ```bash python tools/test.py ${CONFIG_FILE} ${CHECKPOINT_FILE} [--out ${RESULT_FILE}] [--eval ${EVAL_METRICS}] ``` 其中 `${CONFIG_FILE}` 是指先前准备好的配置脚本位置,而`${CHECKPOINT_FILE}` 则对应所选用预训练模型保存下来的权重量级档案地址。如果期望评估测试效果的话,还可以额外指定输出结果存储路径及评测指标选项。 至于集成有 Faster R-CNN 组件的情况,除了遵循以上常规操作外,还需特别注意确保目标检测子系统的正常运作——这可能涉及到进一步定制化的开发工作,比如自定义损失函数的设计或是特征金字塔网络(FPN)结构的选择等。 #### 性能优化 当追求更高的推理速度而不牺牲太多分类准确度时,可以从以下几个方面入手进行调优尝试: - **硬件层面**:尽可能采用高性能显卡设备执行运算任务,同时关注 CPU/GPU 协同工作的效能表现; - **软件层面**:探索混合精度训练(Half Precision Training),即允许部分变量以较低位宽表示形式参与前向传播计算,以此减少内存占用并加快迭代速率; - **算法层面**:依据实际应用场景特点微调骨干网架构中的超参设定,例如卷积核数量、池化层窗口尺度等,力求达到最佳性价比组合。 通过上述措施,可以在很大程度上改善整个系统的响应时间和资源利用率水平。
评论 22
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值