基于YOLOv8和MMAction2的行人动作检测

Aobu＆

已于 2024-08-01 14:49:18 修改

阅读量1k

点赞数 17

文章标签： YOLO 目标检测计算机视觉

于 2024-07-30 15:28:56 首次发布

本文链接：https://blog.csdn.net/qq_54534211/article/details/140769565

版权

引言

在智能视频监控和行为分析领域，行人动作检测一直是一个关键的技术挑战。随着深度学习技术的快速发展，基于视频的动作识别已经取得了显著的进展。本文将介绍一个结合了YOLOv8目标检测和MMAction2时间序列模型的行人动作检测系统，该系统能自动检测视频中的行人并识别其行为动作。

MMAction2安装与环境配置

mmpose安装参考：

open-mmlab / mmpose安装、使用教程_mmpose使用教程-CSDN博客

mmdetection安装参考：

开始你的第一步 — MMDetection 3.3.0 文档

安装过程中出现ModuleNotFoundError: No module named ‘setuptools.command.build

使用如下命令解决：

python -m pip install --upgrade pip setuptools wheel

预训练模型的选择

由于我收集到相应行为的数据集比较小大概就二十几个五六秒的mp4文件，所以我选择Temporal Shift Module（TSM）作为我的预训练模型。TSM是一种专为视频理解任务设计的网络结构，它通过时间维度上的平移操作有效地捕捉视频中的动作信息，同时保持了较高的学习效率。这对于小数据集尤为重要，因为它可以在有限的数据上实现更好的泛化能力。

TSM的优势在于其能够适应不同长度的视频片段，并且通过时间平移模块增强了对时间信息的敏感性。这使得TSM非常适合作为小数据集上的预训练模型，因为它可以在预训练阶段学习到丰富的特征表示，这些特征随后可以通过迁移学习应用到特定的动作检测任务上。

如果项目涉及到复杂的人体动作分析，并且数据集允许，使用Pose3D作为预训练模型可能更合适，此外，如果数据集规模较大或者计算资源较为充足，可以考虑使用I3D等基于3D卷积的模型，这些模型能够从视频中学习更深层次的时空特征。

数据集与预处理

为了训练和验证我的行人动作检测模型，我从4个g的视频资料中剪辑并导出了二十多个关键动作片段，每个片段都精确地展示了一种行为动作，并按照“动作_序号.mp4”的格式命名以保持命名的一致性，如下图

我要识别的行为共有两种，分别是cross和walk两种行为状态。

由于数据集规模较小，我手动将这些视频文件划分为训练集和验证集，其中训练集17个mp4文件，验证集6个mp4文件。

label文件格式如下，以train.txt为例：

数据集文件结构如上图

最后按照官网在F:\mmaction2\configs\recognition\tsm中修改配置文件。

我选择的是在tsm_imagenet-pretrained-r50_8xb16-1x1x8-50e_kinetics400-rgb.py配置文件上进行修改。

这是我最后训练时使用的配置文件代码：

# 数据集路径修改以适配自定义数据集的标注文件和视频文件。
ann_file_train = 'data/walk_cross/train.txt'
ann_file_val = 'data/walk_cross/val.txt'
# 禁用自动学习率调整并设置基础批量大小。
auto_scale_lr = dict(base_batch_size=128, enable=False)
# 设置数据集的根目录，指向存放训练和验证视频的文件夹。
data_root = 'data/walk_cross/train'
data_root_val = 'data/walk_cross/val'
# 指定数据集类型为VideoDataset，适用于视频理解任务。
dataset_type = 'VideoDataset'
# 定义训练过程中使用的钩子（hooks），包括检查点保存、日志记录等。
default_hooks = dict(
    checkpoint=dict(
        interval=3, max_keep_ckpts=3, save_best='auto', type='CheckpointHook'),
    logger=dict(ignore_last=False, interval=20, type='LoggerHook'),
    param_scheduler=dict(type='ParamSchedulerHook'),
    runtime_info=dict(type='RuntimeInfoHook'),
    sampler_seed=dict(type='DistSamplerSeedHook'),
    sync_buffers=dict(type='SyncBuffersHook'),
    timer=dict(type='IterTimerHook'))
default_scope = 'mmaction'
# 设置环境配置，如是否使用CuDNN基准测试，分布式训练后端等。
env_cfg = dict(
    cudnn_benchmark=False,
    dist_cfg=dict(backend='nccl'),
    mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0))
# 设置文件客户端参数，指定io_backend为disk，即从磁盘加载数据。
file_client_args = dict(io_backend='disk')
# 指定启动器为none，表示不使用分布式训练。
launcher = 'none'
# 加载预训练模型权重的路径，此处指定了预训练的ResNetTSM模型权重。
load_from = 'work_dirs/tsm_r50_8xb16_u48_240e/best_acc_top1_epoch_27.pth'
log_level = 'INFO'
log_processor = dict(by_epoch=True, type='LogProcessor', window_size=20)
# 定义模型结构，包括骨干网络ResNetTSM和分类头部TSMHead。
model = dict(
    backbone=dict(
        depth=50,
        norm_eval=False,
        pretrained=
        '/var/www/ai/mmaction2-main/configs/_base_/models/resnet50-0676ba61.pth',
        shift_div=8,
        type='ResNetTSM'),
    cls_head=dict(
        average_clips='prob',
        consensus=dict(dim=1, type='AvgConsensus'),
        dropout_ratio=0.5,
        in_channels=2048,
        init_std=0.001,
        is_shift=True,
        num_classes=2,
        spatial_type='avg',
        type='TSMHead'),
    data_preprocessor=dict(
        mean=[
            123.675,
            116.28,
            103.53,
        ],
        std=[
            58.395,
            57.12,
            57.375,
        ],
        type='ActionDataPreprocessor'),
    test_cfg=None,
    train_cfg=None,
    type='Recognizer2D')
# 设置优化器构造函数和参数，包括学习率、动量、权重衰减等。
optim_wrapper = dict(
    clip_grad=dict(max_norm=20, norm_type=2),
    constructor='TSMOptimWrapperConstructor',
    optimizer=dict(lr=0.02, momentum=0.9, type='SGD', weight_decay=0.0001),
    paramwise_cfg=dict(fc_lr5=True))
# 定义参数调度器，包括学习率预热和多步衰减策略。
param_scheduler = [
    dict(begin=0, by_epoch=True, end=5, start_factor=0.1, type='LinearLR'),
    dict(
        begin=0,
        by_epoch=True,
        end=50,
        gamma=0.1,
        milestones=[
            25,
            45,
        ],
        type='MultiStepLR'),
]
# 设置数据预处理配置，包括均值和标准差。
preprocess_cfg = dict(
    mean=[
        123.675,
        116.28,
        103.53,
    ], std=[
        58.395,
        57.12,
        57.375,
    ])
# 指定是否从先前的训练断点恢复训练。
resume = False
# 设置测试配置，使用TestLoop作为测试循环。
test_cfg = dict(type='TestLoop')
# 定义测试数据加载器，包括批量大小、数据集配置等。
test_dataloader = dict(
    batch_size=1,
    dataset=dict(
        ann_file='data/walk_cross/val.txt',
        data_prefix=dict(video='data/walk_cross/val'),
        pipeline=[
            dict(io_backend='disk', type='DecordInit'),
            dict(
                clip_len=1,
                frame_interval=1,
                num_clips=8,
                test_mode=True,
                type='SampleFrames'),
            dict(type='DecordDecode'),
            dict(scale=(
                -1,
                256,
            ), type='Resize'),
            dict(crop_size=224, type='TenCrop'),
            dict(input_format='NCHW', type='FormatShape'),
            dict(type='PackActionInputs'),
        ],
        test_mode=True,
        type='VideoDataset'),
    num_workers=8,
    persistent_workers=True,
    sampler=dict(shuffle=False, type='DefaultSampler'))
# 设置测试评估器为准确率评估器。
test_evaluator = dict(type='AccMetric')
# 定义测试流程，包括视频解码、多尺度裁剪、数据格式化等。
test_pipeline = [
    dict(io_backend='disk', type='DecordInit'),
    dict(
        clip_len=1,
        frame_interval=1,
        num_clips=8,
        test_mode=True,
        type='SampleFrames'),
    dict(type='DecordDecode'),
    dict(scale=(
        -1,
        256,
    ), type='Resize'),
    dict(crop_size=224, type='TenCrop'),
    dict(input_format='NCHW', type='FormatShape'),
    dict(type='PackActionInputs'),
]
# 设置训练配置，包括最大周期数、验证开始周期和验证间隔。
train_cfg = dict(
    max_epochs=50, type='EpochBasedTrainLoop', val_begin=1, val_interval=1)
# 定义训练数据加载器，包括批量大小、数据集配置等。
train_dataloader = dict(
    batch_size=16,
    dataset=dict(
        ann_file='data/walk_cross/train.txt',
        data_prefix=dict(video='data/walk_cross/train'),
        pipeline=[
            dict(io_backend='disk', type='DecordInit'),
            dict(
                clip_len=1, frame_interval=1, num_clips=8,
                type='SampleFrames'),
            dict(type='DecordDecode'),
            dict(scale=(
                -1,
                256,
            ), type='Resize'),
            dict(
                input_size=224,
                max_wh_scale_gap=1,
                num_fixed_crops=13,
                random_crop=False,
                scales=(
                    1,
                    0.875,
                    0.75,
                    0.66,
                ),
                type='MultiScaleCrop'),
            dict(keep_ratio=False, scale=(
                224,
                224,
            ), type='Resize'),
            dict(flip_ratio=0.5, type='Flip'),
            dict(input_format='NCHW', type='FormatShape'),
            dict(type='PackActionInputs'),
        ],
        type='VideoDataset'),
    num_workers=8,
    persistent_workers=True,
    sampler=dict(shuffle=True, type='DefaultSampler'))
train_pipeline = [
    dict(io_backend='disk', type='DecordInit'),
    dict(clip_len=1, frame_interval=1, num_clips=8, type='SampleFrames'),
    dict(type='DecordDecode'),
    dict(scale=(
        -1,
        256,
    ), type='Resize'),
    dict(
        input_size=224,
        max_wh_scale_gap=1,
        num_fixed_crops=13,
        random_crop=False,
        scales=(
            1,
            0.875,
            0.75,
            0.66,
        ),
        type='MultiScaleCrop'),
    dict(keep_ratio=False, scale=(
        224,
        224,
    ), type='Resize'),
    dict(flip_ratio=0.5, type='Flip'),
    dict(input_format='NCHW', type='FormatShape'),
    dict(type='PackActionInputs'),
]
val_cfg = dict(type='ValLoop')
# 定义验证数据加载器，与测试数据加载器类似，但用于验证集。
val_dataloader = dict(
    batch_size=16,
    dataset=dict(
        ann_file='data/walk_cross/val.txt',
        data_prefix=dict(video='data/walk_cross/val'),
        pipeline=[
            dict(io_backend='disk', type='DecordInit'),
            dict(
                clip_len=1,
                frame_interval=1,
                num_clips=8,
                test_mode=True,
                type='SampleFrames'),
            dict(type='DecordDecode'),
            dict(scale=(
                -1,
                256,
            ), type='Resize'),
            dict(crop_size=224, type='CenterCrop'),
            dict(input_format='NCHW', type='FormatShape'),
            dict(type='PackActionInputs'),
        ],
        test_mode=True,
        type='VideoDataset'),
    num_workers=8,
    persistent_workers=True,
    sampler=dict(shuffle=False, type='DefaultSampler'))
# 设置验证评估器为准确率评估器。
val_evaluator = dict(type='AccMetric')
# 定义验证流程，与测试流程类似，使用中心裁剪代替多尺度裁剪。
val_pipeline = [
    dict(io_backend='disk', type='DecordInit'),
    dict(
        clip_len=1,
        frame_interval=1,
        num_clips=8,
        test_mode=True,
        type='SampleFrames'),
    dict(type='DecordDecode'),
    dict(scale=(
        -1,
        256,
    ), type='Resize'),
    dict(crop_size=224, type='CenterCrop'),
    dict(input_format='NCHW', type='FormatShape'),
    dict(type='PackActionInputs'),
]
# 添加可视化后端和可视化器配置，用于结果的可视化展示。
vis_backends = [
    dict(type='LocalVisBackend'),
]
visualizer = dict(
    type='ActionVisualizer', vis_backends=[
        dict(type='LocalVisBackend'),
    ])
# 设置工作目录，用于存放训练过程中生成的所有文件和模型权重。
work_dir = './work_dirs/tsm_imagenet-pretrained-r50_8xb16-1x1x8-50e_kinetics400-rgb'

其中配置文件的主要修改点是数据集路径与预训练模型加载路径，其他的则自适应修改。

开始训练模型

训练指令：

python tools/train.py configs/recognition/tsm/tsm_imagenet-pretrained-r50_8xb16-1x1x8-50e_kinetics400-rgb.py --work-dir work_dirs/new_tsm

其中configs是配置文件路径，work-dir填输出训练结果的路径

开始训练后如图：

训练后得到的work-dir结构如上图。

测试指令：

python tools/test.py configs/recognition/tsm/tsm_imagenet-pretrained-r50_8xb16-1x1x8-50e_kinetics400-rgb.py \
    work_dirs/tsm_r50_8xb16_u48_240e/best_acc_top1_epoch_27.pth

测试结果如下图：

推理指令：

python demo/demo_inferencer.py demo/walk_8.mp4 --vid-out-dir output/ --model-config configs/recognition/tsm/tsm_imagenet-pretrained-r50_8xb16-1x1x8-50e_kinetics400-rgb.py --model-weights work_dirs/tsm_r50_8xb16_u48_240e/best_acc_top1_epoch_27.pth --print-result

推理结果如下图所示：

测试推理对于有walk行为的mp4文件的识别labels为0，打印结果如下：

{'predictions': [{'rec_labels': [[0]], 'rec_scores': [[0.728128969669342, 0.27187103033065796]]}]}

测试推理对于有cross行为的mp4文件的识别labels为1，打印结果如下：

{'predictions': [{'rec_labels': [[1]], 'rec_scores': [[0.0561964325606823, 0.9438035488128662]]}]}

可以看到即使训练数据集很小的情况下，所得出的模型识别效果还不错。

集成YOLOv8进行目标检测

在动作识别模型训练完成并验证了其有效性之后，我们需要在视频帧中准确找到行人并提取其动作。为了实现这一目标，我使用了YOLOv8去进行目标的检测，选择的是yolov8x模型。

YOLOv8集成步骤

目标检测：首先应用YOLOv8模型对视频帧进行逐帧分析，其任务是识别并定位每一帧中所有行人目标。

行人提取：当YOLOv8完成目标检测后，系统将根据检测到的边界框从视频帧中提取行人区域。这些区域被称为感兴趣区域（Region of Interest, RoI），它们仅包含行人的图像信息，为动作识别准备数据。

动作识别：接下来，提取的行人片段被送入MMAction2模型进行动作识别。MMAction2模型分析每个片段，识别出行人的具体动作。这一步骤利用了之前训练得到的模型权重，确保了动作识别的准确性。

结果融合：最后，系统将YOLOv8的目标检测结果与MMAction2的动作识别结果进行整合。这意味着，每个检测到的行人不仅被标记了边界框，还标注了其动作类别和置信度。最终输出是视频帧上叠加了动作识别信息的可视化结果。

代码如下：

import cv2
import torch
import tempfile
import os
from collections import defaultdict
from ultralytics import YOLO
from mmaction.apis.inferencers import MMAction2Inferencer

def main():
    # 初始化 YOLOv8 模型
    yolo_model = YOLO('yolov8x.pt')

    # 初始化 MMAction2 模型
    action_model = MMAction2Inferencer(
        rec='configs/recognition/tsm/tsm_imagenet-pretrained-r50_8xb16-1x1x8-50e_kinetics400-rgb.py',
        rec_weights='work_dirs/tsm_r50_8xb16_u48_240e/best_acc_top1_epoch_27.pth',
        device='cuda:0',
        label_file='tools/data/kinetics/label_1.txt'
    )

    # 打开视频文件
    cap = cv2.VideoCapture('demo/cross_8.mp4')

    # 获取视频的宽度和高度
    width = int(cap.get(cv2.CAP_PROP_FRAME_WIDTH))
    height = int(cap.get(cv2.CAP_PROP_FRAME_HEIGHT))

    # 创建视频写入对象
    fourcc = cv2.VideoWriter_fourcc(*'mp4v')
    out = cv2.VideoWriter('output/output_video.mp4', fourcc, 20.0, (width, height))

    # 用于统计识别到的行为及其人数
    action_count = defaultdict(int)

    while cap.isOpened():
        ret, frame = cap.read()
        if not ret:
            break

        # 使用 YOLOv8 进行目标检测
        results = yolo_model(frame)

        # 处理检测结果
        for result in results:
            boxes = result.boxes.cpu().numpy()
            for box in boxes:
                x1, y1, x2, y2 = map(int, box.xyxy[0])
                confidence = box.conf[0]
                class_id = int(box.cls[0])


                if class_id == 0 and confidence > 0.5:  # 'person' 类别通常是 0
                    person_frame = frame[y1:y2, x1:x2]

                    # 调整 person_frame 的分辨率
                    person_frame = cv2.resize(person_frame, (width, height))

                    # 保存 person_frame 为临时视频文件
                    with tempfile.NamedTemporaryFile(delete=False, suffix='.mp4') as temp_file:
                        temp_filename = temp_file.name
                        fourcc = cv2.VideoWriter_fourcc(*'mp4v')
                        out_temp = cv2.VideoWriter(temp_filename, fourcc, 20.0, (person_frame.shape[1], person_frame.shape[0]))
                        out_temp.write(person_frame)
                        out_temp.release()

                    # 对每个人的视频片段进行行为识别
                    action_results = action_model(temp_filename, print_result=False)

                    # 删除临时文件
                    os.remove(temp_filename)

                    # 获取行为识别结果
                    action_label = action_results['predictions'][0]['rec_labels'][0][0]
                    action_score = action_results['predictions'][0]['rec_scores'][0][0]

                    # 统计识别到的行为
                    action_count[action_label] += 1

                    # 在视频帧上标注行为
                    label = f"Action: {action_label}, Score: {action_score:.2f}"
                    cv2.putText(frame, label, (x1, y2 + 20), cv2.FONT_HERSHEY_SIMPLEX, 0.5, (0, 255, 0), 2)
                    cv2.rectangle(frame, (x1, y1), (x2, y2), (0, 255, 0), 2)

        # 将处理后的帧写入输出视频文件
        out.write(frame)

    # 释放资源
    cap.release()
    out.release()

    # 打印识别到的行为及其人数
    print("Recognized Actions and Their Counts:")
    for action, count in action_count.items():
        print(f"{action}: {count}")

if __name__ == '__main__':
    main()

输出视频结果截取图如图：

可以看到成功识别到Action:1,即识别到cross(跨越)的动作

Aobu＆

关注

17
点赞
踩
17

收藏

觉得还不错? 一键收藏
0
评论
基于YOLOv8和MMAction2的行人动作检测

本文介绍了一个结合YOLOv8目标检测和MMAction2动作识别的系统，专门用于视频监控中的行人动作分析。鉴于数据集规模较小，选用了TSM作为预训练模型以提高泛化能力。通过自定义数据集的剪辑、标注和划分，完成了模型的训练与测试。进一步集成YOLOv8进行行人检测，并提取动作片段供MMAction2识别，实现了端到端的动作检测流程。实验结果展示了系统在小数据集上的良好性能，证明了方法的有效性。
复制链接

扫一扫