MESNet:A Convolutional Neural Network forSpotting Multi-Scale Micro-ExpressionIntervals 阅读笔记

最新推荐文章于 2024-04-10 11:40:01 发布

pzb19841116

最新推荐文章于 2024-04-10 11:40:01 发布

阅读量929

点赞数 17

分类专栏：计算机视觉论文解读深度学习文章标签：笔记计算机视觉人工智能

本文链接：https://blog.csdn.net/pzb19841116/article/details/136415312

版权

计算机视觉同时被 3 个专栏收录

39 篇文章 6 订阅

订阅专栏

论文解读

16 篇文章 0 订阅

订阅专栏

深度学习

13 篇文章 0 订阅

订阅专栏

TIP上的一篇文章，用卷积神经网络实现从长视频中提取出微表情视频片段。中科院心理所王甦菁老师团队的工作，值得好好研读。

摘要：

This paper proposes a novelnetwork based convolutional neural network(CNN)for spottingmulti-scale spontaneous micro-expression intervals in long videos.

论文提出了Micro-Expression Spotting Network（MESNet），是一种基于卷积神经网络（CNN）的新型网络，用于在长视频中检测多尺度的自发微表情。

It is composed of three modules.

MESNet包括三个模块：2+1D时空卷积网络，用于提取空间和时间特征；Clip Proposal Network，生成微表情片段的候选；Classification Regression Network，对候选片段进行微表情分类和时间边界回归。

1.简介

Compared with the common expressions called“macro-expressions”,there are three distinguishing characteristics ofMEs:short duration,1low intensity,and local movements.

相对于常见的“宏表情”，微表情具有三个区别明显的特征：短暂的持续时间、低强度以及局部运动。

It is challenging for human beings to spot and recog-nize such brief and subtle expressions by naked eyes[9].

人类通过肉眼识别这种短暂而微妙的表情是具有挑战性的。

In computer vision,generally,ME analysis includes twomajor steps:spotting and recognition.Spotting is toﬁnd thetemporal location of the ME clip in a given video.Andrecognition is the emotional classiﬁcation of the ME clip.

在计算机视觉领域，微表情（ME）分析通常包括两个主要步骤：定位和识别。定位是找到给定视频中微表情片段的时间位置，而识别是对微表情片段进行情感分类。

Onset is the time when the MEstarts.Apex is the time when the ME reaches its maximummuscular contraction.Offset is the time when the ME ends.

起始是指微表情开始的时间点，顶峰是指微表情达到最大肌肉收缩的时间点，结束是指微表情结束的时间点。

According to the different kinds of outputs,ME spottingmethods are divided into apex frame spotting[21]–[24]andsequence spotting.

根据输出类型的不同，微表情（ME）定位方法分为顶峰帧定位和序列定位。

The traditional algorithm spots ME interval by comparingfeature difference(FD)in aﬁxed-length time window.

传统算法通过在固定长度的时间窗口内比较特征差异（FD）来定位微表情（ME）间隔。

The main idea is to extract hand-craft features and then use a classiﬁer to recognize MEand non-ME frames[34]–[38].

近年来，结合机器学习的微表情（ME）定位方法得到了发展。主要思想是提取手工设计的特征，然后使用分类器识别ME和非ME帧。

ME spotting combined with machine learning isdivided into frame-based classiﬁcation[38]and interval-basedclassiﬁcation[35],which means to determine whether theframe is a ME frame or whether the interval is a MEsegment.

ME定位结合机器学习分为基于帧的分类和基于间隔的分类。基于间隔的定位能够避免不准确的ME标注的影响，并减少定位结果中的真负样本数量。

Through multi-scale analysis,such as normalization of different lengths or multi-scale video sam-pling,the interval-based spotting method can adapt and detectME fragments of different lengths,and better distinguish MEfrom other kinds of facial movements.Here,multi-scale meansthe difference in the length of ME clips.

通过多尺度分析，如不同长度的归一化或多尺度视频采样，基于间隔的定位方法可以适应和检测不同长度的ME片段，并更好地区分ME和其他类型的面部运动。这里，多尺度指的是ME片段长度的差异。

At the start of micro-expression research,since only themicro-expression sequence was considered for recognitionwhen collecting micro-expression samples,the video clipwhen micro-expression occurred was recorded(maybe alsoa few frames before the start and a few frames after theshift),which becomes The so-called short video.

微表情研究初期，仅考虑在收集微表情样本时仅记录发生微表情时的视频片段，即所谓的短视频，包括微表情发生前几帧和发生后几帧。

In contrast,in long videos,participants inevitably have a lotof head movements such as blinking,swallowing,weak headrotation,and macro expressions.Furthermore,there will benoise caused by environmental changes.

相比之下，在长视频中，参与者不可避免地会有许多头部运动，如眨眼、吞咽、轻微的头部旋转和宏观表情。此外，环境变化会引起噪音。这些因素会极大地影响ME的定位性能。

Sliding windows are used to split long videosinto short videos,making it easy for the algorithm to focus onextracting the features of micro-expressions.

采用滑动窗口将长视频分割为短视频，使得算法能够集中于提取微表情的特征。

It is still a challenge for current research to effectively extractor learn the most representative spatio-temporal features ofME from limited data and thereby accurately locate its timeposition on long videos.

目前的研究仍然面临一个挑战，即如何从有限的数据中有效地提取或学习最具代表性的时空特征，从而准确地定位长视频中的ME时间位置。

MESNetincludes three modules:2+1D Spatiotemporal ConvolutionalNetwork,Clip Proposal Network,and Classiﬁcation Regres-sion Network.They extract spatial features,provide pro-posed clips,and further regress temporal boundaries of these proposed clips.

MESNet包括三个模块：2+1D时空卷积网络、Clip Proposal网络和Classification Regression网络。这些模块分别用于提取空间特征、提供候选片段，并进一步回归这些提出片段的时间边界。

•We propose a CNN-based method for spotting multi-scaleME intervals in long videos.

•There are several special tricks to deal with small samplesize and sample unbalanced problems of ME.

•We propose a novel evaluation metric for ME spotting.

三大创新点：

我们提出了一种基于卷积神经网络（CNN）的方法，用于在长视频中定位多尺度的微表情间隔。

针对微表情的小样本量和样本不平衡问题，我们采用了一些特殊的技巧来处理。

我们提出了一种新颖的评估指标，用于评价微表情的定位性能。

2.相关工作

In computer vision,thereare two similar problems:object detection and temporal actionlocalization.

在计算机视觉中，有两个问题：目标检测和时间动作定位。

Object detection is to determine where objects are locatedin a given image and which category each object belongsto[47].Object detection includes two basic tasks:classiﬁ-cation and regression.

目标检测是确定在给定图像中物体位置以及每个物体属于哪个类别的任务。目标检测包括两个基本任务：分类和回归。

Object detection is to locate in the spatial domain,whileME spotting is to spot in the temporal domain.

目标检测是在空间域中定位，而微表情（ME）定位是在时间域中进行定位。

the proposed networkalso uses a network module to produce clip proposals inthe temporal domain.

受以上目标检测方法的启发，提出的网络同样使用一个网络模块在时间域生成片段建议。

The proposed networkalso has two parallel output layers,which are responsible forclassiﬁcation and regression.

提出的网络还具有两个并行的输出层，分别负责分类和回归。

Temporal action localization(TAL)is toﬁnd the temporallocation of actions in a video.

时间动作定位（TAL）是找到视频中动作发生的时间位置。

TAL’s methods can be dividedinto three categories[52]:(1)methods performing frame orsegment-level classiﬁcation where the smoothing and mergingsteps are required to obtain the temporal boundaries[53],[54];(2)methods using a two-step framework including proposalproduction,classiﬁcation and boundary regression[55],[56];(3)methods developing end-to-end architectures integratingthe proposal production and classiﬁcation[57],[58].

TAL的方法可以分为三类：(1)执行帧或段级别分类的方法，需要平滑和合并步骤以获得时间边界；(2)使用两步框架的方法，包括提出生成、分类和边界回归；(3)开发端到端架构的方法，集成提出生成和分类。

3.提出的方法

four 2Dconvolution layers with max pooling layers extract spatialfeatures of each frame of micro-expressions and two 1Dconvolution layers extract temporal features of these spatialfeatures.

在2+1D时空卷积网络中，使用了四个2D卷积层和最大池化层，用于提取微表情每一帧的空间特征。

N spatial features fn consist of a matrixF=(f1,f2,...f N)∈RL×N.

N个空间特征fn组成一个矩阵F = ( f1 , f2 , ... f N)∈RL × N。

Here,1D CNN with two layers is implemented on rowsof F to extract spatiotemporal features S=(s1,s2,...s N)∈R128×Nof the video clip.

在这里，对F的行进行了两层1D CNN的实现，以提取视频片段的空时特征S=(s1,s2,...s N)∈R128×N。

S isﬂattened to a vector to feed the network.The output of thenetwork is denoted asyˆi=(yˆ1,yˆ2)T.

S被展平为一个向量以馈入网络，网络的输出表示为yˆi=(yˆ1,yˆ2)T。

where theﬁrst item represents the cross-entropy loss function,and the second item represents the L2 regularization loss.

优化的损失包括交叉熵损失和L2正则化损失。

All the trainable parameters w include four 2D convolutionlayers,two 1D convolution layers,and two fully-connectedlayers,except the last fully-connected layer parameters.

所有可训练参数w包括四个2D卷积层、两个1D卷积层和两个全连接层，除了最后一个全连接层的参数。

All clipsare sampled to Npre frames.

这里视频长度是固定的。

Clip Proposal Network(CPN)takes spatiotemporal featuresS as input and outputs a set of clip proposals.

片段提议网络（Clip Proposal Network，CPN）以空时特征S为输入，并输出一组片段提议。

To get clip proposals with different scales,we need to usedifferent sizes of receptiveﬁelds on S.

为了获得具有不同尺度的片段提议，我们需要在空时特征S上使用不同大小的感受野。

A natural ideais expanding the receptiveﬁeld’s size to a desirable size bystacking more 1D convolution layers.

一个自然的想法是通过堆叠更多的1D卷积层将感受野的大小扩展到期望的大小。不同数量的堆叠层将得到具有不同大小的感受野。

With the increasing number of stacked layers,the number ofweight parameters will dramatically increase.It is unsuitablefor the small sample size problem of ME.

随着堆叠层的增加，权重参数的数量将急剧增加。这对于ME的小样本问题是不适用的。

CPN includesﬁve parallel sub-networks with two 1D dilatedconvolutional layers.Each sub-network is the equivalent of aﬁxed-length window sliding on videos.Its output is a set ofprobabilities that the video clip corresponding to the slidingwindow belongs to ME.

CPN包括五个并行的子网络，每个子网络都有两个1D扩张卷积层。每个子网络相当于在视频上滑动的固定长度窗口，其输出是一组概率，表示与滑动窗口对应的视频片段属于ME的可能性。

It isintroduced into the loss function in order to alleviate thesample imbalance problem of ME and non-ME.

这个参数的引入是为了缓解ME和non-ME样本不平衡的问题。

CPNwill propose a set of temporal segment proposals.The spatialfeatures fn corresponding to the proposals are fed into the lastmodule,Classiﬁcation Regression Network.

CPN将提出一组时间段建议。与这些建议相对应的空间特征fn被送入最后一个模块，即分类回归网络。

Classiﬁcation Regression Network(CRN)classiﬁes the spa-tial features fn corresponding to the proposals into ME ornon-ME and further regresses the temporal boundaries ofproposals belonging to ME.

分类回归网络（CRN）将与建议相对应的空间特征fn分类为ME或non-ME，并进一步回归属于ME的建议的时间边界。

The corre-sponding spatial features(fa,fa+1,···,fb)are normalizedinto aﬁxed temporal length NCR,and then are fed into CRN.

相应的空间特征(fa,fa+1,···,fb)被标准化为固定的时间长度NCR，然后输入到CRN中。

theﬁrst item is the cross entropy loss functionof classiﬁcation and the Smooth L1 loss of regression,andthe second item is the L2 regularization loss of trainableparameters in all layers except the last two fully-connectedlayers.

第一项是分类的交叉熵损失函数和回归的平滑L1损失，第二项是在所有层中除了最后两个全连接层之外的可训练参数的L2正则化损失。

If the predictionvalue ycr reveals that the probability of ME is not less thana threshold TCRN,then the proposed clip is output with theinterval[a+r1(b−a),b+r2(b−a)].

如果ycr的预测值显示ME的概率不低于阈值TCRN，那么建议的剪辑将以区间[a+r1(b−a),b+r2(b−a)]的形式输出。

Theﬁrst is that we sample the spatialfeatures instead of the corresponding spatiotemporal feature,in order to obtain aﬁxed-length input.

第一个是我们采样空间特征而不是相应的时空特征，以获得固定长度的输入。

The second trick is that CRN doesn’tshare parameters of the two 1D CNN layers with the 2+1DSpatiotemporal Convolutional Network.

第二个技巧是CRN不与2+1D时空卷积网络共享两个1D CNN层的参数。

The total loss of MESNet is computed by

这个是整体的损失函数。

The direct output of MESNet is probabilities and regressionvalues.

MESNet的直接输出是概率和回归值。

A Non-Maximum Suppression(NMS)algorithm withthreshold Tnms is then applied to remove some overlappingintervals as in[56],[64].

NMS算法的目的是删除一些重叠的时间区间。

Theoverlap makes sure that every ME can be completely containedin a certain video clip when splitting the long video.

通过设置overlap的长度，确保在分割长视频时，每个ME能够完全包含在某个视频片段中。

4.实验

实验做了对比实验，肯定是这个最好。还做了各部分的性能测试，及超参数的测试。

5.结论

this paperﬁrst proposes the CNN-basedmethod to spot multi-scale spontaneous ME intervals in longvideos.

本文首次提出了一种基于卷积神经网络（CNN）的方法，用于在长视频中检测多尺度的自发微表情间隔。

The proposed MESNet contains two-stage predictions:one is the prediction of CPN;the other is the further predictionof CRN.

提出的MESNet包含两个阶段的预测：一个是CPN的预测，另一个是CRN的进一步预测。

Experiment results prove that thetwo-stage design can effectively enhance the F1-score metric.And the proposed MESNet outperforms the published state-of-the-art ME spotting methods regardless of the occurrenceof overﬁtting.Especially in SAMM,the performance improve-ment is very signiﬁcant.

两阶段设计可以有效提高F1分数指标。而且，提出的MESNet在无过拟合的情况下优于已发表的最先进的微表情检测方法。

It reveals the potential ofthe proposed method to achieve superior performance whenthere are more data available in the future.

这表明了在未来有更多数据可用时，所提出的方法有望实现更优越的性能。

pzb19841116

关注

17
点赞
踩
11

收藏

觉得还不错? 一键收藏
打赏
0
评论
MESNet:A Convolutional Neural Network forSpotting Multi-Scale Micro-ExpressionIntervals 阅读笔记

通过多尺度分析，如不同长度的归一化或多尺度视频采样，基于间隔的定位方法可以适应和检测不同长度的ME片段，并更好地区分ME和其他类型的面部运动。每个子网络相当于在视频上滑动的固定长度窗口，其输出是一组概率，表示与滑动窗口对应的视频片段属于ME的可能性。目前的研究仍然面临一个挑战，即如何从有限的数据中有效地提取或学习最具代表性的时空特征，从而准确地定位长视频中的ME时间位置。微表情研究初期，仅考虑在收集微表情样本时仅记录发生微表情时的视频片段，即所谓的短视频，包括微表情发生前几帧和发生后几帧。
复制链接

扫一扫