Dual-Branch Cross-Attention Network for Micro-Expression Recognition with Transformer Variants阅读笔记

用Transformer做的微表情识别,看完记录一下。
摘要:
However,the main drawback of the current methods is their inability to fully extracting holistic contextual information from ME images.
传统方法最大的弊端在于,无法从图像中完全提取到全部上下文信息。
this paper uses Transformer variants as the main backbone and the dual-branch architecture as the main framework to extract meaningful multi-modal contextual features for ME recognition(MER).
本文使用Transformer的变体作为主干网络,采用双分支结构为框架,提取多模态上下文特征用于实现微表情的识别。
The first branch leverages an optical flow operator to facilitate the motion information extraction between ME sequences,and the corresponding optical flow maps are fed into the Swin Transformer to acquire motion–spatial representation.
第一个分支利用光流法获取运动信息,再将光流信息传入Swin Transformer中获取运动空间表征(motion-spatial representation)。
The second branch directly sends the apex frame in one ME clip to Mobile ViT(Vision Transformer),which can capture the local–global features of MEs.
第二个分支直接将峰值帧传入Mobile ViTal中,获取微表情的局部与全局特征。
to achieve the optimal feature stream fusion,a CAB(cross attention block)is designed to interact the feature extracted by each branch for adaptive learning fusion.
为了实现最优的特征流融合,设计了一个CAB(交叉注意力块)来交互各分支提取的特征进行自适应学习融合。
1.简介
These studies empirically designed several manual descriptors,such as Local Binary Patternfrom Three Orthogonal Planes(LBP-TOP)[4],Histogram of Image Gradient Orientation(HIGO)[5],and Histograms of Oriented Optical Flow(HOOF)[6]and their variants,that can effectively capture representative information on the edges,textures,and facial movements to distinguish MEs.
这些是一些手工设计的特征,加上机器学习的分类器完成微表情识别。
However,because CNN mechanisms filter images layer by layer using weight-sharing convolutional kernels,they tend to focus on the local information and neglect contextual connectivity,ignoring the effects of the global information.
CNN的缺点:通过分层与共享权重的卷积核,CNN倾向于局部信息,容易忽略上下文的关联性和全局信息的作用。
In contrast to CNN,ViT can model global features by using the entire image as input to better achieve the contextual relation in images.
与CNN不同,ViT可以利用整幅图像作为输入来建模全局特征,从而更好地实现图像中的上下文关系。
In summary,the core novelty is exploring the optimal combination of Transformer variants suitable for multimodal MER.
综上,本文的核心创新点在于探索了适用于多模态MER的最优Transformer变体组合
2.相关工作
its feature representation is scene-constrained and lacks concern for ME nuances.
传统手工方法其特征表示是场景受限的,缺乏对ME细微差别的关注,这是传统手工特征的缺陷。
To the best of our knowledge,different basic frameworks emphasize specific facial features.
据我们所知,不同的基本框架强调特定的面部特征。
3.提出的方法
3.1网络结构
Specifically,this scheme adopts a dual-branch framework, where one branch sends the face image to a Swin Transformer after optical flow processing to extract the temporal–spatial information of ME,and the other branch sends the apex frame image to MobileViT to acquire the local–global information.More importantly,the multiple-mode features from the two branches interact through the CAB module for adaptive learning fusion.
这是整篇文章的核心,两个分支一个用光流提取时空特征,一个用峰值帧提取全局局部特征,最后用CAB进行融合。
3.2 分支一,时空信息提取
which take the onset frame and apex frame of the ME sequence as input and compute the motion information of pixels between these two frames using the optical flow operator.
这里光流是用的起始帧与峰值帧计算出来的。
this work selects the onset frame and the apex frame in the ME sequence to calculate the optical flow map,which can capture the movement information of MEs.
本方法是手工挑选的起始帧和峰值帧。
On one hand,optical flow analysis can capture the subtle dynamic changes in MEs by providing high-resolution information.On the other hand,the optical flow information provides continuity information between frames,effectively maintaining the consistency of ME sequences and reducing noise and incoherence.
加入光流的好处如下:一方面,光流分析可以通过提供高分辨率的信息来捕获微表情的细微动态变化。另一方面,光流信息提供了帧与帧之间的连续性信息,有效地保持了ME序列的一致性,减少了噪声和不连续的干性。
the optical flow operator is based on the principle of constant luminance to calculate the motion characteristics between two frames.
光流算子是基于亮度不变的原理来计算两帧之间的运动特性。
Swin Transformer divides the windows to compute self-attention without losing position information,which greatly improves the computational efficiency and model accuracy.
Swin Transformer在不损失位置信息的情况下划分窗口计算自注意力,大幅提高了计算效率和模型精度。
It captures the dynamic features and temporal dependencies in ME sequences by employing tech-niques such as patch-based processing,self-attention mechanism,multi-scale hierarchical modeling,and positional encoding,providing strong modeling capabilities for ME analysis.
Swin Transformer通过使用基于块的处理、自注意力机制、多尺度分层建模和位置编码等技术捕获ME序列中的动态特征和时间依赖关系,为ME分析提供了强大的建模能力。
3.3 分支二,局部全局信息提取
In the development of deep learning, CNNs excel at extracting local features from images, while Transformer is good at extracting global features .
随着深度学习的发展,CNN擅长从图像中提取局部特征,而Transformer擅长提取全局特征。
Mobile ViT block (Figure 4) is the core component of the Mobile ViT model, including the local representation, the global presentation, and the fusion module. 
3.4 交叉注意力模块
To facilitate effective interaction between differ-ent information types,we design the cross attention block(CAB),which is an attention mechanism used for interaction and information propagation between different feature rep-resentations.
为了方便不同信息类型之间的有效交互,我们设计了交叉注意力块( cross attention block,CAB ),它是一种用于不同特征表示之间交互和信息传播的注意力机制。
(这个图好像有问题)
Then, the Q vector of Branch 1 and the K and V vector of Branch 2 are used to calculate the first cross-atention. With a similar method, the Q vector of Branch 2 and the K and V vectors of Branch 1 are responsible for the second cross-atention calculation.
然后,利用分支1的Q向量和分支2的K、V向量计算第1个交叉注意力。用类似的方法,分支2的Q向量和分支1的K和V向量负责第二次交叉注意力计算。
In CAB,there are typically two inputs:query(Q) vectors and key–value(K-V) pair vectors.The Q vectors are used to specify the target of attention,while the K-V pair vectors contain the content to be attended to.
在CAB中,通常有两个输入:查询( Q )向量和键值( K-V )对向量。Q向量用于指定注意的目标,而K - V对向量则包含需要注意的内容。
4.实验
实验没啥好说的,做了对比实验和消融实验,肯定是作者的最好。
In addition,since the dataset of ME is unbalanced in terms of sentiment categories,we employed the F1 score evaluation metric to mitigate potential class bias.
此外,由于ME的数据集在情感类别方面是不平衡的,我们使用F1分数评价指标来减轻潜在的类别偏见。
5.结论
Efficient multimodality fusion with depth information in ME is a promising direction for future research.
在ME中利用深度信息进行高效的多模态融合是未来研究的一个重要方向。
  • 6
    点赞
  • 10
    收藏
    觉得还不错? 一键收藏
  • 打赏
    打赏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

pzb19841116

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值