论文阅读【Controllable Video Captioning with POS Sequence Guidance Based on Gated Fusion Network】

最新推荐文章于 2023-06-30 19:57:53 发布

hei_hei_hei_

最新推荐文章于 2023-06-30 19:57:53 发布

阅读量448

点赞数

分类专栏：论文阅读文章标签：深度学习机器学习

本文链接：https://blog.csdn.net/hei_hei_hei_/article/details/125706792

版权

论文阅读专栏收录该内容

27 篇文章 4 订阅

订阅专栏

Controllable Video Captioning with POS Sequence Guidance Based on Gated Fusion Network

概述

发表：ICCV 2019
代码：Controllable_XGating
idea：提出用POS信息指导caption生成；提出特征融合网络用于对多模态特征进行融合。

详细设计

在这里插入图片描述
Ⓖ表示本文提出的cross gating mechanism，Ⓐ表示soft attention

Gated Fusion Network：对两种特征使用CG模块进行融合得到融合后的特征x；
POS sequence Generator：x输入LSTM+soft attention得到POS信息；
Description Generator：最后将POS information和x一起输入两层的LSTM进行解码生成word

1. Gated Fusion Network

使用训练好的网络提取视频的content features和motion features，然后分别经过LSTM进行特征的聚合（时间纬度上），最后是通过一个CG模块完成特征的融合。

LSTM编码提取好的特征
经过CNNs提取的context features $R = \{r_1, r_2,...,r_m\}$ , motion features $F = \{f_1, f_2,...,f_m\}$

这一阶段结束后得到high-level content and motion features：（所有step的隐层）
在这里插入图片描述

Cross Gating

这里的 $\bigoplus 和 \bigotimes$ 分别表示element-wise 的相加和相乘
公式解释：

然后concate之后接上一个全连接层，得到最终融合后的特征 $X = \{x_1, x_2,...,x_m\}$ ,

2. POS Sequence Generator

依旧是LSTM+soft attention生成 POS sequence
在这里插入图片描述
$c_{t-1}$ 表示上一step预测的POS tag， $E_{pos}$ 为 POS tag embedding matrix； $\phi_t$ 表示soft attention

最后一个step的hidden state $\psi = h_n^{(T)}$ 包含了全局的POS信息，会被用于指导句子生成

3. Description Generator

将POS tag与上一步生成的word embedding进行交叉融合
在这里插入图片描述
将 $X,\bar{\psi}$ 输入两层的LSTM

4. Training

首先训练POS sequence generator，这时将description generator的参数冻结
在这里插入图片描述
当POS sequence generator收敛之后再训练description generator

实验

Ablation Studies

不同的fusion strategies
Comparison studies

hei_hei_hei_

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
1
评论
论文阅读【Controllable Video Captioning with POS Sequence Guidance Based on Gated Fusion Network】

论文阅读【Controllable Video Captioning with POS Sequence Guidance Based on Gated Fusion Network】
复制链接

扫一扫