Paper Reading: Deep Multimodal Speaker Naming

最新推荐文章于 2021-12-26 21:34:26 发布

wlwchina

最新推荐文章于 2021-12-26 21:34:26 发布

阅读量818

点赞数

分类专栏： face 文章标签： paper-read audio face deep-learn CNN

本文链接：https://blog.csdn.net/xmzwlw/article/details/47134193

版权

face 专栏收录该内容

2 篇文章 0 订阅

订阅专栏

http://herohuyongtao.github.io/research/publications/speaker-naming/
问题描述:

spearking naming(SN): localizing + identifying (each speaking character)

问题的难点在于multimodal.
已有的方法都是分别处理各个modal,然后使用 handcrafted heuristics合并.

本文
- 基于CNN的学习框架来同时利用face和audio两个方面的信息.
- 不需要face tracking, facial landmark localization, subtitle/transcript, 可以获得state-of-the-art的性能.
- train end-to-end
- use only cropped face regions and corresponding audio
- real time

Architecture:
- face feature extractor : CNN, 最后一层是一个vector
- audio feature extractor : MFCC, 结果也是一个vector
- 拼在一起的feature, 后面是几层全连接层, 维度逐渐升高
- 整个网络train end to end

Experiment:

三个任务:
- 1) face recognition(using both information)
- 2) identifying non-matched face-audio pairs
- 3)real world SN

具体网络设置:
- 2 个conv(15*15, 5*4)和两个pooling
- 最后一个pooling和fully-connect之间使用7*5的滤波器?
- 两个卷积层的number of feature map: 48 + 256
- 两个fully connected layer: 1024, 2028
- pooling factor : 2

初始参数设置:
- bias term all 0.01( prevent the dead unit caused by rectifier units)
- others, [-1, 1] gaussian,然后根据隐层数量进行scale

加入音频后的网络:
- 训练时,两个feature extractor分别使用预先训练的参数作为初始化.
- 每帧对应的音频, 窗口为20ms(每个音频产生特征维度75)
- 每个脸随机选择5个audio, 这样音频特征就有375

Face recognition accuracy:
- only use face : 86.7%
- use face-audio : 88.5%
- 其他方法: <70%

Identifying Non-matched Pairs, 三个SVM进行二值分类(匹配/不匹配)
- 1使用1024D face-audio model的特征 (82.2%)
- 使用1024D face-alone model的特征+75D audio feature ( 82.9%)
- 实际采用的SVM和第二个参数相同, 仅仅是把1024D face-alone model换成1024D face-audio model (84.1%)

可见face-audio model的特征, 再加入audio特征, 可以获得更高的性能, 在distinguishing non-matched pairs方面

Speaker Naming

其实就是前两个实验的综合版.
先用SVM去掉 non-matched pairs
然后进行recgonize
SN accuracy: Friends 90.5%, BBT 82.9%

@inproceedings{hu2015deep,
  title={{Deep Multimodal Speaker Naming}},
  author={Hu, Yongtao and Ren, Jimmy SJ. and Dai, Jingwen and Yuan, Chang and Xu, Li and Wang, Wenping},
  booktitle={Proceedings of the 23rd Annual ACM International Conference on Multimedia},
  pages={xxx--xxx},
  year={2015},
  organization={ACM}
}