Show and Tell: A Neural Image Caption Generator
时间:2015年
Target
Automatically describe the content of an image
Difficulty
image captioning 不仅要得到图片包含的物体,而且要给出它们之间的关系
Inspiration
- machine translation with Recurrent Neural Networks(RNNs), an “encoder” RNN reads the source sentence and transforms it into a rich fixed-length vector representation, which in turn in used as the initial hidden state of a “decoder” RNN that generates the target sentence
Contribution
- an end-to-end system for image caption
Idea
- 将机器翻译中的encoder换成 pre-trained 的 CNN来提取图片信息,再用LSTM作为decoder
- 使用了word embedding
- 损失函数对encoder、decoder和word embedding同时做更新
- Inference 的时候使用 Beam Search
Model
损失函数
实际上就是分类的负对数损失
Evaluation Metrics
- subjective score
- BLEU
- perplexity
训练细节
- 使用ImageNet预训练模型避免过拟合
- 使用在大型文集上初始化的词向量避免过拟合的效果不明显,所以不这样做
- dropout、ensembling
- SGD with fixed lr and no momentum
- CNN参数不变
- embeding维度512,LSTM memory也是
- 描述标签预处理,保留出现次数超过五次的词
Terminology
- NIC: Neural Image Caption
- BLEU
- perplexity