论文阅读：DEEP CAPTIONING WITH MULTIMODAL RECURRENTNEURAL NETWORKS (M-RNN)

最新推荐文章于 2024-05-05 16:51:45 发布

weixin_42322020

最新推荐文章于 2024-05-05 16:51:45 发布

阅读量611

点赞数

分类专栏： image caption 文章标签： nlp 自然语言处理 pytorch

本文链接：https://blog.csdn.net/weixin_42322020/article/details/105976278

版权

image caption 专栏收录该内容

0 篇文章 0 订阅

订阅专栏

DEEP CAPTIONING WITH MULTIMODAL RECURRENT NEURAL NETWORKS (M-RNN)

0.summary

reccurent layer(deep RNN) + word embedding layer = language model part

Future improvements:

1.use more better deep neural network to extract more better word embedding matrix and image features,eg:language model part:LSTM,vision part: SSD,YOLO

2.use pre-computed word vector to initialize two word emdedding layer

3.explore more effictive model architecture

4.use GPU

1.Research Objective

three task:

(1) generating novel sentence

(2) retrieveing images given a sentence

(3) retrieveing sentence given a images
build a multimodal Recurrent Neural Network (m-RNN) for generating **novel **image captions.
m-RNN model also have a significant performance improvement over the SOTA methods in retrieval tasks for retrieving images or sentences.
project page of this work( Nearest Neighbor as Reference: A Simple Way to Boost the Performance of Image Captioning ):

2.Background and Problems

image caption(previous):

Many previous methods treat it as a retrieval task: They learn a joint embedding to map the features of both sentences and images to the same semantic space. These methods generate image captions by retrieving them from a sentence database.

details:

image feature: before use deep model to extract as a global level,rencently use object level image festures based on object detection.

sentence feature method: dependency tree Recursive Neural Network

embedding model: optimize a ranking cost to learn a embedding model,use that model to maps both sentence feature and image feature to a common semantic feature space

backward:

lack the ability of generating novel sentences or describing images that contain novel combinations of objects and scenes.
3 categories of method for Generating novel sentence descriptions for images.

method 1 :

parse the sentence ->divide it into several parts -> each part associate with an object（or attribute）in the image.

model: Conditional Random Field model or Markov Random Field model

method 2:

retrieves similar captioned images in train data -> generalizing and re-composing the retrieved captions -> generates new descriptions

method 3(our model):

learn a probability density over the space of language and image，The probability of generating sentences using the model can serve as the affinity metric for retrieval

model: RNN, it can storing context information in a recurrent layer.

contribution of our model:

(1) incorporate a two-layer word embedding system in the m-RNN network structure which learns the word representation more efficiently than the single-layer word embedding.

(2) do not use the recurrent layer to store the visual information. The image representation is inputted to the m-RNN model along with every word in the sentence description. allows us to achieve SOTA performance using a relatively small dimensional recurrent layer.

3.Method(s)

3.1Model architecture

在这里插入图片描述

The whole m-RNN model contains a language model part, a vision part and a multimodal part. The language model part learns a dense feature embedding for each word in the dictionary and stores the semantic temporal context in recurrent layers.The vision part contains a deep Convolutional Neural Network (CNN) which generates the image representation.The multimodal part connects the language model and the deep CNN together by a one-layer representation.

two word embedding layers: randomly initialize this two layers and learn then from training data can also generate SOTA result(others use pre-computed word embedding vector to initial their model)

output: word embedding vector at time t denote as $w (t)$ --256dim
recurrent layer：

input(time t): $w (t)$ and $r (t - 1)$

Calculation process : $ r(t) = f_2(U_r\cdot r(t-1)+w(t));$

Parameter: $U_r$ map $r (t - 1)$ into the same vector space as $w (t)$

$f_2(\cdot)$ Rectified Linear Unit( $R e L U$ )

+ element-wize addition

output（time t）： $r (t)$ --256dim
multimodal layer: connects the language model part and the vision part of the m-RNN model

three input:

$w (t)$ – from word emedding layer $\Pi$ 、 $r (t)$ --from recurrent layer 、 $I$ – image representation from AlexNet or VggNet

Calculation process : $m(t)=g_2(V_m\cdot w(t)+V_r\cdot r(t)+V_I\cdot I);$

$V_m$ $V_r$ $V_I$ : all this parameters can be seen as a map operation from their Original space to multimodal space .

$g_2(\cdot)$ is the element-wise scaled hyperbolic tangent function

$g_2(x)=1.7159\cdot \tanh(\frac{2}{3}x)$

output : $m (t)$ --512dim
softmax layer: generates the probability distribution of the next word.

output : a probability vector —dimension = vacabulary size $M$ ,which is different for different datasets

3.2 train

cost function: log-likehood cost function

Perplexity: a standard measure for evaluating language model.

$\log_2PPL(W_{1:L}|I)=-\frac{1}{L}\sum\limits_{n=1}^{l}log_2P(w_n|w_{1:n-1},I)$

$L$ : length of the word sequence

$PPL(w_{1:L}|I)$ : perplexity of the sentence $w_{1:L}$ given the image $I$ .

$P(w_n|w_{1:n-1},I)$ : the probability of generating the word $w_n$ given $I$ and previous words $w_{1:n-1}$ ,corresponds to the activation of the SoftMax layer of our model.
cost function: average log-likehood of the words:

$C=\frac{1}{N}\sum\limits_{i-1}^{N_s}L_i\cdot\log_2PPL(w_{1:L_i}^{(i)}|I^{(i)}+\lambda_{\theta}\cdot\parallel x \parallel_2^{2})$

$N_s$ : number of sentences $N$ : number of words $L_i$ : length of $i^{th}$ sentences, $\theta$ : model’s parameter
train objective : minimize the cost function $C$

vision part : pretrain AlexNet or VggNet on ImageNet dataset.

language model part: randomly initialize

deep learning platform: baidu PADDEL, m-rnn model average take 25ms to generate a sentence on single CPU core onFlicker8K.

datasets: IAPR TC-12 , Flickr8K , Flickr30K , MS COCO

4.Evaluation

sentence generation:

sentence perplexity
BLEU scores(B-1,B-2,B-3,B-4)

sentence & image retrieval:

R@K (K=1,5,10)
Med r

5.Conclusion

m-RNN model consist of a deep RNN a deep CNN and this two sub-netwark interact with each other in a multimodel layer which performs at the SOTA in three tasks: sentence generation, sentence retrieval given query image and image retrieval given query sentence. That model is powerful of connecting images and sentences and is flexible to incorporate more complex image representations and more sophisticated language models.

model is powerful of connecting images and sentences and is flexible to incorporate more complex image representations and more sophisticated language models.

weixin_42322020

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
论文阅读：DEEP CAPTIONING WITH MULTIMODAL RECURRENTNEURAL NETWORKS (M-RNN)

DEEP CAPTIONING WITH MULTIMODAL RECURRENT NEURAL NETWORKS (M-RNN)0.summaryreccurent layer(deep RNN) + word embedding layer = language model partFuture improvements:1.use more better deep neural ne...
复制链接

扫一扫