Paper Reading - Mind’s Eye: A Recurrent Visual Representation for Image Caption Generation ( CVPR 20...

最新推荐文章于 2024-10-15 01:20:33 发布

dichunpu6524

最新推荐文章于 2024-10-15 01:20:33 发布

阅读量232

点赞数

文章标签：人工智能

原文链接：http://www.cnblogs.com/zlian2016/p/9487548.html

版权

Link of the Paper: https://ieeexplore.ieee.org/document/7298856/

A Correlative Paper: Learning a Recurrent Visual Representation for Image Caption Generation (Link of the Paper: https://arxiv.org/abs/1411.5654)

Main Points:

A bi-directional mapping model using recurrent neural networks: unlike previous approaches which map both sentences and images to a common embedding ( and then calculate the similarity and match / generate, I guess ) that may be used for image search or for ranking image captions.
A bi-directional representation: generates both novel descriptions from images and visual representations from descriptions.
A novel recurrent visual memory: automatically learns to remember long-term visual concepts.
A set of latent variables U_t-1 that encodes the visual interpretation of the previously generated or read words W_t-1. Using U, our goal is to compute P(w_t | V, W_t-1, U_t-1) and P(V | W_t-1, U_t-1). Combining these two likelihoods together our global objective is to maximize, P(w_t, V | W_t-1, U_t-1) = P(w_t | V, W_t-1, U_t-1)P(V | W_t-1, U_t-1). That is, we want to maximize the likelihood of the word w_t and the observed visual features V given the previous words and their visual interpretation. Note that in previous papers, the objective was only to compute P(w_t | V, W_t-1) and not P(V | W_t-1).

Other Key Points:

Previous approaches project both semantics and visual features to a common embedding, they are not able to perform the inverse projection. That is, they cannot generate novel sentences or visual depictions from the embedding.

posted on 2018-08-16 15:24 LZ_Jaja 阅读( ...) 评论( ...) 编辑收藏

转载于:https://www.cnblogs.com/zlian2016/p/9487548.html