Karpathy, Andrej, and Li Fei-Fei. “Deep visual-semantic alignments for generating image descriptions.” Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition. 2015. (Citations: 505).
CNN. Before
Recognition. 2015. (Citations: 505).
1 Motivation
Sentences written by people make frequent references to some particular, but unknown location in the image. We want to generate sentences from image regions.
2 Pipeline
See Fig. Inputs are the whole raw image or bounding box regions computed by R-CNN. Before
Now
Where v is the fc7 features of image representations, and W_i v is the embedding of each image. Words are represented by word2vec embedding, which project each word directly into the space of h.