Cross-modal retrieval aims at retrieving relevant items that are of different nature with respect to the query format.
Four Challenges:
1.representation
2.translation
3.alignment(对齐)
4.co-learning
挑战:The main challenge is to measure the similarity between different modalities of data.
方法:map images and texts into a shared latent space F in which they can be compared
对齐的两种策略
1) global alignment methods aiming at mapping each modal manifold in F such that semantically similar regions share the same directions in F;
全局对齐方法,将每个模态流形映射到F中,使得语义上相似的区域在F中共享相同的方向。
2) local metric learning approaches aiming at mapping each modal manifold such that semantically similar items have a short distances in F
局部度量方法:映射每个模态流形,使得语义相似的items在F中距离更短。
Multimodal alignment faces a number of difficulties:
1) there are few datasets with explicitly annotated alignments;
2) it is difficult to design similarity metrics between modalities;(模态间的相似度度量)
3) there may exist multiple possible alignments and not all elements in one modality have correspondences in another(可能存在多个匹配或者无匹配)