一、CLIP

Learning Transferable Visual Models From Natural Language Supervision
CLIP是用对比学习的方式去训练一个视觉-语言的多模态模型
二、分割
LSeg (language-driven semantic segmentation)

GroupViT (semantic segmentation emerges from text supervision)

Learning Transferable Visual Models From Natural Language Supervision
CLIP是用对比学习的方式去训练一个视觉-语言的多模态模型