一、CLIP
![](https://img-blog.csdnimg.cn/img_convert/ee07f8e7dbd496fe0bf9d67003554db9.png)
Learning Transferable Visual Models From Natural Language Supervision
CLIP是用对比学习的方式去训练一个视觉-语言的多模态模型
二、分割
LSeg (language-driven semantic segmentation)
![](https://img-blog.csdnimg.cn/img_convert/655a1d8675fc579f74afd9dd4f98a58e.png)
GroupViT (semantic segmentation emerges from text supervision)
![](https://img-blog.csdnimg.cn/img_convert/e86efa3f2b7258d8547919ee56cd8cda.png)
Learning Transferable Visual Models From Natural Language Supervision
CLIP是用对比学习的方式去训练一个视觉-语言的多模态模型