图像识别transformer基础知识和实践

幻兽帕鲁

已于 2024-05-03 16:43:50 修改

阅读量535

点赞数 8

分类专栏：大模型学习文章标签： transformer 深度学习人工智能

于 2024-05-03 16:43:10 首次发布

本文链接：https://blog.csdn.net/m0_49134108/article/details/138417463

版权

大模型学习专栏收录该内容

11 篇文章 0 订阅

订阅专栏

现在文本领域的大模型，如chatgpt对文字的理解能力已经做到很强了。那么怎么让大模型不仅会“说”，而且会“看”，拥有跨图像和文字的多模态能力呢？

多模态的算法其实研究了很长时间了，进入大模型时代后是一个质变的飞跃时刻。就是标志着“transformer”这一种模型架构可以一统江湖了。这里限于篇幅，就列出多模态理解的一个重要的论文——CLIP。

（原来是openai出品的文章。。现在才发现。果然openai的工作都是有重大影响力的）

openai官方链接：https://openai.com/index/clip

论文：https://arxiv.org/pdf/2103.00020

作者：小小将
链接：https://www.zhihu.com/question/438649654/answer/1669790824
来源：知乎
著作权归作者所有。商业转载请联系作者获得授权，非商业转载请注明出处。

我觉得这个工作非常nice，就像OpenAI所说虽然深度学习在CV领域很成功，但是：

typical vision datasets are labor intensive and costly to create while teaching only a narrow set of visual concepts（标注数据我太难了）
standard vision models are good at one task and one task only, and require significant effort to adapt to a new task; （模型在单一任务上优秀，但难迁移到新任务）
and models that perform well on benchmarks have disappointingly poor performance on stress tests, casting doubt on the entire deep learning approach to computer vision.（真o(╥﹏╥)o了，泛化性和鲁棒性堪忧）

OpenAI的这项新工作CLIP可以解决上述问题，思路看起来很简单，看下图就知道了，简单来说CLIP是将Text Decoder从文本中提取的语义特征和Image Decoder从图像中提取的语义特征进行匹配训练：

CLIP pre-trains an image encoder and a text encoder to predict which images were paired with which texts in our dataset. We then use this behavior to turn CLIP into a zero-shot classifier. We convert all of a dataset’s classes into captions such as “a photo of adog” and predict the class of the caption CLIP estimates best pairs with a given image.（这个zero-shot真是鬼才）

这里直接引用一下知乎网友的分析。简单来说，就是把文字embedding和图像embedding匹配起来了。这样cv领域可能就不需要复杂的标注，需要大规模的监督训练才能让模型学到图像表征的含义。

然后在实践部分，我是看了smartflow发布的llama3 tutorial里关于图片理解能力微调这一部分的内容学习实践。

学习文档见：

Llama3-Tutorial/docs/llava.md at main · SmartFlowAI/Llama3-Tutorial · GitHub

看起来，我还漏掉了一部分，这里还需要了解一下LLAVA架构的image projector：

论文链接： https://arxiv.org/pdf/2304.08485

所以这个image projector也是把图像投影到文字空间吗？

基于llama3来pretrain的image projector和上文作者发布的预训练权重有什么区别？

那如果有了这个预训练的权重，还需要clip的把文字embedding和图像embedding匹配吗？

感觉这几个问题都不知道怎么回答。。

有读者看到我这篇文章留下评论？

训练的时候，保持llama3-8b-instruct和clip的visual encoder权重不动，训练一个lora权重，看起来是加载llava上的（似乎是顺应了llava的指令微调结构，很像我之前用过的taiyi的中文文生图模型的微调方法。）

Pretrain 模型

export MKL_SERVICE_FORCE_INTEL=1
xtuner chat /root/model/Meta-Llama-3-8B-Instruct \
  --visual-encoder /root/model/clip-vit-large-patch14-336 \
  --llava /root/llama3_llava_pth/pretrain_iter_2181_hf \
  --prompt-template llama3_chat \
  --image /root/tutorial/xtuner/llava/llava_data/test_img/oph.jpg

此时可以看到，Pretrain 模型只会为图片打标签，并不能回答问题。

Finetune 后模型

export MKL_SERVICE_FORCE_INTEL=1
xtuner chat /root/model/Meta-Llama-3-8B-Instruct \
  --visual-encoder /root/model/clip-vit-large-patch14-336 \
  --llava /root/llama3_llava_pth/iter_1200_hf \
  --prompt-template llama3_chat \
  --image /root/tutorial/xtuner/llava/llava_data/test_img/oph.jpg

经过 Finetune 后，我们可以发现，模型已经可以根据图片回答我们的问题了。

幻兽帕鲁

关注

8
点赞
踩
20

收藏

觉得还不错? 一键收藏
0
评论
图像识别transformer基础知识和实践

现在文本领域的大模型，如chatgpt对文字的理解能力已经做到很强了。那么怎么让大模型不仅会“说”，而且会“看”，拥有跨图像和文字的多模态能力呢？多模态的算法其实研究了很长时间了，进入大模型时代后是一个质变的飞跃时刻。就是标志着“transformer”这一种模型架构可以一统江湖了。这里限于篇幅，就列出多模态理解的一个重要的论文——CLIP。（原来是openai出品的文章。。现在才发现。果然openai的工作都是有重大影响力的）作者：小小将。
复制链接

扫一扫