RAG增强的视觉问答开发框架

最新推荐文章于 2025-04-13 01:12:20 发布

新缸中之脑

最新推荐文章于 2025-04-13 01:12:20 发布

阅读量1.2k

点赞数 15

文章标签： RAG

本文链接：https://blog.csdn.net/shebao3333/article/details/141779740

版权

检索增强生成 (RAG) 是一种强大的技术，可以提高大型语言模型 (LLM) 生成的答案的准确性和可靠性。它还提供了检查模型在特定生成过程中使用的源的可能性，从而使人类用户更容易进行事实核查。此外，RAG 可以使模型知识保持最新状态并整合特定主题的信息，而无需进行微调。总体而言，RAG 提供了许多好处和很少的缺点，并且其工作流程易于实施。正因为如此，它已成为许多需要最新和/或专业知识的 LLM 用例的首选解决方案。

生成式人工智能领域的一些最新发展专注于扩展流行的转换器架构以处理多种输入和/或输出模式，试图复制 LLM 的巨大成功。目前已有多种模型（包括开源和闭源模型）展示了处理多种模式的卓越能力。一种流行的多模态设置，也是最先被解决的问题之一，是视觉语言模型 (VLM)，随着 LLaVA、Idefics 和 Phi-vision 等小型但功能强大的模型的发布，它已经看到了有趣的开源贡献。

为多模态模型设计 RAG 系统比纯文本情况更具挑战性。事实上，LLM 的 RAG 系统设计已经很成熟，并且对一般工作流程有一些共识，因为许多最近的发展都侧重于提高准确性、可靠性和可扩展性，而不是从根本上改变 RAG 架构。另一方面，多模态开辟了多种检索相关信息的方式，因此，可以做出几种不同的架构选择，每种都有自己的优点和缺点。例如，可以使用多模态嵌入模型为不同模态创建共享向量空间，或者选择仅将信息固定在一种模态中。

在这篇博文中，我将讨论一个简单的框架，将 RAG 扩展到视觉语言模型 (VLM)，重点关注视觉问答任务。该方法的核心思想是利用 VLM 理解文本和图像的功能来生成合适的搜索查询，该查询将用于在回答用户的提示之前检索外部信息。

我还将提供一个实用教程，介绍如何实施该框架以使 Phi-3.5-vision 能够访问维基百科信息，讨论实施的要点并展示一些示例。我将在下面的 Git Hub 存储库中分享的完整代码中介绍详细信息。

NSDT工具推荐： Three.js AI纹理开发包 - YOLO合成数据生成器 - GLTF/GLB在线编辑 - 3D模型格式在线转换 - 可编程3D场景编辑器 - REVIT导出3D模型插件 - 3D模型语义搜索引擎 - AI模型在线查看 - Three.js虚拟轴心开发包 - 3D模型在线减面 - STL模型在线切割

1、用于视觉问答的 RAG

在本节中，我将描述介绍中提到的框架的一般工作流程。为了便于说明，我将讨论只有一个用户对一张图片提出提示的情况。例如，对于简单的视觉问答 (VQA) 任务就是这种情况。该方法可以直接推广到多个提示和图像，但流程将变得更加复杂并引入进一步的复杂性。此外，我将仅考虑外部数据仅由文本文档组成的情况。使用多模态嵌入模型进行检索，或者更一般地说，使用多模态搜索引擎，也可以将图像包含在外部数据中。

对于通常的 RAG 工作流程，框架工作流程可以分为两部分：检索相关的外部信息和以提供的外部数据为条件的生成。

在检索阶段，目标是从外部文本文档中检索一些可以提供有用信息来回答用户提示的段落。为了有效地做到这一点，我们必须确保检索到的段落与提供的图像、提示以及更重要的是两者之间的关系相关。事实上，即使检索到的文档包含有关图像的信息，它们也可能不包含为用户提示提供答案所需的具体信息。另一方面，提示只有与它所指的图像配对时才能被正确理解。为了应对这些挑战，本文讨论的框架利用多模态模型来生成适当的搜索查询，该查询经过量身定制，以捕获在提供的图像上下文中回答用户提示所需的信息。搜索引擎将使用生成的查询从外部数据中检索相关信息。

更详细地说，多模态模型接收用户的提示和图像作为输入，并负责创建与它们整体相关的搜索查询。这个过程可以看作是查询转换的一个特例，旨在考虑问题的多模态性质。实际上，该模型将用户的提示转换为搜索查询，同时还考虑其所指的图像。

与其他分别处理每个输入模态的方法（例如使用多模态嵌入模型进行检索或使用生成的图像标题/描述进行语义相似性）相比，这种方法的优势在于它可以更有效地捕获提示和图像之间的关系。

检索阶段的流程图如下所示：

在检索过程中，模型的任务是创建一个查询，搜索引擎将使用该查询来检索相关段落

生成阶段与标准的纯文本 RAG 工作流程非常相似，唯一的区别是，除了提示和检索到的段落之外，模型还会在其上下文中接收图像。此过程如下所示：

在生成过程中，模型接收检索到的段落、提示和图像作为输入

2、使用维基百科增强 Phi-3.5-vision 的能力

在本节中，我将提供实用指南，介绍如何应用讨论的框架通过允许多模态模型访问维基百科来增强多模态模型。我选择了 Phi-3.5-vision 模型，因为它是一个非常强大但轻量级的开源视觉语言模型。

在本节中，我将仅讨论实现的一般方面，将细节留给 GitHub repo 中提供的代码。

2.1 检索

检索阶段的目标是从维基百科收集一些可以提供有用信息的段落，以回答用户关于图像的问题。在代码实现中，我使用 Python 包 wikipedia 搜索和检索维基百科中的内容。

以下是检索相关段落的实施步骤：

使用多模态模型生成关键字，以捕捉有关图像的问题的含义。
使用生成的关键词在维基百科上搜索相关页面。
将每个检索到的页面的内容分成几块。
根据与问题和关键词的语义文本相似性选择排名靠前的块。

第一步利用 Phi-3.5-vision 生成适当的搜索查询，该查询将用于检索相关的维基百科页面。为此，我要求 Phi-3.5-vision 生成与用户问题和图像相关的关键词。然后，我使用维基百科包的内置搜索功能检索一些与生成的关键词相关的页面。

Phi-vision-3.5 的通用单轮单图聊天模板具有以下结构：

<|user|>\n
<|image_1|>\n
{prompt}<|end|>\n
<|assistant|>\n

为了生成关键字，我使用了以下提示：

Your task is to write a few search keywords to find Wikipedia pages containing
the relevant information to answer the question about the provided image. The 
keywords must be as specific as possible and must represent the information 
that is needed to answer the question in relation to the provided image. Don't 
write more than 3 search keywords.
Question: {question}

在推理之前，标签 {question} 被替换为用户问题。

生成关键词后，使用 wikipedia 包的内置搜索功能检索与生成的关键词相关的一些页面。最后，将选定的页面拆分成段落，然后使用嵌入模型和 FAISS 向量存储的 LangChain 实现选择最相关的段落。我使用嵌入模型 snowflake-arctic-embed-l 来嵌入问题和关键词的串联以及检索到的页面块。

实际上，检索阶段实际上是一种“混合搜索”，由两个连续步骤组成：使用 wikipedia 包的内置搜索功能进行关键词搜索，以及使用嵌入模型进行嵌入相似性检索。这样，检索就可在使用关键字搜索选择的最相关页面的段落的较小空间上进行，从而避免了需要构建包含维基百科所有内容的嵌入的庞大向量存储。在不同的设置中，可以重新建模检索阶段，以在整个外部语料库上使用相似性检索或使用不同的检索方法组合。

从多个页面检索段落可以帮助减少选择错误页面的机会，当需要多个页面的信息来生成答案时，它也很有用。

2.2 生成

在生成阶段，用户的问题、检索到的段落和原始图像被用作 Phi-3.5-vision 的输入来生成答案。

我在 Phi-3.5-vision 的通用聊天模板中使用了以下提示：

You are a helpful assistant tasked with answering questions about the provided 
image.
Answer the following question: {question}
You can use the following passages retrieved from Wikipedia to provide your 
answer:
{passages}

在生成时，标签 {question} 将像以前一样替换为用户问题，而标签 {passages} 将替换为检索到的段落和相应页面的名称，格式如下：

From Wikipedia page {page_name} : "{passage1}"\n\n
From Wikipedia page {page_name} : "{passage2}"\n\n
From Wikipedia page {page_name} : "{passage3}"\n\n

提供提取段落的页面名称有助于解决歧义问题，因为后者的内容不足以唯一地确定其所指的主题或话题。

3、示例

在本节中，我将展示使用上一节中讨论的实现所获得的答案的一些示例，并将使用 RAG 赋能的视觉语言模型的输出与基本版本进行比较。

对于下面的每个示例，我将展示提供给模型的图像、一个包含问题和 RAG 增强版和基本 VLM 的答案的块、一个包含模型创建的搜索查询的块以及一个包含从维基百科检索到的段落的块。

3.1 示例 1

Question: How tall are the plants that produce this fruit?

Base VLM: Tomatoes are typically grown on plants that can reach heights of 2 to
4 feet, depending on the variety and growing conditions.

RAG VLM: The tomato plants that produce this fruit typically grow to 1-3 meters
(3-10 feet) in height.

VLM 生成了以下搜索关键词：

Search keywords: tomato plant, tomato plant height

并检索到以下段落：

From page Tomato (https://en.wikipedia.org/wiki/Tomato): 
 "throughout all seasons of the year. Tomato plants typically grow to 1–3 
meters (3–10 ft) in height. They are vines that have a weak stem that sprawls 
and typically needs support. Indeterminate tomato plants are perennials in 
their native habitat, but are cultivated as annuals. (Determinate, or".

From page Tomato (https://en.wikipedia.org/wiki/Tomato): 
 "Numerous varieties of the tomato plant are widely grown in temperate climates
 across the world, with greenhouses allowing for the production of tomatoes 
throughout all seasons of the year. Tomato plants typically grow to 1–3 meters
(3–10 ft) in height. They are vines that have a weak stem that".

From page Early Girl (https://en.wikipedia.org/wiki/Early_Girl): 
 "== See also ==
List of tomato cultivars


== References ==".

From page Black Krim (https://en.wikipedia.org/wiki/Black_Krim): 
 "== References ==


== See also ==
List of tomato cultivars
Heirloom tomato".

From page Celebrity tomato (https://en.wikipedia.org/wiki/Celebrity_tomato): 
 "== See also ==
List of tomato cultivars
Tomato production in Florida


== References ==".

From page Fourth of July tomato 
(https://en.wikipedia.org/wiki/Fourth_of_July_tomato): 
 "== See also ==
List of tomato cultivars


== References ==


== External links ==
Picture Gallery from Google Images".

From page Mortgage Lifter (https://en.wikipedia.org/wiki/Mortgage_Lifter): 
 "== See also ==
List of tomato cultivars


== References ==


== External links ==
Picture Gallery from Google Images".

使用 RAG 增强的模型正确报告了维基百科页面“番茄”上所述的身高范围，而基础模型则给出了较低的范围。

3.2 示例 2

Query: How much does this species weight?

Base VLM: Polar bears can weigh between 900 to 1,600 pounds 
(408 to 727 kilograms).

RAG VLM:  The weight of this species, the polar bear, ranges from 300-800 kg 
(660-1,760 lb) for males and 150-300 kg (330-660 lb) for females.

Search keywords: polar bear, weight

From page Polar bear (https://en.wikipedia.org/wiki/Polar_bear): 
"Males are generally 200–250 cm (6.6–8.2 ft) long with a weight of 300–800 kg
(660–1,760 lb). Females are smaller at 180–200 cm (5.9–6.6 ft) with a weight 
of 150–300 kg (330–660 lb). Sexual dimorphism in the species is particularly 
high compared with most other mammals. Male polar bears also have".

From page Polar bear (https://en.wikipedia.org/wiki/Polar_bear): 
 "== Notes ==


== References ==


== Bibliography ==


== External links ==
Polar Bears International website
ARKive—images and movies of the polar bear (Ursus maritimus)".

From page Polar bear (https://en.wikipedia.org/wiki/Polar_bear): 
 "weight of 150–300 kg (330–660 lb). Sexual dimorphism in the species is 
particularly high compared with most other mammals. Male polar bears also have 
proportionally larger heads than females. The weight of polar bears fluctuates 
during the year, as they can bulk up on fat and increase their mass by".

From page List of ursids (https://en.wikipedia.org/wiki/List_of_ursids): 
 "long, plus a 3–20 cm (1–8 in) tail, though the polar bear is 2.2–2.44 m 
(7–8 ft) long, and some subspecies of brown bear can be up to 2.8 m (9 ft). 
Weights range greatly from the sun bear, which can be as low as 35 kg (77 lb), 
to the polar bear, which can be as high as 726 kg (1,600 lb). Population".

From page Knut (polar bear) (https://en.wikipedia.org/wiki/Knut_(polar_bear)): 
 "== See also ==
Binky (polar bear)
List of individual bears


== References ==


== External links ==".

From page Polar bear (https://en.wikipedia.org/wiki/Polar_bear): 
 "The polar bear is the largest living species of bear and land carnivore, 
though some brown bear subspecies like the Kodiak bear can rival it in size. 
Males are generally 200–250 cm (6.6–8.2 ft) long with a weight of 300–800 kg 
(660–1,760 lb). Females are smaller at 180–200 cm (5.9–6.6 ft) with a".

From page Subspecies of brown bear 
(https://en.wikipedia.org/wiki/Subspecies_of_brown_bear): 
 "males, potentially tripling their average weight within three years' time, 
and can expect to average between 360 and 545 kg (794 and 1,202 lb). 
The reported mean adult body masses for both sexes of the polar bear are very 
similar to the peninsular giant and Kodiak bears. Due to their roughly".

同样，RAG 模型使用 Wikipedia 页面“北极熊”中报告的权重正确回答了问题，而基础模型则使用略有不同的权重范围回答了问题。

3.3 示例 3

在我之前的博客文章使用 LLaVA 创建视觉聊天助手中，我展示了当用户的提示询问有关图像的误导性问题时，LLaVA-1.5 模型如何产生幻觉。让我们看看在该示例中使用 Phi-3.5-vision 模型（带和不带 RAG 增强）会发生什么。特别是，如果检索到任何误导性段落，该模型现在可能更容易产生幻觉。

Question: Tell me the species of the fish in this image

Base VLM: The image does not contain any fish or aquatic environments, so 
I cannot provide a species of fish.

RAG VLM:  The image does not contain any fish or aquatic environments. It is
a photograph of a desert landscape with sand dunes. 
Therefore, I cannot provide the species of fish based on this image.

基础模型和 RAG 模型都没有因误导性提示而产生幻觉，这表明 Phi-3.5 视觉的稳健性可能比 LLaVA-1.5 更好。特别是，RAG VLM 没有从维基百科检索任何页面。事实上，当被要求生成关键词时，该模型回答

There are no fish visible in the image provided. The image shows a desert 
landscape with sand dunes. Therefore, the search keywords related to the image 
content would be more appropriate for a desert or dune-related topic.

此搜索查询未通过 wikipedia 包搜索功能产生任何结果。

注意：在后一个示例中，在早期使用 Idefics 2 进行的实验中，基础 VLM 和 RAG VLM 模型都受到幻觉的影响，回答的是图像中不存在的鱼类物种名称。特别是，RAG 模型检索了与各种鱼类相关的误导性段落。幻觉现象依赖于模型，RAG 框架可能不足以在所有情况下消除它。

4、局限性

该框架的主要局限性在于，它要求模型至少对所提供的图像和问题有一般了解（而确切的细节将通过 RAG 检索），以便生成有用的搜索查询。在某些特定领域的设置中，可能需要对模型进行微调以生成搜索查询。

本文讨论的框架也存在传统纯文本 RAG 系统常见的局限性。生成过程中常见的错误来源是对外部信息的错误检索。在某些情况下，模型无法正确识别检索阶段的失败，并根据不正确、不相关或误导性的信息生成答案。特别是，不相关段落的存在有时会引起幻觉，否则模型不会产生幻觉。

另一个常见的错误来源是忽略检索到的段落或错误地引用它们。这有时会导致检索到的段落和生成的答案之间存在细微的差异，乍一看可能很难发现。

在此示例中，我将展示模型错误引用检索到的段落的实例。

Query: What's the average weight of this animal?

Base VLM: The average male lion can weigh between 160 to 310 kilograms, while 
females typically weigh between 120 to 180 kilograms.

RAG VLM:  Based on the information provided in the passages, the average weight
of a male lion is 187.5 kg (386 lb) in Southern Africa and 174.9 kg (386 lb) in
East Africa.

Search keywords: lion, weight, average weight

From page American lion (https://en.wikipedia.org/wiki/American_lion): 
 "(518lbs-1153lbs) in males and 175kg to 365 kg (385lbs-805lbs) for females. 
In 2008, the American lion was estimated to weigh up to 420 kg (930 lb). A 
study in 2009 showed an average weight of 256 kg (564 lb) for males and 351 kg
(774 lb) for the largest specimen analyzed.".

From page Lion (https://en.wikipedia.org/wiki/Lion): 
 "=== Size ===
Among felids, the lion is second only to the tiger in size. The size and weight
of adult lions vary across its range and habitats. Accounts of a few 
individuals that were larger than average exist from Africa and India.


=== Mane ===".

From page Panthera leo leo (https://en.wikipedia.org/wiki/Panthera_leo_leo): 
 "The lion's fur varies in colour from light buff to dark brown. It has rounded
ears and a black tail tuft. Average head-to-body length of male lions is 
2.47–2.84 m (8 ft 1 in – 9 ft 4 in) with a weight of 148.2–190.9 kg 
(327–421 lb). Females are smaller and less heavy. Zoological lion specimens".

From page Panthera leo melanochaita 
(https://en.wikipedia.org/wiki/Panthera_leo_melanochaita): 
 "Average head-to-body length of male lions is 2.47–2.84 m (8 ft 1 in – 9 ft 
4 in) with a weight ranging from 150–225 kg (331–496 lb) averaging 187.5 kg 
(413 lb) in Southern Africa and 145.4–204.7 kg (321–451 lb) averaging 174.9 kg
 (386 lb) in East Africa. Females average 83–165 kg (183–364 lb) in".

From page Asiatic lion (https://en.wikipedia.org/wiki/Asiatic_lion): 
 "An adult male Asiatic lion weighs 160.1 kg (353 lb) on average with the 
limit being 190 kg (420 lb); a wild female weighs 100 to 130 kg (220 to 285 lb)
.[1]".

From page List of largest mammals 
(https://en.wikipedia.org/wiki/List_of_largest_mammals): 
 "== See also ==
List of largest land carnivorans
Largest organisms
Largest prehistoric animals
List of largest birds
List of largest cats
List of largest fish
List of largest plants
List of largest reptiles
List of largest insects
List of heaviest land mammals
Smallest organisms


== Notes ==".

From page Ancient Mesopotamian units of measurement 
(https://en.wikipedia.org/wiki/Ancient_Mesopotamian_units_of_measurement): 
 "== See also ==
Assyrian lion weights
Babylonian mathematics
Historical weights and measures
Weights and measures


== References ==


=== Citations ===".

虽然以公斤为单位的重量答案是正确的，但该模型错误地将南非雄狮的平均体重转换为磅，即使从维基百科中提取的相应段落报告了正确的重量。

5、结束语

在这篇文章中，我说明了一个简单的框架，可用于通过检索增强生成功能增强视觉问答。该方法的核心思想是利用视觉语言模型生成查询，然后由标准 RAG 管道使用这些查询从外部语料库中检索信息。我还介绍了该框架的实现，该框架授予 Phi-3.5-vision 对维基百科的访问权限。此实现的完整代码可在 GitHub 存储库中找到。

虽然讨论的方法简单有效，但它并不能免受所有 RAG 系统共有的局限性以及多模态设置复杂性带来的新挑战。一方面，检索某些特定问题的相关信息可能很困难。由于搜索查询是使用视觉语言模型创建的，因此检索准确性进一步受到 VLM 识别图像和理解问题所指细节的能力的限制。另一方面，即使检索到了正确的信息，也不能保证模型在生成答案时不会产生幻觉。在多模态设置中，这可能会因模型必须将正确的含义与文本和图像相关联并理解它们之间的相互作用而加剧。

我在本文中讨论的框架是 vanilla RAG 管道的直接扩展，适用于视觉问答任务。可以轻松包含标准的高级 RAG 技术，例如查询转换、对检索到的段落进行重新排序以及假设文档嵌入 (HyDE)，以提高性能。

此外，使用多模态嵌入模型（如 CLIP）会出现新的机会：图像嵌入可用于按相似性搜索相关文本文档，也可以检索与原始图像和问题相似和/或相关的图像。后者可能很有用，例如，当需要从不同的角度来回答提示时。

另一个改进方向是进行微调以获得更专业和更有效的模型。考虑到多模态模型在检索和生成过程中的作用，可以执行两个不同的微调过程：一个是获得专门用于编写搜索查询的模型，另一个是提高模型在基础生成任务上的性能。

最后，该框架可以纳入专门的代理系统，以进一步提高其性能和稳健性。例如，代理系统可以通过对检索到的段落提供反馈并提出后续问题或仅在需要时专注于搜索有关图像特定细节的信息来迭代地细化生成的查询。它还可以处理更复杂问题的多跳问答任务，并决定何时需要检索更多外部信息来回答用户的查询。

原文链接：RAG视觉问答开发框架 - BimAnt