探索RAG（四）--ARG with Chroma（二）

桔色的猫

已于 2024-05-29 11:02:13 修改

阅读量913

点赞数 34

文章标签：人工智能自然语言处理

于 2024-05-29 10:47:13 首次发布

本文链接：https://blog.csdn.net/weixin_47440313/article/details/139284030

版权

文章目录

概要

在上节中我们将embeddings-based retrieval进行了介绍，在embeddings-based retrieval中，我们通过计算原查询与文档块之间的余弦距离，找出相似距离最近的几个块，并通过将原查询与被检索的块集成，最后可令模型完成特定的任务。在最后提到，仅仅依靠语义上的相似度来确定与原查询内容相关块，这样的操作往往得不到满意的结果，所以在本节，我们将基于上节遗留下的问题，探讨更适合的检索策略。

Query Expansion

Query Expansion（扩展查询），原查询的内容往往出现歧义，概念模糊或者是不完整，使用这些不完美的语句进行检索时，所检索到的内容可能与原查询是不相关的，因此为了增强原查询的内容，提出了扩展查询这一技术，如下图所示：
在这里插入图片描述
首先利用LLM给原查询设定一个答案（假设），将原查询与答案联合进行文档检索，最后通过集成输送给LLM得到答案。这样做的原因是利用了LLM出色的生成功能，通过丰富原查询内容，从而找到更符合相关性的文档，下面通过实例以及辅助工具，我们更为直观的感受到这一点。
初步准备工作：

import os
import openai
import umap
from openai import OpenAI
from dotenv import load_dotenv, find_dotenv
from chromadb.utils.embedding_functions import SentenceTransformerEmbeddingFunction
embedding_function = SentenceTransformerEmbeddingFunction()
chroma_collection = load_chroma(filename='microsoft_annual_report_2022.pdf', collection_name='microsoft_annual_report_2022', embedding_function=embedding_function)
chroma_collection.count()

_ = load_dotenv(find_dotenv()) # read local .env file
openai.api_key = os.environ['OPENAI_API_KEY']

openai_client = OpenAI()

embeddings = chroma_collection.get(include=['embeddings'])['embeddings']
umap_transform = umap.UMAP(random_state=0, transform_seed=0).fit(embeddings)
projected_dataset_embeddings = project_embeddings(embeddings, umap_transform)

使用LLM对原查询进行扩展：

def augment_query_generated(query, model="gpt-3.5-turbo"):
    messages = [
        {
            "role": "system",
            #提示大模型给原查询一个答案
            "content": "You are a helpful expert financial research assistant. Provide an example answer to the given question, that might be found in a document like an annual report. "
        },
        {"role": "user", "content": query}
    ] 

    response = openai_client.chat.completions.create(
        model=model,
        messages=messages,
    )
    content = response.choices[0].message.content
    return content
#原查询内容
original_query = "Was there significant turnover in the executive team?"
hypothetical_answer = augment_query_generated(original_query)

joint_query = f"{original_query} {hypothetical_answer}"
print(word_wrap(joint_query))

得到扩展内容：
在这里插入图片描述
使用扩展内容进行文档检索：

results = chroma_collection.query(query_texts=joint_query, n_results=5, include=['documents', 'embeddings'])
retrieved_documents = results['documents'][0]

for doc in retrieved_documents:
    print(word_wrap(doc))
    print('')

运行结果
在这里插入图片描述
对比原查询与扩展查询在矢量数据库中的相关性计算的效果：

retrieved_embeddings = results['embeddings'][0]
original_query_embedding = embedding_function([original_query])
augmented_query_embedding = embedding_function([joint_query])

projected_original_query_embedding = project_embeddings(original_query_embedding, umap_transform)
projected_augmented_query_embedding = project_embeddings(augmented_query_embedding, umap_transform)
projected_retrieved_embeddings = project_embeddings(retrieved_embeddings, umap_transform)

import matplotlib.pyplot as plt

# Plot the projected query and retrieved documents in the embedding space
plt.figure()
plt.scatter(projected_dataset_embeddings[:, 0], projected_dataset_embeddings[:, 1], s=10, color='gray')
plt.scatter(projected_retrieved_embeddings[:, 0], projected_retrieved_embeddings[:, 1], s=100, facecolors='none', edgecolors='g')
plt.scatter(projected_original_query_embedding[:, 0], projected_original_query_embedding[:, 1], s=150, marker='X', color='r')
plt.scatter(projected_augmented_query_embedding[:, 0], projected_augmented_query_embedding[:, 1], s=150, marker='X', color='orange')

plt.gca().set_aspect('equal', 'datalim')
plt.title(f'{original_query}')
plt.axis('off')

在这里插入图片描述
橙色是我们进行扩展后的结果，相对比原查询（红色）而言，可以得到相对不错的结果簇，能获得相关性更好的文件块。

Expansion with multiple queries

相对于Query Expansion而言，还有一种扩展查询方式–Expansion with multiple queries，如下图所示：
在这里插入图片描述
与前者不同的是，这里不需要LLM提出一个关于原查询的回答，而是利用LLM生成一些与原查询相关的查询（新查询）。
使用提示词工程，让LLM生成5个新查询：

def augment_multiple_query(query, model="gpt-3.5-turbo"):
    messages = [
        {
            "role": "system",
            "content": "You are a helpful expert financial research assistant. Your users are asking questions about an annual report. "
            "Suggest up to five additional related questions to help them find the information they need, for the provided question. "
            "Suggest only short questions without compound sentences. Suggest a variety of questions that cover different aspects of the topic."
            "Make sure they are complete questions, and that they are related to the original question."
            "Output one question per line. Do not number the questions."
        },
        {"role": "user", "content": query}
    ]

    response = openai_client.chat.completions.create(
        model=model,
        messages=messages,
    )
    content = response.choices[0].message.content
    content = content.split("\n")
    return content

利用LLM输出：

original_query = "What were the most important factors that contributed to increases in revenue?"
augmented_queries = augment_multiple_query(original_query)

for query in augmented_queries:
    print(query)

得到了如下的几个关联查询：

How did the company’s marketing strategies impact revenue growth?
Were there any significant changes in product pricing that affected revenue?
Did the company introduce any new product lines that contributed to revenue growth?
How did changes in customer demand impact revenue?
Were there any acquisitions or partnerships that positively impacted revenue?

输出检索结果：

queries = [original_query] + augmented_queries
results = chroma_collection.query(query_texts=queries, n_results=5, include=['documents', 'embeddings'])

retrieved_documents = results['documents']

# Deduplicate the retrieved documents
unique_documents = set()
for documents in retrieved_documents:
    for document in documents:
        unique_documents.add(document)

for i, documents in enumerate(retrieved_documents):
    print(f"Query: {queries[i]}")
    print('')
    print("Results:")
    for doc in documents:
        print(word_wrap(doc))
        print('')
    print('-'*100)

每一个查询都有5个相关性文档被检索到
在这里插入图片描述
使用nmap投射到2维去看一下最终效果：

original_query_embedding = embedding_function([original_query])
augmented_query_embeddings = embedding_function(augmented_queries)

project_original_query = project_embeddings(original_query_embedding, umap_transform)
project_augmented_queries = project_embeddings(augmented_query_embeddings, umap_transform)

result_embeddings = results['embeddings']
result_embeddings = [item for sublist in result_embeddings for item in sublist]
projected_result_embeddings = project_embeddings(result_embeddings, umap_transform)

import matplotlib.pyplot as plt

plt.figure()
plt.scatter(projected_dataset_embeddings[:, 0], projected_dataset_embeddings[:, 1], s=10, color='gray')
plt.scatter(project_augmented_queries[:, 0], project_augmented_queries[:, 1], s=150, marker='X', color='orange')
plt.scatter(projected_result_embeddings[:, 0], projected_result_embeddings[:, 1], s=100, facecolors='none', edgecolors='g')
plt.scatter(project_original_query[:, 0], project_original_query[:, 1], s=150, marker='X', color='r')

plt.gca().set_aspect('equal', 'datalim')
plt.title(f'{original_query}')
plt.axis('off')