探索RAG(四)--ARG with Chroma(二)

概要

在上节中我们将embeddings-based retrieval进行了介绍,在embeddings-based retrieval中,我们通过计算原查询与文档块之间的余弦距离,找出相似距离最近的几个块,并通过将原查询与被检索的块集成,最后可令模型完成特定的任务。在最后提到,仅仅依靠语义上的相似度来确定与原查询内容相关块,这样的操作往往得不到满意的结果,所以在本节,我们将基于上节遗留下的问题,探讨更适合的检索策略。

Query Expansion

Query Expansion(扩展查询),原查询的内容往往出现歧义,概念模糊或者是不完整,使用这些不完美的语句进行检索时,所检索到的内容可能与原查询是不相关的,因此为了增强原查询的内容,提出了扩展查询这一技术,如下图所示:
在这里插入图片描述
首先利用LLM给原查询设定一个答案(假设),将原查询与答案联合进行文档检索,最后通过集成输送给LLM得到答案。这样做的原因是利用了LLM出色的生成功能,通过丰富原查询内容,从而找到更符合相关性的文档,下面通过实例以及辅助工具,我们更为直观的感受到这一点。
初步准备工作:

import os
import openai
import umap
from openai import OpenAI
from dotenv import load_dotenv, find_dotenv
from chromadb.utils.embedding_functions import SentenceTransformerEmbeddingFunction
embedding_function = SentenceTransformerEmbeddingFunction()
chroma_collection = load_chroma(filename='microsoft_annual_report_2022.pdf', collection_name='microsoft_annual_report_2022', embedding_function=embedding_function)
chroma_collection.count()

_ = load_dotenv(find_dotenv()) # read local .env file
openai.api_key = os.environ['OPENAI_API_KEY']

openai_client = OpenAI()

embeddings = chroma_collection.get(include=['embeddings'])['embeddings']
umap_transform = umap.UMAP(random_state=0, transform_seed=0).fit(embeddings)
projected_dataset_embeddings = project_embeddings(embeddings, umap_transform)

使用LLM对原查询进行扩展:

def augment_query_generated(query, model="gpt-3.5-turbo"):
    messages = [
        {
            "role": "system",
            #提示大模型给原查询一个答案
            "content": "You are a helpful expert financial research assistant. Provide an example answer to the given question, that might be found in a document like an annual report. "
        },
        {"role": "user", "content": query}
    ] 

    response = openai_client.chat.completions.create(
        model=model,
        messages=messages,
    )
    content = response.choices[0].message.content
    return content
#原查询内容
original_query = "Was there significant turnover in the executive team?"
hypothetical_answer = augment_query_generated(original_query)

joint_query = f"{original_query} {hypothetical_answer}"
print(word_wrap(joint_query))

得到扩展内容:
在这里插入图片描述
使用扩展内容进行文档检索:

results = chroma_collection.query(query_texts=joint_query, n_results=5, include=['documents', 'embeddings'])
retrieved_documents = results['documents'][0]

for doc in retrieved_documents:
    print(word_wrap(doc))
    print('')

运行结果
在这里插入图片描述
对比原查询与扩展查询在矢量数据库中的相关性计算的效果:

retrieved_embeddings = results['embeddings'][0]
original_query_embedding = embedding_function([original_query])
augmented_query_embedding = embedding_function([joint_query])

projected_original_query_embedding = project_embeddings(original_query_embedding, umap_transform)
projected_augmented_query_embedding = project_embeddings(augmented_query_embedding, umap_transform)
projected_retrieved_embeddings = project_embeddings(retrieved_embeddings, umap_transform)

import matplotlib.pyplot as plt

# Plot the projected query and retrieved documents in the embedding space
plt.figure()
plt.scatter(projected_dataset_embeddings[:, 0], projected_dataset_embeddings[:, 1], s=10, color='gray')
plt.scatter(projected_retrieved_embeddings[:, 0], projected_retrieved_embeddings[:, 1], s=100, facecolors='none', edgecolors='g')
plt.scatter(projected_original_query_embedding[:, 0], projected_original_query_embedding[:, 1], s=150, marker='X', color='r')
plt.scatter(projected_augmented_query_embedding[:, 0], projected_augmented_query_embedding[:, 1], s=150, marker='X', color='orange')

plt.gca().set_aspect('equal', 'datalim')
plt.title(f'{original_query}')
plt.axis('off')

在这里插入图片描述
橙色是我们进行扩展后的结果,相对比原查询(红色)而言,可以得到相对不错的结果簇,能获得相关性更好的文件块。

Expansion with multiple queries

相对于Query Expansion而言,还有一种扩展查询方式–Expansion with multiple queries,如下图所示:
在这里插入图片描述
与前者不同的是,这里不需要LLM提出一个关于原查询的回答,而是利用LLM生成一些与原查询相关的查询(新查询)。
使用提示词工程,让LLM生成5个新查询:

def augment_multiple_query(query, model="gpt-3.5-turbo"):
    messages = [
        {
            "role": "system",
            "content": "You are a helpful expert financial research assistant. Your users are asking questions about an annual report. "
            "Suggest up to five additional related questions to help them find the information they need, for the provided question. "
            "Suggest only short questions without compound sentences. Suggest a variety of questions that cover different aspects of the topic."
            "Make sure they are complete questions, and that they are related to the original question."
            "Output one question per line. Do not number the questions."
        },
        {"role": "user", "content": query}
    ]

    response = openai_client.chat.completions.create(
        model=model,
        messages=messages,
    )
    content = response.choices[0].message.content
    content = content.split("\n")
    return content

利用LLM输出:

original_query = "What were the most important factors that contributed to increases in revenue?"
augmented_queries = augment_multiple_query(original_query)

for query in augmented_queries:
    print(query)

得到了如下的几个关联查询:

  • How did the company’s marketing strategies impact revenue growth?
    Were there any significant changes in product pricing that affected revenue?
    Did the company introduce any new product lines that contributed to revenue growth?
    How did changes in customer demand impact revenue?
    Were there any acquisitions or partnerships that positively impacted revenue?

输出检索结果:

queries = [original_query] + augmented_queries
results = chroma_collection.query(query_texts=queries, n_results=5, include=['documents', 'embeddings'])

retrieved_documents = results['documents']

# Deduplicate the retrieved documents
unique_documents = set()
for documents in retrieved_documents:
    for document in documents:
        unique_documents.add(document)

for i, documents in enumerate(retrieved_documents):
    print(f"Query: {queries[i]}")
    print('')
    print("Results:")
    for doc in documents:
        print(word_wrap(doc))
        print('')
    print('-'*100)

每一个查询都有5个相关性文档被检索到
在这里插入图片描述
使用nmap投射到2维去看一下最终效果:

original_query_embedding = embedding_function([original_query])
augmented_query_embeddings = embedding_function(augmented_queries)

project_original_query = project_embeddings(original_query_embedding, umap_transform)
project_augmented_queries = project_embeddings(augmented_query_embeddings, umap_transform)

result_embeddings = results['embeddings']
result_embeddings = [item for sublist in result_embeddings for item in sublist]
projected_result_embeddings = project_embeddings(result_embeddings, umap_transform)

import matplotlib.pyplot as plt

plt.figure()
plt.scatter(projected_dataset_embeddings[:, 0], projected_dataset_embeddings[:, 1], s=10, color='gray')
plt.scatter(project_augmented_queries[:, 0], project_augmented_queries[:, 1], s=150, marker='X', color='orange')
plt.scatter(projected_result_embeddings[:, 0], projected_result_embeddings[:, 1], s=100, facecolors='none', edgecolors='g')
plt.scatter(project_original_query[:, 0], project_original_query[:, 1], s=150, marker='X', color='r')

plt.gca().set_aspect('equal', 'datalim')
plt.title(f'{original_query}')
plt.axis('off')

结果
在这里插入图片描述
橙色是新查询,红色代表原查询,从图上可以看出,当使用Expansion with multiple queries时,查询范围变大,意味着能够找到更多的相关性文档,这对上下文信息非常重要。

总结

在本节中,阐述了使用Query Expansion 和 Expansion with multiple queries两种方法,并依据实际例子证实了这两种方式的可行性,针对于上面的两种方案,在实际中应按照具体任务去实行。
下节内容将介绍交叉编码重新排序。

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

桔色的猫

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值