RAG实战 12 - 在RAG管道中实现上下文压缩和过滤

LLM之RAG实战(十二)| 在RAG管道中实现上下文压缩和过滤

转载自: LLM之RAG实战(十二)| 在RAG管道中实现上下文压缩和过滤
https://mp.weixin.qq.com/s/nZ0p0C1lGUALoSbtQRSYhQ



在RAG中可能面临的最大问题之一是检索器应该检索什么内容?

实际使用中,检索到的上下文并不完全有用,可能检索处理较大的块中只有非常小的一部分与答案相关,还可能对于一个特定的问题需要来自多个块合并来得到答案。


一、什么是上下文压缩?

在这里插入图片描述


我们划分文档块的时候,通常不知道用户的查询,这意味着,与查询最相关的信息可能隐藏在一个包含大量不相关文本的文档中,这样输入给LLM,可能会导致更昂贵的LLM调用和较差的响应。

因此,我们可以对上下文进行压缩,大致思路是:

  1. 使用某种基本的检索器来检索不同的信息;
  2. 然后将检索到的信息添加到文档压缩器中;
  3. 压缩器对这些信息进行过滤和处理,只提取对回答问题有用的信息。

二、上下文压缩中遵循的步骤

  • 上下文压缩检索器将查询传递给基础检索器;
  • 然后,获取初始文档,并将它们传递给文档压缩器;
  • 文档压缩器获取文档列表,并通过减少文档内容或完全删除文档来缩短列表

三、准备工作

Langchain:支持使用LLM创建应用程序的框架

Llmware BLING模型:大型语言模型(实验)

Chromadb : 向量数据库


四、代码实现


4.1 安装所需的依赖项

!pip install -qU langchain huggingface_hub chromadb pypdf python-dotenv transformers sentence-transformers

4.2 导入需要的包

from langchain.llms import HuggingFaceHub
from langchain.document_loaders import PyPDFLoader
from langchain.vectorstores import Chroma
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import SentenceTransformerEmbeddings
from langchain.prompts import PromptTemplate
from dotenv import load_dotenv

4.3 设置Huggingafechub token

import os
from getpass import getpass
os.environ["HUGGINGFACEHUB_API_TOKEN"] = getpass("Enter HuggingFace Hub Token:")

4.4 导入数据

loader = PyPDFLoader("/content/CommonInsuranceTerms.pdf")
documents = loader.load()
print(len(documents))
print(documents[0].page_content)

#### RESPONSE ####
16
Glossary of Common Insurance Terms 
NOTICE:  This document is for informational purposes only and is not in tended to alter or replace the 
insurance policy. Additionally, this informational sheet is not  intended to fully set out your rights and 
obligations or the rights and obligations of the insurance comp any. If you have questions about your insurance, 
you should consult your insurance agent, the insurance company,  or the language of the insurance policy. 
A 
Accelerated death benefits  - An insurance policy with an accelerated death benefits provi sion will pay - 
under certain conditions - all or part of the policy death bene fits while the policyholder is still alive. These 
conditions include proof that the policyholder is terminally il l, has a specified life-thr eatening disease or is in a 
long-term care facility such as a nursing home. By accepting an  accelerated benefit payment, a person could be 
ruled ineligible for Medicaid or  other government benefits. The  proceeds may also be taxable. 
Accident  - An unforeseen, unintended event. 
Accident-only policies  - Policies that pay only in cas es arising from an accident or injury. 
Accidental death benefits  - If a life insurance policy includes an accidental death bene fit, the cause of death 
will be examined to determine whether the insured´s death meets  the policy´s definition of accidental. 
Actual cash value (ACV)  - The value of your property, based on the current cost to rep lace it minus 
depreciation. Also see "replacement cost." 
Additional livin g expenses (ALE)  - Reimburses the policyholder for the cost of temporary housin g, food, and 
other essential living expenses, if the home is damaged by a co vered peril that makes the home temporarily 
uninhabitable.  
Adjuster  - An individual employed by an insurer to evaluate losses and settle policyholder claims.  
Administrative expense charge  - An amount deducted, usually monthly, from the policy. 
Agent  - A person who sells insurance policies. Must be licensed by t he Alabama Department of Insurance to 
legally sell and transact insurance business.  
Annuitant  - A person who receives the payments from an annuity during hi s or her lifetime.

4.5 设置文本切分器

text_splitter = RecursiveCharacterTextSplitter(chunk_size=700,chunk_overlap=70)
split_documents = text_splitter.split_documents(documents)
print(len(split_documents))
print(split_documents[0])

##### RESPONSE #####
65
Document(page_content='Glossary of Common Insurance Terms \nNOTICE:  This document is for informational purposes only and is not in tended to alter or replace the \ninsurance policy. Additionally, this informational sheet is not  intended to fully set out your rights and \nobligations or the rights and obligations of the insurance comp any. If you have questions about your insurance, \nyou should consult your insurance agent, the insurance company,  or the language of the insurance policy. \nA \nAccelerated death benefits  - An insurance policy with an accelerated death benefits provi sion will pay - \nunder certain conditions - all or part of the policy death bene fits while the policyholder is still alive. These', metadata={'source': '/content/CommonInsuranceTerms (1).pdf', 'page': 0})

在HuggingFace上使用BLING(无GPU之后的最佳小指令)模型系列
https://huggingface.co/llmware


4.6 设置嵌入

industry-bert-insurance-v0.1 是经过行业微调的sentence_transformer嵌入模型系列中的一个。

industry-bert-insurance-v0.1 是一个基于领域微调bert的768参数句子转换器模型,旨在作为保险行业领域嵌入的“替代品”。该模型是根据保险业的广泛公开文件进行培训的。

embeddings = SentenceTransformerEmbeddings(model_name="llmware/industry-bert-insurance-v0.1")

4.7 设置LLM

bling-shared-llama-1.3b-0.1是BLING(“Best Little Instruction-following No-GPU-required”)模型系列的一部分,在Sheared-LLaMA-1.3B基础模型上经过指令微调获得的。

BLING模型使用蒸馏的高质量自定义指令数据集进行微调,针对指令任务的特定子集,目的是在CPU笔记本电脑上提供高质量的指令模型,即使不使用任何高级量化优化,也能“为推理做好准备”。

BLING模型的具体介绍,可以参考 https://colab.research.google.com/corgiredirector?site=https%3A%2F%2Fmedium.com%2F%40darrenoberst%2Fsmall-instruct-following-llms-for-rag-use-case-54c55e4b41a8

repo_id ="llmware/bling-sheared-llama-1.3b-0.1"
llm = HuggingFaceHub(repo_id=repo_id, model_kwargs = {"temperature":0.3,"max_length":500}) 

4.8 打印文档的助手功能

def pretty_print_docs(docs):  
  print(f"\n{'-'* 100}\n".join([F"Document{i+1}:\n\n" + d.page_content for i,d in enumerate(docs)]))

4.9 设置矢量存储

vectorstore = Chroma.from_documents(split_documents,
                                    embeddings,
                                    collection_metadata={"hnsw:space":"cosine"},
                                    persist_directory="/content/stores/insurance")
vectorstore.persist()

4.10 设置检索器

retriever = vectorstore.as_retriever(search_kwargs={"k":2})

4.11 获取与查询匹配的相关上下文

docs = retriever.get_relevant_documents(query="What is Group life insurance?")
pretty_print_docs(docs)

##### RESPONSE #####
Document1:

or claim payment. Insurance companies also may have grievance p rocedures. 
Group life insurance  - This type of life insurance provides coverage to a group of people under one contract. 
Most group contracts are sold to businesses that w ant to provid e life insurance for their employees. Group life 
insurance can also be sold to associations to cover their membe rs and to lending institutions to cover the 
amounts of their debtor loans. Most group policies are for term  insurance. Generally, the business will be 
issued a master policy and each person in the group will receiv e a certificate of insurance. 
Group of companies  - Several insurance companies u nder common ownership and often  common 
management.
----------------------------------------------------------------------------------------------------
Document2:

Most group contracts are sold to businesses that w ant to provid e life insurance for their employees. Group life 
insurance can also be sold to associations to cover their membe rs and to lending institutions to cover the 
amounts of their debtor loans. Most group policies are for term  insurance. Generally, the business will be 
issued a master policy and each person in the group will receiv e a certificate of insurance. 
Group of companies  - Several insurance companies u nder common ownership and often  common 
management.

4.12 使用LLMChainExtractor添加上下文压缩

  • 添加LLMChainExtractor以迭代最初返回的文档;
  • 仅从每个文档中提取与查询相关的上下文。
from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import LLMChainExtractor
#making the compressor
compressor = LLMChainExtractor.from_llm(llm=llm)
#compressor retriver = base retriever + compressor
compression_retriever = ContextualCompressionRetriever(base_retriever=retriever,
                                                       base_compressor=compressor)

默认压缩器提示

print(compressor.llm_chain.prompt.template)

###### RESPONSE #######
Given the following question and context, extract any part of the context *AS IS* that is relevant to answer the question. If none of the context is relevant return NO_OUTPUT. 

Remember, *DO NOT* edit the extracted parts of the context.

> Question: {question}
> Context:
>>>
{context}
>>>
Extracted relevant parts:

4.13 向上下文压缩添加过滤器

使用LLMChainFilter选择要传递给LLM的查询

print(compressor.llm_chain.prompt.template)

###### RESPONSE #######
Given the following question and context, extract any part of the context *AS IS* that is relevant to answer the question. If none of the context is relevant return NO_OUTPUT. 

Remember, *DO NOT* edit the extracted parts of the context.

> Question: {question}
> Context:
>>>
{context}
>>>
Extracted relevant parts:

使用RetrievalQA链实现问答功能

from langchain.chains import RetrievalQA
qa = RetrievalQA.from_chain_type(llm=llm,
                                 chain_type="stuff",
                                 retriever=compression_retriever_filter,
                                 verbose=True)
#
Ask Question
qa("What is Coinsurance?")

##### RESPONSE #####
> Entering new RetrievalQA chain...

> Finished chain.
{'query': 'What is Coinsurance?',
 'result': ' Coinsurance is the percentage of each health care bill a person must pay out of their own pocket. Non-covered charges and deductibles are in addition to this amount. Coinsurance maximum is the most you will have to pay in coinsurance during a policy period (usually a year) before your health plan begins paying 100 percent of the cost of your covered health services. The coinsurance maximum generally does not apply to copayments or other expenses you might be required to pay.\n\nC'}

qa("What is Group Life Insurance?")
###### RESPONSE #######
> Entering new RetrievalQA chain...

> Finished chain.
{'query': 'What is Group Life Insurance?',
 'result': ' Group life insurance provides coverage to a group of people under one contract. \nMost group contracts are sold to businesses that w ant to provid e life insurance for their employees. Group life \ninsurance can also be sold to associations to cover their membe rs and to lending institutions to cover the \namounts of their debtor loans. Most group policies are for term  insurance. Generally, the business will be \nissued a master policy and each person in the group will receiv e a certificate of insurance. \nGroup of companies  - Several insurance companies u nder common ownership and often  common \nmanagement'}

五、Pipeline

将压缩器和文档转换器串在一起

embeddings:langchain_core.embeddings.Embeddings[必需]用于嵌入文档内容和查询的嵌入。

k:可选[int]=20要返回的相关文档数。可以设置为“无”,在这种情况下,必须指定similarity_threshold。默认值为20。

similarity_fn:Callable=用于比较文档的相似性函数。函数期望将两个矩阵(List[List[foat]])作为输入,并返回一个分数矩阵,其中值越高表示相似性越大。

similarity_threshold:可选[foat]=无用于确定两个文档何时相似到足以被视为冗余的阈值。如果k设置为“无”,则必须指定默认值“无”。

​ 在这里,我们创建了一个由冗余过滤器+相关过滤器组成的管道,其中冗余过滤器过滤掉重复的上下文,相关过滤器仅提取相关上下文。

**EmbeddingsRedundantFilter:**我们可以识别类似的文档并过滤掉冗余;

**EmbeddingsFilter:**通过嵌入文档和查询并只返回那些与查询具有足够相似嵌入的文档,提供了一种更便宜、更快的选择。


from langchain.document_transformers import EmbeddingsRedundantFilter
from langchain.retrievers.document_compressors import DocumentCompressorPipeline
#
redundant_filter = EmbeddingsRedundantFilter(embeddings=embeddings)
relevant_filter = EmbeddingsFilter(embeddings=embeddings,k=5)
#making the pipeline
pipeline_compressor = DocumentCompressorPipeline(transformers=[redundant_filter,relevant_filter])
# compressor retriever
compression_retriever_pipeline = ContextualCompressionRetriever(base_retriever=retriever,
                                                       base_compressor=pipeline_compressor)
## print the prompt
print(compression_retriever_pipeline)
## Get relevant documents
compressed_docs = compression_retriever_pipeline.get_relevant_documents(query="What is Coinsurance?")
pretty_print_docs(compressed_docs)


##### RESPOSNE #####
Document1:

Claimant  - A person who makes an insurance claim. 
Coinsurance  - The percentage of each health care bill a person must pay ou t of their own pocket. Non-covered 
charges and deductibles are in addition to this amount. 
Coinsurance maximum  - The most you will have to pay in coinsurance during a policy  period (usually a 
year) before your health plan begins paying 100 percent of the cost of your covered health services. The 
coinsurance maximum generally does not apply to copayments or o ther expenses you might be required to pay. 
Collision coverage  - Pays for damage to a car with out regard to who caused an acc ident. The company must



compressed_docs = compression_retriever_pipeline.get_relevant_documents(query="What is Earned premium?")
pretty_print_docs(compressed_docs)

##### RESPOSNE #####
Document1:

replacement cost or the actual cash value, which includes depre ciation. 
Replacement cost  - Insurance coverage that pays the dollar amount needed to rep lace the structure or 
damaged personal property without deducting for depreciation bu t limited by the policy's maximum dollar 
amount. 
Rescission  - The termination of an insurance contract by the insurer when  material misrepresentation has 
occurred. 
Return premium  - A portion of the premium returned to a policy owner as a res ult of cancelation, rate 
adjustment, or a calculation that an advance premium was in exc ess of the actual premium.

compressed_docs = compression_retriever_pipeline.get_relevant_documents(query="What is Group Insurance Policy?")
pretty_print_docs(compressed_docs)
##### RESPONSE #######

5.1 使用llmware/bling-sheared-llama-1.3b-0.1模型实现问答功能

​ 该模型用于生成短文本作为回复,主要有助于聊天机器人类型的应用程序,在这些应用程序中,我们不需要更长的回复。此外,与顶级Zephyr-beta-7b或Openai相比,这种LLM不会产生有效的响应,仅用于实验目的。使用正确的LLM可以增强生成的响应的正确性。

**llmware模型的提示格式,如下所示:

from langchain.prompts import PromptTemplate
template ="""
<human>:
Context:{context}

Question:{question}

Use the above Context to answer the user's question.Consider only the Context provided above to formulate response.If the Question asked does not match with the Context provided just say 'I do not know thw answer'.
<bot>:

"""
prompt = PromptTemplate(input_variables=["context","question"],template=template)
chain_type_kwargs = {"prompt":prompt}
print(prompt)

####
from langchain.chains import RetrievalQA
qa = RetrievalQA.from_chain_type(llm=llm,
                                 chain_type="stuff",
                                 retriever=compression_retriever_pipeline,
                                 chain_type_kwargs=chain_type_kwargs,
                                 return_source_documents=True,
                                 verbose=True)
#
qa("What is Group Insurance Policy?")

###### RESPONSE ############
> Entering new RetrievalQA chain...

> Finished chain.
{'query': 'What is Group Insurance Policy?',
 'result': '<bot>: Group insurance policy is a policy that covers the life of a group of people. \nIt is usually sold to businesses that want to provide life insurance for their employees. \nIt can also be sold to associations to cover their members and to lending institutions to cover the amount of their debtor loans. \nMost group policies are for term insurance.<|endoftext|> Хронологија Хронологија Хронологија Хронологија Хронологија instanceof instanceof instanceof instanceof instanceof Хронологија Хронологија instanceof instanceof instanceof instanceofbolds Хронологија Narodowecka Narodowecka Narodowecka Narodowecka<|endoftext|><|endoftext|>bolds<|endoftext|>boldstrightarrow</boldsightarrow <|endoftext|></trightarrow',
 'source_documents': [_DocumentWithState(page_content='Most group contracts are sold to businesses that w ant to provid e life insurance for their employees. Group life \ninsurance can also be sold to associations to cover their membe rs and to lending institutions to cover the \namounts of their debtor loans. Most group policies are for term  insurance. Generally, the business will be \nissued a master policy and each person in the group will receiv e a certificate of insurance. \nGroup of companies  - Several insurance companies u nder common ownership and often  common \nmanagement.', metadata={'page': 5, 'source': '/content/CommonInsuranceTerms (1).pdf'}, state={'embedded_doc': [0.21783578395843506, 0.10463877022266388, -0.027319835498929024, -0.3879217505455017, 0.40784013271331787, 0.26043280959129333, 
 ...
 -0.4364112913608551, -0.06394162029027939, 0.059370845556259155, -0.10004513710737228, 0.41433295607566833, 0.02755933254957199, -0.29557839035987854, 0.8827390670776367], 'query_similarity_score': 0.5921743311683283})]}



response = qa("What is Long-term care benefits?")
print(response['result'].split("<|endoftext|>")[0])

###### RESPONSE ######
 Entering new RetrievalQA chain...

> Finished chain.
Long-term care benefits - Coverage that provides help for people when they are unable to care for themselves because of prolonged illness or disability. Benefits are triggered by specific findings of "cognitive impairment" or inability to perform certain actions known as "Activities of Daily Living." Benefits can range from help with daily activities while recuperating at home to skilled nursing care provided in a nursing home.
print(response)

######### RESPONSE #######
{'query': 'What is Long-term care benefits?',
 'result': 'Long-term care benefits - Coverage that provides help for people when they are unable to care for themselves because of prolonged illness or disability. Benefits are triggered by specific findings of "cognitive impairment" or inability to perform certain actions known as "Activities of Daily Living." Benefits can range from help with daily activities while recuperating at home to skilled nursing care provided in a nursing home.<|endoftext|> Хронологија Хронологија Хронологија Хронологија instanceof instanceof instanceof instanceof instanceof instanceofboldsboldsboldsboldsboldsboldsboldsboldsboldsboldsboldsboldsboldsboldsboldsboldsboldsboldsboldsboldsboldsboldsboldsboldsboldsboldsboldsboldsboldsboldsboldsboldsboldsboldsboldsboldsboldsboldsboldsboldsboldsboldsboldsboldsboldsboldsboldsboldsboldsboldsboldsboldsboldsboldsboldsboldsboldsboldsboldsboldsboldsboldsboldsboldsboldsboldsboldsboldsboldsboldsboldsboldsboldsboldsboldsboldsboldsboldsboldsboldsboldsboldsboldsboldsboldsboldsbolds',
 'source_documents': [_DocumentWithState(page_content='Long-term care benefits  - Coverage that provides help f or people when they are unable to care for themselves \nbecause of prolonged illness or disability. Benefits are trigge red by specific findings of "cognitive impairment" \nor inability to perform certain actions known as "Activities of  Daily Living." Benefits  can range from help \nwith daily activities while recuperating at home to skilled nur sing care provided in a nursing home.', metadata={'page': 7, 'source': '/content/CommonInsuranceTerms (1).pdf'}, state={'embedded_doc': [0.10907948762178421, 0.09334351122379303, 0.41147854924201965, -0.3373779356479645, 0.4558241069316864, 0.5461165308952332, -0.052463024854660034, -0.43229037523269653, -0.06157633662223816, 0.09955484420061111, -0.6033256649971008, -0.05812996253371239, 0.7388888001441956, -0.011550833471119404, -0.5817828178405762, -0.08132432401180267, 
 ...
 -0.04400945082306862, 0.12993864715099335, 0.10278826206922531, -0.3340747654438019, 0.02115345373749733, -0.7229974269866943, 0.07934501022100449, -1.2799934148788452, -0.006999789737164974], 'query_similarity_score': 0.6116580581049508}),
  _DocumentWithState(page_content='because of prolonged illness or disability. Benefits are trigge red by specific findings of "cognitive impairment" \nor inability to perform certain actions known as "Activities of  Daily Living." Benefits  can range from help \nwith daily activities while recuperating at home to skilled nur sing care provided in a nursing home.', metadata={'page': 7, 'source': '/content/CommonInsuranceTerms (1).pdf'}, state={'embedded_doc': [0.2193884253501892, 0.3908122181892395, 0.014659926295280457, -0.2731555104255676, 0.2924192249774933, 0.6941999793052673, 0.2676321566104889, -0.45316028594970703, -0.018273770809173584, 0.05401374399662018, -0.4481824040412903, -0.008003979921340942, 0.7279651165008545, -0.04766766354441643, -0.6827098727226257, -0.015044593252241611, 0.2702178359031677, 
  ...
  0.24355649948120117, 0.014441099017858505, -0.16064004600048065, -0.51854407787323, -0.6001731157302856, 0.21660731732845306, 0.031298816204071045, 0.8995707035064697, 0.33247408270835876, 0.0323009230196476, -0.33297884464263916, -0.855561375617981, -0.6020331978797913, -0.4133726954460144, 0.13149243593215942, -0.20236220955848694, 0.3687146008014679, -0.08876236528158188, -0.2355964183807373, -0.7732605934143066, 0.13224034011363983, -1.1086665391921997, -0.11192735284566879], 'query_similarity_score': 0.5300057624305469})]}

5.2 创建新管道

新的Pipeline=压缩机+冗余过滤器+相关过滤器

compressor:LLMChainExtractor,它将迭代最初返回的文档,并从每个文档中仅提取与查询相关的内容。

#
compressor = LLMChainExtractor.from_llm(llm=OpenAI(temperature=0.3,openai_api_key=api_key))
#
new_pipeline = DocumentCompressorPipeline(transformers=[compressor,redundant_filter,relevant_filter])
new_compression_retriever = ContextualCompressionRetriever(base_retriever=retriever,
                                                       base_compressor=new_pipeline)
compressed_docs = new_compression_retriever.get_relevant_documents(query="What is Coinsurance?")
pretty_print_docs(compressed_docs)

###### RESPONSE ##########
Document1:

Coinsurance - The percentage of each health care bill a person must pay out of their own pocket. Coinsurance maximum - The most you will have to pay in coinsurance during a policy period (usually a year) before your health plan begins paying 100 percent of the cost of your covered health services.

实现问答链

from langchain.chains import RetrievalQA
qa = RetrievalQA.from_chain_type(llm=llm,
                                 chain_type="stuff",
                                 retriever=new_compression_retriever,
                                 chain_type_kwargs=chain_type_kwargs,
                                 return_source_documents=True,
                                 verbose=True)
#
response = qa("What is Coinsurance?")
print(response['result'].split("<|endoftext|>")[0])

##### RESPONSE #######
  Finished chain.
 No, Coinsurance is the percentage of each health care bill a person must pay out of their own pocket.

六、结论

总之,应对文档存储系统中检索的挑战需要一种深思熟虑的方法来提高效率和响应能力。在数据接收过程中,特定查询的固有不确定性往往会导致文档中包含不相关的信息。这反过来又会导致在使用大型语言模型时成本增加和响应不理想。上下文压缩的概念是这个问题的一个有价值的解决方案。通过使用基本检索器来收集各种信息,然后使用文档压缩器,系统可以过滤和处理数据,只保留有效响应用户查询所需的相关细节。这种方法不仅优化了资源的使用,而且有助于全面提高系统性能和用户体验。


参考文献:

[1] https://medium.aiplanet.com/implement-contextual-compression-and-filtering-in-rag-pipeline-4e9d4a92aa8f?gi=62283e44f70c&source=email-c63e4493b83d-1704307828713-digest.reader-edbc285dc84a-4e9d4a92aa8f----10-98------------------fdc06c81_5ca8_4867_b1e0_aa632ce3289c-1

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值