想要做一个pdf问答功能,如果pdf文章太短(低于1页)会导致chroma索引太短,在查询索引时会低于默认结果数(默认是4),会导致报错(chromadb.errors.NotEnoughElementsException: Number of requested results 4 cannot be greater than number of elements in index 1)
解决思路是,在获取chroma1.as_retriever()之前先得到索引总数,如果小于4,那么as_retriever()加参数search_kwargs={"k": n};
假设索引数等于3,那么执行docsearch = chroma1.as_retriever(search_kwargs={"k": 3})
就不会报错了
下面是一个获取chroma索引总数的例子:
from langchain.document_loaders import PyPDFLoader
from langchain.vectorstores import Chroma
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.document_loaders import UnstructuredPDFLoader
from langchain.chat_models import ChatOpenAI
from langchain.chains.question_answering import load_qa_chain
loader = PyPDFLoader("test.pdf")
pages = loader.load_and_split()
embeddings = OpenAIEmbeddings()
chroma1= Chroma.from_documents(pages, embeddings, persist_directory="testdir")
#使用chroma1._collection.count()获取索引数
print(chroma1._collection.count())
index_count = chroma1._collection.count()
if index_count <= 3:
docsearch = chroma1.as_retriever(search_kwargs={"k": index_count })
else:
docsearch = chroma1.as_retriever()
query = "文章讲了什么?"
#下面这句如果索引总数小于查询默认结果数4,会报错
docs = docsearch.get_relevant_documents(query)
chain = load_qa_chain(ChatOpenAI(temperature=0), chain_type="stuff")
output = chain.run(input_documents=docs, question=query)
print(output )