(20-3-01)基于《哈利·波特》系列图书内容的问答系统:LangChain多文档检索器(01)加载文档+文本分割+创建嵌入+加载向量数据库

8.5  基于LangChain的多文档检索器

在本节的内容中,将准备好文本文档供自然语言处理(NLP)任务使用,基于LangChain实现多文档检索器。具体来说,分别实现加载、分割、创建嵌入(embeddings)等操作,并将这些嵌入保存到向量存储(Vector Store)中。本步骤是NLP项目中常见的文档处理步骤,特别是在构建聊天机器人、问答系统或文本检索系统时。通过这种方式,可以将非结构化的文本数据转换为机器可理解的格式,并用于执行复杂的语言任务。

8.5.1  加载文档

使用LangChain中的类DirectoryLoader加载指定目录中的PDF文件,以便进行后续的文本提取、嵌入生成或相似性搜索等操作。

loader = DirectoryLoader(
    CFG.PDFs_path,
    glob="./*.pdf",
    loader_cls=PyPDFLoader,
    show_progress=True,
    use_multithreading=True
)

documents = loader.load()

执行后会输出:

100%|██████████| 7/7 [02:17<00:00, 19.69s/it]
CPU times: user 2min 18s, sys: 1.83 s, total: 2min 20s
Wall time: 2min 17s

(2)下面这行代码打印出一个字符串,其中包含文档中总页数的信息。len(documents)计算了documents列表的长度,即加载的所有文档中的总页数。

print(f'We have {len(documents)} pages in total')

执行后会输出:

We have 4114 pages in total

(3)下面这行代码用于访问documents列表中索引为8的文档的页面内容,page_content是一个属性(或方法),用于返回第9个文档中的所有文本。

100%|██████████| 7/7 [02:17<00:00, 19.69s/it]
documents[8].page_content

执行后会输出:

"8Ron\nP.S. Percy's Head Boy. He got the letter last week.Harry glanced back at the photograph. Percy, who was in his seventh and\nfinal year at Hogwarts, was looking particularly smug. He had pinned hisHead Boy badge to the fez perched jauntily on top of his neat hair, hishorn-rimmed glasses flashing in the Egyptian sun.\nHarry now turned to his present and unwrapped it. Inside was what looked\nlike a miniature glass spinning top. There was another note from Ronbeneath it.\nHarry -- this is a Pocket Sneakoscope. If there's someone untrustworthy\naround, it's supposed to light up and spin. Bill says it's rubbish soldfor wizard tourists and isn't reliable, because it kept lighting up atdinner last night. But he didn't realize Fred and George had put beetlesin his soup.\nBye --RonHarry put the Pocket Sneakoscope on his bedside table, where it stood\nquite still, balanced on its point, reflecting the luminous hands of hisclock. He looked at it happily for a few seconds, then picked up theparcel Hedwig had brought.\nInside this, too, there was a wrapped present, a card, and a letter,\nthis time from Hermione.\nDear Harry,Ron wrote to me and told me about his phone call to your Uncle Vernon. I\ndo hope you're all right.\nI'm on holiday in France at the moment and I didn't know how I was going\nto send this to you -- what if they'd opened it at customs? -- but thenHedwig turned up! I think she wanted to make sure you got something for"

8.5.2  文本分割

在LangChain中,Splitter用于将长文本分割成小块,以便在进行相似性搜索时更容易处理和比较。文本分割功能在创建文本嵌入(embeddings)时是必需的,因为大型语言模型和嵌入模型通常对输入文本的大小有限制。通过下面的代码,将原始文档转换成了一系列较小的、可管理的文本块,这些文本块可以被用来生成嵌入,进而用于文本相似性搜索或其他NLP任务。这种方法有助于处理大型文档集合,尤其是在需要模型处理能力时。

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = CFG.split_chunk_size,
    chunk_overlap = CFG.split_overlap
)

texts = text_splitter.split_documents(documents)
print(f'We have created {len(texts)} chunks from {len(documents)} pages')

8.5.3  创建嵌入

接下来使用LangChain和Hugging Face中的transformers库来构建一个能够回答有关《哈利·波特》系列书籍问题的聊天机器人的概述。整个过程涉及了从加载文档、分割文本、创建嵌入,到使用大型语言模型生成回答的完整工作流程。此外,还提供了一些性能考量,比如使用FAISS向量存储与Chroma相比的优势。

下面的代码用于检查是否存在一个FAISS索引文件,如果不存在,则自动下载相应的Hugging Face嵌入模型,并使用该模型为文本文档创建嵌入,随后将这些嵌入保存为FAISS向量数据库,以便于后续的文本相似性搜索和检索。这一过程涉及到从文档处理到向量数据库构建的完整工作流程,确保了在需要时可以高效地检索和分析文本数据。

if not os.path.exists(CFG.Embeddings_path + '/index.faiss'):

    ### download embeddings model
    embeddings = HuggingFaceInstructEmbeddings(
        model_name = CFG.embeddings_model_repo,
        model_kwargs = {"device": "cuda"}
    )

    ### create embeddings and DB
    vectordb = FAISS.from_documents(
        documents = texts, 
        embedding = embeddings
    )

    ### persist vector database
    vectordb.save_local(f"{CFG.Output_folder}/faiss_index_hp")

具体来说,上述代码执行了以下与创建嵌入相关的操作:

  1. 检查FAISS索引文件:代码首先检查在CFG.Embeddings_path指定的路径下是否存在一个名为index.faiss的文件。这个文件通常包含了由FAISS创建的文本嵌入的索引。
  2. 下载嵌入模型:如果index.faiss文件不存在,代码将下载一个预训练的嵌入模型。使用的是HuggingFaceInstructEmbeddings类,它基于Hugging Face的模型,并且指定使用GPU(通过model_kwargs = {"device": "cuda"})。
  3. 创建嵌入和数据库:使用FAISS.from_documents方法,基于texts(即之前处理好的文本块)和下载的嵌入模型,创建一个FAISS向量数据库。
  4. 保存向量数据库:最后,将创建的FAISS向量数据库保存到CFG.Output_folder指定的输出文件夹中,文件命名为faiss_index_hp。

整个代码块的目的是确保存在一个可用于文本相似性搜索的FAISS向量数据库,如果数据库不存在,则从文本嵌入的创建开始,自动完成整个设置过程。

8.5.4  加载向量数据库

在保存了向量数据库之后,只需从数据集中进行加载,用于加载嵌入的函数必须与用于创建嵌入的函数相同。

(1)下面的代码用于加载一个实现文本搜索功能的预训练嵌入模型,并从指定路径加载FAISS向量数据库,以便进行后续的文本相似性搜索或问答任务。

embeddings = HuggingFaceInstructEmbeddings(
    model_name = CFG.embeddings_model_repo,
    model_kwargs = {"device": "cuda"}
)
vectordb = FAISS.load_local(
    CFG.Embeddings_path, # from input folder
    #CFG.Output_folder + '/faiss_index_hp', # from output folder
    embeddings
)

clear_output()

(2)下面的这行代码执行了相似性搜索,使用已加载的向量数据库(vectordb)来查找与查询字符串 'magic creatures' 最相关的文档或文本段落。

vectordb.similarity_search('magic creatures')

执行后会输出:

[Document(page_content='“Magic?” he repeated in a whisper. \n“That’s right,” said Dumbledore. \n“It’s … it’s magic, what I can do?” \n“What is it that you can do?” \n“All sorts,” breathed Riddle. A flush of excitement was \nrising up his neck into his hollow cheeks; he looked \nfevered. “I can make things move without touching \nthem. I can make animals do what I want them to do, \nwithout training them. I can make bad things happen \nto people who annoy me. I can make them hurt if I \nwant to.”', metadata={'source': '/kaggle/input/harry-potter-books-in-pdf-1-7/HP books/Harry Potter - Book 6 - The Half-Blood Prince.pdf', 'page': 302}),

 Document(page_content='91"Shut up, Malfoy," said Harry quietly. Hagrid was looking downcast and\nHarry wanted Hagrid\'s first lesson to be a success.\n"Righ\' then," said Hagrid, who seemed to have lost his thread, "so -- so\nyeh\'ve got yer books an\' -- an\' - - now yeh need the Magical Creatures.Yeah. So I\'ll go an\' get \'em. Hang on... "\nHe strode away from them into the forest and out of sight."God, this place is going to the dogs," said Malfoy loudly. "That oaf\nteaching classes, my father\'ll have a fit when I tell him\n"Shut up, Malfoy," Harry repeated."Careful, Potter, there\'s a dementor behind you"Oooooooh!" squealed Lavender Brown, pointing toward the opposite side\nof the paddock.\nTrotting toward them were a dozen of the most bizarre creatures Harry', metadata={'source': '/kaggle/input/harry-potter-books-in-pdf-1-7/HP books/Harry Potter - Book 3 - The Prisoner of Azkaban.pdf', 'page': 91}),

 Document(page_content='says Draco Malfoy, a fourth-year student. “We all hate Hagrid, but we’re just too scared to say \nanything.” \nHagrid has no intention of ceasing his campaign \nof intimidation, however. In conversation with a \nDaily Prophet  reporter last month, he admitted \nbreeding creatures he has dubbed “Blast-Ended \nSkrewts,” highly dangerous crosses between manti-\ncores and fire-crabs. The creation of new breeds of magical creature is, of course, an activity usually \nclosely observed by the Department for the Regu-\nlation and Control of Magical Creatures. Hagrid, however, considers himself to be above such petty \nrestrictions.', metadata={'source': '/kaggle/input/harry-potter-books-in-pdf-1-7/HP books/Harry Potter - Book 4 - The Goblet of Fire.pdf', 'page': 453}),

 Document(page_content='Here and there adult wizards and witches were emerging from \ntheir tents and starting to cook breakfast. Some, with furtive looks \naround them, conjured fires with th eir wands; others were striking', metadata={'source': '/kaggle/input/harry-potter-books-in-pdf-1-7/HP books/Harry Potter - Book 4 - The Goblet of Fire.pdf', 'page': 96})]

请注意,上面具体的返回值和搜索逻辑可能依赖于vectordb对象的实现细节,以及FAISS向量数据库的配置。此外,为了执行这个搜索,需要确保向量数据库已经被正确加载,并且包含与《哈利·波特》系列相关的文本的嵌入。

未完待续

  • 12
    点赞
  • 17
    收藏
    觉得还不错? 一键收藏
  • 打赏
    打赏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

码农三叔

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值