LLM Sherpa - 加速 LLM 用例

最新推荐文章于 2025-01-08 14:11:35 发布

编程乐园

最新推荐文章于 2025-01-08 14:11:35 发布

阅读量702

点赞数 20

文章标签： LLM Sherpa NLMatics LayoutPDFReader

本文链接：https://blog.csdn.net/lovechris00/article/details/142498975

版权

文章目录

一、关于 LLM Sherpa

github : https://github.com/nlmatics/llmsherpa
官方文档：https://llmsherpa.readthedocs.io/

LLM Sherpa 提供战略API来加速大型语言模型（LLM）用例。

新特性介绍

重要：llmsherpa后端服务现在在Apache 2.0许可下完全开源。见https://github.com/nlmatics/nlm-ingestor

您现在可以使用docker映像运行自己的服务器！
支持不同的文件格式：DOCX、PPTX、超文本标记语言、TXT、XML
内置OCR支持
块现在有坐标-使用块的bbox属性，例如部分
一个新的缩进解析器，可更好地将文档中的所有标题与其对应的级别对齐
免费服务器和付费服务器没有使用最新代码更新，并且要求用户使用nlm-ingestor中的说明生成自己的服务器

关于 NLMatics

github : https://github.com/nlmatics
官网：https://www.nlmatics.com/

Nlmatics使用检索增强生成（RAG）从大型文档集中提取数据。它也可用于知识库上的RAG搜索。它带有用于搜索、数据提取和PDF查看的广泛UI。它使用llmsherpa/nlm-ingestor后端摄取文档，并在弹性搜索中索引文档，这些文档使用混合搜索方法检索。

Nlmatics由Ambika Sukla和Bulent Yener创立。

Nlmatics使用布局感知分块、向量+bm25索引和语言模型开发了早期的类似RAG的问答、语义搜索和数据提取管道。开源代码库由张毅、安比卡·苏克拉、基兰·帕尼克、尼兰詹·博尔瓦克、苏海尔·坎达努尔、王俊康、雷沙夫·亚伯拉罕、****谢赫勒斯拉米、劳拉·约翰斯、贾斯敏·奥曼诺维奇、凯伦·里维斯、索尼娅·约瑟夫、埃文·李、巴蒂亚·斯坦、夏延·张、阿什兰·艾哈迈德、尼古拉斯·格林斯潘、康妮·徐、希万吉·贾等人在Pooja Reddy、安比卡·苏克拉和简·蔡的产品管理支持下开发。

Nlmatics感谢与金融服务、法律服务和生命科学领域的杰出早期采用者合作，他们在当前生成人工智能浪潮之前就认识到并利用了我们的技术。

Nlmatics从Felix Anthony、Silvertech Ventures、World Trade Ventures和ERS Ventures筹集了种子资金。

LayoutPDFReader

大多数PDF转文本解析器不提供布局信息。通常，即使是句子也会被任意的CR/LF分割，这使得很难找到段落边界。这在分块和添加长时间运行的上下文信息（如段落标题）方面带来了各种挑战，同时为LLM应用程序（如检索增强生成（RAG））索引/向量化PDF。

LayoutPDFReader通过解析PDF以及分层布局信息来解决这个问题，例如：

章节和小节及其级别。
段落-组合行。
章节和段落之间的链接。
表格以及表格所在的部分。
列表和嵌套列表。
加入跨页面传播的内容。
删除重复的页眉和页脚。
水印去除。

使用LayoutPDFReader，开发人员可以找到要矢量化的最佳文本块，以及有限LLM上下文窗口大小的解决方案。

您可以直接在Google Colab 中试验该库

这里有一篇文章解释了这个问题和我们的方法。

这里有一个LlamaIndex博客解释了智能分块的必要性。

How to use with Google Gemini Pro

How to use with Cohere Embed3

重要注意事项

LayoutPDFReader在各种PDF上进行了测试。尽管如此，正确解析每个PDF仍然是一项挑战。
目前不支持OCR。仅支持带有文本层的PDF。

注：LLMSherpa使用免费开放的api服务器。除了解析期间的临时存储之外，服务器不存储您的PDF。该服务器将很快退役。使用以下说明自托管您自己的私有服务器： https://github.com/nlmatics/nlm-ingestor

重要：私有可在Microsoft Azure市场将很快退役。请使用https://github.com/nlmatics/nlm-ingestor中的说明移动到您的自托管实例。

二、使用

1、安装

pip install llmsherpa

2、阅读PDF文件

使用LayoutPDFReader的第一步是为其提供url或文件路径并取回文档对象。

from llmsherpa.readers import LayoutPDFReader

llmsherpa_api_url = "https://readers.llmsherpa.com/api/document/developer/parseDocument?renderFormat=all"
pdf_url = "https://arxiv.org/pdf/1910.13461.pdf" # also allowed is a file path e.g. /home/downloads/xyz.pdf
pdf_reader = LayoutPDFReader(llmsherpa_api_url)
doc = pdf_reader.read_pdf(pdf_url)

3、安装LlamaIndex

在下面的示例中，为了简单起见，我们将使用LlamaIndex。

pip install llama-index

4、设置OpenAI

import openai
openai.api_key = #<Insert API Key>

5、使用智能分块进行矢量搜索和检索增强生成

LayoutPDFReader进行智能分块，将由于文档结构而导致的相关文本保存在一起：

所有列表项都在一起，包括列表前面的段落。
表中的项目被放在一起
包含来自节标题和嵌套节标题的上下文信息

以下代码从LayoutPDFReader文档块创建LlamaIndex查询引擎

from llama_index.core import Document
from llama_index.core import VectorStoreIndex

index = VectorStoreIndex([])
for chunk in doc.chunks():
    index.insert(Document(text=chunk.to_context_text(), extra_info={}))
query_engine = index.as_query_engine()

让我们运行一个查询：

response = query_engine.query("list all the tasks that work with bart")
print(response)

我们得到以下回复：

BART works well for text generation, comprehension tasks, abstractive dialogue, question answering, and summarization tasks.

让我们尝试另一个需要从表中回答的查询：

response = query_engine.query("what is the bart performance score on squad")
print(response)

这是我们得到的回应：

The BART performance score on SQuAD is 88.8 for EM and 94.6 for F1.

使用 prompts 总结一个部分

LayoutPDFReader提供了从大型文档中选择部分和子部分并使用LLM从部分中提取见解的强大方法。

以下代码查找文档的微调部分：

from IPython.core.display import display, HTML
selected_section = None
# find a section in the document by title
for section in doc.sections():
    if section.title == '3 Fine-tuning BART':
        selected_section = section
        break
# use include_children=True and recurse=True to fully expand the section. 
# include_children only returns at one sublevel of children whereas recurse goes through all the descendants
HTML(section.to_html(include_children=True, recurse=True))

运行上述代码会产生以下超文本标记语言输出：

3微调BART

BART生成的表示可以以多种方式用于下游应用程序。

3.1序列分类任务

对于序列分类任务，相同的输入被馈送到编码器和解码器，最终解码器令牌的最终隐藏状态被馈送到新的多类线性分类器。\n这种方法与BERT中的CLS令牌有关；然而，我们在最后添加了额外的令牌，以便解码器中令牌的表示可以从完整的输入中关注解码器状态（图3a）。

3.2令牌分类任务

对于令牌分类任务，例如SQuAD的答案端点分类，我们将完整的文档输入编码器和解码器，并使用解码器的顶部隐藏状态作为每个单词的表示。\n此表示用于对令牌进行分类。

3.3序列生成任务

因为BART有一个自回归解码器，它可以直接用于序列生成任务，如抽象问答和摘要。\n在这两个任务中，信息都是从输入中复制但被操纵的，这与去噪预训练目标密切相关。\n这里，编码器输入是输入序列，解码器自回归生成输出。

3.4机器翻译

我们还探索使用BART来改进机器翻译解码器以翻译成英语。\n以前的工作Edunov等人。\n（2019）表明，可以通过合并预训练的编码器来改进模型，但在解码器中使用预训练语言模型的收益有限。\n我们表明，通过添加一组从比特文本中学习的新编码器参数，可以将整个BART模型（编码器和解码器）用作机器翻译的单个预训练解码器（参见图3b）。

更准确地说，我们用一个新的随机初始化的编码器替换BART的编码器嵌入层。\n模型是端到端训练的，它训练新编码器将外来词映射到BART可以去噪为英语的输入中。\n新编码器可以使用与原始BART模型不同的词汇表。

我们分两步训练源编码器，在这两种情况下都从BART模型的输出反向传播交叉熵损失。\n在第一步，我们冻结了大部分BART参数，只更新随机初始化的源编码器、BART位置嵌入和BART编码器第一层的自关注输入投影矩阵。\n在第二步，我们为少量迭代训练所有模型参数。

现在，让我们使用提示创建此文本的自定义摘要：

from llama_index.llms import OpenAI
context = selected_section.to_html(include_children=True, recurse=True)
question = "list all the tasks discussed and one line about each task"
resp = OpenAI().complete(f"read this text and answer question: {question}:\n{context}")
print(resp.text)

上面的代码导致以下输出：

Tasks discussed in the text:

1. Sequence Classification Tasks: The same input is fed into the encoder and decoder, and the final hidden state of the final decoder token is used for multi-class linear classification.
2. Token Classification Tasks: The complete document is fed into the encoder and decoder, and the top hidden state of the decoder is used as a representation for each word for token classification.
3. Sequence Generation Tasks: BART can be fine-tuned for tasks like abstractive question answering and summarization, where the encoder input is the input sequence and the decoder generates outputs autoregressively.
4. Machine Translation: BART can be used to improve machine translation decoders by incorporating pre-trained encoders and using the entire BART model as a single pretrained decoder. The new encoder parameters are learned from bitext.

使用提示分析表

使用LayoutPDFReader，您可以遍历文档中的所有表，并使用LLM的强大功能来分析表让我们看看本文档中的第6个表。如果您使用的是笔记本，您可以按如下方式显示表格：

from IPython.core.display import display, HTML
HTML(doc.tables()[5].to_html())

输出表结构如下所示：

	SQuAD 1.1 EM/F1	SQuAD 2.0 EM/F1	MNLI m/mm	SST Acc	QQP Acc	QNLI Acc	STS-B Acc	RTE Acc	MRPC Acc	CoLA Mcc
BERT	84.1/90.9	79.0/81.8	86.6/-	93.2	91.3	92.3	90.0	70.4	88.0	60.6
UniLM	-/-	80.5/83.4	87.0/85.9	94.5	-	92.7	-	70.9	-	61.1
XLNet	89.0/94.5	86.1/88.8	89.8/-	95.6	91.8	93.9	91.8	83.8	89.2	63.6
RoBERTa	88.9/94.6	86.5/89

现在让我们问一个问题来分析这个表：

from llama_index.llms import OpenAI
context = doc.tables()[5].to_html()
resp = OpenAI().complete(f"read this table and answer question: which model has the best performance on squad 2.0:\n{context}")
print(resp.text)

上述问题将导致以下输出：

The model with the best performance on SQuAD 2.0 is RoBERTa, with an EM/F1 score of 86.5/89.4.

就是这样！LayoutPDFReader还支持具有嵌套标头和标头行的表。

这是一个带有嵌套标头的示例：

from IPython.core.display import display, HTML
HTML(doc.tables()[6].to_html())

	CNN/DailyMail			XSum		-
	R1	R2	RL	R1	R2	RL
—	—	—	—	—	—	—
Lead-3	40.42	17.62	36.67	16.30	1.60	11.95
PTGEN (See et al., 2017)	36.44	15.66	33.42	29.70	9.21	23.24
PTGEN+COV (See et al., 2017)	39.53	17.28	36.38	28.10	8.02	21.72
UniLM	43.33	20.21	40.51	-	-	-
BERTSUMABS (Liu & Lapata, 2019)	41.72	19.39	38.76	38.76	16.33	31.15
BERTSUMEXTABS (Liu & Lapata, 2019)	42.13	19.60	39.18	38.81	16.50	31.27
BART	44.16	21.28	40.90	45.14	22.27	37.25

现在让我们问一个有趣的问题：

from llama_index.llms import OpenAI
context = doc.tables()[6].to_html()
question = "tell me about R1 of bart for different datasets"
resp = OpenAI().complete(f"read this table and answer question: {question}:\n{context}")
print(resp.text)

我们得到以下答案：

R1 of BART for different datasets:
- For the CNN/DailyMail dataset, the R1 score of BART is 44.16.
- For the XSum dataset, the R1 score of BART is 45.14.