Azure Machine Learning 实战：基于列值分区的表格数据集并行批处理推理-CSDN博客

本文链接：https://blog.csdn.net/gitblog_00523/article/details/148548725

Azure Machine Learning 实战：基于列值分区的表格数据集并行批处理推理

MachineLearningNotebooks Python notebooks with ML and deep learning examples with Azure Machine Learning Python SDK | Microsoft 项目地址: https://gitcode.com/gh_mirrors/ma/MachineLearningNotebooks

引言

在现代机器学习应用中，处理大规模数据集进行批量推理（Batch Inference）是常见的生产场景。Azure Machine Learning 提供了强大的管道（Pipeline）功能，特别是 ParallelRunStep 组件，能够高效处理这类需求。本文将深入探讨如何使用 Azure ML 管道对按列值分区的表格数据集进行并行批处理推理。

批处理推理概述

批处理推理是针对大规模数据集的异步预测方法，具有以下特点：

高吞吐量：可处理TB级生产数据
成本效益：相比实时推理更适合大批量处理
可扩展性：能自动扩展计算资源

技术选型建议
如果需要低延迟处理（如快速处理单个文档或小量数据），应选择实时评分而非批处理。

环境准备

工作区连接

首先需要连接到 Azure Machine Learning 工作区：

from azureml.core.workspace import Workspace
ws = Workspace.from_config()
datastore = ws.get_default_datastore()

计算资源配置

创建或附加计算集群资源：

from azureml.core.compute import AmlCompute

compute_name = "cpu-cluster"
compute_min_nodes = 0
compute_max_nodes = 2
vm_size = "STANDARD_D2_V2"

if compute_name in ws.compute_targets:
    compute_target = ws.compute_targets[compute_name]
else:
    provisioning_config = AmlCompute.provisioning_configuration(
        vm_size=vm_size,
        min_nodes=compute_min_nodes,
        max_nodes=compute_max_nodes)
    
    compute_target = ComputeTarget.create(ws, compute_name, provisioning_config)
    compute_target.wait_for_completion(show_output=True)

数据处理

数据集准备

使用橙汁销售数据作为示例：

import requests
oj_sales_path = "./oj.csv"
r = requests.get("橙汁销售数据URL")
open(oj_sales_path, "wb").write(r.content)
datastore.upload_files([oj_sales_path], ".", "oj_sales_data")

创建表格数据集

from azureml.core import Dataset
dataset = Dataset.Tabular.from_delimited_files(
    path=(datastore, 'oj_sales_data/*.csv'))

按列分区

按商店(Store)和品牌(Brand)列进行分区：

partitioned_dataset = dataset.partition_by(
    partition_keys=['Store', 'Brand'],
    target=(datastore, "partition_by_key_res"),
    name="partitioned_oj_data")

分区后，每个分区包含特定商店和品牌组合的所有行数据。

批处理管道构建

推理脚本

创建计算每个分区总收入的脚本 total_income.py：

import pandas as pd

def run(mini_batch, context):
    # 计算每个mini-batch的总收入
    mini_batch['total_income'] = mini_batch['Quantity'] * mini_batch['Price']
    return mini_batch[['WeekStarting', 'Quantity', 'logQuantity', 'Advert', 
                      'Price', 'Age60', 'COLLEGE', 'INCOME', 'Hincome150', 
                      'Large HH', 'Minorities', 'WorkingWoman', 'SSTRDIST', 
                      'SSTRVOL', 'CPDIST5', 'CPWVOL5', 'Store', 'Brand', 
                      'total_income']].to_dict('records')

运行环境配置

from azureml.core import Environment
from azureml.core.runconfig import CondaDependencies

batch_conda_deps = CondaDependencies.create(
    pip_packages=["azureml-core", "azureml-dataset-runtime[fuse,pandas]"])
batch_env = Environment(name="batch_environment")
batch_env.python.conda_dependencies = batch_conda_deps

并行运行配置

from azureml.pipeline.steps import ParallelRunConfig

parallel_run_config = ParallelRunConfig(
    source_directory="Code",
    entry_script="total_income.py",
    partition_keys=['Store', 'Brand'],
    error_threshold=5,
    output_action='append_row',
    append_row_file_name="revenue_outputs.txt",
    environment=batch_env,
    compute_target=compute_target,
    node_count=2,
    run_invocation_timeout=600
)

关键参数说明：

partition_keys: 指定分区键，决定如何划分mini-batch
error_threshold: 允许的错误数量阈值
output_action: 输出行为，此处为追加行

创建管道步骤

from azureml.pipeline.steps import ParallelRunStep
from azureml.pipeline.core import PipelineData

output_dir = PipelineData(name="inferences", datastore=datastore)

parallel_run_step = ParallelRunStep(
    name='summarize-revenue',
    inputs=[partitioned_dataset.as_named_input("partitioned_tabular_input")],
    output=output_dir,
    parallel_run_config=parallel_run_config,
    allow_reuse=False
)

管道执行与结果

运行管道

from azureml.core import Experiment
from azureml.pipeline.core import Pipeline

pipeline = Pipeline(workspace=ws, steps=[parallel_run_step])
pipeline_run = Experiment(ws, 'tabular-dataset-partition').submit(pipeline)
pipeline_run.wait_for_completion(show_output=True)

结果查看

import pandas as pd
import tempfile

batch_run = pipeline_run.find_step_run(parallel_run_step.name)[0]
batch_output = batch_run.get_output_data(output_dir.name)

target_dir = tempfile.mkdtemp()
batch_output.download(local_path=target_dir)
result_file = os.path.join(target_dir, batch_output.path_on_datastore, 
                          "revenue_outputs.txt")

df = pd.read_csv(result_file, delimiter=" ", header=None)
df.columns = ["WeekStarting", "Quantity", "logQuantity", "Advert", "Price", 
             "Age60", "COLLEGE", "INCOME", "Hincome150", "Large HH", 
             "Minorities", "WorkingWoman", "SSTRDIST", "SSTRVOL", 
             "CPDIST5", "CPWVOL5", "Store", "Brand", "total_income"]

print("预测结果行数:", df.shape[0])
df.head(10)