Azure ML pipeline怎么加入自定义python代码

Damien_J

于 2024-09-13 17:48:25 发布

阅读量418

点赞数 7

分类专栏： Python Azure Machine learning 文章标签： azure microsoft

本文链接：https://blog.csdn.net/Damien_J_Scott/article/details/142216051

版权

Python 同时被 3 个专栏收录

82 篇文章 0 订阅

订阅专栏

Machine learning

4 篇文章 0 订阅

订阅专栏

Azure

1 篇文章 0 订阅

订阅专栏

最近有个需求，Azure ml pipeline中加入我们自定义的python代码来和Azure Document intelligence集成，在做POC的时候，这个怎么加入自定义python代码卡了我半天。终于搞定了，记录下

最开始问的是LLM，顺着找到了官方文档，告诉我有个 execute python code 组件，可以直接使用，找的我脑壳痛，还怀疑是权限问题。

在设计器中执行 Python 脚本 - Azure Machine Learning | Microsoft Learn

后面后面发现，这是V1的文档。。。加上我看到这段描述，就直接使用自定义组件，我使用的是自定义模型，然后因为自定义模型是V2的，

导致我一直到不到 execute python code 这个组件，其实选择经典就可以找到了

那么如果是自定义模型，要怎么使用自定义的python code呢？

根据这个文档：
创建和运行基于组件的 ML 管道 (UI) - Azure Machine Learning | Microsoft Learn

先去GitHub - Azure/azureml-examples: Official community-driven Azure Machine Learning examples, tested with GitHub Actions.

这个git库中把cli/jobs/pipelines-with-components/basics/1b_e2e_registered_components这个down下来，里面就是示例的自定义代码。

每一个自定义python function分为两个部分，首先是.py代码部分，就拿这里面的train举例，

import argparse
from pathlib import Path
from uuid import uuid4
from datetime import datetime
import os

parser = argparse.ArgumentParser("train")
parser.add_argument("--training_data", type=str, help="Path to training data")
parser.add_argument("--max_epocs", type=int, help="Max # of epocs for the training")
parser.add_argument("--learning_rate", type=float, help="Learning rate")
parser.add_argument("--learning_rate_schedule", type=str, help="Learning rate schedule")
parser.add_argument("--model_output", type=str, help="Path of output model")

args = parser.parse_args()

print("hello training world...")

lines = [
    f"Training data path: {args.training_data}",
    f"Max epocs: {args.max_epocs}",
    f"Learning rate: {args.learning_rate}",
    f"Learning rate: {args.learning_rate_schedule}",
    f"Model output path: {args.model_output}",
]

for line in lines:
    print(line)

print("mounted_path files: ")
arr = os.listdir(args.training_data)
print(arr)

for filename in arr:
    print("reading file: %s ..." % filename)
    with open(os.path.join(args.training_data, filename), "r") as handle:
        print(handle.read())


# Do the train and save the trained model as a file into the output folder.
# Here only output a dummy data for demo.
curtime = datetime.now().strftime("%b-%d-%Y %H:%M:%S")
model = f"This is a dummy model with id: {str(uuid4())} generated at: {curtime}\n"
(Path(args.model_output) / "model.txt").write_text(model)

这里面是获取参数与具体处理逻辑与输出值的的地方。

然后是与之对应的yml文件，里面定义了你这个方法代码在哪里，显示名字是什么，传入参数是哪些类型默认值，与怎么执行你的python代码的cmd。

$schema: https://azuremlschemas.azureedge.net/latest/commandComponent.schema.json
name: my_train
display_name: Train_upper_case
#version: 1b
type: command
inputs:
  training_data: 
    type: uri_folder
  max_epocs:
    type: integer
    min: 0
    max: 100
  learning_rate: 
    type: number
    default: 0.01
  learning_rate_schedule: 
    type: string
    default: time-based 
    enum:
        - "step"
        - "time-based"

  
outputs:
  model_output:
    type: uri_folder
code: ./train_src
environment: azureml://registries/azureml/environments/sklearn-1.5/labels/latest
command: >-
  python train.py 
  --training_data ${{inputs.training_data}} 
  --max_epocs ${{inputs.max_epocs}}   
  --learning_rate ${{inputs.learning_rate}} 
  --learning_rate_schedule ${{inputs.learning_rate_schedule}} 
  --model_output ${{outputs.model_output}}