Azure机器学习——动手实验03：使用 ScriptRunConfig 对象训练模型

最新推荐文章于 2021-10-24 22:37:00 发布

liyuan2020

最新推荐文章于 2021-10-24 22:37:00 发布

阅读量524

点赞数

分类专栏： Azure机器学习文章标签： azure 机器学习 machine learning 人工智能 python

本文链接：https://blog.csdn.net/m0_46591986/article/details/105180559

版权

Azure机器学习专栏收录该内容

13 篇文章 6 订阅

订阅专栏

使用 RunConfiguration 对象+ScriptRunConfig 对象训练模型

一、使用前提
二、连接和初始化工作区
三、创建试验
四、创建计算资源
五、创建运行脚本
六、创建运行配置（RunConfiguration）
七、脚本运行配置（ScriptRunConfig）
八、查看试验运行状态
九、查看试验运行结果
十、总结

在通过RunConfiguration 对象和ScriptRunConfig 对象使用 Azure 机器学习训练模型一文中，我们介绍了RunConfiguration 对象和ScriptRunConfig 对象，使用它们可以灵活配置训练的运行环境，然后提交训练脚本到计算资源执行训练任务。本节我们通过一个例子来完整介绍这整个流程，使用的框架为SKLearn，整个过程都在Jupyter notebook中运行。

一、使用前提

在开始本节内容之前，你需要：

Azure 订阅和Azure机器学习工作区。创建方法：Azure机器学习（实战篇）——创建Azure机器学习服务
配置Azure机器学习开发环境。配置方法：Azure机器学习（实战篇）——配置 Azure 机器学习开发环境

查看Azure机器学习版本：

# Check core SDK version number
import azureml.core
print("SDK version:", azureml.core.VERSION)

输出：

SDK version: 1.0.85

二、连接和初始化工作区

整个训练过程都是在Azure机器学习工作区内开展的。所以第一步是要连接到工作区。如果还没有工作区，请先创建一个，创建步骤详见：Azure机器学习（实战篇）——创建Azure机器学习服务的第二部分。
使用以下代码连接到Azure机器学习工作区：

ws = Workspace.from_config('config.json')
print('Workspace name: ' + ws.name, 'Azure region: ' + ws.location, 
      'Subscription id: ' + ws.subscription_id, 'Resource group: ' + ws.resource_group, sep = '\n')

'config.json’是存在本地的连接工作区的配置文件，该文件生成方法详见：Azure机器学习（实战篇）——配置 Azure 机器学习开发环境的第四部分。
输出：
在这里插入图片描述

三、创建试验

在当前工作区中创建一个试验，以便在该试验下训练模型。
试验是工作区下面的一个逻辑容器，它涵盖了每一次模型训练的运行记录和结果信息。

from azureml.core import Experiment

experiment_name = 'train-with-RunConfiguration'
experiment = Experiment(ws, name=experiment_name)

试验创建成功后，可以在Azure机器学习studio中查看。
Azure机器学习studio是除Python SDK外使用Azure机器学习的另一种方式。
在这里插入图片描述
图1 在Azure机器学习studio中查看试验信息

四、创建计算资源

这里我使用已经在云上创建好的名为"cpu-cluster"的计算集群。以下代码先在云上查找这个计算集群，如果没有则新建：

from azureml.core.compute import ComputeTarget, AmlCompute
from azureml.core.compute_target import ComputeTargetException

# Choose a name for your CPU cluster
cpu_cluster_name = "cpu-cluster"

# Verify that cluster does not exist already
try:
    cpu_cluster = ComputeTarget(workspace=ws, name=cpu_cluster_name)
    print('Found existing cluster, use it.')
except ComputeTargetException:
    compute_config = AmlCompute.provisioning_configuration(vm_size='STANDARD_D2_V2',
                                                           max_nodes=4)
    cpu_cluster = ComputeTarget.create(ws, cpu_cluster_name, compute_config)

cpu_cluster.wait_for_completion(show_output=True)

输出：

Found existing cluster, use it.
Succeeded
AmlCompute wait for completion finished

Minimum number of nodes requested have been provisioned

五、创建运行脚本

首先在本地创建一个文件夹，用来保存脚本及其所依赖的文件。

import os

project_folder = './sklearn-diabetes'
os.makedirs(project_folder, exist_ok=True)

上述文件夹包含2个脚本，训练脚本train.py和工具脚本mylib.py。它们的代码如下。
mylib.py代码：

import numpy as np

def get_alphas():
    # list of numbers from 0.0 to 1.0 with a 0.05 interval
    return np.arange(0.0, 1.0, 0.05)

train.py代码：

from sklearn.datasets import load_diabetes
from sklearn.linear_model import Ridge
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
from azureml.core.run import Run
from sklearn.externals import joblib
import os
import numpy as np
import mylib

os.makedirs('./outputs', exist_ok=True)

X, y = load_diabetes(return_X_y=True)

run = Run.get_context()

X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    test_size=0.2,
                                                    random_state=0)
data = {"train": {"X": X_train, "y": y_train},
        "test": {"X": X_test, "y": y_test}}

# list of numbers from 0.0 to 1.0 with a 0.05 interval
alphas = mylib.get_alphas()

for alpha in alphas:
    # Use Ridge algorithm to create a regression model
    reg = Ridge(alpha=alpha)
    reg.fit(data["train"]["X"], data["train"]["y"])

    preds = reg.predict(data["test"]["X"])
    mse = mean_squared_error(preds, data["test"]["y"])
    run.log('alpha', alpha)
    run.log('mse', mse)

    model_file_name = 'ridge_{0:.2f}.pkl'.format(alpha)
    # save model in the outputs folder so it automatically get uploaded
    with open(model_file_name, "wb") as file:
        joblib.dump(value=reg, filename=os.path.join('./outputs/',
                                                     model_file_name))

    print('alpha is {0:.2f}, and mse is {1:0.2f}'.format(alpha, mse))

训练脚本中使用了Run.get_context()这个方法，该方法可以记录训练中的日志信息，比如模型参数、模型精度等。在训练完成后，可以到studio的Web页面查看这些参数的可视化结果。

六、创建运行配置（RunConfiguration）

RunConfiguration 是Azure机器学习中一种基本的环境配置方法。RunConfiguration 对象封装了在试验中提交训练运行时所需的环境设置。关于运行配置的详细内容请查看通过RunConfiguration 对象和ScriptRunConfig 对象使用 Azure 机器学习训练模型。
首先创建一个RunConfiguration对象。

from azureml.core.runconfig import RunConfiguration
from azureml.core.conda_dependencies import CondaDependencies

# Create a new runconfig object
run_amlcompute = RunConfiguration()

指定计算资源。接下来设置的Python训练环境就是为该计算资源设置的。
这里我使用上面提到的名为"cpu-cluster"的计算集群。在这里你也可以指定计算资源为本地计算机或者云上的VM等资源。

# Use the cpu_cluster you created above. 
run_amlcompute.target = cpu_cluster

设置试验运行环境：训练脚本中需要的Python环境及相关依赖。

from azureml.core.environment import Environment
from azureml.core.conda_dependencies import CondaDependencies

# to install required packages
env = Environment('diabetes-env')
cd = CondaDependencies.create(pip_packages=['numpy==1.16.2','azureml-dataprep[pandas,fuse]>=1.1.14', 'azureml-defaults'],
                              conda_packages = ['scikit-learn==0.22.1'])

env.python.conda_dependencies = cd

env.docker.enabled = True
# Specify docker steps as a string. Alternatively, load the string from a file.
dockerfile = r"""
FROM mcr.microsoft.com/azureml/base:intelmpi2018.3-ubuntu16.04
RUN conda config --add channels https://mirrors.ustc.edu.cn/anaconda/pkgs/main/ && \
conda config --add channels https://mirrors.ustc.edu.cn/anaconda/pkgs/free/ && \
conda config --add channels https://mirrors.ustc.edu.cn/anaconda/cloud/conda-forge/ && \
conda config --add channels https://mirrors.ustc.edu.cn/anaconda/cloud/msys2/ && \
conda config --add channels https://mirrors.ustc.edu.cn/anaconda/cloud/bioconda/ && \
conda config --add channels https://mirrors.ustc.edu.cn/anaconda/cloud/menpo/ && \
conda config --set show_channel_urls yes
RUN pip install -U pip
RUN pip config set global.index-url http://mirrors.aliyun.com/pypi/simple
RUN pip config set install.trusted-host mirrors.aliyun.com
RUN echo "Hello from custom container!"
"""

# Set base image to None, because the image is defined by dockerfile.
env.docker.base_image = None
env.docker.base_dockerfile = dockerfile

# Attach environment to run config
run_amlcompute.environment = env

七、脚本运行配置（ScriptRunConfig）

脚本运行配置表示在 Azure 机器学习中提交训练运行任务时的配置信息。ScriptRunConfig 包将 RunConfiguration 的环境配置与用于训练的脚本一起创建脚本运行任务（ script run）。更多关于脚本运行配置的详细内容请查看通过RunConfiguration 对象和ScriptRunConfig 对象使用 Azure 机器学习训练模型。
脚本运行配置

from azureml.core import ScriptRunConfig

src = ScriptRunConfig(source_directory = project_folder, script = 'train.py', run_config = run_amlcompute)

提交训练脚本

run = exp.submit(src)

八、查看试验运行状态

可以通过run输出的web链接到Azure机器学习studio中查看试验运行状态

run

输出：
在这里插入图片描述
也可以通过代码来查看试验运行状态。
使用以下代码可以在Jupyter widget中每隔10到15秒钟更新一次试验运行状态。

from azureml.widgets import RunDetails

RunDetails(run).show()

输出：

图2 通过代码来查看试验运行状态

可以看到之前在训练脚本中通过Run.get_context()记录的参数和mse的信息，在计算完成后都会在jupyter notebook中可视化出来。

使用wait_for_completion来打印试验运行日志信息：

# specify show_output to True for a verbose log
run.wait_for_completion(show_output=True)

九、查看试验运行结果

试验运行结果都保存在了run这个对象中，可以调用run的相关方法来查看对应信息。
查看保存的参数信息和模型度量值：

# Get all metris logged in the run
run.get_metrics()

输出：

{'alpha': [0.0,
  0.05,
  0.1,
  0.15000000000000002,
  0.2,
  0.25,
  0.30000000000000004,
  0.35000000000000003,
  0.4,
  0.45,
  0.5,
  0.55,
  0.6000000000000001,
  0.65,
  0.7000000000000001,
  0.75,
  0.8,
  0.8500000000000001,
  0.9,
  0.9500000000000001],
 'mse': [3424.3166882137334,
  3408.9153122589296,
  3372.6496278100326,
  3345.1496434741885,
  3325.294679467877,
  3311.5562509289744,
  3302.6736334017255,
  3297.658733944204,
  3295.74106435581,
  3296.316884705675,
  3298.9096058070622,
  3303.140055527517,
  3308.704270772322,
  3315.356839962256,
  3322.8983149039614,
  3331.1656169285875,
  3340.024662032161,
  3349.3646443486023,
  3359.0935697484424,
  3369.1347399130477]}

找出最优参数和结果：

import numpy as np

metrics = run.get_metrics()
best_alpha = metrics['alpha'][np.argmin(metrics['mse'])]

print('When alpha is {1:0.2f}, we have min MSE {0:0.2f}.'.format(
    min(metrics['mse']), 
    best_alpha
))

输出：

When alpha is 0.40, we have min MSE 3295.74.

绘制“alpha”和“mse”曲线：

%matplotlib inline

import matplotlib
import matplotlib.pyplot as plt

plt.plot(metrics['alpha'], metrics['mse'], marker='o')
plt.ylabel("MSE")
plt.xlabel("Alpha")

输出：
在这里插入图片描述
打印训练中保存的日志和结果文件：

results=run.get_file_names()
print(results)

输出：

['azureml-logs/20_image_build_log.txt',
 'azureml-logs/55_azureml-execution-tvmps_c3ce5e0e51c832eab7d5d4f7b758394832fb50305832bc0a3e27c167cf51b9e3_d.txt',
 'azureml-logs/65_job_prep-tvmps_c3ce5e0e51c832eab7d5d4f7b758394832fb50305832bc0a3e27c167cf51b9e3_d.txt',
 'azureml-logs/70_driver_log.txt',
 'azureml-logs/75_job_post-tvmps_c3ce5e0e51c832eab7d5d4f7b758394832fb50305832bc0a3e27c167cf51b9e3_d.txt',
 'azureml-logs/process_info.json',
 'azureml-logs/process_status.json',
 'logs/azureml/141_azureml.log',
 'logs/azureml/job_prep_azureml.log',
 'logs/azureml/job_release_azureml.log',
 'outputs/ridge_0.00.pkl',
 'outputs/ridge_0.05.pkl',
 'outputs/ridge_0.10.pkl',
 'outputs/ridge_0.15.pkl',
 'outputs/ridge_0.20.pkl',
 'outputs/ridge_0.25.pkl',
 'outputs/ridge_0.30.pkl',
 'outputs/ridge_0.35.pkl',
 'outputs/ridge_0.40.pkl',
 'outputs/ridge_0.45.pkl',
 'outputs/ridge_0.50.pkl',
 'outputs/ridge_0.55.pkl',
 'outputs/ridge_0.60.pkl',
 'outputs/ridge_0.65.pkl',
 'outputs/ridge_0.70.pkl',
 'outputs/ridge_0.75.pkl',
 'outputs/ridge_0.80.pkl',
 'outputs/ridge_0.85.pkl',
 'outputs/ridge_0.90.pkl',
 'outputs/ridge_0.95.pkl']

下载结果文件到本地:
下载上面outputs文件下的所有文件到本地目录：

for file in results:
    if file.startswith('outputs'):
        run.download_files(prefix='outputs', output_directory='./outputs/')

prefix表示需要下载的文件夹，output_directory是本地文件夹。

十、总结

本文通过一个动手试验演示了如何在Azure机器学习中使用RunConfiguration 对象+ScriptRunConfig 对象训练模型。

liyuan2020

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
Azure机器学习——动手实验03：使用 ScriptRunConfig 对象训练模型

使用 RunConfiguration 对象+ScriptRunConfig 对象训练模型一、使用前提二、连接和初始化工作区三、创建试验四、创建计算资源五、创建运行脚本六、创建运行配置（RunConfiguration）七、脚本运行配置（ScriptRunConfig）八、查看试验运行状态九、查看试验运行结果十、总结在通过RunConfiguration 对象和ScriptRunConfig 对...
复制链接

扫一扫