机器学习_kedro+mlfow使用简单示意

Scc_hy

已于 2022-10-17 22:32:25 修改

阅读量779

点赞数

分类专栏：机器学习文章标签：机器学习 python 人工智能数据挖掘 kedro

于 2022-10-15 16:52:56 首次发布

此文为笔者原创，如需转载请联系笔者:hyscc1994@foxmail.com

本文链接：https://blog.csdn.net/Scc_hy/article/details/127334235

版权

机器学习专栏收录该内容

36 篇文章 7 订阅

订阅专栏

kedro相关的构建参看笔者前篇文章《机器学习_kedro框架使用简单示意》

简介及安装包

kedro用来构建可复用，易维护，模块化的机器学习代码。相比于Notebook的超级灵活性,便于探索数据和算法， Kedro 定位于解决版本控制，可复用性，文档，单元测试，部署等工程方面的问题。

pip install mlflow
pip install mlflow[pipelines]
pip install kedro-mlflow
pip install stastd

一、创建kedro-mlflow项目

1.1 主要步骤

到项目目录下执行命令kedro mlflow init 初始化项目
在项目对应目录下创建hook.py文件： src/<package_name>/hooks.py
基于官方文档写hooks.py 也可以直接看笔者的
- 官方文档： https://kedro.readthedocs.io/en/stable/hooks/examples.html
在项目对应目录下创建setting.py文件： src/<package_name>/setting.py
在setting.py 中完善 HOOKS变量
将catalog.yml 的变量转变成kedro_mlfow虚拟变量
- 官方文档：https://kedro-mlflow.readthedocs.io/en/stable/source/04_experimentation_tracking/03_version_datasets.html
运行mlflow ui

1.2 hooks.py完善

# python3
# func: add mlflow
# ==========================================
from typing import Any, Dict
import statsd
import mlflow
import sys
from kedro.framework.hooks import hook_impl
from kedro.pipeline.node import Node


class ModelTrackingHooks:
    # https://kedro.readthedocs.io/en/stable/hooks/examples.html#add-memory-consumption-tracking
    """Namespace for grouping all model-tracking hooks with MLflow together."""
    def __init__(self):
        self._timers = {}
        self._client = statsd.StatsClient(prefix="kedro")

    @hook_impl
    def before_pipeline_run(self, run_params: Dict[str, Any]) -> None:
        """Hook implementation to start an MLflow run
        with the session_id of the Kedro pipeline run.
        """
        mlflow.start_run(run_name=run_params["session_id"], nested=True)
        for k, v in run_params.items():
            if v is None: continue
            if len(v):
                mlflow.log_params({k:v})

    @hook_impl
    def after_node_run(
        self, node: Node, outputs: Dict[str, Any], inputs: Dict[str, Any]
    ) -> None:
        """Hook implementation to add model tracking after some node runs.
        In this example, we will:
        * Log the parameters after the data splitting node runs.
        * Log the model after the model training node runs.
        * Log the model's metrics after the model evaluating node runs.
        """
        for k, v in inputs.items():
            if v is None: continue
            if len(v):
                mlflow.log_params({k:v})

    @hook_impl
    def after_pipeline_run(self) -> None:
        """Hook implementation to end the MLflow run
        after the Kedro pipeline finishes.
        """
        self._client.incr("run")
        mlflow.end_run()

    @hook_impl
    def before_node_run(self, node: Node) -> None:
        node_timer = self._client.timer(node.name)
        node_timer.start()
        self._timers[node.short_name] = node_timer


    @hook_impl
    def after_node_run(self, node: Node, inputs: Dict[str, Any]) -> None:
        self._timers[node.short_name].stop()
        for dataset_name, dataset_value in inputs.items():
            self._client.gauge(dataset_name + "_size", sys.getsizeof(dataset_value))

1.3 setting.py完善

from .hooks import ModelTrackingHooks

HOOKS = ( ModelTrackingHooks(), )

1.4 catalog.yml 修改

irir_data:
    type: kedro_mlflow.io.artifacts.MlflowArtifactDataSet
    data_set:
        type: pandas.CSVDataSet
        filepath: data/05_model_input/iris.csv


logistic_model_v1:
    type: kedro_mlflow.io.artifacts.MlflowArtifactDataSet
    data_set:
        type: kedro_mlflow.io.models.MlflowModelSaverDataSet
        flavor: mlflow.sklearn
        filepath: data/06_models/logistic_model_v1.pickle

X_train:
    type: kedro_mlflow.io.artifacts.MlflowArtifactDataSet
    data_set:
        type: pandas.ParquetDataSet
        filepath: data/05_model_input/X_train.parquet

X_test:
    type: kedro_mlflow.io.artifacts.MlflowArtifactDataSet
    data_set:
        type: pandas.ParquetDataSet
        filepath: data/05_model_input/X_test.parquet

y_train:
    type: kedro_mlflow.io.artifacts.MlflowArtifactDataSet
    data_set:
        type: pandas.ParquetDataSet
        filepath: data/05_model_input/y_train.parquet

y_test:
    type: kedro_mlflow.io.artifacts.MlflowArtifactDataSet
    data_set:
        type: pandas.ParquetDataSet
        filepath: data/05_model_input/y_test.parquet

1.4 运行`mlflow ui`

如果是lunix可以直接nohup
windows的话可以简单起两个终端

mlflow ui --port 80 --host 127.0.0.1
# 新起一个终端，到项目目录下运行项目
kedro run

在这里插入图片描述

二、进阶模型评估

2.1 拆分训练与评估

将评估模块独立出来
构建metric_pipline

2.1 增加模型评估图与json

model_metric.py

import mlflow
from sklearn.metrics import f1_score, precision_score, recall_score, classification_report, confusion_matrix
import matplotlib.pyplot as plt
plt.style.use('ggplot')
import seaborn as sns
import logging

log_ = logging.getLogger(__name__)

def conf_heat_map(conf_matrix):
    fig, axes = plt.subplots(1, 1, figsize=(8, 8))
    sns.heatmap(conf_matrix, ax=axes, annot=True, vmin=conf_matrix.min()-1, 
            vmax=conf_matrix.max() + conf_matrix.min())
    axes.set_title('heatmap')
    return fig
    

def evaluate_model(estimator, X_test, y_test):
    metric_info = {}
    y_pred = estimator.predict(X_test)
    score = f1_score(y_test.values.ravel(), y_pred.ravel(), average='macro')
    conf_matrix = confusion_matrix(y_test.values.ravel(), y_pred.ravel())
    fig = conf_heat_map(conf_matrix)
    log_.info(f"[ valid ] f1-score {score:.3f}")
    metric_info['f1_score'] = score
    metric_info['precision_score'] = precision_score(y_test.values.ravel(), y_pred.ravel(), average='macro')
    metric_info['recall_score'] = recall_score(y_test.values.ravel(), y_pred.ravel(), average='macro')
    metric_info['classification_report'] = classification_report(y_test.values.ravel(), y_pred.ravel())

    mlflow.log_metric(key='f1-score', step=1, value=score)
    return [
        metric_info, {'heatmap.png' : fig}
    ]

catalog.yml 修改

metric_info:
    type: kedro_mlflow.io.artifacts.MlflowArtifactDataSet
    data_set:
        type: json.JSONDataSet
        filepath: data/08_reporting/metric_info.json


metric_pics:
    type: kedro_mlflow.io.artifacts.MlflowArtifactDataSet
    data_set:
        type: matplotlib.MatplotlibWriter
        filepath: data/08_reporting/metric_pics