(五)自动化MLOps部署到Kubernetes

最新推荐文章于 2024-03-08 16:36:53 发布

寒冰屋

最新推荐文章于 2024-03-08 16:36:53 发布

阅读量349

点赞数

分类专栏： Docker 文章标签： jenkins docker MLOps Kubernetes python

原文链接：https://www.codeproject.com/Articles/5302285/Automating-MLOps-Deployment-to-Kubernetes

版权

Docker 专栏收录该内容

107 篇文章 8 订阅

订阅专栏

本文档详细介绍了如何使用Python脚本与Jenkins和Kubernetes交互，实现半自动化的MLOps生产部署。脚本涉及清理旧作业、复制模型到生产注册表、检查及关闭预测服务Pod等步骤，确保服务零停机时间升级。

摘要由CSDN通过智能技术生成

开发Python脚本

运行脚本

结论

在这里，我们为我们的CI/CD MLOps管道开发了一个半自动化的生产部署。

在之前的系列文章中，我们解释了如何编写要在我们的Docker容器组中执行的脚本作为 CI/CD MLOps管道的一部分。在本系列中，我们将设置一个Google Kubernetes Engine( GKE )集群来部署这些容器。

本系列文章假设您熟悉深度学习、DevOps、Jenkins和Kubernetes基础知识。

在本系列的前一篇文章中，我们构建了四个自动化的Jenkins工作流。在本文（本系列的最后一篇）中，我们将为我们的CI/CD MLOps管道开发一个半自动化的生产部署。它是半自动化的，因为您作为产品所有者通常希望在部署到生产之前检查单元测试结果——以避免服务失败。部署到生产可以手动完成，但需要自动化才能实现Google MLOps 成熟度模型目标。

下图显示了我们在项目架构中的位置。

部署到生产包括：

单元测试结束后，将模型文件从GCS测试注册表复制到生产版本
清理已完成的Kubernetes作业
启动预测服务Pod的系统关闭（如果相应的工作流已经执行），这会迫使Kubernetes启动新的，以零服务停机时间加载新模型

开发Python脚本

我们一直在与Jenkins和Kubernetes合作来构建我们的CI/CD解决方案。下一个脚本将向您展示如何使用Python与Jenkins和Kubernetes交互，从而自动化部署到生产任务。我们的Python 脚本将在本地运行。

让我们深入研究代码。首先，我们导入所需的库并定义变量：

from kubernetes import client, config
from google.cloud import storage
import jenkins
import time
import os
 
bucket_name = 'automatictrainingcicd-aiplatform'
model_name = 'best_model.hdf5'

接下来，我们声明清除集群中已完成作业的函数：

def clean_jobs():
    config.load_kube_config()
 
    api_instance=client.BatchV1Api()
    print("Listing jobs:")
    api_response = api_instance.list_job_for_all_namespaces()
    jobs = []
    print('job-name  job-namespace  active  succeeded  failed  start-time  completion-time')
    for i in api_response.items:
        jobs.append([i.metadata.name,i.metadata.namespace])
        print("%s  %s  %s  %s  %s  %s  %s" % (i.metadata.name,i.metadata.namespace,i.status.active,i.status.succeeded,i.status.failed,i.status.start_time,i.status.completion_time))
    print('Deleting jobs...')
    if len(jobs) > 0:
        for i in range(len(jobs)):
            api_instance.delete_namespaced_job(jobs[i][0],jobs[i][1])
        print("Jobs deleted.")
    else:
        print("No jobs found.")
return

将模型从GCS测试注册表复制到生产注册表的功能如下：

def model_to_production():
    storage_client = storage.Client.from_service_account_json('AutomaticTrainingCICD-68f56bfa992e.json')
    bucket = storage_client.bucket(bucket_name)
    status = storage.Blob(bucket=bucket, name='{}/{}'.format('testing',model_name)).exists(storage_client)
    if status == True:
        print('Copying model...')
        source_blob = bucket.blob('{}/{}'.format('testing',model_name))
        destination_blob_name = '{}/{}'.format('production',model_name)
        blob_copy = bucket.copy_blob(source_blob, bucket, destination_blob_name)
        print('Model from testing registry has been copied to production registry.')
    else:
        print('No model found at testing registry.')
    return

下一个函数检查预测服务是否处于活动状态。如果是，则启动系统pod关闭；否则，将触发AutomaticTraining-PredictionAPI Jenkins工作流：

def check_services():
    api_instance = client.CoreV1Api()
    api_response = api_instance.list_service_for_all_namespaces()
    print('Listing services:')
    print('service-namespace  service-name')
    services = []
    for i in api_response.items:
        print("%s  %s" % (i.metadata.namespace, i.metadata.name))
        services.append(i.metadata.name)
    if True in (t.startswith('gke-api') for t in services):
        print('gke-api service is active. Proceeding to systematically shutdown its pods...')
        shutdown_pods()
        return
    else:
        jenkins_build()
        return

如果预测服务处于活动状态，则以下函数负责Pod关闭：

def shutdown_pods():
    config.load_kube_config()
    api_instance = client.CoreV1Api()
    print("Listing pods:")
    api_response = api_instance.list_pod_for_all_namespaces(watch=False)
    pods = []
    print('pod-ip-address  pod-namespace  pod-name')
    for i in api_response.items:
        print("%s  %s  %s" % (i.status.pod_ip, i.metadata.namespace, i.metadata.name))
        pods.append([i.metadata.name, i.metadata.namespace])
    print('Shutting down pods...')
    print('Deleting only gke-api pods...')
    if len(pods) > 0:
        for i in range(len(pods)):
            if pods[i][0].startswith('gke-api') == True:
                api_instance.delete_namespaced_pod(pods[i][0],pods[i][1])
                print("Pod '{}' shut down.".format(pods[i][0]))
                time.sleep(120)
        print("All pods have been shut down.")
    else:
        print("No pods found.")
   return

如果预测服务未激活，则会触发以下功能。它部署了预测服务：

def jenkins_build():
    print('gke-api service is not active. Proceeding to build AutomaticTraining-PredictionAPI job at Jenkins.')
    server = jenkins.Jenkins('http://localhost:8080', username='your_username', password='your_password')
    server.build_job('AutomaticTraining-PredictionAPI')
    print('AutomaticTraining-PredictionAPI job has been triggered, check Jenkins logs for more information.')
    return

最后是main函数；它按要求的顺序执行整个脚本：

def main():
    clean_jobs()
    model_to_production()
    check_services()
 
if __name__ == '__main__':
    main()

运行脚本

运行我们开发的Python脚本文件后，您应该得到以下响应：

所有旧的、已完成的作业都将被删除，模型复制到生产注册表，并且Pod成功终止。要仔细检查您是否获得了新的pod，请在脚本执行前后运行kubectl get pods。您应该会看到不同的pod标识符：

要查看最终产品的外观（包括界面“奖励”），请查看此内容。接口的公共IP地址是我们访问服务的地方：

服务接口如下所示：

最后，这是您提交图像后显示的预测：

结论

我们的系列到此结束。我们希望您喜欢这个系列，并且在处理具有挑战性的ML任务时，您将利用它来获取知识！

Automating MLOps Deployment to Kubernetes - CodeProject