机器学习与分布式机器学习_持续交付机器学习

最新推荐文章于 2022-04-12 23:12:58 发布

weixin_26750481

最新推荐文章于 2022-04-12 23:12:58 发布

阅读量246

点赞数

文章标签：机器学习 python 人工智能深度学习 java

原文链接：https://medium.com/@ODSC/continuous-delivery-for-machine-learning-d07f2d0f051

版权

机器学习与分布式机器学习

为什么很难将机器学习代码投入生产？ (Why is bringing machine learning code into production hard?)

Machine Learning applications are becoming popular in all industries, however, the process for developing, deploying, and continuously improving them is more complex compared to more traditional software, such as a web service or a mobile application. They are subject to change in three dimensions: the base code itself, the model parameters, and the data. Unlike with most software, improvement in models is more ambiguous as performance is not measured in right or wrong but with various quantitative measures and tradeoffs with model complexity. In addition, real-world data is continuously changing affecting the performance of the machine learning application in production. This means that the application needs to be continuously monitored and the model retrained on new data.

机器学习应用程序在所有行业中都变得越来越流行，但是与更传统的软件(例如Web服务或移动应用程序)相比，开发，部署和持续改进它们的过程更加复杂。它们可能会在三个维度上发生变化：基本代码本身，模型参数和数据。与大多数软件不同，模型的改进更加含糊不清，因为性能不是通过对与错来衡量的，而是通过各种量化方法进行的，并且要权衡模型的复杂性。此外，现实世界中的数据不断变化，从而影响机器学习应用程序在生产中的性能。这意味着需要对应用程序进行持续监控，并在新数据上对模型进行重新训练。

什么是“ CD4ML”？ (What is “CD4ML”?)

Continuous Delivery for Machine Learning (CD4ML) is the discipline of bringing Continuous Delivery principles and practices to Machine Learning applications. The concept has been derived from Continuous Delivery, an approach developed 25 years ago to foster automation, quality, and discipline to create a reliable and repeatable process to release software into production.

机器学习的持续交付(CD4ML)是将持续交付原则和实践引入机器学习应用程序的学科。该概念源自“持续交付”，它是25年前开发的一种方法，旨在促进自动化，质量和纪律，以创建可靠且可重复的过程以将软件发布到生产中。

CD4ML builds upon this, allowing a cross-functional team to produce machine learning applications based on code, data, and models that progress in small and safe increments that can be reproduced and reliably released at any time. This improves greatly over typical approaches which include lots of manual, difficult-to-reproduce steps and handoffs that result in errors, confusion, and occasionally, disaster.

CD4ML以此为基础，允许跨职能的团队基于代码，数据和模型来生成机器学习应用程序，这些代码，数据和模型可以以小而安全的增量进行开发，并且可以随时复制并可靠地发布。与典型的方法相比，这种方法有了很大的改进，这些方法包括许多手工操作，难以重现的步骤和移交，这些操作和移交会导致错误，混乱甚至是灾难。

Figure 1: The overall process of CD4ML

图1：CD4ML的总体过程

Figure 1 shows the different steps of the overall CD4ML process. This begins with the work of a Data Scientist, using easily discoverable and accessible data to build a model. The inputs to the training process include the model itself, the parameters, the source code and the training data required. One item to note is that throughout this entire process, we are carrying the source code, executables, model and parameters through the entire CD4ML pipeline. The next step would be Model Evaluation, ensuring that it’s predictive accuracy is acceptable for the performance of your application using your testing data.

图1显示了整个CD4ML过程的不同步骤。这始于数据科学家的工作，即使用易于发现和访问的数据来构建模型。培训过程的输入包括模型本身，参数，源代码和所需的培训数据。需要注意的一件事情是，在整个过程中，我们通过整个CD4ML管道传递源代码，可执行文件，模型和参数。下一步将是模型评估，以确保使用测试数据为应用程序的性能提供可接受的预测准确性。

Afterwards, we need to Productionize and perform Integration Testing on our model. This might involve exposing a RESTful endpoint to our model for consumers, integrating it into a streaming pipeline for real-time predictions or perhaps adjusting the implementing language like adjusting a machine learning model to run on a big data framework like Apache Spark. During this Productionization process, one item we want to ensure is that if we need to adjust the model that the productionized implementation matches the original implementation, so writing integration tests to ensure model predictive performance matches between implementation adjustments is critical.

之后，我们需要对模型进行量产并执行集成测试。这可能涉及向我们的消费者模型提供一个RESTful终结点，将其集成到用于实时预测的流传输管道中，或者调整实现语言，例如调整机器学习模型以在诸如Apache Spark的大数据框架上运行。在此Productionization过程中，我们要确保的一项内容是，如果需要调整生产实现与原始实现匹配的模型，那么编写集成测试以确保实现调整之间的模型预测性能匹配至关重要。

Lastly, we move to deployment and monitoring. During these steps we monitor our model in production ensuring metrics such as model prediction accuracy, response time and system load are acceptable. It’s also important to capture the information around what is being asked from the model and the final outcome; a key concept is that these processes should be happening continuously and the outputs from monitoring should be used in the next iteration of development of our machine learned model. This live collected data should be used as input when iterating on the model along with any source code and parameter adjustments. Finally, If there are any code adjustments, this CD4ML pipeline should trigger the machine learning process to start from the beginning to ensure continuous delivery and the latest code changes and models are available for customers to use.

最后，我们转到部署和监视。在这些步骤中，我们在生产中监控我们的模型，以确保可以接受模型预测准确性，响应时间和系统负载等指标。获取有关模型要求的信息和最终结果的信息也很重要。一个关键的概念是这些过程应连续发生，并且监视的输出应在我们的机器学习模型的下一次开发迭代中使用。在模型上进行迭代以及任何源代码和参数调整时，应将这些实时收集的数据用作输入。最后，如果有任何代码调整，此CD4ML管道应触发从头开始的机器学习过程，以确保持续交付并提供最新的代码更改和模型供客户使用。

您如何在笔记本电脑上尝试CD4ML？ (How can you try CD4ML on your laptop?)

At ODSC Europe this fall, we invite you to our workshop to join us to learn about CD4ML. This workshop runs completely on your local environment using open source data and technologies. You can browse the repository here. During the workshop, we will guide you through the steps of CD4ML using the tools in Figure 2 and how they communicate to each other in Figure 3.

今年秋天在ODSC Europe上，我们邀请您参加我们的研讨会，与我们一起学习CD4ML。 该研讨会使用开源数据和技术完全在您的本地环境中运行。您可以在此处浏览存储库。在研讨会期间，我们将使用图2中的工具指导您完成CD4ML的步骤，并在图3中指导它们如何相互通信。

Figure 2: The different tools we will use and their categories.

图2：我们将使用的不同工具及其类别。

One important aspect of CD4ML is that it is a software development approach which incorporates the entire data science and model development workflow. The tools used just need to apply to the six categories outlined in Figure 2. For instance, for “Continuous Delivery Orchestration to Combine Pipelines” you can use Jenkins or another tool like CircleCI. For “Model Monitoring and Observability” tools like Prometheus or monitoring tools provided by your cloud provider, such as AWS CloudWatch or Azure Monitoring can be used. This software development approach is preferred because it allows for the development team to collaborate and come together to evaluate and choose the tools best fit for their development process.

CD4ML的一个重要方面是它是一种软件开发方法，它结合了整个数据科学和模型开发工作流程。所使用的工具仅需要应用于图2中概述的六个类别。例如，对于“ 连续交付编排以合并管道”，您可以使用Jenkins或CircleCI之类的其他工具。对于“ 模型监视和可观察性”工具，可以使用Prometheus之类的工具或云提供商提供的监视工具，例如AWS CloudWatch或Azure监视。首选此软件开发方法，因为它允许开发团队进行协作，共同评估和选择最适合其开发过程的工具。

Figure 3: The overall architecture of our scenario

图3：我们场景的总体架构

As part of this workshop, we will be completing the following real-world scenarios together. These scenarios represent the major steps and learnings in a teams software development process in implementing CD4ML:

作为该研讨会的一部分，我们将一起完成以下真实场景。这些方案代表团队软件开发过程中实现CD4ML的主要步骤和学习内容：

Doing the plumbing: Set up the pipeline and see if it is working
做管道 ：设置管道，看看它是否在工作
Data Science: Develop the model and test the code with Test Driven Development
数据科学 ：使用测试驱动开发来开发模型并测试代码
Machine Learning Engineering: Improve the model in several steps and monitor the results
机器学习工程 ：分几步改进模型并监控结果
Continuous Deployment: Set up a performance test of the model, which only allows automatic deployment if the model passes the test
持续部署 ：设置模型的性能测试，仅在模型通过测试后才允许自动部署
Undo changes: Roll changes back in time, consistently, with all artifacts
撤消更改 ：使用所有工件，一致地及时回滚更改
Our app in the wild: Monitor your application in production with fluentd, elasticsearch and kibana
野外使用我们的应用程序 ：使用流利的，elasticsearch和kibana监控您的生产应用程序

As Machine Learning techniques continue to evolve and perform more complex tasks, so is evolving our knowledge of how to manage and deliver such applications to production. By bringing and extending the principles and practices from Continuous Delivery, we can better manage the risks of releasing changes to Machine Learning applications in a safe and reliable way.

随着机器学习技术的不断发展和执行更复杂的任务，我们对如何管理和将此类应用程序交付生产的知识也在不断发展。通过引入和扩展持续交付中的原则和实践，我们可以更好地管理以安全可靠的方式向机器学习应用程序发布更改的风险。

We look forward to seeing you at ODSC Europe 2020 at our talk, “Data Science Best Practices: Continuous Delivery for Machine Learning“!

我们期待在ODSC Europe 2020的演讲“ 数据科学最佳实践：机器学习的持续交付 ”中与您相见！