机器学习经典算法实践_服务机器学习算法的系统设计-不同环境下管道的最佳实践-CSDN博客

机器学习经典算法实践

“Eureka”! While working on a persistently difficult-to-solve problem, you discovered a marketable and profitable solution. To simplify your tax reporting, you decided to file paperwork to start your LLC. You realize the importance of keeping your personal and business finances separate. If you fail to keep your finances separate, it can become very messy and cumbersome to file your taxes when the time comes. This concept is very similar to building pipelines for different environments, specifically the Development/Non-Production and Deployment/Production environments. What is the purpose of having these separate environments? We will cover the benefits of compartmentalizing these environments throughout this section.

“尤里卡”！在解决一个持续难以解决的问题时，您发现了一个可销售且有利可图的解决方案。为了简化您的纳税报告，您决定提交文书文件以启动您的LLC。您意识到保持个人财务和业务财务分开的重要性。如果您无法分开自己的财务状况，那么在时机成熟时提交税款可能会变得非常麻烦和麻烦。此概念与为不同环境(特别是开发/非生产和部署/生产)的环境构建管道非常相似。拥有这些独立环境的目的是什么？在本节中，我们将介绍分隔这些环境的好处。

Throughout constructing our Data Science solution, we develop a series of steps to perform the operations we need to become successful. We extract data from our source, engineer our features, train our various models, validate on a subset of our source data, and then upload our predictions. While it is straightforward to construct our solution using a script, we can quickly introduce complexities with our pipeline deployments. Why and how can we make that statement? Data we use in production can be different from the data we use in development.

在构建数据科学解决方案的整个过程中，我们制定了一系列步骤来执行成功所需的操作。我们从源中提取数据，设计功能，训练我们的各种模型，对源数据的一部分进行验证，然后上传我们的预测。尽管使用脚本构建解决方案很简单，但是我们可以通过管道部署快速引入复杂性。我们为什么以及如何发表这一声明？我们在生产中使用的数据可能与我们在开发中使用的数据不同。

For this reason alone, we need to have different considerations or additional steps to compensate for these differences. Once we start considering data syncing with a Data Lake, our Data Science solution must not synchronize predictions we are making in development with the Data Lake that our clients will see. In addition to the points above, stability and validation tests would be impacted by how we construct our pipelines. If our pipelines have coalesced into one, it will become challenging to create inspections to ensure expected functionality. By following these concepts and thoughts, we can save ourselves a measure of “pain and suffering” by restricting our focus on simplifying our tunnel of vision and compartmentalizing.

仅出于这个原因，我们需要有不同的考虑因素或其他步骤来弥补这些差异。一旦我们开始考虑与Data Lake进行数据同步，我们的Data Science解决方案就不能与客户将看到的与Data Lake同步开发中的预测。除了上述几点之外，稳定性和验证测试还将受我们构建管道的方式的影响。如果我们的管道合并为一个，那么进行检查以确保预期功能将变得具有挑战性。通过遵循这些概念和思想，我们可以将精力集中在简化视力通道和分隔上，从而节省一些“痛苦和痛苦”。

为什么要有单独的培训管道和预测管道？ (Why Have Separate Training Pipeline and Prediction Pipelines?)

In the previous subsection, we discussed the importance of having two different pipelines between running our Data Science solution in development and production. We increase our stability and testability of our pipeline. However, as mentioned in the previous section, it is easy to develop our pipelines such that we have integrated our Training and Prediction processes in the same script. Typically, what is common in these situations is that boolean values are passed into the software to denote whether we are predicting our Machine Learning model or training our Machine Learning Model. Melding the two fundamentally different processes together complicates the Data Science code repositories and decreases the software and the pipeline’s maintainability. There are software concepts and principles to increase maintainability and decrease cognitive overload. The Single Responsibility principle can be applied and used to help make the Training and Prediction processes easier to maintain and manage. If this principle is upheld and enforced, the prediction pipeline could realistically consist of five base operations and up to ten code lines. Following this principle reduces what each member of the deployment party needs to know about the training pipeline.

在上一部分中，我们讨论了在开发和生产中运行数据科学解决方案之间具有两个不同管道的重要性。我们提高了管道的稳定性和可测试性。但是，如前一节所述，很容易开发管道，以便我们将训练和预测过程集成在同一脚本中。通常，在这些情况下常见的是将布尔值传递到软件中以表示我们是在预测我们的机器学习模型还是在训练我们的机器学习模型。将这两个根本不同的过程融合在一起，会使数据科学代码存储库复杂化，并降低软件和管道的可维护性。有一些软件概念和原则可以提高可维护性并减少认知负担。可以应用和使用“ 单一职责”原则来帮助使培训和预测过程更易于维护和管理。如果坚持并执行该原则，则预测管道实际上可以包含五个基本操作和最多十个代码行。遵循此原则减少了部署团队的每个成员需要了解的培训渠道。

Thank you for reading thus far! This is part of a series of articles to come.

到目前为止，感谢您的阅读！这是后续系列文章的一部分。

Please stay tune!

请继续关注！

As always, #happycoding

与往常一样，＃happycoding