机器学习隐藏单元_机器学习隐藏的技术债务和解决方案

机器学习隐藏单元

As machine learning systems find wide-spread adoption, solving complex real-words problems across various industries (Automotive, BFSI, Entertainment, Medical, Agriculture, …), improving and maintaining these systems overtime is becoming expensive and difficult, than developing and deployment. Long term maintenance of these ML systems is getting more involved than traditional systems due to the additional challenges of data and other specific ML issues [1]. In this article, we have covered a few hidden technical debt of the ML system which is summarized in the below table with some possible mitigation strategies.

随着机器学习系统被广泛采用,解决各个行业(汽车,BFSI,娱乐,医疗,农业等)的复杂实词问题,与开发和部署相比,随着时间的推移改善和维护这些系统变得昂贵且困难。 由于数据和其他特定的ML问题带来的额外挑战,这些ML系统的长期维护比传统系统要介入得多[1]。 在本文中,我们介绍了ML系统的一些隐藏技术问题,下表总结了一些可能的缓解策略,这些问题在下表中进行了概述。

Image for post

Abstraction in software engineering is one of the best practices for the maintainable system. Strict abstraction boundaries help express the invariants and logical consistency of inputs and outputs for a given software component. But it’s difficult to enforce such strict abstraction boundaries in the ML system, as intended behavior of the system is learned using data and there is little way to separate abstract from quirks of data. [3]

软件工程的抽象是可维护系统的最佳实践之一。 严格的抽象边界有助于表达给定软件组件的输入和输出的不变性和逻辑一致性。 但是,在ML系统中强制执行如此严格的抽象边界很困难,因为系统的预期行为是使用数据学习的,并且几乎没有办法将抽象与数据怪癖分开。 [3]

Image for post

纠缠 (Entanglement)

To make this concrete, let say we have an ML system that uses features f1, … , fn in a model. Since the change in feature distribution or adding new features or deleting existing features changes the importance weights of all features, as well as the target output. This is referred to as the Changing Anything Changes Everything (CACE) phenomenon, and applies not only to input features but also to hyper-parameters, learning settings, sampling methods, convergence thresholds, data selection, and essentially every other possible tweak [2].

为了具体化,可以说我们有一个ML系统,该系统在模型中使用特征f1,...,fn。 由于更改要素分布或添加新要素或删除现有要素会更改所有要素以及目标输出的重要性权重。 这被称为“ 一切都变了”(CACE)现象,不仅适用于输入功能,而且还适用于超参数,学习设置,采样方法,收敛阈值,数据选择以及基本上所有其他可能的调整[2]。 。

To mitigate this, one possible solution is to detect prediction change as they occur and diagnose the cause of change, one reason for this prediction change could be a change in feature distribution (drift) which can be found using a tool like TensorFlow Data Validation which supports schema skew, feature skew, and distribution skew detection. For example, as shown in the below image by visualizing the distribution of feature values, we can catch these problems with feature distribution.

为了缓解这种情况,一种可能的解决方案是在预测变化发生时进行检测并诊断变化的原因,这种预测变化的原因可能是特征分布(漂移)的变化,可以使用TensorFlow Data Validation之类的工具来查找 它支持模式偏斜,功能偏斜和分布偏斜检测。 例如,如下图所示,通过可视化特征值的分布,我们可以利用特征分布来解决这些问题。

Image for post
TensorFlow Extended Data Visualisation Guide showing feature distribution TensorFlow扩展数据可视化指南中的图像显示了功能分布

Underutilized data dependencies

数据利用不足

These dependencies are caused due to legacy features, epsilon features, correlated features, and bundled features.

这些依赖关系是由于遗留功能, ε功能, 相关功能和捆绑功能引起的。

Legacy features are the features that are included in early model development and overtime these features are made redundant by new features.

功能是早期模型开发中包含的功能,超时后这些功能会因新功能而变得多余。

Bundled features, when a group of features is evaluated and found to beneficial and all features in the bundle are added without careful investigation of all features in isolation for metric improvement due to time pressure or similar effect.

捆绑特征,当评估一组特征并认为是有益的并且添加捆绑中的所有特征时,无需仔细检查所有特征,以免由于时间压力或类似效果而改进度量标准。

Epsilon features are the features that are included in the model because they provided very small gains in the accuracy or other metrics.

Epsilon功能是模型中包含的功能,因为它们在准确性或其他指标方面提供了非常小的增益。

Correlated features are the features that have a high correlation with other features.

关联要素是与其他要素高度相关的要素。

To mitigate these underutilized dependencies, one can rely on leave-one-feature-out evaluation, Principal Component Analysis, Lasso Regularization, Auto Encoders, use explainable tools like SHAP, or using boosted tree estimators from Tensorflow which are immune to some of these dependencies like correlated features.

为了减轻这些未充分利用的依赖关系,可以依靠留一特征评估, 主成分分析套索 正则化自动编码器 ,使用诸如SHAP之类的可解释工具,或者使用来自Tensorflow的增强型树估计器 ,这些估计器不受这些依赖关系的影响诸如相关特征。

For example, the SHAP summary plot can help us in identifying features importance along with feature value distribution and its impact on the prediction. The below image shows the summary plot of the boosted regression trees model for the Boston housing price dataset.

例如,SHAP摘要图可以帮助我们识别特征重要性以及特征值分布及其对预测的影响。 下图显示了波士顿房价数据集的增强回归树模型的摘要图。

1. LSTAT is the most important feature affecting model prediction, and the lower feature value is positively contributing to the prediction, while the higher feature value is negatively contributing to the prediction.

1. LSTAT是影响模型预测的最重要特征,较低的特征值对预测有正面贡献,而较高的特征值对预测有负面影响。

2. CHAS and ZN feature has no prediction power and can be safely removed from the model.

2. CHAS和ZN功能没有预测能力,可以安全地从模型中删除。

SHAP python package also supports a single prediction explanation plot, dependence plot, etc. which can also be used to diagnose the prediction change.

SHAP python软件包还支持单个预测说明图,依存图等,也可用于诊断预测变化。

Image for post
SHAP Git Repository, Summary plot for Xgboost Boston Model SHAP Git存储库 ,Xgboost波士顿模型的摘要图

反馈回路 (Feedback Loops)

This debt mostly exists in live ML systems, which often end up in influencing their own behavior, if ML systems get the updated over time. These loops can exist in many different forms (direct or indirect) and are not easily detectable.

这种债务主要存在于实时ML系统中,如果ML系统随着时间的推移得到更新,则最终会影响他们自己的行为。 这些循环可以以许多不同的形式(直接或间接)存在,并且不容易检测到。

For example, the recommender system often keeps on recommending an item from a similar actor, genre, etc according to the user activity. This phenomenon in which beliefs are amplified or reinforced is known as an echo chamber. The common mitigation strategy is to make these close-loop recommender systems open-loop by collecting explicit feedbacks, including search behavior, etc.

例如,推荐器系统经常根据用户活动继续推荐来自相似演员,流派等的项目。 这种信念被放大或强化的现象被称为回声室。 常见的缓解策略是通过收集明确的反馈(包括搜索行为等)使这些闭环推荐系统开环。

反模式 (Anti-Patterns)

Actual learning or inferencing code in the ML system is a small fraction of the total system code as can be seen in the image below and it's quite common to see many anti-patterns surfacing in an ML system, which should be avoided.

如下图所示,ML系统中的实际学习或推理代码仅占系统代码的一小部分,并且在ML系统中出现许多反模式非常普遍,应避免使用。

Image for post
MLOps Solution MLOps解决方案的图片
  1. Glue code: This anti-pattern arises due to supportive code to fit data to use specific ML models/libraries, this can be mitigated by using a standard preprocessing library like TensorFlow Transform which uses Apache Beam for distributed computation and Apache Arrow for vectorized NumPy and supports format conversions, tokenizing and stemming text, a numerical operation like normalization, etc.

    胶水代码:这种反模式是由于支持代码以使数据适合特定的ML模型/库而产生的,这可以通过使用标准预处理库(如TensorFlow Transform)缓解 它使用Apache Beam进行分布式计算,使用Apache Arrow进行矢量化NumPy,并支持格式转换,对文本进行标记和词干化,诸如归一化的数值运算等。

  2. Pipeline Jungles: As new sources are added incrementally to models, this leads to code full of joins, sampling steps, etc. This can be mitigated by a clean slate approach i.e. developing pipeline code from scratch after model development is freeze or thinking more holistically on data collection and feature extraction.

    管道丛林:随着新的源代码逐渐添加到模型中,这导致代码充满了连接,采样步骤等。这可以通过干净的方法缓解 ,即在冻结模型开发后从头开始开发管道代码,或者更全面地思考数据收集和特征提取。

  3. Dead Experimental Codepaths: As a result of glue code or pipeline jungles, many times experiments are performed by implementing experimental code paths, however over time these code paths are difficult to maintain for backward compatibility. It's often standard practice to review and ripped out these code paths to mitigate this debt.

    无效的实验代码路径由于胶水代码或管道丛林,很多时候都是通过实现实验代码路径来进行实验,但是随着时间的流逝,这些代码路径很难维护以实现向后兼容。 通常的惯例是查看并删除这些代码路径以减轻这种负担。

配置债务 (Configuration Debt)

This debt can accumulate due to the range of configurable options at each step of machine learning systems like which features are used, how data is selected, model-specific parameters, pre- or post-processing, model deployment, etc. These mistakes in configuration can be costly, leading to serious loss of time, waste of computing resources, or production issues.

由于在机器学习系统的每个步骤(例如,使用哪些功能,如何选择数据,特定于模型的参数,预处理或后处理,模型部署等)的每个步骤中,可配置选项的范围之广,这种债务可能会累积。可能会很昂贵,导致严重的时间损失,计算资源的浪费或生产问题。

To mitigate configuration debt, the common practice is to keep parameters related to the model, hyperparameter search, pre-or-post processing in YAML files, modularized templates, Jinja2 templates, Infrastructure as Code, etc.

为了减轻配置负担,通常的做法是保留与模型,超参数搜索, YAML文件中的前后处理,模块化模板,Jinja2模板,基础架构即代码等相关的参数。

In this blog, we have seen a few ML aspects such as model entanglement, data (discover, source, manage, version), and how these data and ML specific aspect causes more hidden technical debts than a traditional system. Also how using visualization, end to end machine learning frameworks like TensorFlow Extended and best software engineering practices we can mitigate them. There are many more ML debts that exist in the ML system, related to serving, monitoring, testing, reproducibility, process management, etc. which we shall cover in future articles with their mitigation strategies.

在此博客中,我们看到了机器学习的一些方面,例如模型纠缠,数据(发现,源,管理,版本),以及这些数据和机器学习的特定方面如何导致比传统系统更多的隐藏技术债务。 以及如何使用可视化,端到端机器学习框架(如TensorFlow Extended)和最佳软件工程实践,我们可以减轻它们。 机器学习系统中还有更多的机器学习债务,涉及服务,监视,测试,可再现性,过程管理等,我们将在以后的文章中介绍其缓解策略。

翻译自: https://towardsdatascience.com/machine-learning-hidden-technical-debts-and-solutions-407724248e44

机器学习隐藏单元

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值