机器学习模型训练_您打算什么时候重新训练机器学习模型

机器学习模型训练

You may find a lot of tutorials which would help you build end to end Machine Learning pipelines. But generally, those tutorials do not mention much about how to maintain the quality of predictions generated from the ML systems.

您可能会发现很多教程,可以帮助您构建端到端的机器学习管道。 但是通常,这些教程并未过多提及如何保持从ML系统生成的预测的质量。

Maintaining the predictive power of the deployed model is deemed to be more difficult than building the ML model from scratch and hence is our topic of discussion today.

与从头开始构建ML模型相比,维持已部署模型的预测能力更加困难,因此,这是我们今天讨论的主题。

But before starting with the details of “model retraining”, let’s have a quick primer on “model training” first:

但是,在详细介绍“模型训练”之前,让我们首先快速了解“模型训练”

  • Assuming the sufficient historical data available, model building starts by learning the dependencies between a set of independent features and the target variable.

    假设有足够的历史数据可用,则通过学习一组独立特征与目标变量之间的依赖关系来开始模型构建。
  • The best learnt dependency is calculated basis some evaluation metric to minimize the predictions error on the validation dataset

    最佳学习的依赖关系是根据一些评估指标计算得出的,以最大程度地减少验证数据集上的预测误差
  • This best learnt model is then deployed in production with the expectation to keep making accurate predictions on incoming unseen data for as long as possible

    然后,将这种最佳学习的模型部署到生产中,期望尽可能长地对传入的看不见的数据做出准确的预测

Now, lets put more emphasis on what we mean by — ‘as long as possible’?

现在,让我们更加强调“尽可能长”的含义吗?

It never happens that the final ML model deployed once takes away the worries forever and keeps giving accurate predictions.

部署最终的ML模型一次永远消除了后顾之忧,并始终提供准确的预测,这从来没有发生过。

Why is that? Lets figure out below:

这是为什么? 让我们看下面:

  1. Model Drift:

    模型漂移:

To understand this, let us recall one of the most critical assumptions in ML modelling — train and test dataset should belong to similar distribution. And, the model will be good if the new data is similar to the data observed in the past on which the model was trained on.

为了理解这一点,让我们回顾一下ML建模中最关键的假设之一-训练和测试数据集应该属于相似的分布。 并且,如果新数据与过去训练模型时所观察到的数据相似,则该模型将是很好的。

So, we understand that if test data distribution deviates from that of train data, the model will not hold good. But what could be the possible reasons for such deviation. Well, it can be attributed to many reasons depending on the business case, e.g. change in consumer preferences, fast moving competitive space, geographic shift, economic conditions etc.

因此,我们了解到,如果测试数据分布与火车数据的分布有偏差,则该模型将无法保持良好状态。 但是这种偏离的可能原因是什么。 嗯,这可以归因于取决于业务情况的多种原因,例如,消费者偏好的变化,快速变化的竞争空间,地域变化,经济状况等。

Hence, the drifting data distribution calls for an ongoing process of periodically checking the validity of old model. In short, it is critical to keep your machine learning model updated; but the key is when? We will discuss this and a lot more as we proceed, so stay tuned.

因此,漂移的数据分发需要进行一个定期检查旧模型有效性的过程。 简而言之,保持机器学习模型的更新至关重要。 但是关键是什么时候? 在进行过程中,我们将对此进行更多讨论,请继续关注。

2. Robustness

2.坚固性

People/entities that get affected by the outcome of the ML models may deliberately alter their response in order to send spurious input to the model, thereby escaping the impact of the model predictions. For example, the models such as fraud detection, cyber-security etc receive manipulated and distorted inputs which cause model to output misclassified predictions. Such type of adversaries also drives down the model performance.

受ML模型结果影响的人/实体可能会故意更改其响应,以便将虚假输入发送到模型,从而避免了模型预测的影响。 例如,欺诈检测,网络安全等模型会接收经过操纵和失真的输入,从而导致模型输出分类错误的预测。 这种类型的对手也会降低模型的性能。

3. When ground truth is not available at the time of model training

3.在进行模型训练时无法获得基本事实时

In most of the machine learning models, the ground truth labels are not available to train the model. For example, target variable which captures the response of the end user is not known. In that case, your best bet could be to mock the user action based on certain set of rules coming from business understanding or leverage the open source dataset to initiate model training. But, this model might not necessarily represent the actual data and hence will not perform well until a burn-in period where it starts picking (aka learning) the true actions of the end user.

在大多数机器学习模型中,地面真相标签不可用于训练模型。 例如,捕获最终用户响应的目标变量是未知的。 在这种情况下,最好的选择是根据业务理解的某些规则来模拟用户操作,或者利用开源数据集来启动模型训练。 但是,此模型可能不一定代表实际数据,因此,在开始选择(也就是学习)最终用户的真实行为的预热期之前,它的性能将不佳。

What all comes under the scope of model retraining?

所有这些都属于模型再培训的范围?

  • Updating the model parameters?

    更新模型参数?
  • Reiterating over the hyper-parameter search space

    重申超参数搜索空间
  • Re-running the model selection pipeline across the candidate pool of algorithms

    在候选算法池中重新运行模型选择管道
  • If that also does not uptick the model performance, then do we need to introduce new features into the model, maybe re-do the feature engineering and selection pipeline?

    如果那也不能提高模型的性能,那么我们是否需要在模型中引入新的特征,也许需要重新进行特征工程和选择流程?

Ideally, retraining involves running the entire existing pipeline with new data, that’s it. It does not involve any code changes or re-building the pipeline.

理想情况下,再培训涉及使用新数据运行整个现有管道,仅此而已。 它不涉及任何代码更改或重新构建管道。

However, if you end up exploring a new algorithm or a feature which might not have been available at the time of previous model training, then incorporating these while deploying the retrained model will further improve the model accuracy.

但是,如果您最终探索了以前的模型训练时可能尚不可用的新算法或功能,那么在部署重新训练的模型时合并这些算法或功能将进一步提高模型的准确性。

How to measure the decline in model performance?

如何衡量模型性能的下降?

Under the assumption that the predictions are stored and mapped with the ground truth values, the decline (or not) is calculated on a continuous basis to assess the drift.

在假设存储了预测并与地面真实值映射的情况下,连续计算(或不计算)下降以评估漂移。

But what if the prediction horizon is farther into the future and we can’t wait till the ground truth label is observed to assess the model goodness. Well, in that case, we can roughly estimate the retraining window from back-testing. This involves using the ground truth labels and predictions from the historical data to estimate the time frame around which the accuracy begins to taper off.

但是,如果预测范围更遥远,而我们迫不及待要等到地面真相标签被评估,以评估模型的良性,该怎么办? 好吧,在那种情况下,我们可以从回测中大致估计出再培训窗口。 这涉及使用地面真相标签和来自历史数据的预测来估计精度开始逐渐降低的时间范围。

Effectively, the whole exercise of finding the model drift boils down to inferring whether the two data sets (training and test) are coming from the same distribution, or if the performance has fallen below acceptable range.

实际上,寻找模型漂移的整个过程归结为推断两个数据集(训练和测试)是否来自同一分布,或者性能是否已降至可接受范围以下。

Lets look at some of the ways to assess the distribution drift:

让我们看一下评估分布漂移的一些方法:

  • Histogram: A quick way to visualize the comparison is to draw the histogram — the degree of overlap between the two histograms gives a measure of similarity.

    直方图:可视化比较的一种快速方法是绘制直方图-两个直方图之间的重叠程度可以衡量相似度。

  • K-S statistic: To check if the upcoming new data belongs to the same distribution as that of training data.

    KS统计:检查即将到来的新数据是否与训练数据属于同一分布。

  • Target distribution: One quick way to check the consistent predictive power of the ML model is to examine the distribution of the target variable. For example, if your training dataset is imbalanced with 99% data belonging to class 1 and remaining 1% to class 0. And, the predictions reflect this distribution to be around 90%-10%, then it should be treated as an alert for further investigation.

    目标分布:检查ML模型的一致预测能力的一种快速方法是检查目标变量的分布。 例如,如果您的训练数据集不平衡,其中99%的数据属于第1类,而剩余的1%属于第0类。并且,预测将这种分布反映为90%-10%左右,则应将其视为警告进一步的调查。

  • Correlation: Monitoring pairwise correlations between individual predictors will help bring out the underlying drift

    相关性:监控各个预测变量之间的成对相关性将有助于找出潜在的漂移

再培训策略: (Retraining Strategy:)

固定的周期间隔 (Fixed periodic interval)

  1. Dynamic periodicity — if the incoming data is changing frequently, the model retraining can happen even daily.

    动态周期性-如果传入数据频繁更改,则甚至每天都可以进行模型重新训练。
  2. Auto-monitoring the performance metrics to decide the retraining trigger point is more effective as compared to the one above. You need to decide the threshold specifying the acceptable level of performance divergence to initiate retraining. Following factors need to be considered while deciding the threshold:

    与上面的方法相比,自动监视性能指标以确定再培训触发点更为有效。 您需要确定阈值,以指定可接受的性能差异级别来启动再培训。 在确定阈值时,需要考虑以下因素:
  • Too low a threshold will lead to frequent retraining which will lead to increased overhead in terms of compute cost

    阈值太低将导致频繁的重新培训,这将导致计算成本增加
  • Too high a threshold will output “strayed predictions”

    阈值过高将输出“错误的预测”

How much new data should be collected before retraining?

再培训之前应收集多少新数据?

Should it be ’n’ rows in and ’n’ rows out? Or, should we keep adding the new data without removing older data. What is a good mix? Well, there is “no one solution fits all” answer to this, but largely it depends upon the following factors:

应该是“ n”行输入和“ n”行输出? 或者,我们应该继续添加新数据而不删除旧数据。 什么是好组合? 嗯,对此没有“一种解决方案能适合所有人”的答案,但很大程度上取决于以下因素:

  • If business experience suggests that new data is highly dynamic, then keep including the new data by replacing older data.

    如果业务经验表明新数据是高度动态的,请通过替换旧数据来继续包含新数据。
  • But, if the data drift doesn’t happen that frequently, then wait to collect sufficient samples of new training data.

    但是,如果数据漂移不那么频繁发生,则请等待收集足够的新训练数据样本。
Image for post
Retraining signal trigger, source: author
再训练信号触发器,来源:作者

翻译自: https://towardsdatascience.com/when-are-you-planning-to-retrain-your-machine-learning-model-5349eb0c4706

机器学习模型训练

  • 0
    点赞
  • 5
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值