大疆机器学习实习生_我们的数据科学机器人实习生

weixin_26752765

于 2020-08-13 04:46:09 发布

阅读量540

点赞数

文章标签：人工智能机器学习 python 大数据

原文链接：https://medium.com/legiti/our-data-science-robot-intern-cea29894d40

版权

大疆机器学习实习生

Machine learning practitioners know how overwhelming the number of possibilities that we have when building a model can be. It’s like going to a restaurant and having a menu the size of a book and we never tried any of the dishes. What models do we test? How do we configure their parameters? Which features do we use? Those who try to solve this problem by ad-hoc manual experiments end up having their time consumed by menial tasks and have their work constantly interrupted to check results and launch a new experiment. This has motivated the rise of the field of automated machine learning.

机器学习的从业者知道，建立模型时，我们拥有无数的可能性。这就像去餐厅，拥有一本书一样大小的菜单，我们从来没有尝试过任何菜肴。我们测试什么模型？我们如何配置它们的参数？我们使用哪些功能？那些试图通过临时手动实验解决此问题的人最终将自己的时间花在了琐碎的工作上，并不断地中断工作以检查结果并启动新的实验。这激发了自动化机器学习领域的兴起。

The main tasks of automated ML are hyperparameter optimization and feature selection, and in this article, we will present Legiti’s solution to these tasks (no, we didn’t hire an intern to do that). We developed a simple algorithm that addresses both challenges and is designed for rapidly changing environments, such as the ones found by data scientists working in real-world business problems. Some convenient properties and limitations are presented, as well as possible extensions to the algorithm. We end by showing results obtained applying it to one of our models.

自动化ML的主要任务是超参数优化和功能选择，在本文中，我们将介绍Legiti针对这些任务的解决方案(不，我们没有雇用实习生来完成此任务)。我们开发了一种简单的算法来应对这两个挑战，并针对快速变化的环境(例如，从事实际业务问题的数据科学家发现的环境)而设计。提出了一些方便的属性和限制，以及对该算法的可能扩展。最后，我们展示将其应用于我们的模型之一所获得的结果。

问题的介绍和动机 (Introduction and motivation to the problem)

At Legiti, we build machine learning models to fight credit card transactional fraud; that is, to predict whether an online purchase was made by (or with the consent of) the person who owns that credit card or not. We know when fraud occurs after some time when a refund is requested by the owner of the credit card, a process known as a chargeback. This is translated into our modeling as inputs for a supervised learning algorithm.

在Legiti，我们建立了机器学习模型来应对信用卡交易欺诈；也就是说，预测在线购物是否由拥有该信用卡的人进行(或得到其同意)。我们知道信用卡所有者要求退款一段时间后，何时发生欺诈行为，这一过程称为拒付。这被转换为我们的建模，作为监督学习算法的输入。

If you already have a good knowledge of common tools for feature selection and hyperparameter optimization you can skip to the “An outline of (the first version of) our algorithm” section.

如果您已经对用于特征选择和超参数优化的常用工具有足够的了解，则可以跳到“ 我们的算法的概述(第一版) ”部分。

机器学习模型通常采用许多可以调整的参数 (Machine learning models usually take many parameters that can be adjusted)

In our case, we’re currently using XGBoost. XGBoost builds a tree ensemble (a set of decision trees that are averaged for prediction) with a technique called gradient boosting. This algorithm usually works quite well “out of the box”, but its performance can be maximized by tuning some characteristics of the model, known as “hyperparameters”. Some examples of hyperparameters are the depth of the trees of the ensemble and the number of them.

就我们而言，我们目前正在使用XGBoost 。 XGBoost使用称为梯度提升的技术构建树集合(一组平均树以进行预测)。该算法通常“开箱即用”，效果很好，但是可以通过调整模型的某些特征(称为“超参数”)来最大化其性能。超参数的一些示例是合奏树的深度及其数量。

Hyperparameters are not exclusive to XGBoost. Other ensemble algorithms share a similar set of hyperparameters and different ones exist for neural networks for example, such as the number of layers and number of nodes in each layer. Even linear models can have hyperparameters — think of L1 and L2 regularization parameters in Lasso/Ridge regression. Any applied machine learning practitioner, if they want to maximize the performance of their model, will be engaged, sooner or later in the task of optimizing these parameters, usually by trial and error.

超参数并非XGBoost独有。其他集成算法共享一组相似的超参数，而对于神经网络则存在不同的超参数，例如层数和每层中的节点数。甚至线性模型也可以具有超参数-考虑Lasso / Ridge回归中的L1和L2正则化参数。任何应用机器学习从业者，如果他们想最大化其模型的性能，迟早都会参与优化这些参数的任务，通常是通过反复试验来进行。

我们通常也有许多功能可供使用 (We also usually have many features at our disposal)

In solutions that use a lot of structured data to make predictions (in contrast with, for example, image or text processing with deep learning) a lot of the work is not on the model itself, but on its inputs; that is, on the features of this model. Another similar problem to hyperparameter selection arises here — the problem of feature selection.

在使用大量结构化数据进行预测的解决方案中(例如，与具有深度学习的图像或文本处理相反)，很多工作不是在模型本身上，而是在模型输入上。也就是说，基于此模型的功能。这里出现了另一个与超参数选择类似的问题-特征选择问题。

Feature selection consists of choosing an optimal configuration of features that will maximize performance. Usually, this performance is measured in predictive capability (some accuracy metric in out-of-time validation sets), but other factors can count as well, such as the speed of training and interpretability of models. At Legiti, we have, for one of our models, more than 1000 features at our disposal. However, simply throwing these 1000 features to a model doesn’t necessarily lead to better accuracy than, for example, using only 50 carefully selected features. On top of that, using multiple features that yield little predictive firepower can make training time much longer.

功能选择包括选择功能的最佳配置以最大程度地提高性能。通常，此性能是通过预测能力(过期的验证集中的某些准确性指标)来衡量的，但其他因素也可以计算，例如训练的速度和模型的可解释性。在Legiti，对于我们的其中一种型号，我们可以使用1000多种功能。但是，仅将这1000个要素扔给模型并不一定比例如仅使用50个精心选择的要素来提高准确性。最重要的是，使用产生很少的预测火力的多种功能可以使训练时间更长。

Image for post — The amount of features we are choosing from is higher than the variety of products in this shelf. Photo by NeONBRAND on Unsplash.

没人希望数据科学家将时间浪费在手动任务上 (No one wants data scientists wasting their time with manual tasks)

Unfortunately, when training a model, the model won’t tell you what the optimal hyperparameter and feature sets are. What data scientists do is to observe out-of-sample performance to avoid the effect of overfitting (that can be done in multiple ways, but we won’t enter in details here). The simplest way to address this problem is to experiment manually with different sets of hyperparameters and features, eventually developing an intuitive understanding of what kind of modifications are important. However, this is a very time-consuming practice, and we do not want to waste our data scientists’ time with often quite manual tasks. What happens then is that most machine learning “power users” develop and/or use algorithms to automate that process.

不幸的是，在训练模型时，该模型不会告诉您最佳超参数和特征集是什么。数据科学家要做的是观察样本外性能，以避免过度拟合的影响(可以通过多种方式完成，但在此不再赘述)。解决此问题的最简单方法是使用不同的超参数和功能集进行手动试验，最终形成对哪种修改很重要的直观了解。但是，这是一个非常耗时的做法，并且我们不想将数据科学家的时间浪费在很多手动任务上。然后发生的事情是，大多数机器学习“高级用户”开发和/或使用算法来自动化该过程。

现有的一些算法 (Some existing algorithms)

Several feature selection and hyperparameter optimization algorithms exist, and lots of them are packaged in open-source libraries. Some basic examples of feature selection algorithms are backward selection and forward selection.

存在几种特征选择和超参数优化算法，其中许多算法打包在开源库中。特征选择算法的一些基本示例是向后选择和正向选择。

Another alternative is using the Lasso — an L1-regularized linear regression model that has sparseness properties. That means that lots of features end up having weight 0. Some people use the Lasso only for selecting the features that were assigned weights and discard the actual regression model.

另一种选择是使用套索-具有稀疏属性的L1正规化线性回归模型。这意味着许多特征最终的权重为0。某些人仅将套索用于选择分配了权重的特征，并丢弃实际的回归模型。

Hyperparameter optimization is usually done either with very simple algorithms (random or grid selection) or with complex Bayesian optimization surrogate models, such as the ones that are implemented in the Hyperopt and Spearmint packages.

通常使用非常简单的算法(随机或网格选择)或复杂的贝叶斯优化替代模型(例如在Hyperopt和Spearmint软件包中实现的模型)来完成超参数优化。

However, despite all this available tooling, as far as we know, there’s not a universal tool that is good both for hyperparameter optimization and feature selection. Another common difficulty is that the optimal hyperparameter set depends on the set of features being used and vice-versa. Finally, most of these algorithms were designed for a laboratory environment such as research programs or Kaggle competitions, where usually neither the data nor the problem changes with time.

但是，尽管有所有可用的工具，但据我们所知，还没有一种同时适用于超参数优化和功能选择的通用工具。另一个常见的困难是，最佳超参数集取决于所使用的特征集，反之亦然。最后，大多数算法都是为实验室环境(例如研究计划或Kaggle竞赛)设计的，通常情况下，数据或问题都不会随时间变化。

Given all of that, we decided to build our own.

考虑到所有这些，我们决定构建自己的。

算法(第一个版本)的概述 (An outline of (the first version of) our algorithm)

When we decided to build our own thing instead of relying on other methods, we didn’t do it because we thought we could do something “superior” to anything else available. In fact, it was more of a pragmatic decision. We didn’t have anything at all in place (we were at the stage of manual tweakings to features and hyperparameters) and we wanted to build sort of a minimum viable solution. None of the algorithms we found would work with little or no adaptation at all, so we decided to code a very simple algorithm and replace it with more elaborate solutions later. However, this algorithm exceeded our expectations and we have been using it since then. Some numbers are available later in the results section.

当我们决定构建自己的东西而不是依靠其他方法时，我们没有这么做，因为我们认为我们可以做的事情“优于”其他任何东西。实际上，这更多是一个务实的决定。我们根本没有任何东西(我们正处于手动调整功能和超参数的阶段)，我们想构建某种最小可行的解决方案。我们发现所有算法都几乎无法适应或根本无法适应，因此我们决定编写一个非常简单的算法，并在以后用更精细的解决方案代替它。但是，此算法超出了我们的期望，从那时起我们一直在使用它。稍后在结果部分中提供了一些数字。

算法说明 (Description of the algorithm)

The strategy we took was to use our best solution at the time as the initial anchor and apply small random modifications to it. And so, we built an algorithm that takes the following general steps:

我们采取的策略是使用当时最好的解决方案作为初始定位点，并对它进行小的随机修改。因此，我们构建了一种算法，该算法采取以下一般步骤：

1) Choose randomly between a change in features or hyperparameters

1)在特征或超参数变化之间随机选择

2.1) If we are changing a feature, choose a random feature from our feature pool and swap it in the model (that is, remove the feature if it is already in the model, or add it if it is not)

2.1)如果我们要更改要素，请从要素池中选择一个随机要素，然后在模型中进行交换(即，如果要素已存在于模型中则将其删除，否则将其添加)

2.2) If we are changing a hyperparameter, choose a random hyperparameter out of the ones we’re using and then randomly increase it or decrease it by a certain configurable value

2.2)如果我们要更改超参数，请从正在使用的参数中选择一个随机的超参数，然后将其随机增加或减少一定的可配置值

3) Test the model with these small changes in our cross-validation procedure. If the out-of-sample accuracy metric is higher, accept the change to the model; if it doesn’t improve, reject this change.

3)在我们的交叉验证过程中，对这些小的更改进行测试。如果样本外准确性度量标准较高，请接受对模型的更改；否则，请接受模型更改。如果没有改善，请拒绝此更改。

Perhaps some readers will note some similarities between this and the Metropolis-Hastings algorithm or to simulated annealing. We can actually see it as a simplification of simulated annealing, where there’s no variable “temperature” and the random conditional turns into a deterministic one.

也许有些读者会注意到该算法与Metropolis-Hastings算法或模拟退火算法之间存在一些相似之处。实际上，我们可以将其视为模拟退火的简化，其中没有可变的“温度”，而随机条件变为确定性温度。

一些大优势 (Some big advantages)

A few nice properties of this algorithm are:

该算法的一些不错的属性是：

It works for both hyperparameters and feature selection. There’s no need to implement and maintain two different processes/algorithms, both on the mathematical/conceptual point of view (there are fewer concepts to think about) and from the technology point of view (there’s less code to write and maintain, so fewer sources of bugs).
它适用于超参数和特征选择。 从数学/概念的观点(要考虑的概念更少)和从技术的观点(不需要编写和维护的代码更少，这两个方面，都不需要实现和维护两个不同的过程/算法。错误)。
This method can quickly find a solution that is better than our current best. Unlike many algorithms, there’s no need to execute a very long-running task (such as backward/forward selection). We can interrupt it at any time and relaunch it right afterward. This is particularly useful considering our circumstances, where we are all the time receiving new data and changing our evaluation procedure to accommodate that. What we do then is to simply interrupt the process, evaluate the current best model with new data, and start running the process again. If instead, we were using Hyperopt for example, all the history of samples you gathered to build your surrogate model is suddenly not valuable anymore.
这种方法可以Swift找到比我们目前最好的解决方案更好的解决方案。 与许多算法不同，不需要执行非常长时间的任务(例如向后/向前选择)。我们可以随时中断它，然后立即重新启动。考虑到我们的情况，这特别有用，因为我们一直在接收新数据并更改评估程序以适应这种情况。然后，我们要做的就是简单地中断该过程，使用新数据评估当前最佳模型，然后再次开始运行该过程。如果相反，例如，我们使用Hyperopt，则您收集的用于构建代理模型的样本的所有历史记录突然变得不再有价值。
It is a never-ending process. This is interesting for a practical reason: no idle time for our machines.
这是一个永无止境的过程。 出于实际原因，这很有趣：我们的机器没有空闲时间。
The setup is very easy. The only “state” needed in the system is the current best model, which was already tracked by us on version control. There’s no need for a database or files to keep a history of runs for example.
设置非常简单。 系统中唯一需要的“状态”是当前的最佳模型，我们已经在版本控制中对其进行了跟踪。例如，无需数据库或文件来保留运行历史记录。
It is easy to understand what happens under the hood. So when things are not working very well, we don’t need to think about Gaussian processes (the surrogate model used by most hyperparameter algorithms) or anything like that.
很容易理解引擎盖下会发生什么。 因此，当事情运行不顺利时，我们无需考虑高斯过程 (大多数超参数算法使用的替代模型)或类似的东西。

局限性 (Limitations)

Of course, this algorithm is not without challenges. However, for all of the most relevant difficulties, there are ways to overcome them with relatively small extensions to the algorithm. Some of the limitations we found are:

当然，该算法并非没有挑战。但是，对于所有最相关的困难，有一些方法可以通过对算法进行较小扩展来克服它们。我们发现的一些限制是：

It can get stuck at local optima since all tested alternatives are derived from the best current model. A solution is to use multiple random steps at the same time, that is, instead of swapping one feature, we swap two features, or we change a hyperparameter and swap a feature and then we test this new model. By doing this we increase the number of possible modifications that the algorithm can make, therefore increasing its search space and decreasing the chances of getting stuck.
由于所有经过测试的替代方案均来自最佳当前模型， 因此它可能会卡在局部最优值上 。一种解决方案是同时使用多个随机步骤，即，代替交换一个功能，我们交换两个功能，或者我们更改超参数并交换一个功能，然后测试此新模型。通过这样做，我们增加了算法可以进行的可能修改的次数，因此增加了其搜索空间并减少了卡住的可能性。

It can select highly correlated features. If interpretability is a big problem for you, this algorithm might not be the best bet. Eliminating features based on the correlation they have with other ones might be a good choice even if it decreases out-of-sample accuracy. We, however, are clearly choosing accuracy in the accuracy vs. interpretability trade-off.
它可以选择高度相关的功能。 如果可解释性对您来说是个大问题，那么此算法可能不是最佳选择。基于特征与其他特征的相关性来消除特征可能是一个不错的选择，即使这样会降低样本外准确性。但是，我们显然会在准确性与可解释性的权衡中选择准确性。
New features take too long to be tested if the number of features is too high. We already are coming across this problem since we have more than 1300 features in our feature pool. To be tested, a new feature will have to wait on average hundreds of times the time of one iteration of the algorithm. Now, one iteration of the algorithm consists in calculating an out-of-sample accuracy of the new proposed model with, for example, a cross-validation procedure. To put that in numbers, if the cross-validation procedure takes 10 minutes and we have to wait for 600 iterations of the algorithm to test a feature, we will wait a total of about 10 days to test one feature. A solution here is to build a feature queue. For every feature, we store the last time in which it was swapped. Then, instead of choosing features randomly, we choose the feature that hasn’t been tested for the longest period or a feature that was never tested.
如果功能数量太多，新功能将花费很长时间进行测试 。我们已经遇到了这个问题，因为我们的功能池中有1300多个功能。要进行测试，一项新功能平均需要等待算法一次迭代的时间数百次。现在，算法的一次迭代在于使用例如交叉验证程序来计算新提出的模型的样本外准确性。简单地说，如果交叉验证过程需要10分钟，并且我们必须等待600次算法迭代才能测试一项功能，那么我们将总共等待约10天才能测试一项功能。这里的解决方案是建立功能队列。对于每个功能，我们都会存储上次交换功能。然后，我们将选择没有经过最长测试的功能或从未测试过的功能，而不是随机选择功能。
The running time is directly correlated with the model evaluation time. Therefore, If the evaluation process (usually cross-validation) takes a long time to run then this algorithm will be very slow as well. A possible solution here is to introduce a pre-selection procedure — for example, before fully testing one model we first test only one of the cross-validation iterations and pass it to the next step (the full test) if this partial metric has over-performed the current best model partial metric.
运行时间与模型评估时间直接相关。 因此，如果评估过程(通常是交叉验证)需要很长时间才能运行，那么该算法也会非常慢。此处可能的解决方案是引入预选程序-例如，在完全测试一个模型之前，如果该部分指标已超过标准，我们首先仅测试交叉验证迭代中的一个，然后将其传递到下一步(完整测试)。 -执行当前的最佳模型局部指标。

结果 (Results)

Most importantly, none of this would matter if we couldn’t reap the benefits in terms of better fraud identification for our customers. Fortunately, Optimus Prime (as we decided to call our robot intern) is a results-improvement machine.

最重要的是，如果我们无法从为客户提供更好的欺诈识别方面获得收益，那么这一切都不重要。幸运的是，Optimus Prime(我们决定将其称为机器人实习生)是一种提高结果的机器。

You can see some results below. In this specific case, Optimus Prime found multiple ways to improve our model, and we can see in the pull request the difference in our metrics from the old model to the new one.

您可以在下面看到一些结果。在这种特定情况下，Optimus Prime找到了多种方法来改进我们的模型，我们可以在请求请求中看到从旧模型到新模型的指标差异。

Note that this PR was not opened by a human. With the help of Python packages for Git and GitHub, we can automate all this process. All that we do now is wait until we get notified of a new pull request with improvement from Optimus, check the results, and accept the pull request.

请注意，此PR不是由人打开的。借助适用于Git和GitHub的Python软件包，我们可以自动化所有过程。现在，我们要做的就是等到Optimus对改进的新拉动通知得到通知，检查结果并接受拉动请求。

A very important feature of the GitHub API is that we can add text to pull request descriptions; without it, we wouldn’t be able to put randomly selected words of wisdom by Optimus Prime in them. This way we can always be reminded of our place in the universe and keep strengthening our company culture.

GitHub API的一个非常重要的功能是，我们可以添加文本以提取请求描述。没有它，我们将无法在其中添加擎天柱随机选择的智慧之词。这样一来，我们总是可以想起我们在宇宙中的地位，并不断加强我们的公司文化。

下一步 (Next steps)

Some of our main challenges were described in the limitations section. The ones we think are the most relevant for us at the moment are:

限制部分描述了我们的一些主要挑战。我们认为与当前最相关的是：

1. our lack of certainty on whether the model can get out of local optima and

1.我们对于模型是否可以摆脱局部最优和模型缺乏不确定性

2. the long time it takes for our algorithm to test new features.

2.我们的算法需要花费很长时间来测试新功能。

The first expansion that we’re testing for is alternating between feature and hyperparameter optimization modes. The feature optimization mode uses the exact same algorithm that we described, while for the hyperparameter optimization we’re trying out Hyperopt, but with a limited number of trials.

我们正在测试的第一个扩展是功能和超参数优化模式之间的交替。特征优化模式使用与我们描述的算法完全相同的算法，而对于超参数优化，我们正在尝试Hyperopt，但尝试次数有限。

结论 (Conclusion)

In this article we gave a brief introduction to the field of automated machine learning, presenting some well-known tools and also the solution that we built for Legiti. As shown in the results section, we are now having continuous improvements in predictive capabilities, and on top of that with almost no cumulative work from our data scientists, who can now focus more on the development of new features and research in general.

在本文中，我们简要介绍了自动化机器学习领域，并介绍了一些知名工具以及我们为Legiti构建的解决方案。如结果部分所示，我们现在的预测能力正在不断提高，最重要的是，我们的数据科学家几乎没有做任何积累的工作，他们现在可以将更多精力放在新功能的开发和总体上。

Even with the success that we’re having, we don’t see our work as finished on this front. We now see this process as a core part of our product, and like any other part, should be undergoing constant improvement.

即使我们已经取得了成功，但我们在这方面的工作仍未完成。现在，我们将此过程视为产品的核心部分，并且应像其他任何部分一样，不断进行改进。

If you have similar experiences trying to build an automated machine learning system, feel welcome to contribute to the discussion in the comments below!

如果您有尝试构建自动化机器学习系统的类似经验，欢迎在下面的评论中为讨论做出贡献！