机器学习指南_机器学习项目的研究指南

最新推荐文章于 2023-02-23 16:59:22 发布

weixin_26726011

最新推荐文章于 2023-02-23 16:59:22 发布

阅读量219

点赞数

文章标签：机器学习 python 人工智能 java

原文链接：https://towardsdatascience.com/research-guidelines-for-machine-learning-projects-3a137c008277

版权

机器学习指南

Machine Learning projects can be delivered in two stages. The first stage is named Research and is about answering the question: can we make a machine learning model out of this pile of data that serves the needs of the client? The deliverable is a proof of concept or a feasibility study. The second stage is named Development and it is about committing to deliver a machine learning product. The deliverable of this stage is a machine learning product.

机器学习项目可以分两个阶段交付。第一阶段称为Research ，它是为了回答这个问题：我们能否根据满足客户需求的这堆数据来构建机器学习模型？可交付成果是概念证明或可行性研究 。第二阶段称为开发，它涉及致力于交付机器学习产品。此阶段的交付物是机器学习产品。

During the Research stage, there are two main aspects that I consider relevant when the project resources are limited and the time is short: firstly is the importance of defining a scope for your problem, and secondly, the importance of testing your model using short iterations.

在研究阶段，当项目资源有限且时间短时，我认为有两个主要方面是相关的：首先是定义问题范围的重要性，其次是使用短迭代测试模型的重要性。

范围 (Scope)

One of the most difficult tasks in this stage is to scope your efforts. For example, think about working with files; you can choose between working with live data (connecting with a live database) or simply read data from files. Working with databases means you need to have access to infrastructure, deal with authentications, and such things that depending on the company can delay the start of the project; on the other hand, you can request a dump of the data, and start working the next day.

在此阶段中最困难的任务之一是扩大您的工作范围。例如，考虑使用文件；您可以选择使用实时数据(连接实时数据库)，也可以仅从文件中读取数据。使用数据库意味着您需要访问基础架构，进行身份验证以及诸如此类的事情，取决于公司，这可能会延迟项目的启动。另一方面，您可以请求转储数据，并在第二天开始工作。

The good (and bad) thing about the scoping exercise is that it narrows down all your possible actions. In some way, this is good because you strip down all the things that are not necessary for delivering a proof of concept:

范围界定练习的好(坏)之处在于，它缩小了所有可能采取的措施的范围。从某种意义上讲，这是很好的，因为您可以精简提供概念验证所不需要的所有内容：

Focus on a single problem. If the problem is too big, split the big-problem in many small-problems (divide-and-conquer). For example, if your problem is world-wide, re-scope your problem to be continent-wide or just focus on a single country. You can also reduce complexity by reframing the timespan of your problem (for example, focus on a specific date range: last year, last decade, etc).
专注于单个问题。 如果问题太大，则将大问题分解为许多小问题(分而治之)。例如，如果您的问题遍及全球，则将问题范围扩大到整个非洲大陆或仅关注单个国家。您还可以通过重新定义问题的时间范围来降低复杂性(例如，关注特定的日期范围：去年，过去十年等)。
The scope is not set in stone, and you should be agile about it. While you delve into the business domain and gain more experience wrangling with the data, there will be times that you will shift your scope. This is one of the reasons I use Kanban instead of Scrum during research, as it is more flexible.
范围不是一成不变的，您应该对此保持敏捷。当您深入研究业务领域并获得更多处理数据的经验时，有时您会改变范围。这是我在研究期间使用看板而不是Scrum的原因之一，因为它更灵活。
Until the moment you have the zero model (explained below) you will not know if you’ve enough data. But I will tell you one secret: in machine learning, you never have too much data, but you might have biased or not representative data which can bring some issues into your inferences. If you have too much data, you can always downsample your dataset; but if you have too few samples, upsampling (augmentation) techniques might add some unwanted variance to your dataset. In the bottom line, if you have many options to choose a single scope from, go for the biggest one (size-wise).
直到您拥有零模型(在下面说明)，您才知道您是否有足够的数据。但我要告诉您一个秘密：在机器学习中，您永远不会拥有太多数据 ，但是您可能会偏爱或不偏爱代表性数据，从而可能在推理中带来一些问题。如果数据太多，可以随时对数据集进行下采样 ；但是如果样本太少，则升采样 (增强)技术可能会给数据集增加一些不必要的差异。最重要的是，如果您有很多选择可以选择一个范围，请选择最大的范围。
The problem should be possible to be handled using a single computer. Except, if you are using deep learning models, then you will need to add one or more GPU’s into the computer. You can use either your own machine or a virtual machine inside the VNET of the client (recommended when the data is sensible). If you need a spark cluster for transforming your data, think again about the scope of your problem.
应该可以使用单台计算机处理该问题。除非您使用的是深度学习模型，否则您将需要在计算机中添加一个或多个GPU。您可以在客户端的VNET中使用您自己的计算机或虚拟机(建议在数据合理的情况下使用)。如果您需要一个火花集群来转换数据，请再次考虑问题的范围。
Convey with the client the acceptance criteria. This point is about negotiating with the client what will be the project’s deliverable. Focus: this is not an ML product (yet), but a proof of concept, so for example, do you really need a blue-green deployment? perhaps would you rather deploy the model as a REST API using flask? or does it works just saving the predictions of the model in an excel file with the purpose of being reviewed by the business expert later?
与客户传达接受标准。这一点是关于与客户协商项目的可交付成果。 重点：这还不是ML产品，而是概念证明，例如，您真的需要进行蓝绿色部署吗？也许您宁愿使用flask将模型部署为REST API？还是仅将模型的预测保存在excel文件中以便稍后由业务专家进行审核就可以了吗？
The acceptance criteria are one of the drivers of your backlog (Jupyter Notebook or python module, docker or local deployment, project log, model artifact, model predictions, etc). Eventually, the metric and/or condition criteria (how to evaluate the goodness of your solution) should also be addressed by the client (more on this in the following section).
接受标准是积压工作的驱动程序之一(Jupyter Notebook或python模块，docker或本地部署，项目日志，模型工件，模型预测等)。最终， 度量标准和/或条件标准 (如何评估解决方案的优劣)也应由客户解决(在下一节中将对此进行详细说明)。
The focus should boost simplicity because it reduces time and effort. The aim of the research stage should be to have an answer to the feasibility of the ML product, as soon as possible. Keep in mind this issue when deciding the scope and your acceptance criteria.
重点应提高简单性，因为它减少了时间和精力。研究阶段的目标应该是尽快解决ML产品的可行性。在确定范围和接受标准时请牢记此问题。
The proof of concept is different from a throw-away prototype. All the efforts invested in this stage should be part of a future product. Most of the time, the data scientist experience is what will make the difference for re-using not only the knowledge learned but the source code and infrastructure as well. The machine learning engineer should help to address these issues (for example, enforcing best practices).
概念证明不同于一次性原型 。在此阶段投入的所有努力应成为将来产品的一部分。在大多数情况下，数据科学家的经验将使不仅重用所学知识而且重用源代码和基础架构也有所作为。机器学习工程师应帮助解决这些问题(例如，实施最佳实践)。
If it’s not clear enough, I repeated five times the word “focus” in this section (now this is the sixth). Does it make sense now?
如果还不够清楚，我将在本节中重复五次“焦点”一词(现在是第六次)。现在有意义吗？

模型测试 (Model testing)

A little disclaimer: Most people think building a model is the most complex part of an ML project. But it is not. It is not because there is a good chance that the ML algorithm you need is already implemented in a library like sklearn, h2o or pycaret. And I’m not even going to mention autoML techniques or the ML libraries available in languages like R or Julia.

一点免责声明：大多数人认为建立模型是ML项目中最复杂的部分。但事实并非如此。并不是因为您的ML算法很有可能已经在sklearn，h2o或pycaret之类的库中实现了。而且，我什至不会提到autoML技术或R或Julia这样的语言中可用的ML库。

Stick to the basics; these will solve 90% of your projects. Once you understand the basics, you can dare to jump ahead and use more complex ML/DL algorithms. It is easier to grasp the theory and the fundamentals of the algorithms once you’ve used them, so do not think you need to fully understand them to use them. This is the learning technique used by Jeremy Howard in their courses, which I consider a fantastic way to learn for non-PhDs people like myself.

坚持基本原则；这些将解决您90％的项目。一旦了解了基础知识，就可以敢于前进并使用更复杂的ML / DL算法。这是比较容易掌握的理论和算法的基础一旦你使用他们，所以不认为你需要充分了解后才能使用。这是杰里米·霍华德 ( Jeremy Howard)在他们的课程中使用的学习技术，我认为这是一种为像我这样的非博士生学习的绝妙方法。

So, if building the model is not the hardest part, what it is? The rest of the pieces around the ML model, like feature engineering, serving the model, etc.

那么，如果构建模型不是最困难的部分，那是什么？ ML模型的其余部分，例如特征工程，服务模型等。

Image for post — Hidden Technical Debt in Machine Learning Systems 机器学习系统中的隐藏技术债务

Most of all these other parts fall beyond the scope of this post (and some of them pertain to the development stage following the research stage).

所有其他这些部分中的大多数都超出了本文的范围(其中一些属于研究阶段之后的开发阶段)。

迭代法 (Iterative approach)

During the iteration zero of your model, my recommendation is that you build your model and collect its predictions as soon as possible. Go for the quickest win, for example: reduce your feature engineering, only converting non-numerical features to numerical, build up a simple model. The ideal result you want to get is that this zero model prediction is better than a random choice or a summary metric (mean, median, etc) figure which is the most basic prediction (no machine learning involved, just pure arithmetic).

在模型的零迭代期间，我的建议是您构建模型并尽快收集其预测。例如，争取最快的胜利：减少要素工程，仅将非数字特征转换为数字，建立简单模型。您想要获得的理想结果是，这种零模型预测优于随机选择或汇总度量 (均值，中位数等)，后者是最基本的预测(不涉及机器学习，仅涉及纯算术 )。

PS: Do not feel down if you do not get it the first time you try. This is just the first step on a long trip.

PS：如果您初次尝试时没有感到沮丧，请不要感到沮丧。这只是长途旅行的第一步。

The model you obtained during the first half of the iteration zero I called it zero model. During the other half of the iteration zero, you will build a better model by updating the training dataset (for example, pre-processing the data more aggressively or optimizing your model hyperparameters, etc). I called this second model the null model.

您在迭代的前半部分获得的模型零称为零模型 。在零迭代的另一半期间，您将通过更新训练数据集来构建更好的模型(例如，更积极地预处理数据或优化模型超参数等)。我称第二个模型为null模型 。

During the following iteration(s), you will build the alternative model that plays the counterpart of the null model. This procedure is similar as when you do hypothesis testing, and you figure out if you are on the right track comparing models against each other. The way I work is that I try to complete at least two iterations, testing two different models. But as the said goes “the more, the merrier”, and you are only limited by the time at your disposal = scope.

在接下来的迭代中，您将构建替代模型 ，该替代模型扮演与null模型相对应的角色。此过程与进行假设检验时相似，并且可以确定是否在正确比较模型之间的正确轨道上。我的工作方式是尝试至少完成两次迭代 ，测试两个不同的模型。但是正如所说的那样，“越多越好”，并且您只受可用时间= 范围的限制 。

指标 (Metrics)

Depending on the ML model you are using, you will end up using a metric. For example, in case of a regression problem, you can use RMSE, and for classification problems, you can use accuracy. Among other things, metrics are used to track the progress of the ML models, so you can measure how a change in the data/model hyperparameters affects the predictions of the model. Your experiments should always be metrics-driven. Choosing the right metrics is as important as choosing the model itself, and you should know that “there is no such thing as a free lunch”. You will need to experiment and check what works better for your problem.

根据所使用的ML模型，最终将使用度量。例如，在出现回归问题的情况下，可以使用RMSE ；对于分类问题，可以使用precision 。除其他事项外，度量标准还用于跟踪ML模型的进度，因此您可以测量数据/模型超参数的变化如何影响模型的预测。您的实验应始终由指标驱动。选择正确的指标与选择模型本身一样重要，并且您应该知道“没有免费的午餐之类的东西”。您将需要进行实验并检查哪种方法更适合您的问题。

Most of the time, the stakeholder will not know anything about RMSE (or any other ML metric); but you will be responsible to show to him that the elected metric aligns with the business goal. In this case, you have two options: a) as mentioned before, explain the meaning of the ML metric and prove its importance, or b) develop a parallel business metric. The business metric is an indicator expressed in business domain units. For example, in the case of a recommender for an online clothing store, a business metric can evaluate how well the model recommends products with the same gender as the person they are recommended for. The business metric is easier to get understood by the stakeholders, and eventually, it will turn out an important driver of your model.

大多数情况下，利益相关者对RMSE(或任何其他ML指标)一无所知。但是您将有责任向他证明所选指标符合业务目标。在这种情况下，您有两个选择：a)如前所述，解释ML度量的含义并证明其重要性，或b)制定并行业务度量。业务指标是以业务领域单位表示的指标。例如，对于在线服装店的推荐者，业务指标可以评估模型推荐与推荐对象性别相同的产品的程度。利益相关者更容易理解业务指标，最终，它将成为模型的重要驱动力。

Each iteration/experiment needs to be recorded. The bare minimum that you need to record is the results of your training: the result metrics. Jupyter Notebooks can serve as a simple log for this purpose. From there, you can use more elegant solutions, that will allow you to track not only the metrics but the data used for the training or the model generated (along with the hyperparameters used to obtain it): mlFlow, weights & biases, …

每个迭代/实验都需要记录。您需要记录的最低要求是培训的结果：结果指标。 Jupyter Notebook可以用作此目的的简单日志。从那里，您可以使用更优雅的解决方案，不仅可以跟踪指标，还可以跟踪用于训练或生成的模型的数据(以及用于获取模型的超参数)： mlFlow ，权重和偏差 ……

There are three outcomes of the Research stage:

研究阶段有三个结果：

In case of promising results, but not enough to meet the condition criteria, the best option is to extend the proof of concept; in order to improve your model results, you can try to change the data pre-processing and/or the machine learning model. The best result is that the model meets the condition criteria to move to development; in this case, the best course of action is to develop the actual product parting from the proof of concept, and parallelly, refine the current proof of concept. When the current model doesn’t meet the condition criteria (for example lack of data, or missing a better Machine Learning algorithm to model the problem), you can put the case on hold until there is a change in the case’s context.

如果结果令人满意，但还不足以满足条件标准，最好的选择是扩展概念验证。为了改善模型结果，您可以尝试更改数据预处理和/或机器学习模型。最好的结果是该模型符合条件标准以进行开发；在这种情况下，最好的做法是从概念证明中开发实际产品，并同时完善当前的概念证明。当当前模型不满足条件标准时(例如，缺少数据，或者缺少更好的机器学习算法来对问题进行建模)，您可以将案例搁置，直到案例的上下文发生变化。

If you’re interested in this kind of issues, I recommend this reading to you: Managing Machine Learning Projects, from Amazon’s Machine Learning University.

如果您对此类问题感兴趣，我建议您阅读以下内容：来自Amazon's Machine Learning University的《管理机器学习项目》。

The next post will be more technical, and I will delve into some tools and techniques I use during the Research stage. Stay tuned.

下一篇文章将更具技术性，我将研究在研究阶段使用的一些工具和技术。敬请关注。