机器学习测试_测试优先机器学习

最新推荐文章于 2024-07-25 08:51:01 发布

weixin_26752765

最新推荐文章于 2024-07-25 08:51:01 发布

阅读量340

点赞数

文章标签：机器学习 python 人工智能深度学习 java

原文链接：https://medium.com/swlh/test-first-machine-learning-8d2cadc3ffe

版权

机器学习测试

Testing software is one of the most complex tasks in software engineering. While in traditional software engineering there are principles that define in a non-ambiguous way how software should be tested, the same does not hold for machine learning, where testing strategies are not always defined. In this post, I elucidate a testing approach that is not only highly influenced by one of the most recognized testing strategies in software engineering — that is test-driven development. But also seems to be an approach that is agnostic from the family of machine learning models under testing, and adapts very well to the typical production environments that lead to the large scale AI/ML services of today.

测试软件是软件工程中最复杂的任务之一。尽管在传统的软件工程中，有一些原则以明确的方式定义了如何测试软件，但是对于机器学习而言却并非如此，因为机器学习并不总是定义测试策略。在本文中，我将阐明一种测试方法，该方法不仅受到软件工程中最公认的测试策略之一的影响，即测试驱动的开发。但是，这似乎也是一种与正在测试的机器学习模型家族无关的方法，并且非常适合于导致当今大规模AI / ML服务的典型生产环境。

After reading this post, you will learn how to set up a testing strategy that works for machine learning models with production in mind. Production in mind means that the team you are operating in is heterogeneous, the project under testing is developed together with other data scientists, data engineers, business customers, developers, and testers. The goals of a good testing strategy are to achieve production readiness and improve code maintainability.

阅读这篇文章后，您将学习如何设置一个针对生产的机器学习模型的测试策略。考虑到生产，意味着您所在的团队是异构的，正在测试的项目是与其他数据科学家，数据工程师，业务客户，开发人员和测试人员一起开发的。好的测试策略的目标是实现生产就绪并提高代码可维护性。

An appropriate name of the approach is Test-First machine learning, in short TFML, because everything starts from writing tests, rather than models.

这种方法的合适名称是TFML(简称TFML)，是“ 测试优先”机器学习，因为一切都始于编写测试，而不是模型。

TFML的步骤 (Steps of TFML)

A characteristic of TFML is to start from writing tests, instead of machine learning models. The approach is based on mocking whatever is not yet available so that different actors involved in the project can proceed with their tasks anyway. It is known that data scientists and data engineers run at a different pace. Mocking a particular aspect of the world that is not yet available not only mitigates such difference but also reduces blockers within larger teams. This, in turn, increases efficiency. Below are the five essential steps of a TFML approach.

TFML的一个特征是从编写测试开始，而不是从机器学习模型开始。该方法基于模拟尚不可用的内容，以便项目中涉及的不同参与者无论如何都可以继续执行其任务。众所周知，数据科学家和数据工程师的运行速度不同。模拟世界上尚不存在的特定方面，不仅可以缓解这种差异，而且可以减少较大团队中的阻碍者。反过来，这提高了效率。以下是TFML方法的五个基本步骤。

1.编写测试 (1. Write a test)

As the name suggests, Test-First in TFML indicates that everything starts with writing a test. Even for a feature that does not yet exist. Such a test is usually very short and should stay so. Larger and more complex tests should be broken down to their essential and testable components. A test can be written after understanding the feature’s specs and requirements that are usually discussed earlier during requirement analysis (e.g. use cases and user stories).

顾名思义，TFML中的Test-First表示一切都始于编写测试。即使对于尚不存在的功能。这样的测试通常很短，应该保持下去。更大和更复杂的测试应该分解为它们的基本和可测试组件。可以在了解功能的规格和要求之后编写测试，这些功能通常在需求分析(例如，用例和用户案例)中进行过讨论。

A working test will fail or pass for the right reasons. This is the step in which such reasons are defined. Defining the happy path is essential to defining what should be observed and considered a success.

正常的测试会因正确的原因而失败或通过。这是定义此类原因的步骤。定义幸福的道路对于定义应观察和认为成功的事情至关重要。

3.编写代码 (3. Write the code)

In this step, the code that leads to the happy path is actually written. This code will cause the test to pass. No other code, beyond the test’s happy path, should be provided. For example, if a machine learning model is expected to return 42, one can just return 42 and force the test to succeed here. If time constraints are needed, adding sleep(milliseconds) is also acceptable. Such mocked values will provide engineers with visible constraints such that they can proceed with their tasks as if the model was complete and working.

在此步骤中，实际编写了通往幸福道路的代码。此代码将导致测试通过。不应提供超出测试满意范围的其他代码。例如，如果预期机器学习模型将返回42，则可以仅返回42并强制测试在此处成功。如果需要时间限制，则增加sleep(milliseconds)也是可以接受的。这样的模拟值将为工程师提供可见的约束，以便他们可以像完成模型和正常工作一样继续执行任务。

4.运行测试 (4. Run tests)

Adding new tests should never break the previous ones. Having tests that depend on each other is considered an anti-pattern in software engineering.

添加新测试永远不会破坏以前的测试。相互依赖的测试被认为是软件工程中的反模式。

5.添加功能(+清理+重构) (5. Add functionality (+ cleanup + refactor))

When values are mocked, success conditions are defined and tests are running, it’s time to show that the ML model under testing is training and performing predictions. Related to the example above, some questions that should find an answer in this step are:

当模拟值，定义成功条件并运行测试时，是时候表明正在测试的ML模型正在训练和执行预测。与上面的示例相关，在此步骤中应该找到答案的一些问题是：

Is the test breaking the constraints we set previously?
测试是否突破了我们先前设定的限制？
Is our ML model returning 84 rather than 42?
我们的ML模型返回84而不是42吗？
How about time constraints?
时间限制如何？

Traditionally, in this step developers perform code cleanup, deduplication, and refactoring (whenever it applies), to improve both readability and maintainability. This strategy should be applied to ML developers too.

传统上，开发人员在此步骤中执行代码清除，重复数据删除和重构(只要适用)，以提高可读性和可维护性。该策略也应适用于ML开发人员。

Falling in the trap of alternative approaches is easier in machine learning due to its nature and the enthusiasm of data scientists who connect-train-analyze data in no time.

在替代方法的陷阱下降，由于其性质和谁的数据科学家的热情是在机器学习更容易connect-train-analyze在任何时间的数据。

The most common approach in the data science community is probably the Test-Last approach a.k.a. code now, test later. This approach can be extremely risky in ML model development, since even for a trivial linear regression there might be just too many moving parts, compared with traditional software (e.g. UI, API calls, data streams, databases, preprocessing steps, etc.) As a matter of fact, the Test-First approach encourages and forces developers to put the minimum amount of code into modules depending on such moving parts (e.g. UIs and databases) and to implement the logic that should belong to the testable section of the codebase.

数据科学界中最普遍的方法可能是现在的Test-Last方法，也称为代码，稍后再测试 。这种方法在ML模型开发中可能具有极大的风险，因为与传统软件(例如，UI，API调用，数据流，数据库，预处理步骤等)相比，即使对于微不足道的线性回归，移动部分也可能太多。实际上，“ 测试优先”方法鼓励并迫使开发人员根据此类活动部分(例如，UI和数据库)将最少的代码放入模块中，并实施应属于代码库可测试部分的逻辑。

One important pitfall to avoid is developer bias. Tests created in a Test-First environment are usually created by the same developer who is writing the code being tested. This can be a problem e.g. if a developer does not consider certain input parameters to be checked. In that case, neither the test nor the code will verify such parameters. There is a reason why in traditional software development, testing engineers and developers are usually not the same individuals.

要避免的一个重要陷阱是开发人员的偏见 。在“测试优先”环境中创建的测试通常由编写测试代码的同一开发人员创建。例如，如果开发人员不考虑某些输入参数，则可能会出现问题。在这种情况下，测试和代码都不会验证此类参数。在传统的软件开发中，测试工程师和开发人员通常不是同一个人，这是有原因的。

TFML反模式 (TFML anti-patterns)

Below are some anti-patterns in TFML.

以下是TFML中的一些反模式。

测试依赖 (Test dependence)

Tests should be standalone. Tests that depend on others can lead to cascading failures or success out of the developer’s control.

测试应该是独立的。依赖其他测试的测试可能会导致级联的失败或成功，而这是开发人员无法控制的。

精确测试模型 (Test model precisely)

As in traditional software engineering, testing precise execution behavior, timing or performance can lead to test failure. In machine learning, it is even more important to consider soft constraints because models can be probabilistic. Moreover, the ranges of output variables and input data can change. Such a dynamic and sometimes loosely defined behavior is the norm rather than the exception in ML.

与传统软件工程中一样，测试精确的执行行为，时序或性能可能会导致测试失败。在机器学习中，考虑软约束更为重要，因为模型可能是概率性的。而且，输出变量和输入数据的范围可以改变。这种动态的，有时是宽松定义的行为是规范，而不是ML中的例外。

测试模型的数学细节 (Test model’s mathematical details)

Testing model implementation details such as statistical and mathematical soundness are not part of the TFML strategy. Such details should be tested separately and are specific to the family of the model under consideration.

测试模型实现的详细信息(例如统计和数学上的正确性)不是TFML策略的一部分。此类详细信息应单独测试，并且特定于所考虑的模型系列。

大型测试装置 (Large testing unit)

The testing surface should always be minimal for the functionality under test. Keeping the testing unit small gives more control to the developer. Larger testing units should be broken down into smaller tests, specialized in one particular aspect of the models to be tested.

对于被测功能，测试表面应始终保持最小。保持测试单元较小可以为开发人员提供更多控制权。较大的测试单元应细分为较小的测试，专门针对要测试的模型的特定方面。

结论 (Conclusion)

The TFML approach forces developers to spend initial time defining the testing strategy for their models. This in turn facilitates the integration of such models in the bigger picture of complex engineering systems where larger teams are involved. It has been observed that programmers who write more tests tend to be more productive. Testing code is as important as developing software core functionality. Testing code should be produced and maintained with the same rigor as production code. In ML all this becomes even more critical, due to the heterogeneity of the systems and the people involved in ML projects.

TFML方法迫使开发人员花费初始时间来定义其模型的测试策略。反过来，这有助于在涉及较大团队的复杂工程系统的更大范围内集成此类模型。据观察，编写更多测试的程序员往往会提高工作效率。测试代码与开发软件核心功能一样重要。测试代码的生产和维护应与生产代码相同。在ML中，由于系统和参与ML项目的人员的异质性，所有这些变得更加关键。

Originally published at https://codingossip.github.io on August 4, 2020.

最初于 2020年8月4日 发布在 https://codingossip.github.io 。