提高机器学习质量的想法_如何提高机器学习的数据质量？

最新推荐文章于 2024-03-29 10:35:09 发布

weixin_26713521

最新推荐文章于 2024-03-29 10:35:09 发布

阅读量455

点赞数

文章标签：机器学习人工智能 python java 数据分析

原文链接：https://towardsdatascience.com/how-to-improve-data-preparation-for-machine-learning-dde107b60091

版权

提高机器学习质量的想法

The ultimate goal of every data scientist or Machine Learning evangelist is to create a better model with higher predictive accuracy. However, in the pursuit of fine-tuning hyperparameters or improving modeling algorithms, data might actually be the culprit. There is a famous Chinese saying “工欲善其事，必先利其器” which literally translates to — To do a good job, an artisan needs the best tools. So if the data are generally of poor quality, regardless of how good a Machine Learning model is, the results will always be subpar at best.

每个数据科学家或机器学习传播者的最终目标是创建一个具有更高预测准确性的更好模型。但是，在追求微调超参数或改进建模算法时，数据实际上可能是罪魁祸首。中国有句名言“工欲善其事，必先利其器”，字面意思是：要做好工作，工匠需要最好的工具。因此，如果数据质量通常很差，那么无论机器学习模型的质量如何，结果总是最好的。

Why is data preparation so important?

为什么数据准备如此重要？

Image for post — Photo by Austin Distel on Unsplash

It is no secret that data preparation in the process of data analytics is ‘an essential but unsexy’ task and more than half of data scientists regard cleaning and organizing data as the least enjoyable part of their work.

众所周知，数据分析过程中的数据准备是“一项必不可少的但并不性感的任务”，超过一半的数据科学家将清理和整理数据视为工作中最不愉快的部分。

Multiple surveys with data scientists and experts have indeed confirmed the common 80/20 trope — whereby 80% of the time is mired in the mundane janitorial work of prepping data, from collecting, cleaning to finding insights of the data (data wrangling or munching); leaving only 20% for the actual analytic work by modeling and building algorithm.

与数据科学家和专家进行的多次调查确实证实了常见的80/20斜率-80％的时间都沉浸在准备数据的平凡的清洁工作中，从收集，清理到发现数据见解(数据整理或压缩) ; 通过建模和构建算法只剩下20％的实际分析工作。

Thus, the Achilles heel of a data analytic process is in fact the unjustifiable amount of time spent on just data preparation. For data scientists, this can be a big hurdle in productivity for building a meaningful model. For businesses, this can be a huge blow to the resources as the investment into data analytics only sees the remaining one-fifth of the allocation dedicated to the original intent.

因此，数据分析过程的致命弱点实际上是仅仅花费在数据准备上的无用时间。对于数据科学家而言，这对于构建有意义的模型可能是生产力的一大障碍。对于企业而言，这可能是对资源的巨大打击，因为对数据分析的投资仅看到剩余的五分之一专用于原始意图。

Heard of GIGO (garbage in, garbage out)? This is exactly what happens here. Data scientists arrive at a task with a given set of data, with the expectation to build the best model to fulfill the goal of the task. But halfway thru the assignment, he realizes that no matter how good the model is he can never achieve better results. After going back-and-forth he finds out that there are lapses in data quality and started scrubbing thru the data to make them “clean and usable”. By the time the data are finally fit again, the dateline is slowly creeping in and resources started draining up, and he is left with a limited amount of time to build and refine the actual model he was hired for.

听说过GIGO(垃圾进，垃圾出)吗？这正是这里发生的情况。数据科学家使用给定的数据集完成一项任务，并期望构建最佳模型来实现任务目标。但是在完成任务的途中，他意识到无论模型多么出色，他都永远无法取得更好的结果。经过反复研究，他发现数据质量存在问题，并开始对数据进行清理以使其“干净且可用”。等到数据终于重新适合时，日期线就慢慢爬进去，资源开始消耗drain尽，他只剩下有限的时间来建立和完善他所雇用的实际模型。

This is akin to a product recall. When defects are discovered in products already on the market, it is often too late to remedy and products have to be recalled to ensure the public safety of consumers. In most cases, the defects are results of negligence in quality control of the components or ingredients used in the supply chain. For example, laptops being recalled due to battery issues or chocolates being recalled due to contamination in the dairy produce. Be it a physical or digital product, the staggering similarity we see here is that it is always the raw material taking the blame.

这类似于产品召回。如果在市场上已有的产品中发现缺陷，通常为时已晚，无法补救，必须召回产品以确保消费者的公共安全。在大多数情况下，缺陷是供应链中使用的组件或成分的质量控制疏忽的结果。例如，由于电池问题而召回笔记本电脑，或者由于乳制品中的污染而召回巧克力。无论是物理产品还是数字产品，我们在这里看到的惊人相似之处都在于，总是责怪原材料。

But if data quality is a problem, why not just improve it?

但是，如果数据质量有问题，为什么不仅仅改善它呢？

To answer this question, we first have to understand what is data quality.

要回答这个问题，我们首先必须了解什么是数据质量。

Tindependent quality as the measure of the agreement between data views presented and the same data in real-world based on inherent characteristics and features; secondly, the quality of dependent application — a measure of conformance of the data to user needs for intended purposes.

T 独立质量是衡量基于固有特征和特征的数据视图与现实世界中相同数据之间一致性的度量；其次， 从属应用程序的质量-衡量数据是否符合预期目的用户需求的量度。

Let’s say you are a university recruiter trying to recruit fresh grads for entry-level jobs. You have a pretty accurate contact list but as you go thru the list you realize that most of the contacts are people over 50 years old, deeming it unsuitable for you to approach them. By applying the definition, this scenario fulfills only the first half of the complete definition — the list has the accuracy and consists of good data. But it does not meet the second criteria — the data, no matter how accurate are not suitable for the application.

假设您是一位大学招聘人员，正在尝试为入门级工作招募应届毕业生。您有一个非常准确的联系人列表，但是当您浏览列表时，您会意识到大多数联系人都是50岁以上的人，认为不适合与他们联系。通过应用定义，此方案仅满足完整定义的前半部分-列表具有准确性，并包含良好的数据。但是它不符合第二个标准-数据，无论多么精确，都不适合该应用程序。

In this example, accuracy is the dimension we are looking at to assess the inherent quality of the data. There are a lot more different dimensions out there. To give you an idea of which dimensions are commonly studied and researched in peer-reviewed literature, here is a histogram showing the top 6 dimensions after studying 15 different data quality assessment methodologies involving 32 dimensions.

在此示例中，准确性是我们要评估的数据固有质量的维度。那里还有更多不同的尺寸。为了让您了解在同行评审的文献中通常研究和研究哪些维度，下面的直方图显示了研究15种不同的数据质量评估方法(涉及32个维度)后的前6个维度。

A systemic approach to Data Quality Assessment

数据质量评估的系统方法

If you fail to plan, you plan to fail. A good systemic approach cannot be successful without a good planning. To have a good plan, you need to have a thorough understanding of the business, especially on problems associating with data quality. In the previous example, one should be aware that the contact list, albeit correct has a data quality problem of not being applicable to achieve the goal of the assigned task.

如果您没有计划，您计划失败。没有良好的计划，好的系统方法就不会成功。要制定好的计划，您需要对业务有透彻的了解，尤其是在与数据质量相关的问题上。在前面的示例中，应该知道联系人列表(尽管正确)存在数据质量问题，不适用于实现所分配任务的目标。

After the problems become clear, data quality dimensions to be investigated should be defined. This can be done using an empirical approach like surveys among stakeholders to find out which dimension matters the most in reference to the data quality problems.

在问题明确之后，应该定义要研究的数据质量维度。可以使用经验方法(例如，在利益相关者之间进行调查)来完成，以找出哪个维度相对于数据质量问题最为重要。

A set of assessment steps should follow suit. Design a way for the implementation so that these steps can map the assessment based on selected dimensions to the actual data. For instance, the following five requirements can be used as an example:

一套评估步骤也应随之而来。设计一种实现方式，以便这些步骤可以将基于选定维度的评估映射到实际数据。例如，可以使用以下五个要求作为示例：

[1] Timeframe — Decide on an interval for when the investigative data are collected.

[1]时间范围-决定收集调查数据的时间间隔。

[2] Definition — Define a standard on how to differentiate the good from the bad data.

[2]定义-定义有关如何区分好数据和坏数据的标准。

[3] Aggregation — How to quantify the data for the assessment.

[3]汇总-如何量化评估数据。

[4] Interpretability — A mathematical expression to assess the data.

[4]可解释性-评估数据的数学表达式。

[5] Threshold —Select a cut-off point to evaluate the results.

[5]阈值—选择一个截止点以评估结果。

Once the assessment methodologies are in place, it is time to get hands-on and carry out the actual assessment. After the assessment, a reporting mechanism can be set up to evaluate the results. If the data quality is satisfactory, then the data are fit for further analytic purposes. Else, the data have to be revised and potentially to be collected again. An example can be seen in the following illustration.

评估方法到位后，就可以动手进行实际评估了。 评估之后 ，可以建立报告机制来评估结果。如果数据质量令人满意，则将数据用于进一步的分析目的。否则，必须修改数据并可能再次收集。下图显示了一个示例。

Conclusion

结论

There is no one-size-fits-all solution for all data quality problems, as the definition outlined above, half of the data quality aspect is highly subjective. However, in the process of data quality assessment, we can always use a systemic approach to evaluate and assess data quality. While this approach is largely objective and relatively versatile, some domain knowledge is still required. For example in the selection of data quality dimension. Data Accuracy and Completeness might be critical aspects of the data for use case A but for use case B these dimensions might be less important.

对于所有数据质量问题，没有一种千篇一律的解决方案，正如上面概述的定义，数据质量方面的一半是高度主观的。但是，在数据质量评估过程中，我们始终可以使用系统的方法来评估和评估数据质量。尽管此方法主要是客观的并且相对通用，但是仍需要一些领域知识。例如在选择数据质量维度时。对于用例A，数据准确性和完整性可能是数据的关键方面，但对于用例B，这些维度可能不太重要。