代价敏感数据不均衡_数据质量差的代价

最新推荐文章于 2024-03-18 20:55:04 发布

weixin_26730921

最新推荐文章于 2024-03-18 20:55:04 发布

阅读量435

点赞数

文章标签： python java

原文链接：https://towardsdatascience.com/the-cost-of-poor-data-quality-cd308722951f

版权

代价敏感数据不均衡

It’s amazing how nowadays the majority of us understand that AI is the way to go when talking about becoming a market leader, regardless of the vertical where you’re into. But in order to successfully develop and adopt AI solutions, there’s a path to be made, and that path is not easy! Data is one of the most important key factors (besides all the technical depth around an ML solution) that dictate whether or not an AI project will succeed, but are we taking into account that we need data with quality?

令人惊讶的是，如今我们大多数人都知道，在谈论成为市场领导者时，无论走到哪个垂直领域，人工智能都是必经之路。但是，要成功开发和采用AI解决方案，必须要走一条路，而且那条路并不容易！数据是决定AI项目是否成功的最重要的关键因素之一(除了ML解决方案的所有技术深度 )， 但是我们是否考虑到我们需要高质量的数据？

Well, having that said I have two questions:

好吧，话虽如此，我有两个问题：

When can I say “I have enough data”?

我什么时候可以说“我有足够的数据” ？

What is quality data after all?

到底什么是质量数据？

Let’s dive into these questions! 🚀

让我们深入探讨这些问题！ 🚀

适可而止！ (Enough is enough!)

This is the question that I guess everyone, including Data Scientists would like to know! But although it sounds like a simple thing, it isn’t. “The more the merrier” is not exactly the ideal, after all, you can have decades of data, but if you have been collecting it without a real purpose, well probably the data won’t hold all the answers for the questions that your business have!

我想这是每个人(包括数据科学家)都想知道的问题！但是，尽管听起来很简单，但事实并非如此。毕竟， “越多越好”并不是理想的选择，毕竟您可以拥有数十年的数据，但是如果您收集的数据没有实际目的，那么数据很可能无法容纳您所遇到问题的所有答案。业务有！

In reality, there are many aspects that impact the amount of data needed, from the use case to be explored to the complexity of the problem and even the chosen algorithm.

实际上，从待研究的用例到问题的复杂性甚至所选择的算法，有很多方面都会影响所需的数据量。

So there is no magic number, but it’s always dangerous to assume that there’s enough or even plenty of data!

因此，没有神奇的数字，但是假设有足够或什至大量的数据总是很危险的！

of数据的“Crèmede lacrème” (💎 The “Crème de la crème” of data)

Perfect data does not exist when it comes to records collected from real-life systems! Don’t assume this and don’t expect Data Science teams to agree with your assumptions, you’ll probably be wrong 🌝 — but we can work towards having it as close as it’s best before feeding it into a model.

从真实系统中收集的记录不存在完美的数据 ！不要假设这一点，也不要期望数据科学团队同意您的假设，您可能会错🌝—但是我们可以在将其输入模型之前尽最大努力使其接近。

But before let’s define what is high-quality data after all. Data quality can be defined as data measures based on factors such as accuracy, completeness, consistency, reliability, and above all, whether is up to date.

但毕竟我们先定义什么是高质量数据。可以将数据质量定义为基于诸如准确性，完整性，一致性，可靠性以及最重要的因素(无论是否最新)的数据度量。

So does this mean that the same data will have the same quality for different use cases?

那么这是否意味着对于不同的用例，相同的数据将具有相同的质量？

No, nevertheless it is possible to define a ground quality metrics that are independent of the use cases, and will give us already a pretty good idea of how much work will that data require.

不，尽管如此，可以定义独立于用例的地面质量指标，并且已经使我们对数据需要多少工作已经有了一个很好的了解。

And what is the connection of data quality with Machine Learning?

数据质量与机器学习之间的联系是什么？

Due to its nature, a machine learning model is very sensitive to the quality of the data, well, you’ve probably already heard the expression “Garbage in garbage out”. Because of the huge volume of data required, even the smallest of the errors in the training data can lead to large scale errors in the output. I totally recommend you to have a look into this article about “High-quality datasets are essential for developing machine learning models.”.

由于其本质，机器学习模型对数据质量非常敏感，好吧，您可能已经听说过“垃圾填埋”一词。 由于所需的数据量巨大，即使训练数据中的最小错误也可能导致输出中的大规模错误。我完全建议您阅读有关“高质量数据集对于开发机器学习模型必不可少的文章”的文章。。

Data quality is a must for the ones that are looking to start investing in Artificial Intelligence based solutions. Do you already have a strategy to tackle your data quality issues, or you still think they don’t exist?

对于那些希望开始投资基于人工智能的解决方案的人来说，数据质量是必须的。您是否已经有解决数据质量问题的策略，或者您仍然认为它们不存在？

are您愿意花多少钱？ (💰How much are you willing to spend?)

For starters from a productivity perspective, the situation appears bleak. Did you know that your Data Scientists spend 80% of their time finding, cleaning, and trying to organize the data, leaving only 20% of their time for the development and analysis of ML solutions? That’s a lot of hours wasted by professionals that are highly expensive on work that could be partially automated. Let me just put here a price tag, the average salary of a Data Scientist in the US is around $120k, and you can do little to nothing with just one person (this I’ll leave for another discussion!). Don’t forget that Data Science jobs are highly qualified, and performing data preprocessing besides tedious can lead to frustration and a lot of churn among your data teams.

从生产率的角度来看，对于初学者来说，情况似乎很暗淡。您是否知道您的数据科学家将80％的时间用于查找，清理和尝试组织数据，而仅将20％的时间用于ML解决方案的开发和分析？专业人员浪费了很多时间，因为工作成本很高，可以部分自动化。让我在这里贴上一个价格标签， 在美国，数据科学家的平均薪资约为12万美元 ，而一个人却几乎什么也做不了(我将去讨论另一个问题！) 。别忘了Data Science的工作是高素质的，除了乏味的数据预处理之外，还可能导致您的数据团队感到沮丧和大量流失。

On the other hand, you can also have a lot of direct financial backlash from the use of data with poor data quality.

另一方面，由于数据质量较差的数据的使用，您也会遭受很多直接的财务冲击。

First storing and keeping bad data is both time-consuming and expensive.
首先存储和保留不良数据既耗时又昂贵。
Second, according to Gartner, “the average financial impact of poor data quality on the organization is estimated to be $9,7 million per year.” and recently IBM also discovered that in the US alone, businesses lose $3.1 trillion annually due to poor data quality. Bad data and poor results from using that data can lead to the loss of confidence from the end-users and customers. Meaning, customers churn related to bad data is a reality.
其次，根据Gartner的说法，“不良数据质量对组织的平均财务影响估计为每年9,700万美元。” 最近IBM还发现，仅在美国，由于数据质量差，企业每年损失3.1万亿美元。不良数据和使用该数据产生的不良结果可能导致最终用户和客户失去信心。这意味着与不良数据相关的客户流失是现实。
And last, but not the least, and this one might be shocking — data inaccuracy and poor quality are inhibiting AI projects. A lot of times AI projects are kicked-off with no idea if there’s enough data, or if the data that exists suits the use case. There are a lot of assumptions done without even looking into the data, which leads to a massive investment in a project that is doomed from the beginning. Another fact, the majority of the companies fail to integrate external information, either because it’s not accessible (due to privacy) or just because is very time-consuming, and this thrid-party data can tell you a lot more than you imagine about your own business.
最后但并非最不重要的一点是，这可能令人震惊- 数据不准确和质量差正在阻碍AI项目 。很多时候，启动AI项目时不知道是否有足够的数据，或者现有数据是否适合用例。在没有调查数据的情况下进行了许多假设，这导致了从一开始就注定要失败的项目的大量投资。另一个事实是，大多数公司无法集成外部信息，要么是由于无法访问(由于隐私)，要么是因为它非常耗时，而第三方数据可以为您提供比您想象的更多的信息自己的事。

结论 (Conclusion)

Data quality is a pre-condition for AI, and not the other way around! Meaning, if the quality of your data is bad, analytics and Ai initiatives are worthless to pursue.

数据质量是AI的前提， 而不是相反！ 这意味着，如果您的数据质量很差，那么分析和AI计划就毫无价值。

Poor data quality can cause analytics and AI projects to take longer than expected (around 40% longer), which means they will cost more or even they will eventually fail to achieve the desired results (70% of AI projects). With more than 70% of the organization relying on data to drive their future business decisions, the data problems are not only going to drain resources (financial and human) but also the ability to extract new valuable business insights. So if you are looking to invest in AI, look first in develop, define, and implement the right tools for an excellent data quality strategy.

不良的数据质量可能导致分析和AI项目花费的时间比预期的长(大约40％)，这意味着它们将花费更多，甚至最终无法达到预期的结果(70％的AI项目)。随着超过70％的组织依靠数据来驱动其未来的业务决策，数据问题不仅会消耗资源(财务和人力)，而且还会提取新的有价值的业务见解。因此，如果您打算投资人工智能，那么首先要开发，定义和实施正确的工具，以实现出色的数据质量策略。

Fabiana Clemente is CDO at YData.

Fabiana Clemente 是 YData的 CDO 。

Improved data for AI

改善AI数据

YData provides a data-centric development platform for Data Scientists to work to high-quality and synthetic data.

YData为数据科学家提供了以数据为中心的开发平台，以处理高质量和合成数据。

翻译自: https://towardsdatascience.com/the-cost-of-poor-data-quality-cd308722951f

代价敏感数据不均衡

weixin_26730921

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
代价敏感数据不均衡_数据质量差的代价

代价敏感数据不均衡It’s amazing how nowadays the majority of us understand that AI is the way to go when talking about becoming a market leader, regardless of the vertical where you’re into. But in order to su...
复制链接

扫一扫