无法从套接字中获取更多数据
介绍 (Introduction)
Data science, machine learning, artificial intelligence, those terms are all over the news. They get everyone excited with the promises of automation, new savings or higher earnings, new features, markets or techniques. Some of those promises are well-founded, while others are still in inception or haven’t passed the proof-of-concept stage (another way to say they’re just at the wet-dreams stage).
数据科学,机器学习,人工智能这些术语无处不在。 他们使每个人对自动化的承诺,新的节余或更高的收益,新功能,市场或技术感到兴奋。 这些承诺中有一些是有根据的,而另一些仍处于起步阶段或尚未通过概念验证阶段(另一种表示它们只是梦wet以求的阶段)。
There have been major improvements in techniques we use to extract, transform and load data. New and refined algorithms or techniques such as PCA, hyperparameter optimization, and designs, such as Neural Network, have brought improvements in outcomes. But there’s that one aspect that doesn’t get enough attention, the villain little duck. If you’re accustomed to working with data you might have already guessed it. If not, you’ll find out next. Let’s dive in.
我们用于提取,转换和加载数据的技术有了重大改进。 新的和完善的算法或技术(例如PCA),超参数优化和设计(例如神经网络)带来了成果的改善。 但是有一个方面没有引起足够的重视,反派小鸭子。 如果您习惯于处理数据,则可能已经猜到了。 如果没有,您将找到下一步。 让我们潜入。
数据科学领域的佼佼者 (The Unloved One In Data Science)
At the heart of everything in business and research, aside from money, is data. Data is the new oil, or the new electricity depends on who you ask. A key asset. Computers make it easy to collect, share and analyse, it’s now a strategic asset.
除了金钱,商业和研究中所有内容的核心都是数据。 数据是新石油,还是新电力取决于您问谁。 关键资产。 计算机使收集,共享和分析变得容易,这已成为一项战略资产。
But there’s an aspect about that that isn’t sufficiently discussed, it’s its quality. Quantity, whether Big Data or small data, doesn’t matter if the quality of the data is poor.
但是有一个方面没有得到充分讨论,那就是它的质量。 无论数据质量不好,数量(大数据还是小数据)都无关紧要。
Garbage in, garbage out
垃圾进垃圾出
No matter how good is your data pipeline, your cleaning and training/testing models, no matter your hypothesis, or the complexity of your algorithm, nothing valuable will result from your work if your data isn’t good or is of poor quality. That’s the famous “garbage in, garbage out”. You can’t bake a good cake with ripe ingredients.
无论您的数据管道,清洁和培训/测试模型有多好,无论您的假设是什么,或者算法的复杂性如何,如果数据不好或质量很差,您的工作都不会产生有价值的东西。 那就是著名的“垃圾进,垃圾出”。 您不能烘烤带有成熟食材的好蛋糕。
This flow provides another way to look at data quality:
此流程提供了另一种查看数据质量的方法:
Data Quality → Information Quality → Decision Quality → Business Outcome
数据质量→信息质量→决策质量→业务成果
当我们谈论不良数据时我们谈论什么 (What We Talk About When We Talk About Poor Data)
To find bad data, one needs to know what to look for. The industry mostly defines 6 to 7 dimensions to quantify the quality of data. Although the number of dimensions can vary depending on the needs, industry and focus.
为了找到错误的数据,需要知道要查找什么。 该行业通常定义6到7个维度来量化数据质量。 尽管尺寸的数量可以根据需求,行业和重点而变化。
An easy acronym will be ACCCUT. Let’s review it:
一个简单的缩写将是ACCCUT。 让我们回顾一下:
Accuracy. Every data point should have correct values. Example: Names properly spelled, and real recorded events.
一个 ccuracy。 每个数据点应具有正确的值。 示例:正确拼写的名称和真实记录的事件。
Completeness. Data records should contain all required information. Optional elements are… optional. Example: Name, Surname & Email are required, but the physical address is optional.
Çompleteness。 数据记录应包含所有必需的信息。 可选元素是…可选。 示例:姓名,姓氏和电子邮件是必填项,但实际地址是可选的。
Consistency. Any given data point should be the same across the organization and all its systems. Example: Records from the client relationship management tool should match the financial tool.
Çonsistency。 在组织及其所有系统中,任何给定的数据点都应该相同。 示例:来自客户关系管理工具的记录应与财务工具匹配。
Conformity. Data records should follow standards (format, size, type, …). Example: Rules for name, date formats.
Çonformity。 数据记录应遵循标准(格式,大小,类型等)。 示例:名称,日期格式规则。
Uniqueness. No duplicates. This is key to prevent any confusion or using outdated records. When facing two data records, which one should you select? Or none? This could be quite a dilemma and a waste of time.
üniqueness。 无重复。 这是防止任何混乱或使用过时的记录的关键。 当面对两个数据记录时,应该选择哪个? 还是没有? 这可能是一个两难选择,浪费时间。
Timeliness. Availability could also be used as a synonym. The data needs to be usable and available when the user needs it. Example: For online booking websites, the expectations are that the data (e.g. free or busy rooms) is updated in real-time, as to avoid confusion, frustration, overall poor user experience.
Ťimeliness。 可用性也可以用作同义词。 数据需要在用户需要时可用并可用。 示例:对于在线预订网站,期望实时更新数据(例如,空闲或忙碌的房间),以避免造成混乱,沮丧和整体不良的用户体验。
You can’t bake a good cake with ripe ingredients.
您不能烘烤带有成熟食材的好蛋糕。
这怎么发生的? (How Did This Happen?)
Bad data is one thing, but finding the causes is another. Culprits can be divided into two main categories:
坏数据是一回事,而寻找原因是另一回事。 罪犯可以分为两大类:
Systems
系统篇
Because we’re using multiple systems and software tools to track and update our records, the possibility for improper integration exists. Over time, those can lead to incomplete records, creation of duplicates, and lack of consistency. Migration of data between platforms is also a cause for data to deteriorate, think degraded or lost records.
由于我们正在使用多个系统和软件工具来跟踪和更新记录,因此存在不正确集成的可能性。 随着时间的流逝,它们可能导致记录不完整,重复项的创建以及缺乏一致性。 平台之间的数据迁移也是导致数据恶化,认为降级或丢失记录的原因。
While platforms might be properly integrated at inception, time and updates might lead them to converge and cause quality concerns. Humans might play an unfortunate role in this.
尽管平台可能在一开始就已正确集成,但是时间和更新可能会导致它们收敛并引起质量问题。 人类可能在其中扮演不幸的角色。
2. Humans
2. 人类
Typos when completing an input form (easy fix: if possible use a drop-down list), failing to follow guidelines, a new entry for an existing record. Misreported data can also lead to poor quality, this can be caused by a lack of understanding of the data to be reported, it should be clearly stated what is expected, maybe by providing examples. Those errors can happen both from your co-workers or customers.
在填写输入表单时输入错字(容易解决:如果可能,请使用下拉列表),不遵循准则,为现有记录添加新条目。 错误报告的数据也可能导致质量较差,这可能是由于对要报告的数据缺乏了解而引起的,应通过提供示例清楚地说明预期的结果。 这些错误可能同时发生在您的同事或客户身上。
3. Bonus: Data degrading over time
3. 奖励:随着时间的推移数据性能下降
The data might be of good quality at first, but if some changes occur in the background, it might be rendered useless. Think of a customer applying a change without informing you this could be phone, email, address. It can also be a change in the methodology used to compute a given metric.
数据一开始可能质量不错,但是如果后台发生某些更改,则可能会变得无用。 想一想客户在不通知您的情况下进行更改的情况,可以是电话,电子邮件,地址。 也可以是用于计算给定指标的方法的变化。
我该怎么办? (What Should I Do Doc’?)
Finding out aka the unpleasant discovery: first things first, find out about the poor quality. It seems trivial but if you aren’t aware of the state of your data, you are already wasting some valuable time and resources. To find out about the poor quality, you can either be informed by your data provider, e.g. another department in your company or your client, have initial suspicions based on hearsay, or discovering when doing your homework: by quickly eyeballing the data and/or by doing your exploratory data analysis (EDA),
找出不愉快的发现:首先,首先要了解质量差的问题。 这似乎是微不足道的,但是如果您不了解数据的状态,那么您已经在浪费一些宝贵的时间和资源。 要了解质量差的问题,可以通过数据提供者(例如公司或客户的其他部门)来通知您,对基于传闻的初步怀疑,或者在做作业时发现它们:快速查看数据和/或通过进行探索性数据分析(EDA),
Define & Report: define the extent of the damage, try to quantify it meaningfully, different categories, summarize your findings so it can be reported. Define also means finding the sources of poor data quality. More on this in the next part. When putting your report or deck together, don’t forget to present the ‘not-so-bad’ parts as well, to keep spirits high.
定义并报告:定义损害程度,尝试有意义地量化损害,不同类别,总结您的发现以进行报告。 定义还意味着找到数据质量差的来源。 下一部分将对此进行更多介绍。 将报告或文件放在一起时,不要忘记展示“不太糟糕”的部分,以保持高昂的情绪。
Inform: Inform your stakeholders, but make sure your team has been informed as well to prevent asymmetries of information internally, so future interactions with stakeholders do not look awkward,
通知:通知您的利益相关者,但要确保您的团队也已获知,以防止内部信息的不对称,因此将来与利益相关者的交互不会显得笨拙,
Get Feedback: Based on your report on the data quality gaps, you’ll hopefully get some concrete feedback, you might, for instance, get a green light to move on with what you currently have. Surprising? Maybe a bit. If the gaps are on 1–5% of the data, or on chunks that do not matter that much, this could simply be tagged as to be removed from the dataset. On the other hand, if there’s recognition the state of the data is not sufficient, there’s some work ahead.
获得反馈:基于您对数据质量差距的报告,希望您能得到一些具体的反馈,例如,您可能会获得开绿灯,继续进行当前的工作。 奇怪? 也许有一点。 如果差距在数据的1-5%或无关紧要的数据块上,则可以简单地将其标记为从数据集中删除。 另一方面,如果认识到数据的状态还不够,那就需要做一些工作。
Return To Data: Now that a decision was taken, one can return to the data matrix. If the decision was to go ahead, then that’s the end of this article. Alternatively, if there’s a need to ‘fix things’ that’s when the next part kicks in.
返回数据:现在已做出决定,现在可以返回数据矩阵。 如果决定继续进行,那么到此结束。 或者,如果需要“解决问题”,那就是下一部分开始。
解决它 (Solving It)
This is an imaginary dialogue, but one you might encounter, one way or the other.
这是一种虚构的对话,但是您可能会遇到一种或另一种方式。
“Can it be solved?”
“可以解决吗?”
“…”
“……”
“Can it be?”
“是真的吗?”
“Well, if you insist, let’s go for it”.
“好吧,如果您坚持,那就去吧。”
The question is ‘how to go for it, how to solve it?’, and there are multiple answers. The focus will be
问题是“如何去做,如何解决?”,并且有多个答案。 重点将是
- The first solution is a bit brute-force, and probably not always suitable for business environments, it consists of two main pillars: 第一个解决方案有点蛮力,可能并不总是适合于业务环境,它包含两个主要Struts:
Drop incomplete records
删除不完整的记录
Dropping rows with data quality concerns on a limited set of columns, or dropping all rows where there’s even one single issue or missing value within a field.
在有限的一组列上删除具有数据质量问题的行,或者在字段中甚至存在单个问题或缺少值的地方删除所有行。
Complete missing fields
填写缺少的字段
Think statistical imputation. There exists multiple methods, each with their pros and cons. Filling or fixing by using ‘Most Frequent’, Mean/Median or replacing by zero or some constant. Alternatively, we can (programmatically) complete records’ missing fields with the help of similar records, that is kNN imputation, using the nearest neighbours of incomplete data records.
考虑统计归因。 存在多种方法,每种方法各有利弊。 通过使用“最频繁”,均值/中位数来填充或固定,或替换为零或某个常数。 或者,我们可以使用不完整数据记录的最接近邻居,通过类似记录(即kNN插补)(以编程方式)完成记录的缺失字段。
2. Leverage data quality tools
2. 利用数据质量工具
Depending on your needs, there exist multiple solutions. Problems such as duplication, inconsistent format, language or unit can easily be flagged by software, then corrected.
根据您的需求,存在多种解决方案 。 诸如重复,格式不一致,语言或单位之类的问题可以很容易地由软件标记,然后加以纠正。
3. Correct faulty records
3. 更正错误的记录
Faulty records do not have to stay that way. If data quality tools cannot help, humans can always support. It’s not fancy nor pleasant, but human input could help correct what human (or system) input messed in the first place. This should be approached on a case by case basis, with a cost-benefit analysis to ensure it makes sense.
错误的记录不必保持这种状态。 如果数据质量工具无法提供帮助,那么人类可以随时提供支持。 这既不花哨也不令人愉悦,但是人工输入可以帮助纠正最初人工(或系统)输入混乱的情况。 应根据具体情况进行处理,并进行成本效益分析以确保其合理性。
可持续发展 (Sustainability)
You’ve attained a decent level of data quality. This was quite a ride. What comes next?
您已经获得了不错的数据质量。 这是一个很好的旅程。 接下来是什么?
By this point, you might be feeling that despite all the fun, dealing with such quality issue is a process you’d gladly put aside for some time. To ensure this stays a distant memory, a few safeguards can be implemented. With the help of data governance best practices:
至此,您可能会觉得尽管有很多乐趣,但是处理这样的质量问题是您很乐意搁置一段时间的过程。 为确保此信息能保存在远处,可以采取一些保护措施。 借助数据治理最佳实践:
Track your data quality over time
跟踪您的数据质量
You have a clear idea of what is critical. Monitor quality with the help of KPIs, it should be improving over time, if not then something is awry and should be dealt with. If the quality is not improving, search for the root-causes. What are examples of new poor records, where do they come from, how were they generated? There might have been rule-changes upstream or the reintroduction of human input which lead to incorrectness.
您对关键问题有一个清晰的认识。 借助KPI监控质量,随着时间的推移,它应该会不断提高,如果没有,那就有些问题了,应该加以解决。 如果质量没有提高,请查找根本原因。 新的不良记录有哪些例子,它们来自何处,又是如何产生的? 可能是上游发生了规则更改,或者可能由于人工输入而导致错误。
Monitor newly added data sources or new fields
监视新添加的数据源或新字段
New data sources should be inspected to ensure the existence of data quality rules. These rules should also match the standards of the ones set previously.
应该检查新的数据源,以确保存在数据质量规则。 这些规则还应符合先前设置的标准。
New fields should limit the range of errors on the human side. With a restricted set of options, such as by using drop-down lists, and only accepting a record if it is complete.
新字段应限制人为错误的范围。 具有一组受限制的选项,例如使用下拉列表,并且仅在记录完成时才接受。
Audit your systems & teams
审核您的系统和团队
If the systems are faulty, or the teams misunderstand certain dimensions or characteristics of the data, the poor data quality will be perpetuated.
如果系统出现故障,或者团队误解了数据的某些维度或特征,那么不良的数据质量将继续存在。
Create a data quality team
建立数据质量团队
Because everyone’s too busy, having a team dedicated to data quality ensures this key asset is well maintained. The team would set data governance principles, and focus on the aforementioned points.
因为每个人都太忙,所以拥有一支致力于数据质量的团队可确保此关键资产得到良好维护。 该团队将制定数据治理原则,并专注于上述要点。
反对忽视不良数据质量的理由 (The Case Against Disregarding Poor Data Quality)
If all of the above seems too much of a burden both on resource and time, here is a list of consequences of poor data quality:
如果以上所有因素似乎都给资源和时间带来了沉重负担,则以下是不良数据质量的后果列表:
Mistrust
不信任
If there is evidence a chunk of data cannot be trusted, then any record from that dataset or tool will be looked at very sceptically. Ultimately this mistrust could plague other datasets or systems throughout the organization, or cause your customers to have doubts everything else.
如果有证据表明不能信任大量数据,则将非常怀疑地查看该数据集或工具中的任何记录。 最终,这种不信任可能会困扰整个组织中的其他数据集或系统,或者使您的客户对其他所有内容产生怀疑。
Reputation
声誉
Errors happen, but if they are blatant and uncared for, it does not put your team or organization under a good light. Your reputation would suffer.
错误会发生,但是如果错误过分且无人理for,则不会使您的团队或组织处于良好状态。 您的声誉会受到影响。
Productivity
生产率
Your team, customers and yourself will waste time and resource with poor data. They might have to cross-check with other sources, call other departments for confirmation, it could have a domino effect on many.
您的团队,客户和您自己都会浪费时间和资源来处理不良数据。 他们可能必须与其他来源进行交叉核对,致电其他部门进行确认,这可能对许多企业产生多米诺骨牌效应。
Decision-Making
做决定
Data can work as eyes. With poor eyes, it’s difficult to roam around. An organization with poor data is navigating blindly, or with a handicap which hinders its decision-making and strategy. To leave data at the door is to trust gut feelings, personal bias or personal agendas. At your risks.
数据可以发挥作用。 眼睛不好,很难漫游。 数据贫乏的组织正在盲目导航,或存在阻碍其决策和战略的障碍。 将数据留在门口就是信任直觉,个人偏见或个人议程。 风险自负。
结论 (Conclusion)
Data is a key asset, as long as it is of decent quality. Most organizations will deal with data quality issues, but they do not have to handicap the business forever. Frequent monitoring using KPIs, limited use of ‘input field’ used by humans, regular testing of systems integrations can prevent unpleasant surprises. If the problem exists, it will not go away on its own, especially not if it is pushed under the carpet. Prevention is to be preferred above cure so that a minor discomfort does not turn into a massive migraine.
数据是高质量的关键资产。 大多数组织将处理数据质量问题,但他们不必永远限制业务。 经常使用KPI进行监视,有限地使用人类使用的“输入字段”,对系统集成进行定期测试可以防止不愉快的意外发生。 如果存在问题,它将不会自行消失,尤其是将其推到地毯下面时。 预防优先于治疗,以免轻微不适不会导致严重的偏头痛。
无法从套接字中获取更多数据