ai人工智能的数据服务_人工智能时代的数据质量-CSDN博客

ai人工智能的数据服务

数据质量至关重要，尤其是在人工智能和自动化决策时代。你有策略吗？ (Data quality is of critical importance especially in the era of Artificial Intelligence and automated decisions. Do you have a strategy?)

数据密集型项目具有单点故障：数据质量 (Data-intensive projects have a single point of failure: data quality)

As the director of 'datamine decision support systems', I’ve delivered more than 80 data-intensive projects across several industries and high-profile corporations. These include data warehousing, data integration, business intelligence, content performance, and predictive models. In most cases, data quality proved to be a critical factor for the success of the project.

作为“数据挖掘决策支持系统”的主管，我已经在多个行业和知名企业中交付了80多个数据密集型项目。其中包括数据仓库 ， 数据集成 ， 商业智能 ， 内容表现和预测模型 。在大多数情况下，数据质量被证明是项目成功的关键因素。

The obvious challenge in every single case was to effectively query heterogeneous data sources, then extract and transform data towards one or more data models.

在每种情况下，显而易见的挑战是有效地查询异构数据源，然后提取数据并将其转换为一个或多个数据模型 。

The non-obvious challenge was the early identification of data issues, which — in most cases — were unknown to the data owners as well.

显而易见的挑战是及早发现数据问题，在大多数情况下，数据所有者也不知道。

We strategically started every project with a data-quality assessment phase — which in many cases lead to project scope modifications and even additional data cleansing initiatives and projects.

我们在战略上从数据质量评估阶段开始了每个项目-在许多情况下，这会导致项目范围的修改，甚至导致额外的数据清理计划和项目。

数据质量已定义 (Data Quality Defined)

There are many aspects to data quality, including consistency, integrity, accuracy, and completeness. According to Wikipedia, data is generally considered high quality if it is “fit for [its] intended uses in operations, decision making and planning and data is deemed of high quality if it correctly represents the real-world construct to which it refers.”

数据质量有很多方面，包括一致性，完整性，准确性和完整性 。根据Wikipedia的说法，如果数据“适合于其在运营，决策和计划中的预期用途，并且数据正确地代表了所引用的现实结构，则数据被认为是高质量的”。

I define data quality as the level of compliance of a data set with contextual normality.

我将数据质量定义为数据集符合上下文正常性的水平。

This normality is set by user-defined rules and/or statistically derived ones. it is contextual, in the sense that rules reflect the logic of particular business processes, corporate knowledge, environmental, social or other conditions. For example, a property of the same entity could have different validation rules in different companies, markets, languages, or currencies.

此正常性由用户定义的规则和/或统计得出的规则设置。从规则的意义上讲，它是上下文的，可以反映特定业务流程，公司知识，环境，社会或其他条件的逻辑 。例如，同一实体的财产在不同的公司，市场，语言或货币中可能具有不同的验证规则。

Modern systems need to become aware of the quality in data I/O. They must instantly identify potential issues and avoid exposing dirty, inaccurate or incomplete data to connected production components/ clients.

现代系统需要意识到数据I / O的质量。他们必须立即发现潜在问题，并避免暴露肮脏，不准确或不完整的数据连接到生产组件/客户。

This implies that, even if there is a sudden problematic situation resulting in poor-data quality entries, the system will be able to handle the quality issue and proactively notify the right users. Depending on how critical the issues are, it might also deny serving data to its clients — or serve data while raising the alert/ flagging the potential issues.

这意味着，即使出现突然的问题情况而导致输入的数据质量不佳，系统也将能够处理质量问题并主动通知合适的用户。根据问题的严重程度，它还可能拒绝向其客户提供数据 ，或者在发出警报/标记潜在问题时提供数据。

数据质量的重要性 (The Importance of Data Quality)

Data quality is of critical importance especially in the era of automated decisions, AI, and continuous process optimization. Corporations need to be data-driven and data quality is a critical pre-condition to achieve this.

数据质量至关重要，尤其是在自动化决策，人工智能和持续流程优化的时代。公司需要由数据驱动，数据质量是实现此目标的关键前提。

混乱，信任有限，决策错误 (Confusion, limited trust, poor decisions)

In most of the cases, data quality issues explain limited trust in data from corporate users, waste of resources or even poor decisions.

在大多数情况下，数据质量问题可解释对公司用户对数据的信任有限，资源浪费甚至决策失误 。

Consider a team of analysts trying to figure out if an outlier is a critical business discovery or an unknown/ poorly handled data issue. Even worse, consider real-time decisions being made by a system not able to identify and handle poor data which accidentally — or even intentionally — had been fed into the process.

考虑一个分析团队，他们试图找出异常值是关键的业务发现还是未知/处理不善的数据问题。更糟糕的是，请考虑由系统做出的实时决策，该系统无法识别和处理意外或什至是故意输入的不良数据。

由于数据质量低而导致的故障 (Failures due to low data quality)

I’ve seen great Business Intelligence, data warehousing, and similar initiatives failing due to low engagement by key users and stakeholders. In most of the cases, limited engagement was the direct result of a lack of trust in the data. Users need to trust the data — if they don’t, they will gradually abandon the system impacting its major KPIs and success criteria.

我已经看到，由于关键用户和利益相关者的参与度低 ，出色的商业智能，数据仓库和类似计划失败了。在大多数情况下，参与度有限是对数据缺乏信任感的直接结果。用户需要信任数据，否则，他们将逐渐放弃该系统，从而影响其主要KPI和成功标准。

Whenever you think you’ve done some major data discovery, cross-check for quality issues first!

每当您认为自己已经完成一些主要的数据发现时，请先对质量问题进行交叉检查！

类型和症状 (Types and symptoms)

Data quality issues can take many forms, for example:

数据质量问题可以采用多种形式，例如：

particular properties in a specific object have invalid or missing values
特定对象中的特定属性的值无效或缺失
a value coming in an unexpected or corrupted format
以意外或损坏的格式出现的值
duplicate instances
重复实例
inconsistent references or unit of measures
参考资料或计量单位不一致
incomplete cases
不完整的案件
broken URLs
损坏的网址
corrupted binary data
损坏的二进制数据
missing packages of data
缺少数据包
gaps in the feeds
提要中的差距
incorrectly -mapped properties
映射错误的属性

根本原因 (The root cause)

Data quality issues are typically the result of:

数据质量问题通常是以下原因造成的：

poor software implementations: bugs or improper handling of particular cases
不良的软件实施：错误或对特殊情况的不正确处理
system-level issues: failures in certain processes
系统级问题：某些过程中的故障
changes in data formats, impacting the source and/or target data stores
数据格式的变化，影响源和/或目标数据存储

Modern systems should be designed assuming that at some point there will be problematic data feeds and unexpected quality issues.

在设计现代系统时，应假设在某些时候会出现问题的数据馈送和意外的质量问题。

The validity of the data properties can be evaluated against [a] known, predefined rules and [b] dynamically derived rules and patterns based on statistical processing

可以根据统计处理，根据[a]已知的预定义规则和[b]动态派生的规则和模式评估数据属性的有效性

数据质量策略 (A Strategy for Data Quality)

A modern data-intensive project typically involves data streams, complex ETL processes, post-processing logic, and a range of analytical or cognitive components.

现代的数据密集型项目通常涉及数据流，复杂的ETL流程，后处理逻辑以及一系列分析或认知组件。

The key deliverable in such scenarios is a high-performance data processing pipeline, feeding and maintaining at least one data store. This defines a “data environment,” which then empowers advanced analytical models, real-time decision making, knowledge extraction and possibly AI applications. The following describes a strategy for ensuring data quality throughout this process.

在这种情况下，关键的交付内容是一条高性能的数据处理管道，可以馈送和维护至少一个数据存储。这定义了一个“数据环境”，然后可以使用高级分析模型，实时决策，知识提取以及可能的AI应用程序。下面介绍了在整个过程中确保数据质量的策略。

识别，理解和记录数据源 (Identify, understand, and document the data sources)

You need to identify your data sources and, for each one, briefly document the following:

您需要标识数据源，并针对每个数据源简要记录以下内容：

1. Type of data contained — for example customer records, web traffic, user documents, activity from a connected device (in an IoT context).

1. 所载数据的类型 — 例如客户记录，网络流量，用户文档，来自连接设备的活动(在IoT上下文中)。

2. Type of storage — for instance, is it a flat file, a relational database, a document store or a stream of events?

2. 存储类型 -例如，它是平面文件，关系数据库，文档存储还是事件流？

3. Time frames — how long do we have data for?

3. 时间范围 -我们需要多长时间的数据？

4. Frequency and types of updates — are you getting deltas, events, updates, aggregated data? All these can significantly impact the design of the pipeline and the ability to identify and handle data quality issues.

4.更新的频率和类型 -您是否获得增量，事件，更新，汇总数据？所有这些都会严重影响管道的设计以及识别和处理数据质量问题的能力。

5. The source of data and involved systems — is data coming from another system? Is it a continuous feed of events or a batch process pulled from another integrated system? Is there manual data entry/ validation involved?

5. 数据来源和有关系统 数据是否来自另一个系统？它是事件的连续馈送还是从另一个集成系统提取的批处理过程？是否有人工输入/验证数据？

6. Known data issues and limitations can help speed up the initial data examination phase — if provided upfront.

6. 已知的数据问题和局限性可以帮助加快初始数据检查阶段(如果事先提供的话)。

7. The data models involved in the particular data source — for example, an ER model representing customers, a flat file structure, an object, a star schema.

7.特定数据源中涉及的数据模型 -例如，代表客户的ER模型，平面文件结构，对象，星型模式。

8. Stakeholders involved — this is very important in order to interpret issues and edge cases and also to validate the overall state of the data, with those having the deepest understanding of the data, the business and the related processes.

8. 涉及的利益相关者 -这对于解释问题和边缘案例以及验证数据的整体状态非常重要，因为那些对数据，业务和相关流程有最深刻的理解。

从数据分析开始 (Start with Data Profiling)

Data profiling is the process of describing the data by performing basic descriptive statistical analysis and summarization. The key is to briefly document the findings thus creating a baseline — a reference point to be used for data validation throughout the process.

数据概要分析是通过执行基本的描述性统计分析和汇总来描述数据的过程。关键是要简要记录调查结果，从而创建基线-在整个过程中用于数据验证的参考点。

Data profiling depends on the type of the underlying data and the business context, but in a general scenario you should consider the following:

数据分析取决于基础数据的类型和业务上下文，但是在一般情况下，您应该考虑以下事项：

1. Identify the key entities, such as customer, user, product, the events involved, such as register, login, purchase, the time frame, the geography, and other key dimensions of your data.

1.识别关键实体，例如客户，用户，产品，涉及的事件，例如注册，登录，购买， 时间范围 ， 地理位置 ，和数据的其他关键维度。

2. Select the typical time frame to use for your analysis. This could be a day, week, month, and so forth depending on the business.

2.选择用于分析的典型时间范围 。根据业务的不同，可能是一天，一周，一个月等等。

3. Analyze high-level trends involving the entities and events identified. Generate time series against the major events and the key entities. Identify trends, seasonality, peaks, and try to interpret them in the context of the particular business. Consult the owner of the data and capture/ document these “data stories.”

3.分析涉及已识别实体和事件的高层趋势。根据主要事件和关键实体生成时间序列。识别趋势，季节性，高峰，并尝试根据特定业务来解释它们。咨询数据所有者并捕获/记录这些“数据故事”。

4. Analyze the data. For each of the properties of your key entities perform statistical summarization to capture the shape of the data. For numerical values, you could start with the basics — min, average, max, standard deviation, quartiles — and then possibly visualize the distribution of the data. Having done so, examine the shape of the distribution and figure out if it makes sense to the business. For categorical values, you could summarize the distinct number of values by frequency and, for example, document the top x values explaining z% of the cases.

4. 分析数据。对于关键实体的每个属性，执行统计汇总以捕获数据的形状。对于数值，您可以从基本知识(最小值，平均值，最大值，标准差，四分位数)开始，然后可视化数据分布。这样做之后，请检查分布的形状，并确定它是否对业务有意义。对于分类值，您可以按频率汇总不同数量的值，例如，文档最高的x值解释了案例的z％。

5. Review a few outliers. Having the distribution of the values for a particular property — let’s say, the age of the customer — try to figure out “suspicious” values in the context of the particular business. Select a few of them and retrieve the actual instances of the entities. Then review their profile and activity — of the particular users, in this example — and try to interpret the suspicious values. Consult the owner of the data to advice on these findings.

5.查看一些离群值。 拥有特定资产的价值分布(例如，客户的年龄)，可以尝试找出特定业务环境中的“可疑”价值。选择其中一些，然后检索实体的实际实例。然后，在本示例中，查看特定用户的个人资料和活动，并尝试解释可疑值。请咨询数据所有者以对这些发现提出建议。

6. Document your results. Create a compact document or report with a clear structure to act as your baseline and data reference. You should append the findings of each of the data sources to this single document — with the same structure, time references, and metadata to ensure easier interpretation.

6. 记录结果。创建结构清晰的紧凑型文档或报告，以用作基准和数据参考。您应将每个数据源的结果附加到此文档中-具有相同的结构，时间参考和元数据，以确保更容易理解。

7. Review, interpret, validate. This is the phase where you need input from the data owner to provide an overall interpretation of the data and to explain edge cases, outliers, or other unexpected data patterns. The outcome of the process could be to confirm the state of the data, explain known issues, and register new ones. This is where possible solutions to known data issues can be discussed and/or decided. Also, validation rules can be documented.

7. 审查，解释，确认 。在此阶段，您需要数据所有者的输入以提供对数据的整体解释，并解释边缘情况，异常值或其他意外数据模式。该过程的结果可能是确认数据状态，解释已知问题并注册新问题。在这里可以讨论和/或确定已知数据问题的可能解决方案。同样，可以记录验证规则。

In an ideal scenario, the data profiling process should be automated. There are several tools allowing quick data profiling by just connecting your data source and going through a quick wizard-like configuration. The output of the process in such scenarios is typically an interactive report enabling easy analysis of the data and sharing of the knowledge with the team.

在理想情况下，应该自动执行数据配置过程。有几种工具可以通过仅连接数据源并进行类似于向导的快速配置来进行快速数据分析。在这种情况下，过程的输出通常是交互式报告，该报告可以轻松分析数据并与团队共享知识。

建立数据质量参考存储 (Establish a Data Quality Reference Store)

The purposes of the data quality reference (DQR) store are to capture and maintain metadata and validity rules about your data and to make them available to external processes.

数据质量参考(DQR)存储的目的捕获和维护元数据和有效性规则关于您的数据，并使它们可用于外部流程。

This could be a highly sophisticated system to automatically derive rules about the validity of your data and continuously assess the incoming (batches of) cases, with the capability to identify time-related and other patterns about your data. This could be a manually maintained set of rules which allow quick validation of the incoming data. This could be a hybrid setup.

这可能是一个高度复杂的系统，可以自动得出有关数据有效性的规则，并持续评估传入的案例(批次)，并能够识别与数据相关的时间相关模式和其他模式。这可以是一组手动维护的规则，可以快速验证传入的数据。这可能是混合设置。

In any case, the ETL process should be able to query the DQR store and load the data validation rules and patterns, along with fixing directives. Data validation rules should be dynamic instead of a fixed set of rules or hard-coded pieces of logic.

无论如何，ETL过程都应该能够查询DQR存储并加载数据验证规则和模式以及修复指令。数据验证规则应该是动态的 而不是固定的一组规则或硬编码的逻辑。

The DQR store should also be accessible via interactive reporting and standardized dashboards — to empower process owners and data analysts to understand the data, the process, trends, and issues.

DQR商店还应该通过交互式报告和标准化仪表板进行访问，以使流程所有者和数据分析人员能够了解数据，流程，趋势和问题。

实施智能数据验证 (Implement Smart Data Validation)

Enable your data processing pipeline to load data validation rules from the DQR store described above. The DQR store could be designed as an internal ETL subsystem or an external to the ETL service. In any case, the logic to validate data along with the suggested action should be dynamic to your ETL process.

使您的数据处理管道能够从上述DQR存储中加载数据验证规则。 DQR存储库可以设计为内部ETL子系统或ETL服务的外部。在任何情况下，验证数据的逻辑以及建议的操作对于您的ETL流程都应该是动态的。

The data processing pipeline should be continuously validating incoming (batches of) cases based on the latest version of the validation rules.

数据处理管道应根据最新版本的验证规则持续验证传入(批次)的案件。

The system should be able to flag and possibly enrich the original incoming data with the outcome of the validation and related metadata, and feed back to the DQR store. The original data is stored, with proper flagging by the ETL, unless otherwise directed by the current validation policy.

该系统应该能够标记并可能通过验证和相关元数据的结果丰富原始输入数据，并将其反馈到DQR存储。除非当前验证策略另有指示，否则将存储原始数据，并由ETL进行适当标记。

With this approach, data quality can be measured and analyzed against time, for example by data source, processing pipeline. Interactive reporting can help to easily explore the overall state of the ETL process and quickly identify and explore data quality concerns or specific issues.

使用这种方法，可以根据时间测量和分析数据质量，例如通过数据源，处理管道。交互式报告可以帮助轻松地探索ETL流程的整体状态，并快速识别和探索数据质量问题或特定问题。

The system could also support an overall “Index of Data Quality”. This can consider multiple aspects of quality, and assign more importance to specific entities and events. For example, an erroneous transaction record could be far more important than a broken hyperlink to an image.

该系统还可以支持整体的“数据质量指标”。这可以考虑质量的多个方面，并为特定实体和事件分配更多的重要性。例如，错误的交易记录可能比断开到图像的超链接更为重要。

The Index of Data Quality could also have specific elasticity — different by entity and event. For example, this could allow delays in receiving data for a particular entity while not for another.

数据质量指数也可以具有特定的弹性 -因实体和事件而异。例如，这可能允许特定实体而不是其他实体的数据接收延迟。

Having an overall Index of Data Quality can help the corporation measure data quality over time and across key dimensions of the business. It can also help to set goals and quantify the impact of potential improvements of the ETL strategy.

总体数据质量指数可以帮助公司随着时间的推移以及跨业务关键维度衡量数据质量。它还可以帮助设定目标并量化ETL策略潜在改进的影响。

智能通知层 (A smart Notification Layer)

The overall process should be aware of any quality issues, trends, and sudden changes. Moreover, the system needs to know the importance — how critical an issue is. Based on this awareness and a smart configuration layer, the system knows when to notify who and through which particular channel.

整个过程应注意任何质量问题，趋势和突然的变化。此外，系统需要了解重要性-问题的严重性。基于这种意识和智能配置层，系统知道何时通知谁以及通过哪个特定渠道。

Modern systems must be aware of the quality of the incoming data and capable of identifying, reporting and handling erroneous cases accordingly.

现代系统必须意识到传入数据的质量，并能够相应地识别，报告和处理错误的情况。