pt timing验证_网络pt 1中的大规模数据质量验证

最新推荐文章于 2024-08-30 17:27:27 发布

weixin_26750481

最新推荐文章于 2024-08-30 17:27:27 发布

阅读量553

点赞数

文章标签： python 人工智能 java linux 大数据

原文链接：https://medium.com/@samueleresca/large-scale-data-quality-verification-in-net-pt-1-23f51d72bc18

版权

pt timing验证

The quality testing of large data-sets plays an essential role in the reliability of data-intensive applications. The business decisions of companies rely on machine learning models and data analysis; for this reason, data quality has gained a lot of importance. A few months ago, the awslabs/deequ library caught my attention.

大型数据集的质量测试在数据密集型应用程序的可靠性中起着至关重要的作用。公司的业务决策依赖于机器学习模型和数据分析；因此，数据质量变得非常重要。几个月前， awslabs / deequ库引起了我的注意。

The library helps to define unit tests for data, and it uses Apache Spark to support large data-sets. I started to dig into the implementation, and I’m working on porting the library into the .NET ecosystem: samueleresca/deequ.NET.

该库有助于定义数据的单元测试，它使用Apache Spark支持大型数据集。我开始研究实现，并且正在将库移植到.NET生态系统： samueleresca / deequ.NET中。

为什么数据质量很重要？ (Why data quality is important?)

One thing that I had noticed when I jumped on the machine learning world is that ordinary software engineering practices are not enough to guarantee the stability of the codebase. One of the main reasons is already well-described in the Hidden Technical Debt in Machine Learning Systems paper.

当我跳入机器学习世界时，我注意到的一件事是，普通的软件工程实践不足以保证代码库的稳定性。主要原因之一已在“机器学习系统中的隐藏技术债务”一文中作了充分描述。

In traditional software projects, the established tools and techniques for code analysis, unit testing, integration testing, and monitoring solve the common pitfalls derived by the code dependencies depts. Although these tools and techniques are still valid on a machine learning project, they are not enough. In a machine learning project, the ecosystem of components and technologies is broader:

在传统软件项目中，已建立的用于代码分析，单元测试，集成测试和监视的工具和技术可解决由代码依赖关系部派生的常见陷阱。尽管这些工具和技术在机器学习项目中仍然有效，但还远远不够。在机器学习项目中，组件和技术的生态系统更为广泛：

The machine learning code is a minimum part of the whole project. A lot of components are dedicated to the pre-processing/preparation/validation phases, such as the feature extraction part, the data collection, and the data verification. One of the main assertions made by the research paper mentioned above is that the data dependencies cost more than code dependencies. Therefore, the versioning of the upstream data-sets and the quality testing data needs a considerable effort, and it plays an essential role in the reliability of the machine learning project.

机器学习代码是整个项目的最小部分。许多组件专用于预处理/准备/验证阶段，例如特征提取部分，数据收集和数据验证。上述研究论文的主要主张之一是数据依赖比编码依赖花费更多 。因此，上游数据集的版本控制和质量测试数据需要大量的工作，并且在机器学习项目的可靠性中起着至关重要的作用。

实施细节 (Implementation details)

The automating large-scale data quality verification research that inspired the deequ library describes the common pitfalls behind the data quality verification and provides a pattern for testing large-scale data-sets. It highlights three data quality dimensions: the completeness, the consistency, and the accuracy of the data.

启发deequ库的自动化大规模数据质量验证研究描述了数据质量验证背后的常见陷阱，并提供了一种测试大规模数据集的模式。它突出显示了三个数据质量维度：数据的完整性 ， 一致性和准确性 。

The completeness represents the degree to which an entity can have all the values needed to describe a real-world object. For example, in the case of relational storage, it is the presence or not of null values.

完整性表示实体可以拥有描述真实世界对象所需的所有值的程度。例如，在关系存储的情况下，是否存在空值。

The consistency refers to the semantic rules of data. More in detail, to all the rules that are related to a data type, a numeric interval, or a categorical column. The consistency dimension also describes the rules that involve multiple columns. For example, if the category value of a record is t-shirt, then the size could be in the range {S, M, L}.

一致性是指数据的语义规则。更详细地讲，与数据类型，数字间隔或分类列有关的所有规则。 一致性维度还描述了涉及多个列的规则。例如，如果记录的类别值为t-shirt ，则大小可以在{S, M, L}范围内。

On the other side, the accuracy focuses the on syntactic correctness of the data based on the definition domain. For example, a color field should not have a value XL. Deequ uses these dimensions as the main reference to understand the data quality of a data-set.

另一方面， 准确性主要集中在基于定义域的数据的句法正确性上。例如， color域不应具有值XL 。 Deequ使用这些维度作为了解数据集数据质量的主要参考。

The next sections go through the main components that the original deequ library uses, and it shows the corresponding implementation in the deequ.NET library.

接下来的部分要经过的主要组成部分，原来deequ库使用，它显示在相应的执行deequ.NET库。

检查和约束声明 (Check and constraint declaration)

The library uses a declarative syntax for defining the list of checks and the related constraints that are used to assert the data quality of a data-set. Every constraint is identified by a type that describes the purpose, and a set of arguments:

该库使用声明性语法来定义检查列表以及用于声明数据集数据质量的相关约束。每个约束都由描述目的的类型和一组参数标识：

The declarative approach of the library asserts the quality of the data-set in the following way:

库的声明方法通过以下方式断言数据集的质量：

验证输出 (Verification output)

As mentioned above the verification output is represented by a VerificationResult type. In concrete, this is the core C# definition of the VerificationResult:

如上所述，验证输出由VerificationResult类型表示。具体来说，这是VerificationResult C＃核心定义：

The code above introduces the concept of the CheckResult type. The CheckResult class describes the result derived from a check, and it has the following implementation:

上面的代码介绍了CheckResult类型的概念。 CheckResult类描述从检查得出的结果，并具有以下实现：

For each executed Check, there is an associated CheckResult that contains the Status of the check and a list of ConstraintResults bound with that check. Therefore, once the VerificationSuite has been executed, it is possible to access the actual results of the checks:

对于每个执行的Check ，都有一个关联的CheckResult ，其中包含检查的Status以及与该检查绑定的ConstraintResults列表。因此，一旦执行VerificationSuite ，就可以访问检查的实际结果：

The Status field represents the overall status of the VerificationResult. In case of failure, it is possible to iterate every single CheckResult instance and extract the list of ConstraintsResults. Furthermore, we can print out a message for every constraint that is failing and the actual reason for the failure.

Status字段表示VerificationResult的总体状态。如果失败，则可以迭代每个CheckResult实例并提取ConstraintsResults列表。此外，我们可以为每个失败的约束以及失败的实际原因打印出一条消息。

At the foundation of each constraint execution, there is an analyzer that interfaces with the Apache Spark APIs. In the deequ.NET implementation the spark API are provided by the dotnet/spark library. In the following section, we will see how the analyzer classes are abstracted from the rest of the layers of the library.

在每个约束执行的基础上，都有一个与Apache Spark API接口的分析器。在deequ.NET实现中， dotnet / spark库提供了spark API。在下一节中，我们将看到如何从库的其余各层抽象分析器类。

分析仪 (Analyzers)

Analyzers are the foundation of the deequ. They are the implementation of the operators that compute the metrics used by the constraints instances. For each metric, the library has multiple analyzer implementations that refer to the Apache Spark operators. Therefore, all the logic for communicating with Spark is encapsulated in the analyzers layer. More in detail, the library uses the following interface to define a generic analyzer definition:

分析仪是装备的基础。它们是计算约束实例使用的度量的运算符的实现。对于每个指标，该库都有多个引用Apache Spark运算符的分析器实现。因此，用于与Spark进行通信的所有逻辑都封装在分析器层中。更详细地，该库使用以下接口定义通用分析器定义：

The interface declares a set of operations part of each analyzer lifecycle:

该接口在每个分析器生命周期中声明一组操作：

ComputeStateFrom executes the computation of the state based on the DataFrame;
ComputeStateFrom执行基于状态的计算DataFrame ;
ComputeMetricFrom computes and returns the IMetric depending on the state you are passing in;
ComputeMetricFrom根据您传入的状态计算并返回IMetric 。
Preconditions returns a set of assertions that must be satisfied by the schema of the DataFrame;
Preconditions返回一组必须由DataFrame的模式满足的断言；
Calculate runs the Preconditions, calculates, and returns an IMetric instance with the result of the computation. In addition to that, it optionally accepts an IStateLoader and an IStatePersiter interfaces that can be used to load/persist the state into storage.
Calculate运行前提条件，进行计算，并返回包含计算结果的IMetric实例。除此之外，它还可以选择接受IStateLoader和IStatePersiter接口， IStatePersiter接口可用于将状态加载/持久化到存储中。

Every analyzer implements the IAnalyzer interface to provide the core functionalities needed to run the operations in a distributed manner using the underlying Spark implementation. In addition to the IAnalyzer, the library also defines three additional interfaces: the IGroupingAnalyzer, the IScanShareableAnalyzer, and the IFilterableAnalyzer interface.

每个分析器都实现IAnalyzer接口，以提供使用基础Spark实现以分布式方式运行操作所需的核心功能。除了IAnalyzer ，该库还定义了三个附加接口： IGroupingAnalyzer ， IScanShareableAnalyzer和IFilterableAnalyzer接口。

The IScanShareableAnalyzer interface identifies an analyzer that runs a set of aggregation functions over the data, and that share scans over the data. The IScanShareableAnalyzer enriches the analyzer with the AggregationFunctions method used to retrieve the list of the aggregation functions and the FromAggragationResult method that is used to return the state calculated from the execution of the aggregation functions.

IScanShareableAnalyzer接口标识一个分析器，该分析器对数据运行一组聚合功能，并共享对数据的扫描。 IScanShareableAnalyzer使用用于检索聚合函数列表的AggregationFunctions方法和用于返回从执行聚合函数所计算出的状态的FromAggragationResult方法来丰富分析器。

The IGroupingAnalyzer interface identifies the analyzers that groups the data by a specific set of columns. It defines the GroupingColumns method to the analyzer to retrieve the list of grouping columns.

IGroupingAnalyzer接口标识按一组特定列将数据分组的分析器。它为分析器定义了GroupingColumns方法，以检索分组列的列表。

The IFilterableAnalyzer describes the analyzer that implements a filter condition on the fields, and it enriches each implementation with the FilterCondition method.

IFilterableAnalyzer描述了在字段上实现过滤条件的分析器，并使用FilterCondition方法丰富了每种实现。

Let’s continue with an example of the implementation of the MaxLength analyzer. As the name suggests, the purpose of this analyzer is to verify the max length of a column in the data-set:

让我们继续示例MaxLength分析器的实现。顾名思义，此分析器的目的是验证数据集中列的最大长度：

The class defines two properties: the string Column and the Option<string> Where condition of the analyzer. The Where condition is returned as the value of the FilterCondition method. The AggregationFunctions method calculates the Length of the field specified by the Column attribute, and it applies the Max function to the length of the specified Column. The Spark API exposes both the Length and the Max functions used in the AggregationFunctions method. Also, the class implements the AdditionalPrecoditions method, which checks if the Column property of the class is present in the data set and if the field is of type string. Finally, the analyzer instance will be then executed by the ComputeStateFrom method implemented in the ScanShareableAnalyzer parent class:

该类定义两个属性： string Column和分析器的Option<string> Where条件。 Where条件作为FilterCondition方法的值返回。 AggregationFunctions方法计算由Column属性指定的字段的Length ，并将Max函数应用于指定Column的长度。 Spark API公开了AggregationFunctions方法中使用的Length和Max函数。此外，该类实现AdditionalPrecoditions方法，该方法检查数据集中是否存在该类的Column属性，以及该字段是否为字符串类型。最后，分析器实例将由ScanShareableAnalyzer父类中实现的ComputeStateFrom方法执行：

The IState resulting from the execution of the above method is then eventually combined with the previous states persisted in the memory and converted in a resulting IMetric instance in the CalculateMetric method implemented in the Analyzer.CalculateMetric method implementation.

在IState从上述方法的执行而产生的那么最终结合先前状态持久保存在存储器，并转换在所得IMetric在实例CalculateMetric在实施的方法Analyzer.CalculateMetric方法实现。

指标的增量计算 (Incremental computation of metrics)

In a real-world scenario, ETLs usually import batches of data, and the data-sets continuously grow in size with new data. Therefore, it is essential to support situations where the resulting metrics of the analyzers can be stored and calculated using an incremental approach. The research paper that inspired deequ describes the incremental computation of the metrics in the following way:

在现实世界中，ETL通常会导入一批数据，并且数据集的大小会随着新数据的增长而不断增长。因此，必须支持可以使用增量方法存储和计算分析仪的度量指标的情况。激发deequ的研究论文通过以下方式描述了指标的增量计算：

On the left, you have the batch computation that is repeated every time the input data-set grows (ΔD). This approach needs access to the previous data-sets, and it results in a more computational effort. On the right side, the data-set grows (ΔD) is combined with the state ( S) of the previous computation. Therefore, the system needs to recompute the metric every time a new batch of data is processed.

左侧是每次输入数据集增长(ΔD)时都会重复进行的批处理计算。这种方法需要访问以前的数据集，这会导致更多的计算工作。在右侧，数据集增长(ΔD)与先前计算的状态( S )相结合。因此，每次处理新一批数据时，系统都需要重新计算度量。

The incremental computation method we described is achievable using the APIs exposed by deequ.

我们描述的增量计算方法可以使用deequ公开的API来实现。

The following example demostrate how to implement the incremental computation using the following sample:

下面的示例演示如何使用以下示例实现增量计算：

and the snippet of code defined here:

以及此处定义的代码段：

The LoadData loads the data schema defined in the table above into three different data sets using the countryCode as a partition key. Also, the code defines a new check using the following constraint methods: IsComplete, ContainsURL, IsContainedIn. The resulting analyzers (obtained by calling the RequiredAnalyzers() method) are then passed into a new instance of the Analysis class. The code also defines 3 different InMemoryStateProvider instances and it executes the AnalyzerRunner.Run method for each country code: DE, US, CN by passing the corresponding InMemoryStateProvider.

LoadData使用countryCode作为分区键将上表中定义的数据模式加载到三个不同的数据集中。此外，该代码使用以下约束方法定义新检查： IsComplete ， ContainsURL ， IsContainedIn 。然后将生成的分析器(通过调用RequiredAnalyzers()方法获得)传递到Analysis类的新实例中。该代码还定义了3个不同的InMemoryStateProvider实例，并通过传递相应的InMemoryStateProvider为每个国家/地区代码( DE ， US ， CN执行AnalyzerRunner.Run方法。

The mechanism of aggregated states ( AnalysisRunner.RunOnAggregatedStates method) provides a way to merge the 3 in-memory states: dataSetDE, dataSetUS and dataSetCN into a unique table of metrics. It is important to notice that the operation does not trigger any re-computation of the data sample.

聚集状态的机制( AnalysisRunner.RunOnAggregatedStates方法)提供了一种将三种内存中状态( dataSetDE ， dataSetUS和dataSetCN合并到唯一的指标表中的方法。重要的是要注意，该操作不会触发数据样本的任何重新计算。

Once we have a unique table of metrics, it is also possible to increment only one partition of the data. For example, let’s assume that the US partition changes and the data increase, the system only recompute the state of the changed partition to update the metrics for the whole table:

一旦有了唯一的指标表，也可以仅增加一个数据分区。例如，假设US分区已更改且数据增加，系统仅重新计算已更改分区的状态以更新整个表的指标：

It is essential to notice that the schema of the data must be unique for every data-set state you need to aggregate. This approach results in a lighter computation effort when you have to refresh the metrics of a single partition of your data-set.

必须注意，对于需要聚合的每个数据集状态，数据的架构都必须是唯一的。当您必须刷新数据集的单个分区的指标时，此方法可减少计算工作量。

处理Scala功能方法 (Handle the Scala functional approach)

The official awslabs/deequ implementation is written in Sala, which is also the official language of Apache Spark. The strong object-oriented nature of C# adds more difficulties in replicating some of the concepts used by the original Scala deequ library. An example could be the widespread use of the Try and Option monads. Fortunately, it is not the first time that someone ports a Scala library into C# .NET: the Akka.NET (port of Akka) has a handful guide that gives some conversion suggestion for doing that. Akka.NET repository also provides some implementation utils such as the Try<T> and Option<T> monads for C#, which are also used by the deequ.NET code.

正式的awslabs / deequ实现使用Sala编写，Sala也是Apache Spark的官方语言。 C＃强大的面向对象性质在复制原始Scala deequ库使用的某些概念时增加了更多困难。一个例子就是Try and Option单子的广泛使用。幸运的是，这并不是第一次有人将Scala库移植到C＃.NET中： Akka.NET (Akka港口)提供了一些指南，提出了一些转换建议。 Akka.NET存储库还提供了一些实现实用程序，例如C＃的Try<T>和Option<T> monad，deequ.NET代码也使用这些工具。

最后的想法 (Final thoughts)

This post described the initial work that I did to port the deequ library into the .NET ecosystem. We have seen an introduction to some of the components that are part of the architecture of the library, such as the checks part, the constraint API, the analyzers layer, and the batch vs. incremental computation approach.

这篇文章描述了我将deequ库移植到.NET生态系统中的最初工作。我们已经看到了作为库体系结构一部分的一些组件的介绍，例如检查部分， 约束API ， 分析器层以及批处理与增量计算方法。

I’m going to cover the rest of the core topics of the library in a next post, such as metrics history, anomaly detectors, and deployment part.

我将在下一篇文章中介绍该库的其余核心主题，例如度量标准历史记录 ， 异常检测器和部署部分 。

In the meantime, this the repository where you can find the actual library implementation samueleresca/deequ.NET, and this is the original awslabs/deequ library.

同时，您可以在该存储库中找到实际的库实现samueleresca / deequ.NET ，这是原始的awslabs / deequ库。

Originally published at https://samueleresca.net.

最初发布在 https://samueleresca.net 。