10.validation

10.1.Purpose

Validate the data copied, either import or export by comparing the row counts from the source and the target post copy.

校验数据拷贝,导出导入通过比较源数据和目标数据(就是导出或导入后需要的数据)的行数。

10.2.Introduction

There are 3 basic interfaces: ValidationThreshold - Determines if the error margin between the source and target are acceptable:Absolute, PercentageTolerant, etc.Default implementation is AbsoluteValidationThreshold which ensures the row counts from source and targets are the same.

有三个基本的接口:

ValidationThreshold-确定源和目标之间的误差范围是否可以接受:绝对值,宽容的,等百分比,等等。默认的实现是AbsoluteValidationThreshold,它保证行数从源和目标是相同的。

ValidationFailureHandler - Responsible for handling failures: log an error/warning, abort, etc. Default implementation is LogOnFailureHandler that logs a warning message to the configured logger.

ValidationFailureHandler - 负责处理故障:记录错误/警告,中止等。默认的实现是LogOnFailureHandler,它记录一个警告消息到配置的日志记录器中。

Validator - Drives the validation logic by delegating the decision to ValidationThreshold and delegating failure handling to ValidationFailureHandler. The default implementation is RowCountValidator which validates the row counts from source and the target.

Validator -驱动验证逻辑,通过委派ValidationThreshold做决定并委派ValidationFailureHandler做错误处理。默认的实现是,RowCountValidator,它校验了源和目标的行数。

10.3.Syntax

$ sqoop import (generic-args) (import-args)
$ sqoop export (generic-args) (export-args)

Validation arguments are part of import and export arguments.

校验参数是导入和导出的参数的一部分。

10.4.Configuration

The validation framework is extensible and pluggable. It comes with default implementations but the interfaces can be extended to allow custom implementations by passing them as part of the command line arguments as described below.


校验框架是可配置和可插拔的,这是因为虽然有默认实现,但是接口可以通过允许自定义实现扩展,通过这些选项作为命令行参数的一部分。如下所述

Validator.

Property:         validator
Description:      Driver for validation,
                  must implement org.apache.sqoop.validation.Validator
Supported values: The value has to be a fully qualified class name.
Default value:    org.apache.sqoop.validation.RowCountValidator

Validation Threshold.

Property:         validation-threshold
Description:      Drives the decision based on the validation meeting the
                  threshold or not. Must implement
                  org.apache.sqoop.validation.ValidationThreshold
Supported values: The value has to be a fully qualified class name.
Default value:    org.apache.sqoop.validation.AbsoluteValidationThreshold

Validation Failure Handler.

Property:         validation-failurehandler
Description:      Responsible for handling failures, must implement
                  org.apache.sqoop.validation.ValidationFailureHandler
Supported values: The value has to be a fully qualified class name.
Default value:    org.apache.sqoop.validation.LogOnFailureHandler

10.5.Limitations

Validation currently only validates data copied from a single table into HDFS. The following are the limitations in the current implementation:

验证目前只验证数据从一个表复制到HDFS。以下是在当前的实现中的限制(下列情况不能用校验):

  • all-tables option 所有表查询

  • free-form query option 查询语句选项

  • Data imported into Hive or HBase 数据导入到hive或HBase

  • table import with --where argument 有wheret条件的表导入

  • incremental imports 增量导入

10.6.Example Invocations

A basic import of a table namedEMPLOYEESin thecorpdatabase that uses validation to validate the row counts:

一个基本的导入,在corp数据库表一张命名为EMPLOYEES的表,使用校验验证行数:

$ sqoop import --connect jdbc:mysql://db.foo.com/corp  \
    --table EMPLOYEES --validate

A basic export to populate a table namedbarwith validation enabled:

一个基本的导出,启用了校验,它填充一张名为bar的表。

$ sqoop export --connect jdbc:mysql://db.example.com/foo --table bar  \
    --export-dir /results/bar_data --validate

Another example that overrides the validation args:

另外一个例子覆盖了校验参数:

$ sqoop import --connect jdbc:mysql://db.foo.com/corp --table EMPLOYEES \
    --validate --validator org.apache.sqoop.validation.RowCountValidator \
    --validation-threshold \
          org.apache.sqoop.validation.AbsoluteValidationThreshold \
    --validation-failurehandler \
          org.apache.sqoop.validation.LogOnFailureHandler

11.Saved Jobs

Imports and exports can be repeatedly performed by issuing the same command multiple times. Especially when using the incremental import capability, this is an expected scenario.

导入和导出可以重复执行通过多次使用相同的命令。尤其是当使用增量导入功能,这是一个预期的场景

Sqoop allows you to definesaved jobswhich make this process easier. A saved job records the configuration information required to execute a Sqoop command at a later time. The section on thesqoop-jobtool describes how to create and work with saved jobs.

Sqoop允许您定义saved jobs  ,使这个过程更简单。一个save job记录了以后要执行的一个Sqoop命令的配置信息。sqoop-job工具一节的内容描述了如何创建和使用保存的job。

By default, job descriptions are saved to a private repository stored in$HOME/.sqoop/. You can configure Sqoop to instead use a sharedmetastore, which makes saved jobs available to multiple users across a shared cluster. Starting the metastore is covered by the section on thesqoop-metastoretool.

默认情况下,job描述保存到一个私人存储库,这个存储库存储在$ HOME / .sqoop /。您可以配置Sqoop转而使用一个共享的metastore,使保存的职位对于多个用户可用在一个共享的集群。启动metastore在sqoop-metastore工具章节中有讲解