

Validate the data copied, either import or export by comparing the row counts from the source and the target post copy.



There are 3 basic interfaces: ValidationThreshold - Determines if the error margin between the source and target are acceptable:Absolute, PercentageTolerant, etc.Default implementation is AbsoluteValidationThreshold which ensures the row counts from source and targets are the same.



ValidationFailureHandler - Responsible for handling failures: log an error/warning, abort, etc. Default implementation is LogOnFailureHandler that logs a warning message to the configured logger.

ValidationFailureHandler - 负责处理故障:记录错误/警告,中止等。默认的实现是LogOnFailureHandler,它记录一个警告消息到配置的日志记录器中。

Validator - Drives the validation logic by delegating the decision to ValidationThreshold and delegating failure handling to ValidationFailureHandler. The default implementation is RowCountValidator which validates the row counts from source and the target.

Validator -驱动验证逻辑,通过委派ValidationThreshold做决定并委派ValidationFailureHandler做错误处理。默认的实现是,RowCountValidator,它校验了源和目标的行数。


$ sqoop import (generic-args) (import-args)
$ sqoop export (generic-args) (export-args)

Validation arguments are part of import and export arguments.



The validation framework is extensible and pluggable. It comes with default implementations but the interfaces can be extended to allow custom implementations by passing them as part of the command line arguments as described below.



Property:         validator
Description:      Driver for validation,
                  must implement org.apache.sqoop.validation.Validator
Supported values: The value has to be a fully qualified class name.
Default value:    org.apache.sqoop.validation.RowCountValidator

Validation Threshold.

Property:         validation-threshold
Description:      Drives the decision based on the validation meeting the
                  threshold or not. Must implement
Supported values: The value has to be a fully qualified class name.
Default value:    org.apache.sqoop.validation.AbsoluteValidationThreshold

Validation Failure Handler.

Property:         validation-failurehandler
Description:      Responsible for handling failures, must implement
Supported values: The value has to be a fully qualified class name.
Default value:    org.apache.sqoop.validation.LogOnFailureHandler


Validation currently only validates data copied from a single table into HDFS. The following are the limitations in the current implementation:


  • all-tables option 所有表查询

  • free-form query option 查询语句选项

  • Data imported into Hive or HBase 数据导入到hive或HBase

  • table import with --where argument 有wheret条件的表导入

  • incremental imports 增量导入

10.6.Example Invocations

A basic import of a table namedEMPLOYEESin thecorpdatabase that uses validation to validate the row counts:


$ sqoop import --connect jdbc:mysql://db.foo.com/corp  \
    --table EMPLOYEES --validate

A basic export to populate a table namedbarwith validation enabled:


$ sqoop export --connect jdbc:mysql://db.example.com/foo --table bar  \
    --export-dir /results/bar_data --validate

Another example that overrides the validation args:


$ sqoop import --connect jdbc:mysql://db.foo.com/corp --table EMPLOYEES \
    --validate --validator org.apache.sqoop.validation.RowCountValidator \
    --validation-threshold \
          org.apache.sqoop.validation.AbsoluteValidationThreshold \
    --validation-failurehandler \

11.Saved Jobs

Imports and exports can be repeatedly performed by issuing the same command multiple times. Especially when using the incremental import capability, this is an expected scenario.


Sqoop allows you to definesaved jobswhich make this process easier. A saved job records the configuration information required to execute a Sqoop command at a later time. The section on thesqoop-jobtool describes how to create and work with saved jobs.

Sqoop允许您定义saved jobs  ,使这个过程更简单。一个save job记录了以后要执行的一个Sqoop命令的配置信息。sqoop-job工具一节的内容描述了如何创建和使用保存的job。

By default, job descriptions are saved to a private repository stored in$HOME/.sqoop/. You can configure Sqoop to instead use a sharedmetastore, which makes saved jobs available to multiple users across a shared cluster. Starting the metastore is covered by the section on thesqoop-metastoretool.

默认情况下,job描述保存到一个私人存储库,这个存储库存储在$ HOME / .sqoop /。您可以配置Sqoop转而使用一个共享的metastore,使保存的职位对于多个用户可用在一个共享的集群。启动metastore在sqoop-metastore工具章节中有讲解