12.sqoop-job

12.1.Purpose

The job tool allows you to create and work with saved jobs. Saved jobs remember the parameters used to specify a job, so they can be re-executed by invoking the job by its handle.

这个job 工具允许你创建和使用  保存 的job,已经 保存的job记得一个特定任务的参数,所以通过执行这个 已经 保存 的job就可以再次执行那个特定的任务。

If a saved job is configured to perform an incremental import, state regarding the most recently imported rows is updated in the saved job to allow the job to continually import only the newest rows.

如果一个保存的工作是执行增量导入的配置,最近的已经导入的行的状态在 saved job中被更新,由此允许job不断的导入最新的行。

12.2.Syntax

$ sqoop job (generic-args) (job-args) [-- [subtool-name] (subtool-args)]
$ sqoop-job (generic-args) (job-args) [-- [subtool-name] (subtool-args)]

Although the Hadoop generic arguments must preceed any job arguments, the job arguments can be entered in any order with respect to one another.

Table24.Job management options:

ArgumentDescription
--create <job-id>Define a new saved job with the specified job-id (name). A second Sqoop command-line, separated by a--should be specified; this defines the saved job //定义一个 新的job 通过指定jobid(名称) 第二段的sqoop命令行通过 -- 分割必须被指定 它定义了要保存的job.
--delete <job-id>Delete a saved job 删除一个已经  保存   的job.
--exec <job-id>Given a job defined with--create, run the saved job 运行已经创建的job.
--show <job-id>Show the parameters for a saved job 展示一个已经保存 的job的参数.
--listList all saved jobs 列出所有已经存储的job


Creating saved jobs is done with the--createaction. This operation requires a--followed by a tool name and its arguments. The tool and its arguments will form the basis of the saved job. Consider:

创建保存工作是通过 --create功能完成。这个操作需要 --后面跟着一个工具名称和它的参数。该工具及其参数将成为已保存的job的主要内容。考虑:

$ sqoop job --create myjob -- import --connect jdbc:mysql://example.com/db \
    --table mytable

This creates a job namedmyjobwhich can be executed later. The job is not run. This job is now available in the list of saved jobs:

上面的命令创建了一个名为myjob的job, 这个命令可以以后执行,当job没有运行时,可以通过下面的命令列出已经保存 的job :

$ sqoop job --list
Available jobs:
  myjob

We can inspect the configuration of a job with theshowaction:

我们可以检查一个job的配置,通过show功能:

 $ sqoop job --show myjob
 Job: myjob
 Tool: import
 Options:
 ----------------------------
 direct.import = false
 codegen.input.delimiters.record = 0
 hdfs.append.dir = false
 db.table = mytable
 ...

And if we are satisfied with it, we can run the job withexec:

如果我们是满意的,我们可以运行工作 exec

$ sqoop job --exec myjob
10/08/19 13:08:45 INFO tool.CodeGenTool: Beginning code generation
...

Theexecaction allows you to override arguments of the saved job by supplying them after a--. For example, if the database were changed to require a username, we could specify the username and password with:

exec操作允许您覆盖已经保存的job的参数通过 -- ,。例如,如果数据库是改变需要一个用户名,我们可以指定用户名和密码:

$ sqoop job --exec myjob -- --username someuser -P
Enter password:
...

Table25.Metastore connection options:

ArgumentDescription
--meta-connect <jdbc-uri>Specifies the JDBC connect string used to connect to the metastore 指定用于连接元数据仓库的jdbc连接字符串

By default, a private metastore is instantiated in$HOME/.sqoop. If you have configured a hosted metastore with thesqoop-metastoretool, you can connect to it by specifying the--meta-connectargument. This is a JDBC connect string just like the ones used to connect to databases for import.

默认,一个私有的元数据仓库在  $HOME/.sqoop 中初始化,如果你配置一个主机元数据仓库通过sqoop-metastore工具,你可以连接它通过指定--meta-connect 参数。这是一个JDBC连接字符串就像那些用于连接数据库导入的字符串一样。

Inconf/sqoop-site.xml, you can configuresqoop.metastore.client.autoconnect.urlwith this address, so you do not have to supply--meta-connectto use a remote metastore. This parameter can also be modified to move the private metastore to a location on your filesystem other than your home directory.

conf/sqoop-site.xml中,你可以配置sqoop.metastore.client.autoconnect.url通过这个地址,所有你不需要提供--meta-connect 来使用一个远程元数据仓库,这个参数也可以被修改, 以便于移动私有的元数据仓库到除了home目录以外的文件系统中的位置。

If you configuresqoop.metastore.client.enable.autoconnectwith the valuefalse, then you must explicitly supply--meta-connect.

如果你配置 sqoop.metastore.client.enable.autoconnect的值为false,这时你必须明确地提供--meta-connect。

Table26.Common options:

ArgumentDescription
--helpPrint usage instructions
--verbosePrint more information while working

12.3.Saved jobs and passwords 已保存的job 和密码

The Sqoop metastore is not a secure resource. Multiple users can access its contents. For this reason, Sqoop does not store passwords in the metastore. If you create a job that requires a password, you will be prompted for that password each time you execute the job.

Sqoop的 元数据仓库  不是一个安全的资源。多个用户可以访问其内容。出于这个原因,Sqoop不存储在metastore密码。如果您创建了一个工作,需要一个密码,每次执行工作时你将被提示输入密码:

You can enable passwords in the metastore by settingsqoop.metastore.client.record.passwordtotruein the configuration.

您可以在 元数据仓库 中启用密码  通过在配置文件中设置 sqoop.metastore.client.record为true。

Note that you have to setsqoop.metastore.client.record.passwordtotrueif you are executing saved jobs via Oozie because Sqoop cannot prompt the user to enter passwords while being executed as Oozie tasks.

请注意,您必须设置sqoop.metastore.client.record密码为true,如果你通过Oozie执行已保存的job,因为Sqoop不能提示用户输入密码。

12.4.Saved jobs and incremental imports

Incremental imports are performed by comparing the values in acheck columnagainst a reference value for the most recent import. For example, if the--incremental appendargument was specified, along with--check-column idand--last-value 100, all rows withid > 100will be imported. If an incremental import is run from the command line, the value which should be specified as--last-valuein a subsequent incremental import will be printed to the screen for your reference. If an incremental import is run from a saved job, this value will be retained in the saved job. Subsequent runs ofsqoop job --exec someIncrementalJobwill continue to import only newer rows than those previously imported.

增量导入通过比较目标表check 列的值和最近导入的check 列的值。例如,如果  --incremental append 参数被指定,还有--check-column id and --last-value 100 ,所有id > 100的行将被导入。如果这个增量导入从命令行运行,在一个后续增量导入中,该值应该重新被指定通过-last-value,将打印到屏幕上,供你参考。如果一个增量导入是运行于一个已保存的工作,这个值将会被保留在已保存的job。  sqoop job --exec someIncrementalJob后续的运行 只导入相比较于那些以前的导入有更新的行。

13.sqoop-metastore

13.1.Purpose

Themetastoretool configures Sqoop to host a shared metadata repository. Multiple users and/or remote users can define and execute saved jobs (created withsqoop job) defined in this metastore.

这个metastore工具配置Sqoop到一个主机共享的元数据存储库。多个用户和/或远程用户可以定义和执行保存的job(通过sqoop job创建)定义在这个元数据仓库。

Clients must be configured to connect to the metastore insqoop-site.xmlor with the--meta-connectargument.

客户端必须配置元数据仓库的连接在sqoop-site.xml中或通过  --meta-connect  参数

13.2.Syntax

$ sqoop metastore (generic-args) (metastore-args)
$ sqoop-metastore (generic-args) (metastore-args)

Although the Hadoop generic arguments must preceed any metastore arguments, the metastore arguments can be entered in any order with respect to one another.

Table27.Metastore management options:

ArgumentDescription
--shutdownShuts down a running metastore instance on the same machine在同一台机器上  关闭一个metastore实例.

Runningsqoop-metastorelaunches a shared HSQLDB database instance on the current machine. Clients can connect to this metastore and create jobs which can be shared between users for execution.

运行sqoop-metastore开启一个共享的HSQLDB数据库实例在当前的机器上,客户端可以连接这个元数据仓库并创建以被多用户分享,执行的job。

The location of the metastore’s files on disk is controlled by thesqoop.metastore.server.locationproperty inconf/sqoop-site.xml. This should point to a directory on the local filesystem.

元数据仓库的文件的位置在磁盘上被conf / sqoop-site.xml 中的sqoop.metastore.server属性控制。这应该指向一个本地文件系统上的目录.

The metastore is available over TCP/IP. The port is controlled by thesqoop.metastore.server.portconfiguration parameter, and defaults to 16000.

元数据仓库在TCP / IP上是可用的。端口号通过sqoop.metastore.server控制参数。默认值为16000。

Clients should connect to the metastore by specifyingsqoop.metastore.client.autoconnect.urlor--meta-connectwith the valuejdbc:hsqldb:hsql://<server-name>:<port>/sqoop. For example,jdbc:hsqldb:hsql://metaserver.example.com:16000/sqoop.


客户端必须连接元数据通过指定 sqoop.metastore.client.autoconnect.url or --meta-connect 通过值 jdbc:hsqldb:hsql://<server-name>:<port>/sqoop.例如: jdbc:hsqldb:hsql://metaserver.example.com:16000/sqoop.

This metastore may be hosted on a machine within the Hadoop cluster, or else where on the network.

这元数据仓库可能是托管在Hadoop集群中的某一台机器,或网络上。

参考资料:http://myeyeofjava.iteye.com/blog/1704644

14.sqoop-merge

14.1.Purpose

The merge tool allows you to combine two datasets where entries in one dataset should overwrite entries of an older dataset. For example, an incremental import run in last-modified mode will generate multiple datasetsin HDFSwhere successively newer data appears in each dataset. Themergetool will "flatten" two datasets into one, taking the newest available records for each primary key.

merge工具允许你合并两个数据集,新的数据集将覆盖旧数据集。例如,一个last - modified模式的增量导入将在HDFS产生多个数据集,先后出现在每个数据集更新数据。merge工具将“平”两个数据集到一个,取得最新的可用的记录为每个主键。

//红字部分没有弄懂,但没关系。merge 工具适用于增量导入,多数用于last - modified模式,当第一次导入时目录为 -target-dir old,第二次导入时使用last - modified模式 目录为-target-dir new,第二次导入的数据相比较与第一次导入的数据中有相同主键的行有数据更新,此时可以使用merge工具合并两个数据集,实例如下:

http://blog.csdn.net/coldplay/article/details/7619065

14.2.Syntax

$ sqoop merge (generic-args) (merge-args)
$ sqoop-merge (generic-args) (merge-args)

Although the Hadoop generic arguments must preceed any merge arguments, the job arguments can be entered in any order with respect to one another.

Table28.Merge options:

ArgumentDescription
--class-name <class>Specify the name of the record-specific class to use during the merge job 指定合并过程中要用到的class名.
--jar-file <file>Specify the name of the jar to load the record class from//指定用来加载class的jar包名.
--merge-key <col>Specify the name of a column to use as the merge key 指定用作合并的主键.
--new-data <path>Specify the path of the newer dataset 指定较新的数据集的路径.
--onto <path>Specify the path of the older dataset  指定较旧的数据集的路径  .
--target-dir <path>Specify the target path for the output of the merge job //指定合并任务的目标输出路径.

Themergetool runs a MapReduce job that takes two directories as input: a newer dataset, and an older one. These are specified with--new-dataand--ontorespectively. The output of the MapReduce job will be placed in the directory in HDFS specified by--target-dir.

merge工具运行MapReduce作业,需要输入两个目录:一个较新的数据集,和一个旧的数据集。这些通过--new-data and --onto分别地指定。MapReduce任务的输出将放置在在HDFS中通过--target-dir指定的目录

When merging the datasets, it is assumed that there is a unique primary key value in each record. The column for the primary key is specified with--merge-key. Multiple rows in the same dataset should not have the same primary key, or else data loss may occur.

当合并数据集,它是假设有一个唯一的主键值在每个记录(一个记录是要合并的两个数据库表中的一个)中。列的主键是通过--merge-key指定。在相同的数据集的多行不应具有相同的主键,否则可能发生数据丢失。

To parse the dataset and extract the key column, the auto-generated class from a previous import must be used. You should specify the class name and jar file with--class-nameand--jar-file. If this is not available you can recreate the class using thecodegentool.

为了解析数据并提取key列,必须使用以前导入时自动生成的类。你应该指定类名和jar文件通过--class-name and --jar-file,如果没有有可用的,你可以使用使用codegen工具重新创建类。

The merge tool is typically run after an incremental import with the date-last-modified mode (sqoop import --incremental lastmodified …).

merge工具通常是运行在一个date-last-modified模式的增量导入 (sqoop import --incremental lastmodified …).

Supposing two incremental imports were performed, where some older data is in an HDFS directory namedolderand newer data is in an HDFS directory namednewer, these could be merged like so:

假如两个增量导入被执行,一些老的数据是在一个HDFS目录中命名为older ,新的数据是在一个HDFS目录中命名为newer ,这种情况可以像这样合并:

$ sqoop merge --new-data newer --onto older --target-dir merged \
    --jar-file datatypes.jar --class-name Foo --merge-key id

This would run a MapReduce jobwhere the value in theidcolumn of each row is used to join rows;rows in thenewerdataset will be used in preference to rows in theolderdataset.

这将运行MapReduce  job,用id列的值做join条件;新的数据集的行将优先于旧的数据集的行被使用。

This can be used with both SequenceFile-, Avro- and text-based incremental imports. The file types of the newer and older datasets must be the same.

这可用于SequenceFile -,Avro-和基于文本的增量导入。新和旧的数据集的文件类型必须是相同的。

15.sqoop-codegen

15.1.Purpose

Thecodegentool generates Java classes which encapsulate and interpret imported records. The Java definition of a record is instantiated as part of the import process, but can also be performed separately. For example, if Java source is lost, it can be recreated. New versions of a class can be created which use different delimiters between fields, and so on.

codegen工具生成Java类,这个Java类,可以封装和解释导入的记录。Java定义实例化是作为导入过程过程,但也可以单独执行。例如,如果Java源丢失,则可以重新创建。新版本的类可使用不同的字段分隔符,等等。


15.2.Syntax

$ sqoop codegen (generic-args) (codegen-args)
$ sqoop-codegen (generic-args) (codegen-args)

Although the Hadoop generic arguments must preceed any codegen arguments, the codegen arguments can be entered in any order with respect to one another.

Table29.Common arguments

ArgumentDescription
--connect <jdbc-uri>Specify JDBC connect string
--connection-manager <class-name>Specify connection manager class to use
--driver <class-name>Manually specify JDBC driver class to use
--hadoop-mapred-home <dir>Override $HADOOP_MAPRED_HOME
--helpPrint usage instructions
-PRead password from console
--password <password>Set authentication password
--username <username>Set authentication username
--verbosePrint more information while working
--connection-param-file <filename>Optional properties file that provides connection parameters

Table30.Code generation arguments:

ArgumentDescription
--bindir <dir>Output directory for compiled objects//指定class文件存放目录
--class-name <name>Sets the generated class name. This overrides --package-name. When combined with --jar-file, sets the input class
设置生成的class名,这将覆盖 --package-name ,它还可以和-jar-file一起使用,用来设置输入的class( 指定一个导入时使用的class)
--jar-file <file>Disable code generation; use specified jar 代码生成无效;使用特定的jar包
--outdir <dir>Output directory for generated code 生成代码的输出路径
--package-name <name>Put auto-generated classes in this package //所有自动生成的class的包名
--map-column-java <m>Override default mapping from SQL type to Java type for configured columns//(不懂) .

Table31.Output line formatting arguments:

ArgumentDescription
--enclosed-by <char>Sets a required field enclosing character
--escaped-by <char>Sets the escape character
--fields-terminated-by <char>Sets the field separator character
--lines-terminated-by <char>Sets the end-of-line character
--mysql-delimitersUses MySQL’s default delimiter set: fields:,lines:\nescaped-by:\optionally-enclosed-by:'
--optionally-enclosed-by <char>Sets a field enclosing character

Table32.Input parsing arguments:

ArgumentDescription
--input-enclosed-by <char>Sets a required field encloser
--input-escaped-by <char>Sets the input escape character
--input-fields-terminated-by <char>Sets the input field separator
--input-lines-terminated-by <char>Sets the input end-of-line character
--input-optionally-enclosed-by <char>Sets a field enclosing character

Table33.Hive arguments:

ArgumentDescription
--hive-home <dir>Override$HIVE_HOME
--hive-importImport tables into Hive (Uses Hive’s default delimiters if none are set.)
--hive-overwriteOverwrite existing data in the Hive table.
--create-hive-tableIf set, then the job will fail if the target hive

table exits. By default this property is false.
--hive-table <table-name>Sets the table name to use when importing to Hive.
--hive-drop-import-delimsDrops\n,\r, and\01from string fields when importing to Hive.
--hive-delims-replacementReplace\n,\r, and\01from string fields with user defined string when importing to Hive.
--hive-partition-keyName of a hive field to partition are sharded on
--hive-partition-value <v>String-value that serves as partition key for this imported into hive in this job.
--map-column-hive <map>Override default mapping from SQL type to Hive type for configured columns.

If Hive arguments are provided to the code generation tool, Sqoop generates a file containing the HQL statements to create a table and load data.

如果Hive参数提供了代码生成工具,Sqoop生成一个文件包含HQL语句来创建一个表和加载数据。

15.3.Example Invocations

Recreate the record interpretation code for theemployeestable of a corporate database:

重新创建记录解释代码为 一个corp数据库的employees表

$ sqoop codegen --connect jdbc:mysql://db.example.com/corp \
    --table employees