7.sqoop-import

7.1.Purpose

Theimporttool imports an individual table from an RDBMS to HDFS. Each row from a table is represented as a separate record in HDFS. Records can be stored as text files (one record per line), or in binary representation as Avro or SequenceFiles.

import tool 用导入 单张表 从RDBMS 到 HDFS。一张表的每一行代表一条单独的记录。

7.2.Syntax

$ sqoop import (generic-args) (import-args)
$ sqoop-import (generic-args) (import-args)

While the Hadoop generic arguments must precede any import arguments, you can type the import arguments in any order with respect to one another.

hadoop通用参数必须写在import参数前,import 参数 可以以任意参数排列。

[Note]Note

In this document, arguments are grouped into collections organized by function. Some collections are present in several tools (for example, the "common" arguments). An extended description of their functionality is given only on the first presentation in this document.

在这个文档里 ,参数是

Table1.Common arguments:(上面写过)

ArgumentDescription
--connect <jdbc-uri>Specify JDBC connect string
--connection-manager <class-name>Specify connection manager class to use
--driver <class-name>Manually specify JDBC driver class to use
--hadoop-mapred-home <dir>Override $HADOOP_MAPRED_HOME //覆盖
--helpPrint usage instructions
-PRead password from console
--password <password>Set authentication password
--username <username>Set authentication username
--verbosePrint more information while working
--connection-param-file <filename>Optional properties file that provides connection parameters

7.2.1.Connecting to a Database Server

Sqoop is designed to import tables from a database into HDFS. To do so, you must specify aconnect stringthat describes how to connect to the database. Theconnect stringis similar to a URL, and is communicated to Sqoop with the--connectargument. This describes the server and database to connect to; it may also specify the port. For example:

可以使用 --connect 连接数据库,还可以指定端口号。例如:

$ sqoop import --connect jdbc:mysql://database.example.com/employees

This string will connect to a MySQL database namedemployeeson the hostdatabase.example.com. It’s important that youdo notuse the URLlocalhostif you intend to use Sqoop with a distributed Hadoop cluster. The connect string you supply will be used on TaskTracker nodes throughout your MapReduce cluster; if you specify the literal namelocalhost, each node will connect to a different database (or more likely, no database at all). Instead, you should use the full hostname or IP address of the database host that can be seen by all your remote nodes.

这个字符串会连接database.example.com  主机employees数据库,不使用localhost URL是重要的, 如果在一个分布式集群上使用Sqoop, 你提供的连接字符串将被用在任务处理器节点遍及你的MapReduce集群,如果你指定目标主机连接是localhost,每一个节点将会连接不同的数据库(或更有可能根本没有数据库),换句话说,你必须使用数据库主机完整的主机名和IP地址.

You might need to authenticate against the database before you can access it. You can use the--usernameand--passwordor-Pparameters to supply a username and a password to the database. For example:

你可能需要认证在你访问数据库之前,你可以使用 --username and --password or -P 参数提供用户名和密码给数据库 例如:

$ sqoop import --connect jdbc:mysql://database.example.com/employees \
    --username aaron --password 12345
[Warning]Warning

The--passwordparameter is insecure, as other users may be able to read your password from the command-line arguments via the output of programs such asps. The-Pargument will read a password from a console prompt, and is the preferred method of entering credentials. Credentials may still be transferred between nodes of the MapReduce cluster using insecure means.

--password参数是不安全的,其他的用户可能会读取你的密码通过输出程序例如ps,-P参数会从控制台提示读取参数,-P是首选的认证方式,认证信息可能仍然一不安全的方式传递.

Sqoop automatically supports several databases, including MySQL. Connect strings beginning withjdbc:mysql://are handled automatically in Sqoop. (A full list of databases with built-in support is provided in the "Supported Databases" section. For some, you may need to install the JDBC driver yourself.)

Sqoop自动地支持几种数据库,包括mysql,使用 jdbc:mysql:// 开头的连接字符串在sqoop自动是自动处理的(一个完整的内置支持数据库列表被提供在Supported Databases,其他的你可能要安装JDBC驱动.)

You can use Sqoop with any other JDBC-compliant database. First, download the appropriate JDBC driver for the type of database you want to import, and install the .jar file in the$SQOOP_HOME/libdirectory on your client machine. (This will be/usr/lib/sqoop/libif you installed from an RPM or Debian package.) Each driver.jarfile also has a specific driver class which defines the entry-point to the driver. For example, MySQL’s Connector/J library has a driver class ofcom.mysql.jdbc.Driver. Refer to your database vendor-specific documentation to determine the main driver class. This class must be provided as an argument to Sqoop with--driver.

您可以使用Sqoop在任何其他 jdbc规范的数据库上,首先,下载合适的JDBC驱动类型你要导入的,安装.jar文件在$SQOOP_HOME/lib目录在你的客户端机器,(这将在/usr/lib/sqoop/lib 如果从RPM 或Debian包安装)每一个驱动.jar文件都有一个定义了驱动的指令的指定的驱动class点。

For example, to connect to a SQLServer database, first download the driver from microsoft.com and install it in your Sqoop lib path.

例如,连接一个SQLserver数据库,首先从microsoft.com下载驱动并安装它在你的Sqoop lib路径。

这时运行sqoop ,例如:

Then run Sqoop. For example:

$ sqoop import --driver com.microsoft.jdbc.sqlserver.SQLServerDriver \
    --connect <connect-string> ...

When connecting to a database using JDBC, you can optionally specify extra JDBC parameters via a property file using the option--connection-param-file. The contents of this file are parsed as standard Java properties and passed into the driver while creating a connection.

当使用JDBC连接数据库时,你可以随意的指定jDBC参数通过一个属性文件 使用选项 --connection-param-file.文件的内容被解析为标准Java属性并在创建连接的时候传递给数据库。

[Note]Note

The parameters specified via the optional property file are only applicable to JDBC connections. Any fastpath connectors that use connections other than JDBC will ignore these parameters.

通过可选的属性文件指定参数仅仅适用于JDBC连接,其他使用JDBC的快速通道连接器将会忽略这些参数。

Table2.Validation argumentsMore Details

ArgumentDescription
--validateEnable validation of data copied, supports single table copy only.--validator <class-name>Specify validator class to use.
启用验证的数据复制,只支持单表复制, --validator<class-name> 指定要使用的class验证器。
--validation-threshold <class-name>Specify validation threshold class to use.// 指定要使用的 阀值验证class
+--validation-failurehandler <class-name>+ Specify validation failure handler class to use. 指定要使用的验证失败处理class


Table3.Import control arguments:

ArgumentDescription
--appendAppend data to an existing dataset in HDFS //追加数据到HDFS中已经存在的数据集
--as-avrodatafileImports data to Avro Data Files 导入数据到Avro数据文件
--as-sequencefileImports data to SequenceFiles 导入数据到  SequenceFiles  
--as-textfileImports data as plain text (default)//导入格式作为无格式text(默认)
--boundary-query <statement>Boundary query to use for creating splits 用于创建分割的边界查询语句
--columns <col,col,col…>Columns to import from table//指定导入的列
--directUse direct import fast path//使用导入中的快速通道,direct模式
--direct-split-size <n>Split the input stream everynbytes when importing in direct mode 当以direct模式导入时,  分割输入流 每N bytes,
--fetch-size <n>Number of entries to read from database at once// 一次中从数据库读取N张表 .
--inline-lob-limit <n>Set the maximum size for an inline LOB 设置 一个内联LOB的最大尺寸。
-m,--num-mappers <n>Usenmap tasks to import in parallel 使用N个map 任务以并行方式导入
-e,--query <statement>Import the results ofstatement. // 可以导入一个查询的结果集,这里指定一个查询语句
--split-by <column-name>Column of the table used to split work units -m,--num-mappers 分割任务时的依据列
-m,--num-mappers --table <table-name>Table to read// 指定导入的单个表名
--target-dir <dir>HDFS destination dir// 导入文件的存放目录
--warehouse-dir <dir>HDFS parent for table destination  指定在HDFS中 表的上级路径
HDFS父表的目的地--where <where clause>WHERE clause to use during import 导入时的where子句
-z,--compressEnable compression//启用压缩
--compression-codec <c>Use Hadoop codec (default gzip)//指定hadoop编×××(默认 gzip  
--null-string <null-string>The string to be written for a null value for string columns// 指定导入一个空值的替换值
--null-non-string <null-string> The string to be written for a null value for non-string columns//  非字符串类型为空时的默认值

The--null-stringand--null-non-stringarguments are optional.\ If not specified, then the string "null" will be used.

--null-string--null-non-string 参数是可选的,如果不指定,就会使用"null"

7.2.2.Selecting the Data to Import

Sqoop typically imports data in a table-centric fashion. Use the--tableargument to select the table to import. For example,--table employees. This argument can also identify aVIEWor other table-like entity in a database.

Sqoop通常导入在数据表为中心的风格下。使用--table 参数选择表导入。例如,——员工表。这个观点也可以识别一个视图或其他实体在数据库表一样。这个参数也可以识别视图或其他相似表的实例在数据库中。

By default, all columns within a table are selected for import. Imported data is written to HDFS in its "natural order;" that is, a table containing columns A, B, and C result in an import of data such as:

//默认,导入时选中一个表的所有行,导入数据到HDFS以表的自然顺序,换言之,一个表包括列A, B,  C 在一个导入数据 比如:

A1,B1,C1
A2,B2,C2
...

You can select a subset of columns and control their ordering by using the--columnsargument. This should include a comma-delimited list of columns to import. For example:--columns "name,employee_id,jobtitle".

通过 --column参数你可以选择所有列的子集和控制列的顺序,这种方式必须指定一个逗号分割符的行列表用于导入。例如--columns "name,employee_id,jobtitle

You can control which rows are imported by adding a SQLWHEREclause to the import statement. By default, Sqoop generates statements of the formSELECT <column list> FROM <table name>. You can append aWHEREclause to this with the--whereargument. For example:--where "id > 400". Only rows where theidcolumn has a value greater than 400 will be imported.

您可以为导入的行添加一个 where SQL条件,默认,sqoop导入一个表的所有行,你可添加一个WHERE通过--where参数,例如 --where "id > 400",只有id大于400的行才能别导入。

By default sqoop will use queryselect min(<split-by>), max(<split-by>) from <table name>to find out boundaries for creating splits. In some cases this query is not the most optimal so you can specify any arbitrary query returning two numeric columns using--boundary-queryargument.

默认情况下,sqoop将使用select min(<split-by>), max(<split-by>) from <table name>来找出分割的边界,万一这个查询不是最优的,你可以指定任意返回两个数值的查询通过使用--boundary-query 参数。

7.2.3.Free-form Query Imports 自由形态的查询导入

Sqoop can also import the result set of an arbitrary SQL query. Instead of using the--table,--columnsand--wherearguments, you can specify a SQL statement with the--queryargument.

sqoop 也可以导入任意的SQL查询结果集,代替了 --table, --columns and --where参数,你可以指定一条SQL语句通过--query参数。

When importing a free-form query, you must specify a destination directory with--target-dir.

但导入一个自由查询的结果集,你必须指定一个目标地址通过--target-dir.

(注意:

    1 导入的目的地是HDFS,导出的起始点也是HDFS

    2 并行数指的就是map任务数,每一个线程有一map任务  

If you want to import the results of a query in parallel, then each map task will need to execute a copy of the query, with results partitioned by bounding conditions inferred by Sqoop. Your query must include the token$CONDITIONSwhich each Sqoop process will replace with a unique condition expression.You must also select a splitting column with--split-by.

//如果你想导入查询的结果集以并行的方式,这时每一个map任务必须执行查询的拷贝,结果集分区边界控制由Sqoop推断,你的查询必须包括$CONDITIONS记号,每个sqoop线程替换$CONDITIONS为一个唯一条件表达式  ,你必须选择一个分割列 通过--split-by.

For example:例如

$ sqoop import \
  --query 'SELECT a.*, b.* FROM a JOIN b on (a.id == b.id) WHERE $CONDITIONS' \
  --split-by a.id --target-dir /user/foo/joinresults

Alternately, the query can be executed once and imported serially, by specifying a single map task with-m 1:

另外,这个查询可以执行一次,然后连续的导入(分成多次导入,如每50行导入,直到完成所有导入),通过指定一个单独的map 任务 通过-m 1:

$ sqoop import \
  --query 'SELECT a.*, b.* FROM a JOIN b on (a.id == b.id) WHERE $CONDITIONS' \
  -m 1 --target-dir /user/foo/joinresults
[Note]Note

If you are issuing the query wrapped with double quotes ("), you will have to use\$CONDITIONSinstead of just$CONDITIONSto disallow your shell from treating it as a shell variable. For example, a double quoted query may look like:"SELECT * FROM x WHERE a='foo' AND \$CONDITIONS"

在shell环境,如果查询语句用双引号括起来,$CONDITIONS 不会做为一个shell 变量被解析,必须写成 \$CONDITIONS 例如:"SELECT * FROM x WHERE a='foo' AND \$CONDITIONS"

[Note]Note

The facility of using free-form query in the current version of Sqoop is limited to simple queries where there are no ambiguous projections and noORconditions in theWHEREclause. Use of complex queries such as queries that have sub-queries or joins leading to ambiguous projections can lead to unexpected results.

在当前版本的Sqoop中自由格式的查询仅限于简单的查询,这个查询不可以有模棱两可的预测,在where子句中也不能有or条件。使用复杂的查询,如有子查询或join连接会导致非期望的结果


7.2.4.Controlling Parallelism

Sqoop imports data in parallel from most database sources. You can specify the number of map tasks (parallel processes) to use to perform the import by using the-mor--num-mappersargument. Each of these arguments takes an integer value which corresponds to the degree of parallelism to employ. By default, four tasks are used. Some databases may see improved performance by increasing this value to 8 or 16.Do not increase the degree of parallelism greater than that available within your MapReduce cluster; tasks will run serially and will likely increase the amount of time required to perform the import. Likewise, do not increase the degree of parallism higher than that which your database can reasonably support. Connecting 100 concurrent clients to your database may increase the load on the database server to a point where performance suffers as a result.

Sqoop以并行方式导入数据从大多数数据库源。您可以指定用于执行导入的的Map任务数(并行进程数),通过使用-m或 ---num-mapper参数。这些参数中的每一个都需要一个表示并行度的整数值,默认情况下,使用四个任务。通过 增加值到8或16,有些数据库可能会看到改善的性能。不要增加并行度到大于在你的MapReduce集群内可用的并行度,任务将连续运行,并可能会增加执行导入所需的时间量。同样,不要增加并行度到高于你的数据库能够合理的支持的并行度, 100个并发客户端连接到您的数据库可能会增加数据库服务器负载进而影响到了性能。

(这段文字只讲了,平行度不能大 MapReduce,数据库所支持数目,没写具体数,也没写方案,看来还需要自己实验)

When performing parallel imports, Sqoop needs a criterion by which it can split the workload. Sqoop uses asplitting columnto split the workload. By default, Sqoop will identify the primary key column (if present) in a table and use it as the splitting column. The low and high values for the splitting column are retrieved from the database, and the map tasks operate on evenly-sized components of the total range. For example, if you had a table with a primary key column ofidwhose minimum value was 0 and maximum value was 1000, and Sqoop was directed to use 4 tasks, Sqoop would run four processes which each execute SQL statements of the formSELECT * FROM sometable WHERE id >= lo AND id < hi, with(lo, hi)set to (0, 250), (250, 500), (500, 750), and (750, 1001) in the different tasks.

当执行并行导入,Sqoop需要一个分割工作量的依据。Sqoop使用分割列来分割工作量,默认情况下,Sqoop将识别一个表的主键列(如果存在),并使用它作为分裂列。分割列从数据库检索最大值和最小值。所有的列导入任务会平均分配,例如,如果你有一个表的主键列id的最小值是0,最大值是1000,Sqoop使用4个任务,Sqoop将运行四个进程,SQL语句的执行以下面的形式, SELECT * FROM sometable WHERE id >= lo AND id < hi, with (lo, hi) and 设置为 (0, 250), (250, 500), (500, 750), and (750, 1001)。(就是任务量会被平均的分配给多个进程,一个任务一个进程)

If the actual values for the primary key are not uniformly distributed across its range, then this can result in unbalanced tasks. You should explicitly choose a different column with the--split-byargument. For example,--split-by employee_id. Sqoop cannot currently split on multi-column indices. If your table has no index column, or has a multi-column key, then you must also manually choose a splitting column.

如果主键的实际值不能用于行的排序,那么这可能导致分配任务不平均。你应该明确地选择一个除主键外的列通过 --split-by参数。例如,  --split-by employee_id 。Sqoop目前不能使用多行索引分割任务,如果你的表没有索引列,或有多列主键,这时你也必须手动选择一个分裂列。

7.2.5.Controlling the Import Process

By default, the import process will use JDBC which provides a reasonable cross-vendor import channel. Some databases can perform imports in a more high-performance fashion by using database-specific data movement tools. For example, MySQL provides themysqldumptool which can export data from MySQL to other systems very quickly. By supplying the--directargument, you are specifying that Sqoop should attempt the direct import channel. This channel may be higher performance than using JDBC. Currently, direct mode does not support imports of large object columns.

默认,导入程序将使用JDBC提供的一个合理的跨厂商的导入渠道。通过使用特定于数据库的数据移动工具,一些数据库能够以更加高效的方式导入。例如,MySQL提供了mysqldump工具,它可以非常迅速地从MySQL导入数据到其他系统,通过提供--direct参数,Sqoop会尝试使用direct导入频道  。这个渠道可能比使用JDBC更高效。目前,直接模式不支持进口的大对象列(BLOB  或CLOB列)。

When importing from PostgreSQL in conjunction with direct mode, you can split the import into separate files after individual files reach a certain size. This size limit is controlled with the--direct-split-sizeargument.

当使用PostgreSQL的direct模式导入时,在文件导入单个文件后达到一定大小后,你可以分割成多个文件,这个大小的限制通过 --direct-split-size参数

By default, Sqoop will import a table namedfooto a directory namedfooinside your home directory in HDFS. For example, if your username issomeuser, then the import tool will write to/user/someuser/foo/(files). You can adjust the parent directory of the import with the--warehouse-dirargument. For example:

默认情况下,Sqoop导入命名为foo的表的数据文件存到HDFS的home目录foo目录。例如,如果您的用户名是someuser,然后导入工具 会将数据写入 /user/someuser/foo/(files) 。你可以调整的父目录导入通过 --warehouse-dir参数。例如:

$ sqoop import --connnect <connect-str> --table foo --warehouse-dir /shared \
    ...

This command would write to a set of files in the/shared/foo/directory.

这个命令会将在写入数据到/shared/foo目录的文件集合.

You can also explicitly choose the target directory, like so:

你也可以明确的选择目标目录,像这样

$ sqoop import --connnect <connect-str> --table foo --target-dir /dest \
    ...

This will import the files into the/destdirectory.--target-diris incompatible with--warehouse-dir.

这会导入文件到 目录,--target-dir 不兼容 --warehouse-dir.

When using direct mode, you can specify additional arguments which should be passed to the underlying tool. If the argument--is given on the command-line, then subsequent arguments are sent directly to the underlying tool. For example, the following adjusts the character set used bymysqldump:

当使用 direct 模式,您可以指定额外的参数传递给底层的工具。例如,下面的调整字符集通过使用mysqldump:

$ sqoop import --connect jdbc:mysql://server.foo.com/db --table bar \
    --direct -- --default-character-set=latin1

By default, imports go to a new target location. If the destination directory already exists in HDFS, Sqoop will refuse to import and overwrite that directory’s contents. If you use the--appendargument, Sqoop will import data to a temporary directory and then rename the files into the normal target directory in a manner that does not conflict with existing filenames in that directory.

默认下,导入指定的目录如果已经存在,sqoop会拒绝覆盖,如过你使用 --append 参数,sqoop导入数据到一个临时目录,然后重命名文件到标准的目标目录而且不会与已经存在的文件名冲突。(使用 --append 即使指定的目录是已经存在的,也能导入,目录名是啥就得试试了)

[Note]Note

When using the direct mode of import, certain database client utilities are expected to be present in the shell path of the task process. For MySQL the utilitiesmysqldumpandmysqlimportare required, whereas for PostgreSQL the utilitypsqlis required.


7.2.6.Controlling type mapping//控制类型映射

Sqoop is preconfigured to map most SQL types to appropriate Java or Hive representatives. However the default mapping might not be suitable for everyone and might be overridden by--map-column-java(for changing mapping to Java) or--map-column-hive(for changing Hive mapping)

Sqoop预先配置 了大量的适用于java或hive的SQL类型,但是默认映射可能并不适合每个人可以同过下面的参数覆盖

--map-column-java (用来改变Java映射) or --map-column-hive (用来改变hive映射)

Table4.Parameters for overriding mapping

ArgumentDescription
--map-column-java <mapping>Override mapping from SQL to Java type for configured columns 覆盖被配置的类的SQL到Java类型映射.
--map-column-hive <mapping>Override mapping from SQL to Hive type for configured columns 覆盖被配置的类的SQL到hive类型映射.  .

Sqoop is expecting comma separated list of mapping in form <name of column>=<new type>. For example:

指定多行类型时,要用逗号分隔,例如

$ sqoop import ... --map-column-java id=String,value=Integer

Sqoop will rise exception in case that some configured mapping will not be used.

如果映射配置不能被使用,sqoop将抛出异常。

7.2.7.Incremental Imports

Sqoop provides an incremental import mode which can be used to retrieve only rows newer than some previously-imported set of rows.

The following arguments control incremental imports:

sqoop 提供增量导入模式,他能够检索自上次导入后新增的数据行。

如下参数控制增量导入:

Table5.Incremental import arguments:

ArgumentDescription
--check-column (col)Specifies the column to be examined when determining which rows to import.//指定 决定那些行要导入 的检查列。
--incremental (mode)Specifies how Sqoop determines which rows are new. Legal values formodeincludeappendandlastmodified//指定增量模式,合法的值是 append和 lastmodified  .
--last-value (value)Specifies the maximum value of the check column from the previous import 指定之前导入的检查列的最大值.

Sqoop supports two types of incremental imports:appendandlastmodified. You can use the--incrementalargument to specify the type of incremental import to perform.

sqoop提供两种类型的增量导入:  append   and lastmodified. 你可以使用参数指定要执行的增量导入类型。

You should specifyappendmode when importing a table where new rows are continually being added with increasing row id values. You specify the column containing the row’s id with--check-column. Sqoop imports rows where the check column has a value greater than the one specified with--last-value.

append模式,即通过指定一个递增的列是实现增量导入,比如:
--incremental append  --check-column num_iid --last-value 0

An alternate table update strategy supported by Sqoop is calledlastmodifiedmode. You should use this when rows of the source table may be updated, and each such update will set the value of a last-modified column to the current timestamp. Rows where the check column holds a timestamp more recent than the timestamp specified with--last-valueare imported.

lastmodified模式,根据时间戳实现增量导入,比如:
--incremental lastmodified --check-column created --last-value '2012-02-01 11:0:00'
就是只导入created 比'2012-02-01 11:0:00'更大的数据。

At the end of an incremental import, the value which should be specified as--last-valuefor a subsequent import is printed to the screen. When running a subsequent import, you should specify--last-valuein this way to ensure you import only the new or updated data. This is handled automatically by creating an incremental import as a saved job, which is the preferred mechanism for performing a recurring incremental import. See the section on saved jobs later in this document for more information.

增量导入时 需要指定--last-value值,这个值在数据导入后,再次导入时会发变化,这时就需要指定变化后的值,一个存储的job能够自动的处理这个值,稍后的 job章节会有更多的信息。

三 sqoop增量倒入

sqoop支持两种增量MySql导入到hive的模式,
一种是 append,即通过指定一个递增的列,比如:
--incremental append  --check-column num_iid --last-value 0
另种是可以根据时间戳,比如:
--incremental lastmodified --check-column created --last-value '2012-02-01 11:0:00'
就是只导入created 比'2012-02-01 11:0:00'更大的数据。