Sqoop官方文档翻译0

最新推荐文章于 2023-05-17 22:37:08 发布

weixin_34293246

最新推荐文章于 2023-05-17 22:37:08 发布

阅读量181

点赞数

文章标签：大数据数据库 python

原文链接：https://my.oschina.net/qiangzigege/blog/552173

版权

2019独角兽企业重金招聘Python工程师标准>>>

官方文档翻译：

1. Introduction

Sqoop是一个工具，用来在关系型数据库和Hadoop之间传输数据。
你可以使用Sqoop来从RDBMS(MySQL or Oracle)到Hadoop环境里，
通过MR转换数据，把数据导回到数据库。

Sqoop自动操作大部分过程，依赖于数据库来描述要导入的数据的模式。
Sqoop使用MR来导入/导出数据，可并行操作和错误处理。
本文档描述了如何使用sqoop来移动数据，以及命令行工具套件。

适用的读者群：
1) 系统和程序开发者
2) 系统管理员
3) 数据库管理员
4) 数据分析师
5) 数据工程师

2. Supported Releases

Sqoop-v1.4.4

3. Sqoop Releases

Sqoop是一个开源软件,来自于Apache基金会。
官网: http://sqoop.apache.org ,在这个网址，你可以获得：
Sqoop的新版本和最新源代码。
issue tracker
wiki

4. Prerequisites

所需技能：

基本的计算机技能和知识

熟悉命令行工具

关系型数据库

了解hadoop

Sqoop目前支持4个主要版本: - 0.20, 0.23, 1.0 and 2.0.

本文档假设你在使用Linux环境

5. Basic Usage

基于Sqoop,你可以从关系型数据库导入数据到HDFS.
输入是一个数据库的表格，Sqoop会逐行读入到HDFS.输出则是一系列文件作为表格的备份。
导入过程是并行的，基于此，输出文件是多个文件。
这些文件可以是文本文件，或者二进制Avro,或者SequenceFiles
导入过程的副产品是一些生成的java类,这些类在导入过程中被使用。
这个类的java源码可以看到，在后续的MR操作中被使用。
这些类的能力很强大，可以让你很快开发MR程序。也可以自己解析有界记录。

处理完记录后，也许你想把数据再写回关系型数据库，
Sqoop的导出功能可以并行读取HDFS的文件，解析成记录，插入到关系型数据库里供后续使用。

Sqoop包含其它的命令，可以让你监视数据库。
比如，你可以列举出可用的数据库schemas.(with the sqoop-list-databases tool)
或者表格 (with the sqoop-list-tables tool).
Sqoop也包含原始的SQL执行shell (the sqoop-eval tool).

导入、代码生成、导出可以被定制。
你可以控制导入的行范围或者特定的列，也可以指定特定的边界符等。
你可以控制生成的代码里的类/包名

6. Sqoop Tools

Sqoop是一个相关工具的集合。
为了使用sqoop,你可以指定要使用的工具和参数来控制它。

如果sqoop是从源码编译的，你可以运行它不需要运行bin/sqoop.
$ sqoop tool-name [tool-arguments]
Sqoop ships with a help tool.

$ sqoop help
usage: sqoop COMMAND [ARGS]

Available commands:
codegen            Generate code to interact with database records
create-hive-table Import a table definition into Hive
eval               Evaluate a SQL statement and display the results
export             Export an HDFS directory to a database table
help               List available commands
import             Import a table from a database to HDFS
import-all-tables Import tables from a database to HDFS
list-databases     List available databases on a server
list-tables        List availab le tables in a database
version            Display version information

观察 'sqoop help COMMAND' 查看一个特定的命令,比如sqoop help import.
也可以使用sqoop import --help.

6.1. Using Command Aliases

也可以使用别名的脚本，比如sqoop-import,sqoop-export等

6.2. Controlling the Hadoop Installation

sqoop命令行调用bin/hadoop脚本。
如果你有多个hadoop的版本，你可以指定其中一个版本，通过$HADOOP_COMMON_HOME and $HADOOP_MAPRED_HOME环境变量。

例如:

$ HADOOP_COMMON_HOME=/path/to/some/hadoop \
HADOOP_MAPRED_HOME=/path/to/some/hadoop-mapreduce \
sqoop import --arguments...
or:
$ export HADOOP_COMMON_HOME=/some/path/to/hadoop
$ export HADOOP_MAPRED_HOME=/some/path/to/hadoop-mapreduce
$ sqoop import --arguments...

如果任何一个环境变量未指定，sqoop将使用 $HADOOP_HOME.如果这个变量还没有设置，则将使用默认位置 for Apache Bigtop, /usr/lib/hadoop and /usr/lib/hadoop-mapreduce, respectively.

The active Hadoop configuration is loaded from $HADOOP_HOME/conf/, unless the $HADOOP_CONF_DIR environment variable is set.

6.3. Using Generic and Specific Arguments

可以指定参数
比如:
$ sqoop help import
usage: sqoop import [GENERIC-ARGS] [TOOL-ARGS]

普通参数:
   --connect <jdbc-uri>     Specify JDBC connect string
   --connect-manager <jdbc-uri>     Specify connection manager class to use
   --driver <class-name>    Manually specify JDBC driver class to use
   --hadoop-mapred-home <dir>+      Override $HADOOP_MAPRED_HOME
   --help                   Print usage instructions
   --password-file          Set path for file containing authentication password
   -P                       Read password from console
   --password <password>    Set authentication password
   --username <username>    Set authentication username
   --verbose                Print more information while working
   --hadoop-home <dir>+     Deprecated. Override $HADOOP_HOME

[...]

总的Hadoop命令行参数:
(must preceed any tool-specific arguments)
Generic options supported are
-conf <configuration file>     specify an application configuration file
-D <property=value>            use value for given property
-fs <local|namenode:port>      specify a namenode
-jt <local|jobtracker:port>    specify a job tracker
-files <comma separated list of files>    specify comma separated files to be copied to the map reduce cluster
-libjars <comma separated list of jars>    specify comma separated jar files to include in the classpath.
-archives <comma separated list of archives>    specify comma separated archives to be unarchived on the compute machines.

命令行参数模式
bin/hadoop command [genericOptions] [commandOptions]
You must supply the generic arguments -conf, -D, and so on after the tool name but before any tool-specific arguments (such as --connect). Note that generic Hadoop arguments are preceeded by a single dash character (-), whereas tool-specific arguments start with two dashes (--), unless they are single character arguments such as -P.

The -conf, -D, -fs and -jt arguments control the configuration and Hadoop server settings. For example, the -D mapred.job.name=<job_name> can be used to set the name of the MR job that Sqoop launches, if not specified, the name defaults to the jar name for the job - which is derived from the used table name.

The -files, -libjars, and -archives arguments are not typically used with Sqoop, but they are included as part of Hadoop’s internal argument-parsing system.

6.4. Using Options Files to Pass Arguments

略

6.5. Using Tools

下面的章节将具体描述每个工具的操作。

7. `sqoop-import`

7.1. Purpose

import工具导入一个表格到HDFS里，表里的每行数据在HDFS里就是一条单独的记录。
记录可以被保存为text文件(每个记录一行)，或者二进制文件比如avro|SequenceFiles

7.2. Syntax

$ sqoop import (generic-args) (import-args)
$ sqoop-import (generic-args) (import-args)
当hadoop的generic参数需要在import参数之前，
Table 1. Common arguments

Argument Description
--connect <jdbc-uri> Specify JDBC connect string
--connection-manager <class-name> Specify connection manager class to use
--driver <class-name> Manually specify JDBC driver class to use
--hadoop-mapred-home <dir> Override $HADOOP_MAPRED_HOME
--help Print usage instructions
--password-file Set path for a file containing the authentication password
-P Read password from console
--password <password> Set authentication password
--username <username> Set authentication username
--verbose Print more information while working
--connection-param-file <filename> Optional properties file that provides connection parameters

7.2.1. Connecting to a Database Server

Sqoop设计来导入数据库的表进入HDFS.
为了完成此目标，必须指定一个connect字符串（描述了怎样连接到数据库）.
这个字符串类似于URL，通过--connect参数指定。
描述了要连接的数据库，和端口。
$ sqoop import --connect jdbc:mysql://database.example.com/employees

这个字符串，连接到mysql数据库employees. 主机为database.example.com
这个字符串会在MR中被TaskTracker节点使用，需要使用全主机名或者IP地址

也许还需要鉴权，--username 来提供用户名，有多种实现方式。

为了安全的提供密码，应该密码保存在家目录的文件里，具有400权限。
然后通过--password-file指定文件路径。
Sqoop将读取密码，传到MR过程中，

$ sqoop import --connect jdbc:mysql://database.example.com/employees \
--username venkatesh --passwordFile ${user.home}/.password

另外一种方式来提供密码是-P参数，从控制台读取参数。

$ sqoop import --connect jdbc:mysql://database.example.com/employees \
--username aaron --password 12345

Sqoop会自动支持一些数据库，包括mysql.
Connect字符串以jdbc:mysql://开始，sqoop会自动处理。

下载合适的JDBC驱动，安装.jar文件在 $SQOOP_HOME/lib目录。
每个.jar文件有相应的驱动类。
比如，mysql的 Connector/J库有驱动类com.mysql.jdbc.Driver.
通过--driver参数指定驱动类。

例如，为了连接sqlserver数据库，下载驱动，在sqoop的lib目录下安装。

运行sqoop,例如：
$ sqoop import --driver com.microsoft.jdbc.sqlserver.SQLServerDriver \
--connect <connect-string> ...
当使用JDBC连接数据库，可以通过额外的JDBC参数文件指定
这通过--connection-param-file.

Table 2. 验证参数的细节

Argument Description
--validate Enable validation of data copied, supports single table copy only. --validator <class-name> Specify validator class to use.
--validation-threshold <class-name> Specify validation threshold class to use.
+--validation-failurehandler <class-name >+ Specify validation failure handler class to use.

Table 3. Import control arguments:

参数描述
--append Append data to an existing dataset in HDFS
--as-avrodatafile Imports data to Avro Data Files
--as-sequencefile Imports data to SequenceFiles
--as-textfile Imports data as plain text (default)
--boundary-query <statement> Boundary query to use for creating splits
--columns <col,col,col…> Columns to import from table
--delete-target-dir Delete the import target directory if it exists
--direct Use direct import fast path
--direct-split-size <n> Split the input stream every n bytes when importing in direct mode
--fetch-size <n> Number of entries to read from database at once.
--inline-lob-limit <n> Set the maximum size for an inline LOB
-m,--num-mappers <n> Use n map tasks to import in parallel
-e,--query <statement> Import the results of statement.
--split-by <column-name> Column of the table used to split work units
--table <table-name> Table to read
--target-dir <dir> HDFS destination dir
--warehouse-dir <dir> HDFS parent for table destination
--where <where clause> WHERE clause to use during import
-z,--compress Enable compression
--compression-codec <c> Use Hadoop codec (default gzip)
--null-string <null-string> The string to be written for a null value for string columns
--null-non-string <null-string> The string to be written for a null value for non-string columns

The --null-string and --null-non-string arguments are optional.\ If not specified, then the string "null" will be used.

7.2.2. Selecting the Data to Import

Sqoop 导入数据以表为单位，
使用--table参数来选择引入的表格。
比如：--table employees. 这个参数也可以指定一个视图或者其它类似的。

默认的，所有的列都被导入，导入的数据根据自然顺序写入HDFS.
比如说，有个表格包含列A,B,C。
导入的数据就是：
A1,B1,C1
A2,B2,C2
...
你可以选择列的子集来控制顺序，这通过使用--columns参数。
列之间用逗号隔开，比如 --columns "name,employee_id,jobtitle".

你可以控制那些行被导入，通过增加一个SQL WHERE语句。
默认的，sqoop产生语句类似于SELECT <column list> FROM <table name>.
增加一个where语句通过--where参数。
比如: --where "id > 400".

默认的，sqoop将使用查询 select min(<split-by>), max(<split-by>) from <table name>来找到分区的边界，
在一些情况下，这个查询不是最合适的，所以你可以自己指定任何强制的查询语句返回2个数字列，
这通过使用--boundary-query参数。

7.2.3. Free-form Query Imports

Sqoop可以导入指定的SQL语句的查询结果，而不是使用--table,--columns,--where语句。
可以指定一个SQL语句，通过--query参数。
当导入一个自由形式的查询语句，必须指定一个目标文件夹，这是通过--target-dir参数。

如果想并行导入数据，每个map任务需要执行查询的拷贝,结果被边界条件分区。

查询必须包括$CONDITIONS，每个sqoop过程将替换为独特的条件。
必须选择一个split列，通过参数--split-by

下面是一些例子:
$ sqoop import --query 'SELECT a.*, b.* FROM a JOIN b on (a.id == b.id) WHERE $CONDITIONS' \
--split-by a.id --target-dir /user/foo/joinresults
Alternately, the query can be executed once and imported serially, by specifying a single map task with -m 1:

$ sqoop import \
--query 'SELECT a.*, b.* FROM a JOIN b on (a.id == b.id) WHERE $CONDITIONS' \
-m 1 --target-dir /user/foo/joinresults

转载于:https://my.oschina.net/qiangzigege/blog/552173