写在译文前的话

这是我第一次翻译技术文档,肯定有很多错误不妥之处,希望各位指出,我马上改过来。

我认为阅读英文文档应该分为三个阶段 1 理解英文含义 2 理解文字表达的内容 3 实际操作,本文力争做到第二个阶段。

译文虽然完成了90%,但主要的功能都翻译的很清楚。应该可以满足大部分的应用需求。

如果把这个文档弄明白了,其他的sqoop也不用看了

注:红字部分是有疑问,或需要标识的地方。

sqoop 官方文档
1. 简介
    Sqoop是被设计用来作为Hadoop与关系型数据库之间传输数据的工具。Sqoop可以把数据从RDBMS(如mysql,oracle)导入到HDFS,经过Hadoop的Mapreduce计算转换,然后再将数据导回到RDBMS。
    Sqoop会依据数据库的schema所描述数据类型,自动的完成大部分的工作。Sqoop利用MapReduce来导入和导出数据的, 这样就可以并行作业及容错。
     本文档是介绍怎样使用Sqoop去做数据库与hadoop之间的数据迁移,并且提供了Sqoop命令行工具操作的一些参考资料。
    本文档的面向对象:
           1.系统和应用程序员
           2.系统管理员
           3.数据库管理员
           4.数据分析
           5.数据工程师
2. 支持版本 Sqoop v1.4.2
3.Sqoop release
   sqoop兼容hadoop 0.21 和 Cloudera’s Distribution of Hadoop version 3.
4.预备知识
   a)计算机基本技术及相关术语
   b)熟悉命令行操作,如bash
   c)关系型数据库
   d)熟悉Hadoop的用途及基本操作
   环境准备:安装配置好Hadoop
   本文档假设你使用的是类Linux环境。如果你用的是Windows,亦可能需要用cygwin去完成大部分的任务。如果你用的是Mac OS X,你可能会遇到一些兼容性的错误。Sqoop主要操作及测试于Linux环境。
5.基本用途
   Sqoop可以将数据从关系型数据库导入到HDFS。Sqoop的数据导入过程是这样的:输入是数据库中的一张表,Sqoop会逐条的将数据从表中导入到 HDFS中。导入过程的输出是包含被导入表的一些文件。导入过程是并行执行的。所以,输出也是多个文件。这些文件可能是以特定分隔符分隔的文本文件(比 如:是用“,”或tab为字段分隔符),或者将记录序列化的二进制Avro或SequenceFiles文件。
   Sqooq导入过程的衍生出一个Java类 ,它用来包装导入表中的一条记录.这个java类用于Sqoop导入数据的过程中,这个Java类的源码也会提供给你,用于下接下来的数据的 MapReduce过程中。这个类可以与SequenceFile格式之间进行序列化及反序列化,也可以解析分隔符形式的文本记录。基于这些功能,你可以 在你的处理管道中,快速开发基于Hdfs存储记录的MapReduce应用。当然,你也可以用你喜欢的其他工具,来解析这些分隔符格式的文本。
   处理完导入数据之后(比如:MapReduce或Hive),你可以将这些结果数据导回到关系型数据库.Sqoop的导出进程,会并行从HDFS中读取一 组delimited-text,解析成记录,然后将其插入到目标数据库(Oracle 或Mysql), 被外部的应用或用户使用。
   Sqoop包括一些可以检验所用数据库的命令。例如,列出可用数据库schema(使用:sqoop-list-databases工具 ) ,数据库表schema(使用sqoop-list-tables).Sqoop还提供一些可以执行原生SQL的Shell(sqoop-eval)
   数据导入,代码生成,数据导出的大部分地方都可以进行定制。你可以控制导入数据的范围、导入哪些列。你可以对基于文件的数据,声明特别的分隔符,转意字 符,以及文件格式化。代码生成中你也可以控制类名、包名。下面部分就是介绍怎样设置Sqoop的这些参数及其他的参数。
原文地址 http://www.dl234.com/blog/?p=33

5.Basic Usage

A by-product of the import process is a generated Java class which can encapsulate one row of the imported table;
导入过程生成的副产品是一个Java class,它封装了导入的表的一行。
boundary query:边界查询
--verbose  ( 详细的) 显示详细的工作信息 , 显示debug日志.

6.Sqoop Tools

Sqoop is a collection of related tools. To use Sqoop, you specify the tool you want to use and the arguments that control the tool.

sqoop是一组 相关工具的集合,你可以通过制定工具和参数来控制它。

If Sqoop is compiled from its own source, you can run Sqoop without a formal installation process by running thebin/sqoopprogram. Users of a packaged deployment of Sqoop (such as an RPM shipped with Apache Bigtop) will see this program installed as/usr/bin/sqoop. The remainder of this documentation will refer to this program assqoop. For example:

sqoop的调用方式, /bin/sqoop 或 /usr/bin/sqoop

$ sqoop tool-name [tool-arguments]
,
[Note]Note

$ 代表客户端,它不是输入的一部分

Sqoop ships with a help tool. To display a list of all available tools, type the following command:

Sqoop有一个帮助工具,来展示可用的工具列表,输入下面的命令

type 有打字和输入的意思 ships with :带有

$ sqoop help
usage: sqoop COMMAND [ARGS]

Available commands:
  codegen            Generate code to interact with database records 生成与数据库记录交互的代码
  create-hive-table  Import a table definition into Hive 创建hive型表结构
  eval               Evaluate a SQL statement and display the results 返回sql的执行结果并显示 
  export             Export an HDFS directory to a database table 导出一个HDFS目录到一个表
  help               List available commands 列出可用命令行
  import             Import a table from a database to HDFS
  import-all-tables  Import tables from a database to HDFS 导出指定数据库的所有表
  list-databases     List available databases on a server 列出所有数据库名
  list-tables        List available tables in a database列出所有表名
  version            Display version information 显示版本信息

See 'sqoop help COMMAND' for information on a specific command. help还可以用在一个指定的命令上

You can display help for a specific tool by entering:sqoop help (tool-name); for example,sqoop help import.


You can also add the--helpargument to any command:sqoop import --help.

你也可以这样使用:sqoop import --help.

6.1.Using Command Aliases

In addition to typing thesqoop (toolname)syntax, you can use alias scripts that specify thesqoop-(toolname)syntax. For example, the scriptssqoop-import,sqoop-export, etc. each select a specific tool.

使用别名,如sqoop-import, sqoop-export.

6.2.Controlling the Hadoop Installation //控制hadoop安装

You invoke Sqoop through the program launch capability provided by Hadoop. Thesqoopcommand-line program is a wrapper which runs thebin/hadoopscript shipped with Hadoop. If you have multiple installations of Hadoop present on your machine, you can select the Hadoop installation by setting the$HADOOP_COMMON_HOMEand$HADOOP_MAPRED_HOMEenvironment variables.

For example:

$ HADOOP_COMMON_HOME=/path/to/some/hadoop \
  HADOOP_MAPRED_HOME=/path/to/some/hadoop-mapreduce \
  sqoop import --arguments...

or:

$ export HADOOP_COMMON_HOME=/some/path/to/hadoop
$ export HADOOP_MAPRED_HOME=/some/path/to/hadoop-mapreduce
$ sqoop import --arguments...

If either of these variables are not set, Sqoop will fall back to$HADOOP_HOME. If it is not set either, Sqoop will use the default installation locations for Apache Bigtop,/usr/lib/hadoopand/usr/lib/hadoop-mapreduce, respectively.

The active Hadoop configuration is loaded from$HADOOP_HOME/conf/, unless the$HADOOP_CONF_DIRenvironment variable is set.

6.3.Using Generic and Specific Arguments //使用通用 和指定参数

To control the operation of each Sqoop tool, you use generic and specific arguments.

控制每个 Sqoop工具,你都可以使用通用参数和指定参数

For example: 例如:

$ sqoop help import
usage: sqoop import [GENERIC-ARGS] [TOOL-ARGS]

Common arguments:
   --connect <jdbc-uri>     Specify JDBC connect string //指定JDBC连接字符串
   --connect-manager <jdbc-uri>     Specify connection manager class to use//指定连接管理者的class类
   --driver <class-name>    Manually specify JDBC driver class to use//手动指定 JDBC驱动类
   --hadoop-mapred-home <dir>+      Override $HADOOP_MAPRED_HOME
   --help                   Print usage instructions //显示使用说明
-P                          Read password from console//从控制台读取参数
   --password <password>    Set authentication password //身份验证密码
   --username <username>    Set authentication username//身份验证用户
   --verbose                Print more information while working //输出debug信息
   --hadoop-home <dir>+     Deprecated. Override $HADOOP_HOME //已经弃用

[...]

Generic Hadoop command-line arguments: Hadoop通用命令参数(是hadoop的命令,详见hadoop命令文档,可以用在sqoop tool上)
(must preceed any tool-specific arguments)(必须优先指定工具的参数)
Generic options supported are
-conf <configuration file>     specify an application configuration file//指定一个应用配置文件
-D <property=value>            use value for given property //传参数
-fs <local|namenode:port>      specify a namenode// 指定一个namenote
-jt <local|jobtracker:port>    specify a job tracker//指定一个 job tracker
-files <comma separated list of files>    specify comma separated files to be copied to the map reduce cluster 指定逗号分隔文件。这些文件被拷贝到mapreduce 集群上
-libjars <comma separated list of jars>    specify comma separated jar files to include in the classpath.-archives <comma separated list of archives>    specify comma separated archives to be unarchived on the compute machines.
The general command line syntax is 通用命令行语法是 bin/hadoop command [genericOptions] [commandOptions]

You must supply the generic arguments-conf,-D, and so on after the tool name butbeforeany tool-specific arguments (such as--connect). Note that generic Hadoop arguments are preceeded by a single dash character (-), whereas tool-specific arguments start with two dashes (--), unless they are single character arguments such as-P.

-conf, -D和其他的通用参数必须写在所有的指定工具的参数前  (比如 --connect).注意 通用参数使用一个破折号 (-),在特定工具中除了单个字符的参数使用一个破折号外其他参数使用两个破折号(-)

The-conf,-D,-fsand-jtarguments control the configuration and Hadoop server settings. For example, the-D mapred.job.name=<job_name>can be used to set the name of the MR job that Sqoop launches, if not specified, the name defaults to the jar name for the job - which is derived from the used table name.

-conf, -D, -fs and -jt控制hadoop服务的设置。(具体 作用 就得研究hadoop的参数)

The-files,-libjars, and-archivesarguments are not typically used with Sqoop, but they are included as part of Hadoop’s internal argument-parsing system.

-files, -libjars, and -archives 通常不用于Sqoop,他们被包含作为hadoop的内部分析参数系统的一部分。


6.4.Using Options Files to Pass Arguments //使用选中的文件传递参数

//很罗嗦,就是说可以指定一个 文件来传递参数,并给出了例子,试试就懂了。

When using Sqoop, the command line options that do not change from invocation to invocation can be put in an options file for convenience. An options file is a text file where each line identifies an option in the order that it appears otherwise on the command line. Option files allow specifying a single option on multiple lines by using the back-slash character at the end of intermediate lines. Also supported are comments within option files that begin with the hash character. Comments must be specified on a new line and may not be mixed with option text. All comments and empty lines are ignored when option files are expanded. Unless options appear as quoted strings, any leading or trailing spaces are ignored. Quoted strings if used must not extend beyond the line on which they are specified.

Option files can be specified anywhere in the command line as long as the options within them follow the otherwise prescribed rules of options ordering. For instance, regardless of where the options are loaded from, they must follow the ordering such that generic options appear first, tool specific options next, finally followed by options that are intended to be passed to child programs.

To specify an options file, simply create an options file in a convenient location and pass it to the command line via--options-fileargument.

Whenever an options file is specified, it is expanded on the command line before the tool is invoked. You can specify more than one option files within the same invocation if needed.

For example, the following Sqoop invocation for import can be specified alternatively as shown below:

$ sqoop import --connect jdbc:mysql://localhost/db --username foo --table TEST

$ sqoop --options-file /users/homer/work/import.txt --table TEST

where the options file/users/homer/work/import.txtcontains the following:

import
--connect
jdbc:mysql://localhost/db
--username
foo

The options file can have empty lines and comments for readability purposes. So the above example would work exactly the same if the options file/users/homer/work/import.txtcontained the following:

#
# Options file for Sqoop import
#

# Specifies the tool being invoked
import

# Connect parameter and value
--connect
jdbc:mysql://localhost/db

# Username parameter and value
--username
foo

#
# Remaining options should be specified in the command line.
#

6.5.Using Tools

The following sections will describe each tool’s operation. The tools are listed in the most likely order you will find them useful.