7.2.8.File Formats// 文件格式化

You can import data in one of two file formats: delimited text or SequenceFiles.

你可以以两种格式导入:分割符文本或序列文件

Delimited text is the default import format. You can also specify it explicitly by using the--as-textfileargument. This argument will write string-based representations of each record to the output files, with delimiter characters between individual columns and rows. These delimiters may be commas, tabs, or other characters. (The delimiters can be selected; see "Output line formatting arguments.") The following is the results of an example text-based import:

分割符文件是默认的导入格式,你也可以明确的指定这个格式通过使用  --as-textfile参数。这个参数将写入基于字符串代表的记录到输出文件中去,单独的行和列都有分割符,这些分割符可能是逗号,tab键,或其他符号(分割符可以被选择,查看“输出行参数格式化”)。

下面是一个导入成为文本的示例:

1,here is a message,2010-05-01
2,happy new year!,2010-01-01
3,another message,2009-11-12

Delimited text is appropriate for most non-binary data types. It also readily supports further manipulation by other tools, such as Hive.

分隔符文本适合大多数非二进制数据类型。它也很容易支持进一步操纵其他工具,如hive。

SequenceFiles are a binary format that store individual records in custom record-specific data types. These data types are manifested as Java classes. Sqoop will automatically generate these data types for you. This format supports exact storage of all data in binary representations, and is appropriate for storing binary data (for example,VARBINARYcolumns), or data that will be principly manipulated by custom MapReduce programs (reading from SequenceFiles is higher-performance than reading from text files, as records do not need to be parsed).

individual, separate 两个词都有独立的意思,区别?

序列文件是一种二进制格式文件,它用来存储特殊数据类型的文件,这些数据类型表现为Java类,Sqoop将为你自动生成这些数据类型。这种格式支持精确存储所有数据以二进制表示形式,适合存储二进制数据(例如,VARBINARY列),或将被自定义的MapReduce程序操作的数据(读取SequenceFiles比读取文本文件高效,记录不需要解析)。

Avro data files are a compact, efficient binary format that provides interoperability with applications written in other programming languages. Avro also supports versioning, so that when, e.g., columns are added or removed from a table, previously imported data files can be processed along with new ones.

Avro数据文件是一种紧凑、高效的二进制格式文件,其他编程语言编写的应用程序也可以用它,Avro也提供版本控制,所以,举例:,列添加或删除从表,导入后,会新生成导入文件,但之前导入的数据文件记录依然存在。

By default, data is not compressed. You can compress your data by using the deflate (gzip) algorithm with the-zor--compressargument, or specify any Hadoop compression codec using the--compression-codecargument. This applies to SequenceFile, text, and Avro files.

默认,数据是不被压缩的,你可以压缩你的数据通过使用压缩算法(默认使用gzip工具) -z或--compress 参数,或指定任意一个hadoop 压缩用的编××× 通过使用--compression-codec argument. 这些参数使用与 序列文件,text,和Avro 文件。

7.2.9.Large Objects 大型对象

Sqoop handles large objects (BLOBandCLOBcolumns) in particular ways. If this data is truly large, then these columns should not be fully materialized in memory for manipulation, as most columns are. Instead, their data is handled in a streaming fashion. Large objects can be stored inline with the rest of the data, in which case they are fully materialized in memory on every access, or they can be stored in a secondary storage file linked to the primary data storage. By default, large objects less than 16 MB in size are stored inline with the rest of the data. At a larger size,they are stored in files in the_lobssubdirectory of the import target directory. These files are stored in a separate format optimized for large record storage, which can accomodate records of up to 2^63 bytes each. The size at which lobs spill into separate files is controlled by the--inline-lob-limitargument, which takes a parameter specifying the largest lob size to keep inline, in bytes. If you set the inline LOB limit to 0, all large objects will be placed in external storage.

Sqoop处理大型对象(BLOB和CLOB列)以特定的方式。如果这个数据是真正的大,那么这些列不应该在操作时完全实例化内存,即使大多数的列都是。相反,他们的数据以一个流的方式处理。大对象可以以与其他数据内联的方式存储,在这种情况下,每一个访问中,他们是完全实例化内存,或者他们可以存储在一个文件链接到主数据仓库的二级文件仓库。默认情况下,小于16MB大小的大型对象与其他数据内联的方式存储。在一个更大的对象中,它们存储在导入目标目录的子目录的_lobs的文件中。这些文件以一种为大型记录存储提供的单独的格式优化方式存储,每个记录可容纳多达2^63 bytes。分离文件的大小被控制通过 --inline-lob-limit参数,这个参数获取最大lob值来保持内联(文件的大小超过这个值,就分割),以字节为单位。如果你设置内联LOB为0,所有大型对象将被放置在外部仓库。

Table6.Output line formatting arguments:

ArgumentDescription
--enclosed-by <char>Sets a required field enclosing character 设置一个必用的字段闭合符
--escaped-by <char>Sets the escape character 设置转义符
--fields-terminated-by <char>Sets the field separator character 设置 字段分隔符。
--lines-terminated-by <char>Sets the end-of-line character 设置行结束符。
--mysql-delimitersUses MySQL’s default delimiter set: fields:,lines:\nescaped-by:\optionally-enclosed-by:' 使用mysql默认的一组分割符设置: 字段:, 分割符:/ 可选闭合符:'
--optionally-enclosed-by <char>Sets a field enclosing character// 设置一个字段闭合符(该闭合符只有字段内出现分割符字符时才会用于字段。)

When importing to delimited files, the choice of delimiter is important. Delimiters which appear inside string-based fields may cause ambiguous parsing of the imported data by subsequent analysis passes. For example, the string"Hello, pleased to meet you"should not be imported with the end-of-field delimiter set to a comma.

当导入数据到带分隔符文件,选择分隔符是重要的。分隔符出现在基于字符串字段可能导致已经导入的数据在后续分析传递过程中模糊不清的解析。例如,字符串“Hello, pleased to meet you ”,不应该在设置为一个逗号为字段结束分隔符的条件下导入。

Delimiters may be specified as:

分隔符可以指定为:

  • a character// 一个字符 (--fields-terminated-by X)

  • an escape character //一个转义字符(--fields-terminated-by \t). Supported escape characters are 支持的转义字符如下:

    • \b(backspace)//(空格)

    • \n(newline 换行)

    • \r(carriage return// 回车)

    • \t(tab// tab键)

    • \"(double-quote//双引号)

    • \\'(single-quote//单引号)

    • \\(backslash//反斜杠)

    • \0(NUL//空字符) - This will insert NUL characters between fields or lines, or will disable enclosing/escaping if used for one of the--enclosed-by,--optionally-enclosed-by, or--escaped-byarguments//这将插入NUL字符 在字段或行之间 ,或将禁用封闭/转义  如果 使用--enclosed-by, --optionally-enclosed-by, or --escaped-by  参数的其中一个.

    下面两段讲的是字符和转义符的8进制和16进制表示法
  • The octal representation of a UTF-8 character’s code point. This should be of the form\0ooo, whereooois the octal value. For example,--fields-terminated-by \001would yield the^Acharacter

  • 一个UTF-8字符的字符码的8进制表示形式,必须 是\0ooo格式,其中ooo 是8进制值。例如,  --fields-terminated-by \001 表示指定一个字符 ^A

  • The hexadecimal representation of a UTF-8 character’s code point. This should be of the form\0xhhh, wherehhhis the hex value. For example,--fields-terminated-by \0x10would yield the carriage return character.

  • 一个UTF-8字符的字符码的16进制表示形式 ,必须是\0xhhh格式,其中hhh 是16进制值。例如,  --fields-terminated-by \0x10 表示指定一个 回车 字符

The default delimiters are a comma (,) for fields, a newline (\n) for records, no quote character, and no escape character. Note that this can lead to ambiguous/unparsible records if you import database records containing commas or newlines in the field data. For unambiguous parsing, both must be enabled. For example, via--mysql-delimiters.

默认的分隔符,一行中的多个字段分割用逗号(,),多行分割用换行符(\n),没有引号字符,没有转义字符。注意,这可能会导致模糊/ 不可解析,如果你导入的数据库记录中的字段数据(多行记录中的一行)包含逗号或换行。为明确的解析,都必须启用(行分割符和列分割符都必须被指定)。例如, 通过--mysql-delimiters.

If unambiguous delimiters cannot be presented, then useenclosingandescapingcharacters. The combination of (optional) enclosing and escaping characters will allow unambiguous parsing of lines. For example, suppose one column of a dataset contained the following values:

如果不能提供明确的分隔符,这时可以使用封闭和转义字符。结合(可选)封闭和转义字符能够明确的解析数据行。例如,假设一列的数据集包含下列值:

Some string, with a comma.
Another "string with quotes"

The following arguments would provide delimiters which can be unambiguously parsed:

以下参数如果为分隔符,可以明确地解析:

$ sqoop import --fields-terminated-by , --escaped-by \\ --enclosed-by '\"' ...

(Note that to prevent the shell from mangling the enclosing character, we have enclosed that argument itself in single-quotes.)

The result of the above arguments applied to the above dataset would be:

上述参数应用到上面的数据集的结果将:

"Some string, with a comma.","1","2","3"...
"Another \"string with quotes\"","4","5","6"...

Here the imported strings are shown in the context of additional columns ("1","2","3", etc.) to demonstrate the full effect of enclosing and escaping. The enclosing character is only strictly necessary when delimiter characters appear in the imported text. The enclosing character can therefore be specified as optional:

这里的导入字符串显示在上下文附加列("1","2","3",等)来演示封闭和转义的全部影响。仅当分隔符字符出现在导入的文本,封闭字符才是必须有的。因此,这个封闭的字符可以被指定为可选的

$ sqoop import --optionally-enclosed-by '\"' (the rest as above)...

Which would result in the following import:

这将导致以下导入:

"Some string, with a comma.",1,2,3...
"Another \"string with quotes\"",4,5,6...
[Note]Note

Even though Hive supports escaping characters, it does not handle escaping of new-line character. Also, it does not support the notion of enclosing characters that may include field delimiters in the enclosed string. It is therefore recommended that you choose unambiguous field and record-terminating delimiters without the help of escaping and enclosing characters when working with Hive; this is due to limitations of Hive’s input parsing abilities.

虽然Hive 支持转义字符,它不处理换行字符的转义。同时,在闭合字符串中可能包含字段分隔符的情况下,它不支持封闭字符的概念。因此,当使用hive时,我们建议您选择明确的字段和记录终止分隔符在没有转义和闭合符的帮助下,这是由于Hive的输入解析能力的限制。

(hive在转义和闭合字符方面有限制,到底有啥限制,我也没彻底弄懂)

The--mysql-delimitersargument is a shorthand argument which uses the default delimiters for themysqldumpprogram. If you use themysqldumpdelimiters in conjunction with a direct-mode import (with--direct), very fast imports can be achieved.

--mysql-delimiters参数是一个速记的参数,它使用默认的分隔符为mysqldump程序。如果你使用mysqldump分隔符联同  直接导入模式 (使用 --direct) ,可以实现非常快的导入。

While the choice of delimiters is most important for a text-mode import, it is still relevant if you import to SequenceFiles with--as-sequencefile. The generated class'toString()method will use the delimiters you specify, so subsequent formatting of the output data will rely on the delimiters you choose.

for: 对于 ,为了

而选择分隔符是最重要的对于一个文本模式导入。它仍然是重要的,如果你导入数据到序列化文件 通过 --as-sequencefile.。生成的类的toString()方法将使用您指定的分隔符,所以后续输出数据的格式化将依靠你选择的分隔符

Table7.Input parsing arguments:输入解析参数(import -export使用)

ArgumentDescription
--input-enclosed-by <char>Sets a required field encloser  设置一个必用的字段闭合符  
--input-escaped-by <char>Sets the input escape character 设置输入转义符d
--input-fields-terminated-by <char>Sets the input field separator 设置输入字段分隔符s
--input-lines-terminated-by <char>Sets the input end-of-line character  设置输入行结束符v
--input-optionally-enclosed-by <char>Sets a field enclosing character 设置一个字段闭合符(该闭合符只有字段内出现分割符字符时才会用于字段)

When Sqoop imports data to HDFS, it generates a Java class which can reinterpret the text files that it creates when doing a delimited-format import. The delimiters are chosen with arguments such as--fields-terminated-by; this controls both how the data is written to disk, and how the generatedparse()method reinterprets this data. The delimiters used by theparse()method can be chosen independently of the output arguments, by using--input-fields-terminated-by, and so on. This is useful, for example, to generate classes which can parse records created with one set of delimiters, and emit the records to a different set of files using a separate set of delimiters.

当Sqoop导入数据到HDFS,它生成一个Java类,这个Java类可以重新解释文本文件,当一个分隔格式的导入时 它创建了。分隔符的选择通过参数 如--fields-terminated-by; 这同时控制如何将数据写入磁盘,以及如何生成的parse()方法解读这些数据。parse()方法使用的分隔符可以自主选择的输出参数,通过使用--input-fields-terminated-by等等。这是有用的,例如,生成类可以解析已经创建的记录 通过一组分隔符,并发送记录到一组不同的文件使用一组不同的分隔符。

Table8.Hive arguments:

ArgumentDescription
--hive-home <dir>Override$HIVE_HOME 覆盖 $HIVE_HOME
--hive-importImport tables into Hive (Uses Hive’s default delimiters if none are set.)导入表到hive(如果没有设置使用hive的默认分割符)
--hive-overwriteOverwrite existing data in the Hive table 覆盖hive中已经存在的表.
--create-hive-tableIf set, then the job will fail if the target hive 如果设置了,当表存在时,这个job会失败,默认是false

table exits. By default this property is false.
--hive-table <table-name>Sets the table name to use when importing to Hive 导入到hive时设置要使用的表名.
--hive-drop-import-delimsDrops\n,\r, and\01from string fields when importing to Hive 当导入到hive时从字段字符串删除\n, \r, and \01 .
--hive-delims-replacementReplace\n,\r, and\01from string fields with user defined string when importing to Hive 当导入到hive时从字段字符串替换 \n, \r, and \01  .
--hive-partition-keyName of a hive field to partition are sharded on 分区依据的key
--hive-partition-value <v>String-value that serves as partition key for this imported into hive in this job 分区依据的value.
--map-column-hive <map>Override default mapping from SQL type to Hive type for configured columns.

7.2.10.Importing Data Into Hive// 导入数据到 Hive

Sqoop’s import tool’s main function is to upload your data into files in HDFS. If you have a Hive metastore associated with your HDFS cluster, Sqoop can also import the data into Hive by generating and executing aCREATE TABLEstatement to define the data’s layout in Hive. Importing data into Hive is as simple as adding the--hive-importoption to your Sqoop command line.

sqoop 的 import tool的主要功能是上传你的数据到HDFS的文件中,如果你有一个关联HDFS集群的Hive数据仓库,通过执行建表语句,sqoop也能导入数据到Hive中去,只要增加--hive-import选项到命令行中

If the Hive table already exists, you can specify the--hive-overwriteoption to indicate that existing table in hive must be replaced. After your data is imported into HDFS or this step is omitted, Sqoop will generate a Hive script containing aCREATE TABLEoperation defining your columns using Hive’s types, and aLOAD DATA INPATHstatement to move the data files into Hive’s warehouse directory.

如果导入到hive中的表已经存在,你可以指定选项--hive-overwrite来指明同名表必须被覆盖,在你的数据导入到HDFS或省略了这一步,Sqoop将生成一个hive脚本包含创建表操作,这个创建表的操作使用Hive的类型定义你的列, 一个LOAD DATA INPATH 语句将数据文件到 Hive 的数据仓库。

The script will be executed by calling the installed copy of hive on the machine where Sqoop is run. If you have multiple Hive installations, orhiveis not in your$PATH, use the--hive-homeoption to identify the Hive installation directory. Sqoop will use$HIVE_HOME/bin/hivefrom here.

该脚本将执行通过调用在机器上hive已经安装的拷贝工具。如果你安装了多个Hive,或hive不在你的$PATH,使用--hive-home选项确定hive安装目录。Sqoop将使用$HIVE_HOME/bin/hive。


[Note]Note

This function is incompatible with--as-avrodatafileand--as-sequencefile.

// 这个功能与 --as-avrodatafile and --as-sequencefile 兼容。

Even though Hive supports escaping characters, it does not handle escaping of new-line character. Also, it does not support the notion of enclosing characters that may include field delimiters in the enclosed string. It is therefore recommended that you choose unambiguous field and record-terminating delimiters without the help of escaping and enclosing characters when working with Hive; this is due to limitations of Hive’s input parsing abilities. If you do use --escaped-by, --enclosed-by, or --optionally-enclosed-by when importing data into Hive, Sqoop will print a warning message.

虽然Hive 支持转义字符,它不处理换行字符的转义。同时,在闭合字符串中可能包含字段分隔符的情况下,它不支持封闭字符的概念。因此,当使用hive时,我们建议您选择明确的字段和记录终止分隔符在没有转义和闭合符的帮助下,这是由于Hive的输入解析能力的限制.你如果使用--escaped-by, --enclosed-by, or --optionally-enclosed-by ,当导入数据到hive时,Sqoop将打印警告信息。

Hive will have problems using Sqoop-imported data if your database’s rows contain string fields that have Hive’s default row delimiters (\nand\rcharacters) or column delimiters (\01characters) present in them. You can use the--hive-drop-import-delimsoption to drop those characters on import to give Hive-compatible text data. Alternatively, you can use the--hive-delims-replacementoption to replace those characters with a user-defined string on import to give Hive-compatible text data. These options should only be used if you use Hive’s default delimiters and should not be used if different delimiters are specified.

使用Sqoop导入的数据时, hive会有问题 , 如果您的数据库的行包含hive的行分割符(\n and \r 字符)或列分隔符 (\01 字符)。在导入时,您可以使用  --hive-drop-import-delims  选项来删除这些字符来提供hive兼容的文本数据。或者,您可以使用--hive-delims-replacement 选项指定一个用户定义的字符串来取代那些字符,进而来提供hive兼容的文本数据。这些选项只应该在你使用了hive的默认分隔符的情况下才能被使用,如果指定了其他的分割符,在些选项就不应该被使用。

Sqoop willpass the field and record delimiters through to Hive. If you do not set any delimiters and do use--hive-import, the field delimiter will be set to^Aand the record delimiter will be set to\nto be consistent with Hive’s defaults.

Sqoop将传递字段和记录分隔符直达hive。如果你使用 了--hive-import,并且不设置任何分隔符,将使用hive分割符的默认设置,字段分隔符将被设置为^A 并且记录分隔符将被设置为\ n。

Sqoop will by default import NULL values as stringnull. Hive is however using string\Nto denoteNULLvalues and therefore predicates dealing withNULL(likeIS NULL) will not work correctly. You should appendparameters--null-stringand--null-non-stringin case ofimport job or--input-null-stringand--input-null-non-stringin case of an export job if you wish to properly preserveNULLvalues. Because sqoop is using those parameters in generated code, you need to properly escape value\Nto\\N:

in case of:  someting 万一 ,假设 是someting

sqoop默认导入NULL值为 null 字符串,然而 hive是使用\N表示 NULL 值,因此NULL的处理可能会不正确,如果想正确的保存NULL值,你应该为导入任务添加参数  --null-string and --null-non-string,为导出任务添加参数 --input-null-string and --input-null-non-string.因为是使用这些参数sqoop生成的代码,你必须把 \N写成\\N:

$ sqoop import  ... --null-string '\\N' --null-non-string '\\N'

The table name used in Hive is, by default, the same as that of the source table. You can control the output table name with the--hive-tableoption.

在hive中的表名默认与资源表名相同,你可以控制输出表名通过 --hive-table 选项。

Hive can put data into partitions for more efficient queryperformance. You can tell a Sqoop job to import data for Hive into aparticularpartitionby specifying the--hive-partition-keyand--hive-partition-valuearguments. The partition value must be a string. Please see the Hive documentation for more details on partitioning.

为了更高效哦查询执行,hive可以把数据放入多个分区,通过指定--hive-partition-key and --hive-partition-value参数,你可以告诉sqoop 任务 为hive导入数据到一个特殊的分区,这个分区值必须是一个字符串,了解更多的分区细节请查看hive文档。

You can import compressed tables into Hive using the--compressand--compression-codecoptions. Onedownsideto compressing tables imported into Hive is that many codecs cannot be split for processing by parallel map tasks. Thelzopcodec, however, does support splitting. When importing tables with this codec, Sqoop will automatically index the files for splitting and configuring a new Hive table with the correct InputFormat. This feature currently requires that all partitions of a table be compressed with the lzop codec.

你可以使用--compress and --compression-codec 选项导入压缩表到hive中去,一个缺陷是许多编×××不能以并行的map任务执行,lzop codec 可以支持平行的map任务(lzop 是一种编×××,可以由多个MapReduce并行来进行处理 详见 http://www.tech126.com/hadoop-lzo/),不支持分割(我猜是不支持指定分割界限值),这个特性目前需要所有的表分区通过 lzop 编××× 压缩。

Table9.HBase arguments:

ArgumentDescription
--column-family <family>Sets the target column family for the import //导入的目标列族集合。
--hbase-create-tableIf specified, create missing HBase tables //如果指定,创建缺失的HBase表
--hbase-row-key <col>Specifies which input column to use as the row key//指定输入列要用的rowkey
--hbase-table <table-name>Specifies an HBase table to use as the target instead of HDFS// 指定要使用的HBase表用来代替HDFS
7用来.2.11.Importing Data Into HBase

Sqoop supports additional import targets beyond HDFS and Hive. Sqoop can also import records into a table in HBase.

sqoop除了支持导入到Hive和HDFS,sqoop也支持导入记录到HBase。

By specifying--hbase-table, you instruct Sqoop to import to a table in HBase rather than a directory in HDFS. Sqoop will import data to the table specified as the argument to--hbase-table. Each row of the input table will be transformed into an HBasePutoperation to a row of the output table.The key for each row is taken from a column of the input. By default Sqoop will use the split-by column as the row key column. If that is not specified, it will try to identify the primary key column, if any, of the source table.You can manually specify the row key column with--hbase-row-key. Each output column will be placed in the same column family, which must be specified with--column-family.

红字部分 是不指定rowkey时,rowkey的生成规则,我还没搞懂,试试就知道。

通过指定  --hbase-table ,您可以指示Sqoop导入到HBase中的一个表而不是HDFS的个目录。Sqoop将导入数据到参数 --hbase-table指定的表 。输入表的每一行将被转换为一个 put操作写入 HBase表。为每一行的关键是取自一个某列的输入。默认情况下Sqoop将使用分割列(标识不同行的列,比如主键,时间戳)作为row key。如果没有指定,它会尝试识别主键列作为rowkey,如果任何,源表的。你可以手动指定row key 列通过--hbase-row-key 。每个导入列将被放置在相同的列族,必须指定--column-family

[Note]Note

This function is incompatible with direct import (parameter--direct).

这个功能与direct模式不兼容(参数 --direct)

If the target table and column family do not exist, the Sqoop job will exit with an error. You should create the target table and column family before running an import. If you specify--hbase-create-table, Sqoop will create the target table and column family if they do not exist, using the default parameters from your HBase configuration.

如果目标表和列族不存在,Sqoop job将错误的退出。您应该创建目标表和列族在运行导入前。如果你指定  --hbase-create-table ,且目标表和列族 不存在,Sqoop将使用HBase配置默认参数创建它们。

Sqoop currently serializes all values to HBase by converting each field to its string representation (as if you were importing to HDFS in text mode), and then inserts the UTF-8 bytes of this string in the target cell. Sqoop will skip all rows containing null values in all columns except the row key column.

目前Sqoop序列化所有值到HBase中通过把每个字段转换为字符串(犹如你是以文本模式导入到HDFS),然后插入utf- 8字符串的byte值。Sqoop会跳过所有的行中包含空值的除了row key的所有列(如果有一行导入数据的某列为空,除row key外,列名和null值都不会被HBASE中的表记录)。

Table10.Code generation arguments:

ArgumentDescription
--bindir <dir>Output directory for compiled objects//指定class文件存放目录
--class-name <name>Sets the generated class name. This overrides--package-name. When combined with--jar-file, sets the input class
设置生成的class名,这将覆盖--package-name ,它还可以和-jar-file一起使用,用来设置输入的class( 指定一个导入时使用的class)
--jar-file <file>Disable code generation; use specified jar 代码生成无效;使用特定的jar包
--outdir <dir>Output directory for generated code 生成代码的输出路径
--package-name <name>Put auto-generated classes in this package //所有自动生成的class的包名
--map-column-java <m>Override default mapping from SQL type to Java type for configured columns//(不懂) .

As mentioned earlier, a byproduct of importing a table to HDFS is a class which can manipulate the imported data. If the data is stored in SequenceFiles, this class will be used for the data’s serialization container. Therefore, you should use this class in your subsequent MapReduce processing of the data.

正如前面提到的,一个导入表到HDFS过程的副产品是一个类,可以操作导入数据。如果数据存储在SequenceFiles,这个类将用于数据的序列化容器。因此,你应该使用这个类在你的随后的MapReduce数据处理。

The class is typically named after the table; a table namedfoowill generate a class namedfoo. You may want to override this class name. For example, if your table is namedEMPLOYEES, you may want to specify--class-name Employeeinstead. Similarly, you can specify just the package name with--package-name. The following import generates a class namedcom.foocorp.SomeTable:

类通常是以表命名,一个命名为foo的表将生成一个命名为foo的类。你可能想要覆盖这个类名。例如,如果您的表名为EMPLOYEES,你可能通过指定--class-name 代替Employee 。同样地,您可以只指定包名通过 --package-name。以下的导入生成一个命名com.foocorp.SomeTable的类:

$ sqoop import --connect <connect-str> --table SomeTable --package-name com.foocorp
--class-name,--package-name 能不能同时使用 我也没搞清,试试就知道了

The.javasource file for your class will be written to the current working directory when you runsqoop. You can control the output directory with--outdir. For example,--outdir src/generated/.

当您运行sqoop,你的类java源文件为将被写入当前工作目录。你可以控制输出目录通过--outdir,例如  --outdir src/generated/.

The import process compiles the source into.classand.jarfiles; these are ordinarily stored under/tmp. You can select an alternate target directory with--bindir. For example,--bindir /scratch.

导入过程编译源代码成为 .class.jar  jar文件;这些都是通常存储在/ tmp。你可以选择另一个目标目录通过--bindir.例如 --bindir /scratch.

If you already have a compiled class that can be used to perform the import and want to suppress the code-generation aspect of the import process, you can use an existing jar and class by providing the--jar-fileand--class-nameoptions. For example:

如果你已经有一个编译后的类,可用于执行导入,并想抑制代码生成方面的导入流程(不生成代码,使用已有代码),你可以使用一个现有的jar和类 通过 --jar-file and --class-name选项,例如:

$ sqoop import --table SomeTable --jar-file mydatatypes.jar \
    --class-name SomeTableType

This command will load theSomeTableTypeclass out ofmydatatypes.jar.

这个命令将加载mydatatypes.jar的SomeTableType类

7.2.12.Additional Import Configuration Properties

There are some additional properties which can be configured by modifyingconf/sqoop-site.xml. Properties can be specified the same as in Hadoop configuration files, for example:

有一些额外的属性可以配置通过修改conf/sqoop-site.xml。通过Hadoop配置文件,属性同样可以指定,例如:

  <property>
    <name>property.name</name>
    <value>property.value</value>
  </property>

They can also be specified on the command line in the generic arguments, for example:

它们也可以在命令行的通用参数中指定,例如:

sqoop import -D property.name=property.value ...

Table11.Additional import configuration properties:

ArgumentDescription
sqoop.bigdecimal.format.stringControls how BigDecimal columns will formatted when stored as a String. A value oftrue(default) will use toPlainString to store them without an exponent component (0.0000001); while a value offalsewill use toString which may include an exponent (1E-7)

sqoop.hbase.add.row.keyWhen set tofalse(default), Sqoop will not add the column used as a row key into the row data in HBase当为false时,不会添加row列到数据行中去.
When set totrue, the column used as a row key will be added to the row data in HBase// 当为true时,row key将作为一个列被添加到数据行中.

7.3.Example Invocations

The following examples illustrate how to use the import tool in a variety of situations.

下面的例子演示了如何在各种各样的情况下使用import工具。

A basic import of a table namedEMPLOYEESin thecorpdatabase:

一个在corp数据库中的 EMPLOYEES表的基本导入 :

$ sqoop import --connect jdbc:mysql://db.foo.com/corp --table EMPLOYEES

A basic import requiring a login:

一个要求登录的基本导入:

$ sqoop import --connect jdbc:mysql://db.foo.com/corp --table EMPLOYEES \
    --username SomeUser -P
Enter password: (hidden)

Selecting specific columns from theEMPLOYEEStable:

EMPLOYEES表中选择指定的列:

$ sqoop import --connect jdbc:mysql://db.foo.com/corp --table EMPLOYEES \
    --columns "employee_id,first_name,last_name,job_title"

Controlling the import parallelism (using 8 parallel tasks):

控制并行导入(使用8个并行任务):

$ sqoop import --connect jdbc:mysql://db.foo.com/corp --table EMPLOYEES \
    -m 8

Enabling the MySQL "direct mode" fast path:

使用快速通道的 MySQL direct模式:

$ sqoop import --connect jdbc:mysql://db.foo.com/corp --table EMPLOYEES \
    --direct

Storing data in SequenceFiles, and setting the generated class name tocom.foocorp.Employee:

以序列文件存储数据,并设置生成class的名为com.foocorp.Employee。

$ sqoop import --connect jdbc:mysql://db.foo.com/corp --table EMPLOYEES \
    --class-name com.foocorp.Employee --as-sequencefile

Specifying the delimiters to use in a text-mode import:

在文本模式的导入 中指定分割符:

$ sqoop import --connect jdbc:mysql://db.foo.com/corp --table EMPLOYEES \
    --fields-terminated-by '\t' --lines-terminated-by '\n' \
    --optionally-enclosed-by '\"'

Importing the data to Hive:

导入数据到 Hive中:

$ sqoop import --connect jdbc:mysql://db.foo.com/corp --table EMPLOYEES \
    --hive-import

Importing only new employees:

只导入新员工(条件查询导入):

$ sqoop import --connect jdbc:mysql://db.foo.com/corp --table EMPLOYEES \
    --where "start_date > '2010-01-01'"

Changing the splitting column from the default:

改变默认分割列:

$ sqoop import --connect jdbc:mysql://db.foo.com/corp --table EMPLOYEES \
    --split-by dept_id

Verifying that an import was successful:

验证一个导入是否成功:

$ hadoop fs -ls EMPLOYEES
Found 5 items
drwxr-xr-x   - someuser somegrp          0 2010-04-27 16:40 /user/someuser/EMPLOYEES/_logs
-rw-r--r--   1 someuser somegrp    2913511 2010-04-27 16:40 /user/someuser/EMPLOYEES/part-m-00000
-rw-r--r--   1 someuser somegrp    1683938 2010-04-27 16:40 /user/someuser/EMPLOYEES/part-m-00001
-rw-r--r--   1 someuser somegrp    7245839 2010-04-27 16:40 /user/someuser/EMPLOYEES/part-m-00002
-rw-r--r--   1 someuser somegrp    7842523 2010-04-27 16:40 /user/someuser/EMPLOYEES/part-m-00003

$ hadoop fs -cat EMPLOYEES/part-m-00000 | head -n 10
0,joe,smith,engineering
1,jane,doe,marketing
...

Performing an incremental import of new data, after having already imported the first 100,000 rows of a table:

在已经导入一个表的条数据后,执行新数据的增量导入:

$ sqoop import --connect jdbc:mysql://db.foo.com/somedb --table sometable \
    --where "id > 100000" --target-dir /incremental_dataset --append

An import of a table namedEMPLOYEESin thecorpdatabase that uses validation to validate the import using the table row count and number of rows copied into HDFS:More Details

一个在corp数据库中的 EMPLOYEES表的导入 , 使用数据库表的行数和复制到HDFS的行数来验证导入:

$ sqoop import --connect jdbc:mysql://db.foo.com/corp \
    --table EMPLOYEES --validate

8.sqoop-import-all-tables

8.1.Purpose

Theimport-all-tablestool imports a set of tables from an RDBMS to HDFS. Data from each table is stored in a separate directory in HDFS.

import-all-tables 工具用来从RDBMS中导入表的结合到 HDFS。每个表的数据都存储在HDFS中的单独目录(每一个表都有独立的目录)。

For theimport-all-tablestool to be useful, the following conditions must be met:

为了import-all-tables tool好用,必须符合下列条件:

  • Each table must have a single-column primary key.// 每个表必须有一个 单列的主键

  • You must intend to import all columns of each table.// 你必须打算导入每个表的所有行

  • You must not intend to use non-default splitting column, nor impose any conditions via aWHEREclause.//你既不能使用非默认的分割列也不能通过WHERE子句限制任何条件

8.2.Syntax

$ sqoop import-all-tables (generic-args) (import-args)
$ sqoop-import-all-tables (generic-args) (import-args)

Although the Hadoop generic arguments must preceed any import arguments, the import arguments can be entered in any order with respect to one another.

虽然hadoop通用参数必须写在所有导入参数前,但是 导入的参数 互相之间可以以任何顺序输入。

Table12.Common arguments(上面写过了):共有的参数

ArgumentDescription
--connect <jdbc-uri>Specify JDBC connect string
--connection-manager <class-name>Specify connection manager class to use
--driver <class-name>Manually specify JDBC driver class to use
--hadoop-mapred-home <dir>Override $HADOOP_MAPRED_HOME
--helpPrint usage instructions
-PRead password from console
--password <password>Set authentication password
--username <username>Set authentication username
--verbosePrint more information while working
--connection-param-file <filename>Optional properties file that provides connection parameters

Table13.Import control arguments:

ArgumentDescription
--as-avrodatafileImports data to Avro Data Files
--as-sequencefileImports data to SequenceFiles
--as-textfileImports data as plain text (default)
--directUse direct import fast path
--direct-split-size <n>Split the input stream everynbytes when importing in direct mode
--inline-lob-limit <n>Set the maximum size for an inline LOB
-m,--num-mappers <n>Usenmap tasks to import in parallel
--warehouse-dir <dir>HDFS parent for table destination
-z,--compressEnable compression
--compression-codec <c>Use Hadoop codec (default gzip)

These arguments behave in the same manner as they do when used for thesqoop-importtool, but the--table,--split-by,--columns, and--wherearguments are invalid forsqoop-import-all-tables.

这里的参数表现与sqoop-import 工具是一样的方式,但是--table, --split-by, --columns, and --where参数是无效的在sqoop-import-all-tables中。

Table14.Output line formatting arguments:

ArgumentDescription
--enclosed-by <char>Sets a required field enclosing character
--escaped-by <char>Sets the escape character
--fields-terminated-by <char>Sets the field separator character
--lines-terminated-by <char>Sets the end-of-line character
--mysql-delimitersUses MySQL’s default delimiter set: fields:,lines:\nescaped-by:\optionally-enclosed-by:'
--optionally-enclosed-by <char>Sets a field enclosing character

Table15.Input parsing arguments:

ArgumentDescription
--input-enclosed-by <char>Sets a required field encloser
--input-escaped-by <char>Sets the input escape character
--input-fields-terminated-by <char>Sets the input field separator
--input-lines-terminated-by <char>Sets the input end-of-line character
--input-optionally-enclosed-by <char>Sets a field enclosing character

Table16.Hive arguments:

ArgumentDescription
--hive-home <dir>Override$HIVE_HOME
--hive-importImport tables into Hive (Uses Hive’s default delimiters if none are set.)
--hive-overwriteOverwrite existing data in the Hive table.
--create-hive-tableIf set, then the job will fail if the target hive

table exits. By default this property is false.
--hive-table <table-name>Sets the table name to use when importing to Hive.
--hive-drop-import-delimsDrops\n,\r, and\01from string fields when importing to Hive.
--hive-delims-replacementReplace\n,\r, and\01from string fields with user defined string when importing to Hive.
--hive-partition-keyName of a hive field to partition are sharded on
--hive-partition-value <v>String-value that serves as partition key for this imported into hive in this job.
--map-column-hive <map>Override default mapping from SQL type to Hive type for configured columns.

Table17.Code generation arguments:

ArgumentDescription
--bindir <dir>Output directory for compiled objects
--jar-file <file>Disable code generation; use specified jar
--outdir <dir>Output directory for generated code
--package-name <name>Put auto-generated classes in this package

Theimport-all-tablestool does not support the--class-nameargument. You may, however, specify a package with--package-namein which all generated classes will be placed.

import-all-tables 不支持 --class-name 参数,然而,你可以指定一个放置所有生成类的包名。

8.3.Example Invocations

Import all tables from thecorpdatabase:

导入corp数据库的所有表:

$ sqoop import-all-tables --connect jdbc:mysql://db.foo.com/corp

Verifying that it worked:

验证是否执行成功:

$ hadoop fs -ls
Found 4 items
drwxr-xr-x   - someuser somegrp       0 2010-04-27 17:15 /user/someuser/EMPLOYEES
drwxr-xr-x   - someuser somegrp       0 2010-04-27 17:15 /user/someuser/PAYCHECKS
drwxr-xr-x   - someuser somegrp       0 2010-04-27 17:15 /user/someuser/DEPARTMENTS
drwxr-xr-x   - someuser somegrp       0 2010-04-27 17:15 /user/someuser/OFFICE_SUPPLIES