mysql传到hdfs需要改格式吗_Sqoop1.4.4将MySQL数据导入到HDFS中及问题总结

本文介绍了使用Sqoop 1.4.4将MySQL数据导入到HDFS的过程,包括使用--query参数进行自由查询导入、--split-by的作用以及处理SQL语句中的双引号问题。 Sqoop默认使用分隔符文本文件格式,可通过--as-textfile参数调整。此外,文章还讨论了如何控制导入进程和映射类型。
摘要由CSDN通过智能技术生成

本帖最后由 pig2 于 2015-10-23 17:53 编辑

问题导读:

1、Sqoop使用SQL语句实现数据导入使用哪个参数?

2、使用--query参数执行数据导入,三个必须加上的参数是?

3、--split-by参数的作用?

4、Sqoop执行数据导入时,Map tasks的默认个数是?

5、--query后SQL语句双引号和单引号的区别?该怎么解决?

6、Sqoop执行数据导入有哪两种数据文件格式?默认的是哪个文件格式?

thread-15717-1-1.html

一、自由查询形式导入

Sqoop还支持将任意的查询结果集导入,不使用--table、--columns和--where,使用SQL语句--query参数执行自由查询导入,但是必须指定--target-dir目录,必须指定--split-by 分隔列,同时必须使用where且在其后加个$CONDITIONS,使Sqoop进程替代为一个唯一的条件表达式达到条件查询效果。如下:

[mw_shl_code=bash,true][hadoopUser@secondmgt conf]$ sqoop import --connect jdbc:mysql://secondmgt:3306/spice --username hive --password hive --query  'select * from users where  id<60 and $CONDITIONS' --split-by id -m 1 --target-dir /output/query/

Warning: /usr/lib/hcatalog does not exist! HCatalog jobs will fail.

Please set $HCAT_HOME to the root of your HCatalog installation.

15/01/18 14:30:10 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead.

15/01/18 14:30:10 INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset.

15/01/18 14:30:10 INFO tool.CodeGenTool: Beginning code generation

15/01/18 14:30:11 INFO manager.SqlManager: Executing SQL statement: select * from users where  id<60 and  (1 = 0)

15/01/18 14:30:11 INFO manager.SqlManager: Executing SQL statement: select * from users where  id<60 and  (1 = 0)

15/01/18 14:30:11 INFO manager.SqlManager: Executing SQL statement: select * from users where  id<60 and  (1 = 0)

15/01/18 14:30:11 INFO orm.CompilationManager: HADOOP_MAPRED_HOME is /home/hadoopUser/cloud/hadoop/programs/hadoop-2.2.0

Note: /tmp/sqoop-hadoopUser/compile/3488270c7f7b23dd3b556d8d185f6a82/QueryResult.java uses or overrides a deprecated API.

Note: Recompile with -Xlint:deprecation for details.

15/01/18 14:30:12 INFO orm.CompilationManager: Writing jar file: /tmp/sqoop-hadoopUser/compile/3488270c7f7b23dd3b556d8d185f6a82/QueryResult.jar

15/01/18 14:30:12 INFO mapreduce.ImportJobBase: Beginning query import.

15/01/18 14:30:12 INFO Configuration.deprecation: mapred.job.tracker is deprecated. Instead, use mapreduce.jobtracker.address

SLF4J: Class path contains multiple SLF4J bindings.

SLF4J: Found binding in [jar:file:/home/hadoopUser/cloud/hadoop/programs/hadoop-2.2.0/share/hadoop/common/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]

SLF4J: Found binding in [jar:file:/home/hadoopUser/cloud/hbase/hbase-0.96.2-hadoop2/lib/slf4j-log4j12-1.6.4.jar!/org/slf4j/impl/StaticLoggerBinder.class]

SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.

SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]

15/01/18 14:30:12 INFO Configuration.deprecation: mapred.jar is deprecated. Instead, use mapreduce.job.jar

15/01/18 14:30:13 INFO Configuration.deprecation: mapred.map.tasks is deprecated. Instead, use mapreduce.job.maps

15/01/18 14:30:13 INFO client.RMProxy: Connecting to ResourceManager at secondmgt/192.168.2.133:8032

15/01/18 14:30:14 INFO mapreduce.JobSubmitter: number of splits:1

15/01/18 14:30:14 INFO Configuration.deprecation: mapred.job.classpath.files is deprecated. Instead, use mapreduce.job.classpath.files

15/01/18 14:30:14 INFO Configuration.deprecation: user.name is deprecated. Instead, use mapreduce.job.user.name

15/01/18 14:30:14 INFO Configuration.deprecation: mapred.cache.files.filesizes is deprecated. Instead, use mapreduce.job.cache.files.filesizes

15/01/18 14:30:14 INFO Configuration.deprecation: mapred.cache.files is deprecated. Instead, use mapreduce.job.cache.files

15/01/18 14:30:14 INFO Configuration.deprecation: mapred.reduce.tasks is deprecated. Instead, use mapreduce.job.reduces

15/01/18 14:30:14 INFO Configuration.deprecation: mapred.output.value.class is deprecated. Instead, use mapreduce.job.output.value.class

15/01/18 14:30:14 INFO Configuration.deprecation: mapreduce.map.class is deprecated. Instead, use mapreduce.job.map.class

15/01/18 14:30:14 INFO Configuration.deprecation: mapred.job.name is deprecated. Instead, use mapreduce.job.name

15/01/18 14:30:14 INFO Configuration.deprecation: mapreduce.inputformat.class is deprecated. Instead, use mapreduce.job.inputformat.class

15/01/18 14:30:14 INFO Configuration.deprecation: mapred.output.dir is deprecated. Instead, use mapreduce.output.fileoutputformat.outputdir

15/01/18 14:30:14 INFO Configuration.deprecation: mapreduce.outputformat.class is deprecated. Instead, use mapreduce.job.outputformat.class

15/01/18 14:30:14 INFO Configuration.deprecation: mapred.cache.files.timestamps is deprecated. Instead, use mapreduce.job.cache.files.timestamps

15/01/18 14:30:14 INFO Configuration.deprecation: mapred.output.key.class is deprecated. Instead, use mapreduce.job.output.key.class

15/01/18 14:30:14 INFO Configuration.deprecation: mapred.working.dir is deprecated. Instead, use mapreduce.job.working.dir

15/01/18 14:30:14 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1421373857783_0016

15/01/18 14:30:15 INFO impl.YarnClientImpl: Submitted application application_1421373857783_0016 to ResourceManager at secondmgt/192.168.2.133:8032

15/01/18 14:30:15 INFO mapreduce.Job: The url to track the job: http://secondmgt:8088/proxy/application_1421373857783_0016/

15/01/18 14:30:15 INFO mapreduce.Job: Running job: job_1421373857783_0016

15/01/18 14:30:27 INFO mapreduce.Job: Job job_1421373857783_0016 running in uber mode : false

15/01/18 14:30:27 INFO mapreduce.Job:  map 0% reduce 0%

15/01/18 14:30:38 INFO mapreduce.Job:  map 100% reduce 0%

15/01/18 14:30:38 INFO mapreduce.Job: Job job_1421373857783_0016 completed successfully

15/01/18 14:30:38 INFO mapreduce.Job: Counters: 27

File System Counters

FILE: Number of bytes read=0

FILE: Number of bytes written=91814

FILE: Number of read operations=0

FILE: Number of large read operations=0

FILE: Number of write operations=0

HDFS: Number of bytes read=87

HDFS: Number of bytes written=123

HDFS: Number of read operations=4

HDFS: Number of large read operations=0

HDFS: Number of write operations=2

Job Counters

Launched map tasks=1

Other local map tasks=1

Total time spent by all maps in occupied slots (ms)=33944

Total time spent by all reduces in occupied slots (ms)=0

Map-Reduce Framework

Map input records=3

Map output records=3

Input split bytes=87

Spilled Records=0

Failed Shuffles=0

Merged Map outputs=0

GC time elapsed (ms)=44

CPU time spent (ms)=2440

Physical memory (bytes) snapshot=164503552

Virtual memory (bytes) snapshot=888926208

Total committed heap usage (bytes)=83886080

File Input Format Counters

Bytes Read=0

File Output Format Counters

Bytes Written=123

15/01/18 14:30:38 INFO mapreduce.ImportJobBase: Transferred 123 bytes in 25.6853 seconds (4.7887 bytes/sec)

15/01/18 14:30:38 INFO mapreduce.ImportJobBase: Retrieved 3 records.[/mw_shl_code]

Sqoop使用--split-by 列名,根据此分隔工作量,默认的Sqoop将表中的关键字作为分隔列,由上导入命令可知,此处我们是以“id”作为分隔列。

Sqoop从大部分的数据源并行的导入数据,我们可以使用-m参数控制Map tasks的数目,默认是4个,此处我们改成了1个Map task。Map task,根据整个范围的均衡大小进行操作。例如,你有一张表,关键字id范围是0-1000,默认Map tasks 是4个,Sqoop将会执行4个进程,每个进程以如下格式执行SELECT * FROM sometable WHERE id >= lo AND id < hi其中(lo, hi) set to (0, 250), (250, 500), (500, 750), and (750, 1001) 在不同的任务中。

注意一: 如果你的表中关键字不是根据其范围均匀的分布,就可能导致不平衡的任务。这个时候你需要明确的选择一个不同的列使用--split-by指定分隔参数。目前,Sqoop,还不支持对各个列索引进行分隔,如果一个表没有索引列或者含有多个关键字列,你必须手动的指定一个分隔列。

注意二:如果SQL语句中使用双引号(“”),则必须使用$CONDITIONS代替$CONDITIONS,使你的shell不将其识别为shell自身的变量。如下示例:

错误方式:

[mw_shl_code=bash,true][hadoopUser@secondmgt ~]$ sqoop import --connect jdbc:mysql://secondmgt:3306/spice --username hive --password hive --query "select * from users where $CONDITIONS" --split-by id  --target-dir /output/query/

Warning: /usr/lib/hcatalog does not exist! HCatalog jobs will fail.

Please set $HCAT_HOME to the root of your HCatalog installation.

15/01/18 15:17:50 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead.

15/01/18 15:17:50 INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset.

15/01/18 15:17:50 INFO tool.CodeGenTool: Beginning code generation

15/01/18 15:17:50 ERROR tool.ImportTool: Encountered IOException running import job: java.io.IOException: Query [select * from users where ] must contain '$CONDITIONS' in WHERE clause.

at org.apache.sqoop.manager.ConnManager.getColumnTypes(ConnManager.java:352)

at org.apache.sqoop.orm.ClassWriter.getColumnTypes(ClassWriter.java:1277)

at org.apache.sqoop.orm.ClassWriter.generate(ClassWriter.java:1089)

at org.apache.sqoop.tool.CodeGenTool.generateORM(CodeGenTool.java:96)

at org.apache.sqoop.tool.ImportTool.importTable(ImportTool.java:396)

at org.apache.sqoop.tool.ImportTool.run(ImportTool.java:502)

at org.apache.sqoop.Sqoop.run(Sqoop.java:145)

at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)

at org.apache.sqoop.Sqoop.runSqoop(Sqoop.java:181)

at org.apache.sqoop.Sqoop.runTool(Sqoop.java:220)

at org.apache.sqoop.Sqoop.runTool(Sqoop.java:229)

at org.apache.sqoop.Sqoop.main(Sqoop.java:238)[/mw_shl_code]

正确如下:

[mw_shl_code=bash,true][hadoopUser@secondmgt ~]$ sqoop import --connect jdbc:mysql://secondmgt:3306/spice --username hive --password hive --query "select * from users where $CONDITIONS" --split-by id  --target-dir /output/query/[/mw_shl_code]

注意三:目前版本的Sqoop中,使用自由形式查询导入,只提供简单的查询,没有复杂的和“OR”条件查询在where子句中。

二、查看结果

[hadoopUser@secondmgt ~]$ hadoop fs -cat /output/query/*

56,hua,hanyun,男,开通,2013-12-02,0,1

58,feng,123456,男,开通,2013-11-22,0,0

59,test,123456,男,开通,2014-03-05,58,0

三、控制导入进程

有些数据库提供更加快捷、高效的方式用来将数据库表中的数据导入到其他的系统中,这个时候可以--direct 参数。如:mysql会调用 mysqldump和mysqlimport ,PostgreSQL 为psql。

四、控制映射类型

Sqoop预配置了Java和Hive典型的大部分SQL类型,然而,默认的类型有时候不一定完全适合用户需求。可以使用下面两个参数根据自己的应用修改映射类型

[mw_shl_code=bash,true]Argument         Description

--map-column-java          Override mapping from SQL to Java type for configured columns.

--map-column-hive          Override mapping from SQL to Hive type for configured columns.[/mw_shl_code]

五、文件格式

Sqoop支持两种类型的文件格式导入:分隔符文本和序列文件(delimited text or SequenceFiles)。默认的是采用分隔符文本,由上面导入后查询的结果可知,默认采用逗号分隔的。可以使用--as-textfile参数修改默认的文件导入格式。

delimited text 是适合大多数非二进制数据类型。它也很容易支持进一步操纵其他工具,如Hive。

SequenceFiles是二进制格式以自定义记录特有的数据类型来存储个人记录的。

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值