mysql传到hdfs需要改格式吗_Sqoop1.4.4将MySQL数据导入到HDFS中及问题总结

最新推荐文章于 2022-12-02 15:16:48 发布

阿莱克西斯

最新推荐文章于 2022-12-02 15:16:48 发布

阅读量264

点赞数

文章标签： mysql传到hdfs需要改格式吗

本文链接：https://blog.csdn.net/weixin_28689193/article/details/113207999

版权

本文介绍了使用Sqoop 1.4.4将MySQL数据导入到HDFS的过程，包括使用--query参数进行自由查询导入、--split-by的作用以及处理SQL语句中的双引号问题。 Sqoop默认使用分隔符文本文件格式，可通过--as-textfile参数调整。此外，文章还讨论了如何控制导入进程和映射类型。

摘要由CSDN通过智能技术生成

本帖最后由 pig2 于 2015-10-23 17:53 编辑

问题导读：

1、Sqoop使用SQL语句实现数据导入使用哪个参数？

2、使用--query参数执行数据导入，三个必须加上的参数是？

3、--split-by参数的作用？

4、Sqoop执行数据导入时，Map tasks的默认个数是？

5、--query后SQL语句双引号和单引号的区别？该怎么解决？

6、Sqoop执行数据导入有哪两种数据文件格式？默认的是哪个文件格式？

thread-15717-1-1.html

一、自由查询形式导入

Sqoop还支持将任意的查询结果集导入，不使用--table、--columns和--where，使用SQL语句--query参数执行自由查询导入，但是必须指定--target-dir目录，必须指定--split-by 分隔列，同时必须使用where且在其后加个$CONDITIONS，使Sqoop进程替代为一个唯一的条件表达式达到条件查询效果。如下：

[mw_shl_code=bash,true][hadoopUser@secondmgt conf]$ sqoop import --connect jdbc:mysql://secondmgt:3306/spice --username hive --password hive --query 'select * from users where id<60 and $CONDITIONS' --split-by id -m 1 --target-dir /output/query/

Warning: /usr/lib/hcatalog does not exist! HCatalog jobs will fail.

Please set $HCAT_HOME to the root of your HCatalog installation.

15/01/18 14:30:10 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead.

15/01/18 14:30:10 INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset.

15/01/18 14:30:10 INFO tool.CodeGenTool: Beginning code generation

15/01/18 14:30:11 INFO manager.SqlManager: Executing SQL statement: select * from users where id<60 and (1 = 0)

15/01/18 14:30:11 INFO orm.CompilationManager: HADOOP_MAPRED_HOME is /home/hadoopUser/cloud/hadoop/programs/hadoop-2.2.0

Note: /tmp/sqoop-hadoopUser/compile/3488270c7f7b23dd3b556d8d185f6a82/QueryResult.java uses or overrides a deprecated API.

Note: Recompile with -Xlint:deprecation for details.

15/01/18 14:30:12 INFO orm.CompilationManager: Writing jar file: /tmp/sqoop-hadoopUser/compile/3488270c7f7b23dd3b556d8d185f6a82/QueryResult.jar

15/01/18 14:30:12 INFO mapreduce.ImportJobBase: Beginning query import.

15/01/18 14:30:12 INFO Configuration.deprecation: mapred.job.tracker is deprecated. Instead, use mapreduce.jobtracker.address

SLF4J: Class path contains multiple SLF4J bindings.

SLF4J: Found binding in [jar:file:/home/hadoopUser/cloud/hadoop/programs/hadoop-2.2.0/share/hadoop/common/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]

SLF4J: Found binding in [jar:file:/home/hadoopUser/cloud/hbase/hbase-0.96.2-hadoop2/lib/slf4j-log4j12-1.6.4.jar!/org/slf4j/impl/StaticLoggerBinder.class]

SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.

SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]

15/01/18 14:30:12 INFO Configuration.deprecation: mapred.jar is deprecated. Instead, use mapreduce.job.jar

15/01/18 14:30:13 INFO Configuration.deprecation: mapred.map.tasks is deprecated. Instead, use mapreduce.job.maps

15/01/18 14:30:13 INFO client.RMProxy: Connecting to ResourceManager at secondmgt/192.168.2.133:8032

15/01/18 14:30:14 INFO mapreduce.JobSubmitter: number of splits:1

15/01/18 14:30:14 INFO Configuration.deprecation: mapred.job.classpath.files is deprecated. Instead, use mapreduce.job.classpath.files

15/01/18 14:30:14 INFO Configuration.deprecation: user.name is deprecated. Instead, use mapreduce.job.user.name

15/01/18 14:30:14 INFO Configuration.deprecation: mapred.cache.files.filesizes is deprecated. Instead, use mapreduce.job.cache.files.filesizes

15/01/18 14:30:14 INFO Configuration.deprecation: mapred.cache.files is deprecated. Instead, use mapreduce.job.cache.files

15/01/18 14:30:14 INFO Configuration.deprecation: mapred.reduce.tasks is deprecated. Instead, use mapreduce.job.reduces

15/01/18 14:30:14 INFO Configuration.deprecation: mapred.output.value.class is deprecated. Instead, use mapreduce.job.output.value.class

15/01/18 14:30:14 INFO Configuration.deprecation: mapreduce.map.class is deprecated. Instead, use mapreduce.job.map.class

15/01/18 14:30:14 INFO Configuration.deprecation: mapred.job.name is deprecated. Instead, use mapreduce.job.name

15/01/18 14:30:14 INFO Configuration.deprecation: mapreduce.inputformat.class is deprecated. Instead, use mapreduce.job.inputformat.class

15/01/18 14:30:14 INFO Configuration.deprecation: mapred.output.dir is deprecated. Instead, use mapreduce.output.fileoutputformat.outputdir

15/01/18 14:30:14 INFO Configuration.deprecation: mapreduce.outputformat.class is deprecated. Instead, use mapreduce.job.outputformat.class

15/01/18 14:30:14 INFO Configuration.deprecation: mapred.cache.files.timestamps is deprecated. Instead, use mapreduce.job.cache.files.timestamps

15/01/18 14:30:14 INFO Configuration.deprecation: mapred.output.key.class is deprecated. Instead, use mapreduce.job.output.key.class

15/01/18 14:30:14 INFO Configuration.deprecation: mapred.working.dir is deprecated. Instead, use mapreduce.job.working.dir

15/01/18 14:30:14 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1421373857783_0016

15/01/18 14:30:15 INFO impl.YarnClientImpl: Submitted application application_1421373857783_0016 to ResourceManager at secondmgt/192.168.2.133:8032

15/01/18 14:30:15 INFO mapreduce.Job: The url to track the job: http://secondmgt:8088/proxy/application_1421373857783_0016/

15/01/18 14:30:15 INFO mapreduce.Job: Running job: job_1421373857783_0016

15/01/18 14:30:27 INFO mapreduce.Job: Job job_1421373857783_0016 running in uber mode : false

15/01/18 14:30:27 INFO mapreduce.Job: map 0% reduce 0%

15/01/18 14:30:38 INFO mapreduce.Job: map 100% reduce 0%

15/01/18 14:30:38 INFO mapreduce.Job: Job job_1421373857783_0016 completed successfully

15/01/18 14:30:38 INFO mapreduce.Job: Counters: 27

File System Counters

FILE: Number of bytes read=0

FILE: Number of bytes written=91814

FILE: Number of read operations=0

FILE: Number of large read operations=0

FILE: Number of write operations=0

HDFS: Number of bytes read=87

HDFS: Number of bytes written=123

HDFS: Number of read operations=4

HDFS: Number of large read operations=0

HDFS: Number of write operations=2

Job Counters

Launched map tasks=1

Other local map tasks=1

Total time spent by all maps in occupied slots (ms)=33944

Total time spent by all reduces in occupied slots (ms)=0

Map-Reduce Framework

Map input records=3

Map output records=3

Input split bytes=87

Spilled Records=0

Failed Shuffles=0

Merged Map outputs=0

GC time elapsed (ms)=44

CPU time spent (ms)=2440

Physical memory (bytes) snapshot=164503552

Virtual memory (bytes) snapshot=888926208

Total committed heap usage (bytes)=83886080

File Input Format Counters

Bytes Read=0

File Output Format Counters

Bytes Written=123

15/01/18 14:30:38 INFO mapreduce.ImportJobBase: Transferred 123 bytes in 25.6853 seconds (4.7887 bytes/sec)

15/01/18 14:30:38 INFO mapreduce.ImportJobBase: Retrieved 3 records.[/mw_shl_code]

Sqoop使用--split-by 列名，根据此分隔工作量，默认的Sqoop将表中的关键字作为分隔列，由上导入命令可知，此处我们是以“id”作为分隔列。

Sqoop从大部分的数据源并行的导入数据，我们可以使用-m参数控制Map tasks的数目，默认是4个，此处我们改成了1个Map task。Map task,根据整个范围的均衡大小进行操作。例如，你有一张表，关键字id范围是0-1000，默认Map tasks 是4个，Sqoop将会执行4个进程，每个进程以如下格式执行SELECT * FROM sometable WHERE id >= lo AND id < hi其中(lo, hi) set to (0, 250), (250, 500), (500, 750), and (750, 1001) 在不同的任务中。

注意一：如果你的表中关键字不是根据其范围均匀的分布，就可能导致不平衡的任务。这个时候你需要明确的选择一个不同的列使用--split-by指定分隔参数。目前，Sqoop，还不支持对各个列索引进行分隔，如果一个表没有索引列或者含有多个关键字列，你必须手动的指定一个分隔列。

注意二：如果SQL语句中使用双引号(“”)，则必须使用$CONDITIONS代替$CONDITIONS，使你的shell不将其识别为shell自身的变量。如下示例：

错误方式：

[mw_shl_code=bash,true][hadoopUser@secondmgt ~]$ sqoop import --connect jdbc:mysql://secondmgt:3306/spice --username hive --password hive --query "select * from users where $CONDITIONS" --split-by id --target-dir /output/query/

Warning: /usr/lib/hcatalog does not exist! HCatalog jobs will fail.

Please set $HCAT_HOME to the root of your HCatalog installation.

15/01/18 15:17:50 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead.

15/01/18 15:17:50 INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset.

15/01/18 15:17:50 INFO tool.CodeGenTool: Beginning code generation

15/01/18 15:17:50 ERROR tool.ImportTool: Encountered IOException running import job: java.io.IOException: Query [select * from users where ] must contain '$CONDITIONS' in WHERE clause.

at org.apache.sqoop.manager.ConnManager.getColumnTypes(ConnManager.java:352)

at org.apache.sqoop.orm.ClassWriter.getColumnTypes(ClassWriter.java:1277)

at org.apache.sqoop.orm.ClassWriter.generate(ClassWriter.java:1089)

at org.apache.sqoop.tool.CodeGenTool.generateORM(CodeGenTool.java:96)

at org.apache.sqoop.tool.ImportTool.importTable(ImportTool.java:396)

at org.apache.sqoop.tool.ImportTool.run(ImportTool.java:502)

at org.apache.sqoop.Sqoop.run(Sqoop.java:145)

at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)

at org.apache.sqoop.Sqoop.runSqoop(Sqoop.java:181)

at org.apache.sqoop.Sqoop.runTool(Sqoop.java:220)

at org.apache.sqoop.Sqoop.runTool(Sqoop.java:229)

at org.apache.sqoop.Sqoop.main(Sqoop.java:238)[/mw_shl_code]