本帖最后由 pig2 于 2015-10-23 17:53 编辑
问题导读:
1、Sqoop使用SQL语句实现数据导入使用哪个参数?
2、使用--query参数执行数据导入,三个必须加上的参数是?
3、--split-by参数的作用?
4、Sqoop执行数据导入时,Map tasks的默认个数是?
5、--query后SQL语句双引号和单引号的区别?该怎么解决?
6、Sqoop执行数据导入有哪两种数据文件格式?默认的是哪个文件格式?
一、自由查询形式导入
Sqoop还支持将任意的查询结果集导入,不使用--table、--columns和--where,使用SQL语句--query参数执行自由查询导入,但是必须指定--target-dir目录,必须指定--split-by 分隔列,同时必须使用where且在其后加个$CONDITIONS,使Sqoop进程替代为一个唯一的条件表达式达到条件查询效果。如下:
[mw_shl_code=bash,true][hadoopUser@secondmgt conf]$ sqoop import --connect jdbc:mysql://secondmgt:3306/spice --username hive --password hive --query 'select * from users where id<60 and $CONDITIONS' --split-by id -m 1 --target-dir /output/query/
Warning: /usr/lib/hcatalog does not exist! HCatalog jobs will fail.
Please set $HCAT_HOME to the root of your HCatalog installation.
15/01/18 14:30:10 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead.
15/01/18 14:30:10 INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset.
15/01/18 14:30:10 INFO tool.CodeGenTool: Beginning code generation
15/01/18 14:30:11 INFO manager.SqlManager: Executing SQL statement: select * from users where id<60 and (1 = 0)
15/01/18 14:30:11 INFO manager.SqlManager: Executing SQL statement: select * from users where id<60 and (1 = 0)
15/01/18 14:30:11 INFO manager.SqlManager: Executing SQL statement: select * from users where id<60 and (1 = 0)
15/01/18 14:30:11 INFO orm.CompilationManager: HADOOP_MAPRED_HOME is /home/hadoopUser/cloud/hadoop/programs/hadoop-2.2.0
Note: /tmp/sqoop-hadoopUser/compile/3488270c7f7b23dd3b556d8d185f6a82/QueryResult.java uses or overrides a deprecated API.
Note: Recompile with -Xlint:deprecation for details.
15/01/18 14:30:12 INFO orm.CompilationManager: Writing jar file: /tmp/sqoop-hadoopUser/compile/3488270c7f7b23dd3b556d8d185f6a82/QueryResult.jar
15/01/18 14:30:12 INFO mapreduce.ImportJobBase: Beginning query import.
15/01/18 14:30:12 INFO Configuration.deprecation: mapred.job.tracker is deprecated. Instead, use mapreduce.jobtracker.address
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/home/hadoopUser/cloud/hadoop/programs/hadoop-2.2.0/share/hadoop/common/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/home/hadoopUser/cloud/hbase/hbase-0.96.2-hadoop2/lib/slf4j-log4j12-1.6.4.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
15/01/18 14:30:12 INFO Configuration.deprecation: mapred.jar is deprecated. Instead, use mapreduce.job.jar
15/01/18 14:30:13 INFO Configuration.deprecation: mapred.map.tasks is deprecated. Instead, use mapreduce.job.maps
15/01/18 14:30:13 INFO client.RMProxy: Connecting to ResourceManager at secondmgt/192.168.2.133:8032
15/01/18 14:30:14 INFO mapreduce.JobSubmitter: number of splits:1
15/01/18 14:30:14 INFO Configuration.deprecation: mapred.job.classpath.files is deprecated. Instead, use mapreduce.job.classpath.files
15/01/18 14:30:14 INFO Configuration.deprecation: user.name is deprecated. Instead, use mapreduce.job.user.name
15/01/18 14:30:14 INFO Configuration.deprecation: mapred.cache.files.filesizes is deprecated. Instead, use mapreduce.job.cache.files.filesizes
15/01/18 14:30:14 INFO Configuration.deprecation: mapred.cache.files is deprecated. Instead, use mapreduce.job.cache.files
15/01/18 14:30:14 INFO Configuration.deprecation: mapred.reduce.tasks is deprecated. Instead, use mapreduce.job.reduces
15/01/18 14:30:14 INFO Configuration.deprecation: mapred.output.value.class is deprecated. Instead, use mapreduce.job.output.value.class
15/01/18 14:30:14 INFO Configuration.deprecation: mapreduce.map.class is deprecated. Instead, use mapreduce.job.map.class
15/01/18 14:30:14 INFO Configuration.deprecation: mapred.job.name is deprecated. Instead, use mapreduce.job.name
15/01/18 14:30:14 INFO Configuration.deprecation: mapreduce.inputformat.class is deprecated. Instead, use mapreduce.job.inputformat.class
15/01/18 14:30:14 INFO Configuration.deprecation: mapred.output.dir is deprecated. Instead, use mapreduce.output.fileoutputformat.outputdir
15/01/18 14:30:14 INFO Configuration.deprecation: mapreduce.outputformat.class is deprecated. Instead, use mapreduce.job.outputformat.class
15/01/18 14:30:14 INFO Configuration.deprecation: mapred.cache.files.timestamps is deprecated. Instead, use mapreduce.job.cache.files.timestamps
15/01/18 14:30:14 INFO Configuration.deprecation: mapred.output.key.class is deprecated. Instead, use mapreduce.job.output.key.class
15/01/18 14:30:14 INFO Configuration.deprecation: mapred.working.dir is deprecated. Instead, use mapreduce.job.working.dir
15/01/18 14:30:14 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1421373857783_0016
15/01/18 14:30:15 INFO impl.YarnClientImpl: Submitted application application_1421373857783_0016 to ResourceManager at secondmgt/192.168.2.133:8032
15/01/18 14:30:15 INFO mapreduce.Job: The url to track the job: http://secondmgt:8088/proxy/application_1421373857783_0016/
15/01/18 14:30:15 INFO mapreduce.Job: Running job: job_1421373857783_0016
15/01/18 14:30:27 INFO mapreduce.Job: Job job_1421373857783_0016 running in uber mode : false
15/01/18 14:30:27 INFO mapreduce.Job: map 0% reduce 0%
15/01/18 14:30:38 INFO mapreduce.Job: map 100% reduce 0%
15/01/18 14:30:38 INFO mapreduce.Job: Job job_1421373857783_0016 completed successfully
15/01/18 14:30:38 INFO mapreduce.Job: Counters: 27
File System Counters
FILE: Number of bytes read=0
FILE: Number of bytes written=91814
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=87
HDFS: Number of bytes written=123
HDFS: Number of read operations=4
HDFS: Number of large read operations=0
HDFS: Number of write operations=2
Job Counters
Launched map tasks=1
Other local map tasks=1
Total time spent by all maps in occupied slots (ms)=33944
Total time spent by all reduces in occupied slots (ms)=0
Map-Reduce Framework
Map input records=3
Map output records=3
Input split bytes=87
Spilled Records=0
Failed Shuffles=0
Merged Map outputs=0
GC time elapsed (ms)=44
CPU time spent (ms)=2440
Physical memory (bytes) snapshot=164503552
Virtual memory (bytes) snapshot=888926208
Total committed heap usage (bytes)=83886080
File Input Format Counters
Bytes Read=0
File Output Format Counters
Bytes Written=123
15/01/18 14:30:38 INFO mapreduce.ImportJobBase: Transferred 123 bytes in 25.6853 seconds (4.7887 bytes/sec)
15/01/18 14:30:38 INFO mapreduce.ImportJobBase: Retrieved 3 records.[/mw_shl_code]
Sqoop使用--split-by 列名,根据此分隔工作量,默认的Sqoop将表中的关键字作为分隔列,由上导入命令可知,此处我们是以“id”作为分隔列。
Sqoop从大部分的数据源并行的导入数据,我们可以使用-m参数控制Map tasks的数目,默认是4个,此处我们改成了1个Map task。Map task,根据整个范围的均衡大小进行操作。例如,你有一张表,关键字id范围是0-1000,默认Map tasks 是4个,Sqoop将会执行4个进程,每个进程以如下格式执行SELECT * FROM sometable WHERE id >= lo AND id < hi其中(lo, hi) set to (0, 250), (250, 500), (500, 750), and (750, 1001) 在不同的任务中。
注意一: 如果你的表中关键字不是根据其范围均匀的分布,就可能导致不平衡的任务。这个时候你需要明确的选择一个不同的列使用--split-by指定分隔参数。目前,Sqoop,还不支持对各个列索引进行分隔,如果一个表没有索引列或者含有多个关键字列,你必须手动的指定一个分隔列。
注意二:如果SQL语句中使用双引号(“”),则必须使用$CONDITIONS代替$CONDITIONS,使你的shell不将其识别为shell自身的变量。如下示例:
错误方式:
[mw_shl_code=bash,true][hadoopUser@secondmgt ~]$ sqoop import --connect jdbc:mysql://secondmgt:3306/spice --username hive --password hive --query "select * from users where $CONDITIONS" --split-by id --target-dir /output/query/
Warning: /usr/lib/hcatalog does not exist! HCatalog jobs will fail.
Please set $HCAT_HOME to the root of your HCatalog installation.
15/01/18 15:17:50 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead.
15/01/18 15:17:50 INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset.
15/01/18 15:17:50 INFO tool.CodeGenTool: Beginning code generation
15/01/18 15:17:50 ERROR tool.ImportTool: Encountered IOException running import job: java.io.IOException: Query [select * from users where ] must contain '$CONDITIONS' in WHERE clause.
at org.apache.sqoop.manager.ConnManager.getColumnTypes(ConnManager.java:352)
at org.apache.sqoop.orm.ClassWriter.getColumnTypes(ClassWriter.java:1277)
at org.apache.sqoop.orm.ClassWriter.generate(ClassWriter.java:1089)
at org.apache.sqoop.tool.CodeGenTool.generateORM(CodeGenTool.java:96)
at org.apache.sqoop.tool.ImportTool.importTable(ImportTool.java:396)
at org.apache.sqoop.tool.ImportTool.run(ImportTool.java:502)
at org.apache.sqoop.Sqoop.run(Sqoop.java:145)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.sqoop.Sqoop.runSqoop(Sqoop.java:181)
at org.apache.sqoop.Sqoop.runTool(Sqoop.java:220)
at org.apache.sqoop.Sqoop.runTool(Sqoop.java:229)
at org.apache.sqoop.Sqoop.main(Sqoop.java:238)[/mw_shl_code]
正确如下:
[mw_shl_code=bash,true][hadoopUser@secondmgt ~]$ sqoop import --connect jdbc:mysql://secondmgt:3306/spice --username hive --password hive --query "select * from users where $CONDITIONS" --split-by id --target-dir /output/query/[/mw_shl_code]
注意三:目前版本的Sqoop中,使用自由形式查询导入,只提供简单的查询,没有复杂的和“OR”条件查询在where子句中。
二、查看结果
[hadoopUser@secondmgt ~]$ hadoop fs -cat /output/query/*
56,hua,hanyun,男,开通,2013-12-02,0,1
58,feng,123456,男,开通,2013-11-22,0,0
59,test,123456,男,开通,2014-03-05,58,0
三、控制导入进程
有些数据库提供更加快捷、高效的方式用来将数据库表中的数据导入到其他的系统中,这个时候可以--direct 参数。如:mysql会调用 mysqldump和mysqlimport ,PostgreSQL 为psql。
四、控制映射类型
Sqoop预配置了Java和Hive典型的大部分SQL类型,然而,默认的类型有时候不一定完全适合用户需求。可以使用下面两个参数根据自己的应用修改映射类型
[mw_shl_code=bash,true]Argument Description
--map-column-java Override mapping from SQL to Java type for configured columns.
--map-column-hive Override mapping from SQL to Hive type for configured columns.[/mw_shl_code]
五、文件格式
Sqoop支持两种类型的文件格式导入:分隔符文本和序列文件(delimited text or SequenceFiles)。默认的是采用分隔符文本,由上面导入后查询的结果可知,默认采用逗号分隔的。可以使用--as-textfile参数修改默认的文件导入格式。
delimited text 是适合大多数非二进制数据类型。它也很容易支持进一步操纵其他工具,如Hive。
SequenceFiles是二进制格式以自定义记录特有的数据类型来存储个人记录的。