在做数据仓库的时候,最关键、最首要的工作就是导数据!
导数据的时候,最关键、最首要的工作就是数据统一规则!
1、空数据
2、字符串中空数据
3、脏数据
4、字段名统一
5、字段类型统一
...
这里要讲的是两个坑:
one:第一个坑比较常见,就是第五个统一规则里面出来的,1)mysql的字段类型是tinyint(1) ,在用sqoop导数的时候,连接数据库就要有设置tinyInt1isBit=false,2)mysql的字段类型是时间,但默认值有0000-00-00 00:00:00,在用sqoop导数的时候,连接数据库就要有设置zeroDateTimeBehavior=convertToNull,3)当然还有其他比如编码类型utf-8等等。
最后的sqoop中的连接就是长这个样子的:--connect jdbc:mysql://ip:3306/db?zeroDateTimeBehavior=convertToNull&tinyInt1isBit=false
two:第二个坑那就太坑了,第二个统一规则中出来的,空数据,mysql中有字段名称是关键字的,比如type、order、desc这种,在写进sqoop的时候,要用`这个符号标识一下,用于sqoop中就是--query "select \`type\`,\`desc\` from table where \$CONDITIONS",然鹅,这个不是坑的地方,这个字符串类型的以关键字命名的字段,数据中竟然还带有'\t'制表符,表面看就是空数据,用sqoop的--hive-drop-import-delims去不掉的情况下,必须使用mysql的语法去掉,也就是REPLACE(col,CHR(9),''),那么我就想当然的这样写了:
--query "select REPLACE(
\`type\`,CHR(9),'')
,REPLACE(
\`desc\`,CHR(9),'')
from table where \$CONDITIONS",可是一直报错,一直完美的跑完了mapreduce,在load data 的时候来了这么一个bug
19/07/12 15:40:01 INFO hive.HiveImport: Loading uploaded data into Hive
Logging initialized using configuration in jar:file:/opt/cloudera/parcels/CDH-5.13.3-1.cdh5.13.3.p0.2/jars/hive-common-1.1.0-cdh5.13.3.jar!/hive-log4j.properties
NoViableAltException(86@[])
at org.apache.hadoop.hive.ql.parse.HiveParser.type(HiveParser.java:39101)
at org.apache.hadoop.hive.ql.parse.HiveParser.colType(HiveParser.java:38866)
at org.apache.hadoop.hive.ql.parse.HiveParser.columnNameType(HiveParser.java:38566)
at org.apache.hadoop.hive.ql.parse.HiveParser.columnNameTypeList(HiveParser.java:36789)
at org.apache.hadoop.hive.ql.parse.HiveParser.createTableStatement(HiveParser.java:5057)
at org.apache.hadoop.hive.ql.parse.HiveParser.ddlStatement(HiveParser.java:2557)
at org.apache.hadoop.hive.ql.parse.HiveParser.execStatement(HiveParser.java:1589)
at org.apache.hadoop.hive.ql.parse.HiveParser.statement(HiveParser.java:1065)
at org.apache.hadoop.hive.ql.parse.ParseDriver.parse(ParseDriver.java:201)
at org.apache.hadoop.hive.ql.parse.ParseDriver.parse(ParseDriver.java:166)
at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:522)
at org.apache.hadoop.hive.ql.Driver.compileInternal(Driver.java:1356)
at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1473)
at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1285)
at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1275)
at org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:226)
at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:175)
at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:389)
at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:324)
at org.apache.hadoop.hive.cli.CliDriver.processReader(CliDriver.java:422)
at org.apache.hadoop.hive.cli.CliDriver.processFile(CliDriver.java:438)
at org.apache.hadoop.hive.cli.CliDriver.executeDriver(CliDriver.java:732)
at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:699)
at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:634)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.sqoop.hive.HiveImport.executeScript(HiveImport.java:331)
at org.apache.sqoop.hive.HiveImport.importTable(HiveImport.java:241)
at org.apache.sqoop.tool.ImportTool.importTable(ImportTool.java:530)
at org.apache.sqoop.tool.ImportTool.run(ImportTool.java:621)
at org.apache.sqoop.Sqoop.run(Sqoop.java:147)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.sqoop.Sqoop.runSqoop(Sqoop.java:183)
at org.apache.sqoop.Sqoop.runTool(Sqoop.java:234)
at org.apache.sqoop.Sqoop.runTool(Sqoop.java:243)
at org.apache.sqoop.Sqoop.main(Sqoop.java:252)
FAILED: ParseException line 1:140 cannot recognize input near 'desc' ',CHAR(9),'')' 'STRING' in column type
苦啊,到处找不到解决方法,最后的最后,我终于发现,‘坑’这个词的定义:最好解决的方式,往往不在网上,不在知识的海洋里,而是在不经意间的尝试!!
我试图这样写:
--query "select REPLACE(
\`type\`,CHR(9),'')as
\`type\`,REPLACE(
\`desc\`,CHR(9),'')as
\`desc\` from table where \$CONDITIONS"
完美!load进hive了。。。