问题简介
今天早上在数据异构项目中导入添加一个mysql导入hive任务,添加后跑任务,任务执行完,界面日志无报错。但是hive中并没生成对应的表。初步定位脚本执行报错,但是没有返回到界面。
环境测试集群
服务器系统版本:centos 7.2
cdh 版本:cdh5.9.0
hadoop 版本:2.6.0+cdh5.9.0
hive 版本:1.1.0+cdh5.9.0
sqoop 版本:1.4.6+cdh5.9.0
备注:涉及敏感信息的的变量,使用${xxxx} 代替
问题定位
sqoop导入命令如下
sqoop import -Dorg.apache.sqoop.splitter.allow_text_splitter=true --connect ${url}/${db}?tinyInt1isBit=false --username ${username} --password ${password} --table seewo_system_info --hive-import --hive-database seewo --hive-overwrite --bindir . --delete-target-dir --hive-table seewo_class_seewo_system_info_text --hive-drop-import-delims --autoreset-to-one-mapper --verbose
果然报错了
18/04/16 15:50:01 INFO db.DataDrivenDBInputFormat: BoundingValsQuery: SELECT MIN(`resourceid`), MAX(`resourceid`) FROM `seewo_system_info`
18/04/16 15:50:01 WARN db.TextSplitter: Generating splits for a textual index column.
18/04/16 15:50:01 WARN db.TextSplitter: If your database sorts in a case-insensitive order, this may result in a partial import or duplicate records.
18/04/16 15:50:01 WARN db.TextSplitter: You are strongly encouraged to choose an integral split column.
䀙耘耍16 15:50:01 DEBUG db.DataDrivenDBInputFormat: Creating input split with lower bound '`resourceid` >= '001'' and upper bound '`resourceid` < '=耼䀽''
䀙耘耍16 15:50:01 DEBUG db.DataDrivenDBInputFormat: Creating input split with lower bound '`resourceid` >= '=耼䀽'' and upper bound '`resourceid` < 'KH聉耳1''
18/04/16 15:50:01 DEBUG db.DataDrivenDBInputFormat: Creating input split with lower bound '`resourceid` >= 'KH聉耳1'' and upper bound '`resourceid` < 'X联쁕'쁌聉耧&''
18/04/16 15:50:01 DEBUG db.DataDrivenDBInputFormat: Creating input split with lower bound '`resourceid` >= 'X联쁕'쁌聉耧&'' and upper bound '`resourceid` < 'faa5fb43''
18/04/16 15:50:01 DEBUG db.DataDrivenDBInputFormat: Creating input split with lower bound '`resourceid` >= 'faa5fb43'' and upper bound '`resourceid` <= 'faa5fb43c0a349f29ca4889d30e5f1d6''
18/04/16 15:50:03 INFO mapreduce.JobSubmitter: number of splits:5
18/04/16 15:50:04 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1522230336758_65472
18/04/16 15:50:04 INFO impl.YarnClientImpl: Submitted application application_1522230336758_65472
18/04/16 15:50:04 INFO mapreduce.Job: The url to track the job: http://psd-hadoop008:28088/proxy/application_1522230336758_65472/
18/04/16 15:50:04 INFO mapreduce.Job: Running job: job_1522230336758_65472
18/04/16 15:50:11 INFO mapreduce.Job: Job job_1522230336758_65472 running in uber mode : false
18/04/16 15:50:11 INFO mapreduce.Job: map 0% reduce 0%
18/04/16 15:50:20 INFO mapreduce.Job: Task Id : attempt_1522230336758_65472_m_000002_0, Status : FAILED
Error: java.io.IOException: SQLException in nextKeyValue
at org.apache.sqoop.mapreduce.db.DBRecordReader.nextKeyValue(DBRecordReader.java:277)
at org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue(MapTask.java:556)
at org.apache.hadoop.mapreduce.task.MapContextImpl.nextKeyValue(MapContextImpl.java:80)
at org.apache.hadoop.mapreduce.lib.map.WrappedMapper$Context.nextKeyValue(WrappedMapper.java:91)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
at org.apache.sqoop.mapreduce.AutoProgressMapper.run(AutoProgressMapper.java:64)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:787)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:164)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
Caused by: com.mysql.jdbc.exceptions.jdbc4.MySQLSyntaxErrorException: You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near '쁌聉耧&' )' at line 1
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
at com.mysql.jdbc.Util.handleNewInstance(Util.java:377)
at com.mysql.jdbc.Util.getInstance(Util.java:360)
at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:978)
at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:3887)
at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:3823)
at com.mysql.jdbc.MysqlIO.sendCommand(MysqlIO.java:2435)
at com.mysql.jdbc.MysqlIO.sqlQueryDirect(MysqlIO.java:2582)
at com.mysql.jdbc.ConnectionImpl.execSQL(ConnectionImpl.java:2530)
at com.mysql.jdbc.PreparedStatement.executeInternal(PreparedStatement.java:1907)
at com.mysql.jdbc.PreparedStatement.executeQuery(PreparedStatement.java:2030)
at org.apache.sqoop.mapreduce.db.DBRecordReader.executeQuery(DBRecordReader.java:111)
at org.apache.sqoop.mapreduce.db.DBRecordReader.nextKeyValue(DBRecordReader.java:235)
... 12 more
看到 resourceid
>= ‘KH聉耳1” and upper bound ‘resourceid
< ‘X联쁕’쁌聉耧&” 这种莫名奇妙的乱码,心中窃喜定位为mysql连接编码问题,将mysql连接指定编码。
sqoop import -Dorg.apache.sqoop.splitter.allow_text_splitter=true --connect "${url}/${db}?tinyInt1isBit=false&useUnicode=true&characterEncoding=utf-8" --username ${username} --password ${password} --table seewo_system_info --hive-import --hive-database seewo --hive-overwrite --bindir . --delete-target-dir --hive-table seewo_class_seewo_system_info_text --hive-drop-import-delims --autoreset-to-one-mapper --verbose
重新执行,还是报刚刚的错误,居然不是编码问题!!有点方。。。
猜测:是不是数据库本来就乱码呢?去看原有的mysql数据库表,resourceid 为 varchar 类型,数据只有76 , 完全没有乱码,这乱码哪里来的???
无头绪,百度,谷歌,cloudera 社区,没到相似问题,这下真的慌了。。。。
换个思路,莫非是sqoop并行切分有问题?指定 -m 1 执行
sqoop import -Dorg.apache.sqoop.splitter.allow_text_splitter=true --connect ${url}/${db}?tinyInt1isBit=false --username ${username} --password ${password} --table seewo_system_info --hive-import --hive-database seewo --hive-overwrite --bindir . --delete-target-dir --hive-table seewo_class_seewo_system_info_text --hive-drop-import-delims --autoreset-to-one-mapper --verbose -m 1
神奇,居然导入成功了。。
那为什么并行会有问题呢???
带着问题重看抱错日志。。
18/04/16 15:50:01 WARN db.TextSplitter: Generating splits for a textual index column.
18/04/16 15:50:01 WARN db.TextSplitter: If your database sorts in a case-insensitive order, this may result in a partial import or duplicate records.
18/04/16 15:50:01 WARN db.TextSplitter: You are strongly encouraged to choose an integral split column.
TextSplitter 切分的并行任务有可能会存在数据重复导入,强烈建议我们更换别的分割机制。会重复导入,跟乱码没半毛钱关系。。
切分出来的区间值,并不是resourceid值的任何一个,莫非TextSplitter切分策略有bug??有毛病??
内心挣扎,这是要逼着我去看源代码啊,验证我们使用的sqoop版本1.4.6,github 源码走起
找到 TextSplitter类,继承关系 TextSplitter extends BigDecimalSplitter implements DBSplitter
什么原因让sqoop选着了用TextSplitter切分区间呢??
DataDrivenDBInputFormat 这哥们在作怪。
protected DBSplitter getSplitter(int sqlDataType) {
switch (sqlDataType) {
case Types.NUMERIC:
case Types.DECIMAL:
return new BigDecimalSplitter();
case Types.BIT:
case Types.BOOLEAN:
return new BooleanSplitter();
case Types.INTEGER:
case Types.TINYINT:
case Types.SMALLINT:
case Types.BIGINT:
return new IntegerSplitter();
case Types.REAL:
case Types.FLOAT:
case Types.DOUBLE:
return new FloatSplitter();
case Types.NVARCHAR:
case Types.NCHAR:
return new NTextSplitter();
case Types.CHAR:
case Types.VARCHAR:
case Types.LONGVARCHAR:
return new TextSplitter();
case Types.DATE:
case Types.TIME:
case Types.TIMESTAMP:
return new DateSplitter();
default:
// TODO: Support BINARY, VARBINARY, LONGVARBINARY, DISTINCT, CLOB,
// BLOB, ARRAY, STRUCT, REF, DATALINK, and JAVA_OBJECT.
return null;
}
}
根据不同的mysql类型调用不同Splitter。
到底什么原因让TextSplitter 切分出一堆乱码呢??
认真看TextSplitter的代码吧。。
切分入口:
public List<InputSplit> split(Configuration conf, ResultSet results,String colName) throws SQLException {
// 篇幅问题部分代码省略。。
//1.拿到切分的最小和最大字符串。
String minString = results.getString(1);
String maxString = results.getString(2);
//2.拿到切分的map数
int numSplits = ConfigurationHelper.getConfNumMaps(conf);
//3.拿到最短的字符长度
int maxPrefixLen = Math.min(minString.length(), maxString.length());
//4.截取共同的长度
String commonPrefix = minString.substring(0, sharedLen);
//5.切分段区间
List<String> splitStrings = split(numSplits, minString, maxString,commonPrefix);
}
public List<String> split(int numSplits, String minString,
String maxString, String commonPrefix) throws SQLException {
//6.将string转成有序代表该string的BigDecimal值,
BigDecimal minVal = stringToBigDecimal(minString);
BigDecimal maxVal = stringToBigDecimal(maxString);
//7.将minVal,maxVal 按照 numSplits 切分
List<BigDecimal> splitPoints = split(new BigDecimal(numSplits), minVal, maxVal);
// 8.将 BigDecimal 转换成 string
for (BigDecimal bd : splitPoints) {
splitStrings.add(commonPrefix + bigDecimalToString(bd));
}
}
//9.万恶的乱码在这里生成的,
/**
*返回在BigDecimal中编码的字符串。
*反复将输入值乘以65536; 之后的整数部分
*这种乘法表示基数为65536的单个字符。
*将其转换回char,并直到我们创建一个字符串
*
*/
public String bigDecimalToString(BigDecimal bd) {
BigDecimal cur = bd.stripTrailingZeros();
StringBuilder sb = new StringBuilder();
for (int numConverted = 0; numConverted < MAX_CHARS; numConverted++) {
cur = cur.multiply(ONE_PLACE);
int curCodePoint = cur.intValue();
if (0 == curCodePoint) {
break;
}
cur = cur.subtract(new BigDecimal(curCodePoint));
sb.append(Character.toChars(curCodePoint));
}
return sb.toString();
}
到此处原因找到了。
问题原因
sqoop导入mysql表时,没有指定map数量时,会并行导入
并行导入时会根据表的主键类型选着不同的Splitter,参见DataDrivenDBInputFormat的getSplitter方法
TextSplitter会现将string转BigDecimal再转string ,这个过程产生乱码和特殊字符
报错是因为在3步骤产生的’X联쁕’쁌聉耧&” 字符串中含有 ’ 就是它 ’ 没错就是它 ‘,导致后期的sql语法错误
解决方案:
导入中不要选用String类型的主键切分map个数
导入小表时指定-m 1,单map 即节省资源,又防止并行切分字符串主键问题