sqoop系列-TextSplitter踩坑记

问题简介

今天早上在数据异构项目中导入添加一个mysql导入hive任务,添加后跑任务,任务执行完,界面日志无报错。但是hive中并没生成对应的表。初步定位脚本执行报错,但是没有返回到界面。
环境测试集群
服务器系统版本:centos 7.2
cdh 版本:cdh5.9.0
hadoop 版本:2.6.0+cdh5.9.0
hive 版本:1.1.0+cdh5.9.0
sqoop 版本:1.4.6+cdh5.9.0
备注:涉及敏感信息的的变量,使用${xxxx} 代替


问题定位

sqoop导入命令如下

sqoop import -Dorg.apache.sqoop.splitter.allow_text_splitter=true --connect ${url}/${db}?tinyInt1isBit=false --username ${username} --password ${password} --table seewo_system_info  --hive-import --hive-database seewo --hive-overwrite --bindir . --delete-target-dir --hive-table seewo_class_seewo_system_info_text  --hive-drop-import-delims --autoreset-to-one-mapper --verbose

果然报错了

18/04/16 15:50:01 INFO db.DataDrivenDBInputFormat: BoundingValsQuery: SELECT MIN(`resourceid`), MAX(`resourceid`) FROM `seewo_system_info`
18/04/16 15:50:01 WARN db.TextSplitter: Generating splits for a textual index column.
18/04/16 15:50:01 WARN db.TextSplitter: If your database sorts in a case-insensitive order, this may result in a partial import or duplicate records.
18/04/16 15:50:01 WARN db.TextSplitter: You are strongly encouraged to choose an integral split column.
䀙耘耍16 15:50:01 DEBUG db.DataDrivenDBInputFormat: Creating input split with lower bound '`resourceid` >= '001'' and upper bound '`resourceid` < '=耼䀽''
䀙耘耍16 15:50:01 DEBUG db.DataDrivenDBInputFormat: Creating input split with lower bound '`resourceid` >= '=耼䀽'' and upper bound '`resourceid` < 'KH聉耳1''
18/04/16 15:50:01 DEBUG db.DataDrivenDBInputFormat: Creating input split with lower bound '`resourceid` >= 'KH聉耳1'' and upper bound '`resourceid` < 'X联쁕'쁌聉耧&''
18/04/16 15:50:01 DEBUG db.DataDrivenDBInputFormat: Creating input split with lower bound '`resourceid` >= 'X联쁕'쁌聉耧&'' and upper bound '`resourceid` < 'faa5fb43''
18/04/16 15:50:01 DEBUG db.DataDrivenDBInputFormat: Creating input split with lower bound '`resourceid` >= 'faa5fb43'' and upper bound '`resourceid` <= 'faa5fb43c0a349f29ca4889d30e5f1d6''
18/04/16 15:50:03 INFO mapreduce.JobSubmitter: number of splits:5
18/04/16 15:50:04 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1522230336758_65472
18/04/16 15:50:04 INFO impl.YarnClientImpl: Submitted application application_1522230336758_65472
18/04/16 15:50:04 INFO mapreduce.Job: The url to track the job: http://psd-hadoop008:28088/proxy/application_1522230336758_65472/
18/04/16 15:50:04 INFO mapreduce.Job: Running job: job_1522230336758_65472
18/04/16 15:50:11 INFO mapreduce.Job: Job job_1522230336758_65472 running in uber mode : false
18/04/16 15:50:11 INFO mapreduce.Job: map 0% reduce 0%
18/04/16 15:50:20 INFO mapreduce.Job: Task Id : attempt_1522230336758_65472_m_000002_0, Status : FAILED
Error: java.io.IOException: SQLException in nextKeyValue
at org.apache.sqoop.mapreduce.db.DBRecordReader.nextKeyValue(DBRecordReader.java:277)
at org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue(MapTask.java:556)
at org.apache.hadoop.mapreduce.task.MapContextImpl.nextKeyValue(MapContextImpl.java:80)
at org.apache.hadoop.mapreduce.lib.map.WrappedMapper$Context.nextKeyValue(WrappedMapper.java:91)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
at org.apache.sqoop.mapreduce.AutoProgressMapper.run(AutoProgressMapper.java:64)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:787)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:164)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
Caused by: com.mysql.jdbc.exceptions.jdbc4.MySQLSyntaxErrorException: You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near '쁌聉耧&' )' at line 1
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
at com.mysql.jdbc.Util.handleNewInstance(Util.java:377)
at com.mysql.jdbc.Util.getInstance(Util.java:360)
at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:978)
at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:3887)
at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:3823)
at com.mysql.jdbc.MysqlIO.sendCommand(MysqlIO.java:2435)
at com.mysql.jdbc.MysqlIO.sqlQueryDirect(MysqlIO.java:2582)
at com.mysql.jdbc.ConnectionImpl.execSQL(ConnectionImpl.java:2530)
at com.mysql.jdbc.PreparedStatement.executeInternal(PreparedStatement.java:1907)
at com.mysql.jdbc.PreparedStatement.executeQuery(PreparedStatement.java:2030)
at org.apache.sqoop.mapreduce.db.DBRecordReader.executeQuery(DBRecordReader.java:111)
at org.apache.sqoop.mapreduce.db.DBRecordReader.nextKeyValue(DBRecordReader.java:235)
... 12 more

看到 resourceid >= ‘KH聉耳1” and upper bound ‘resourceid < ‘X联쁕’쁌聉耧&” 这种莫名奇妙的乱码,心中窃喜定位为mysql连接编码问题,将mysql连接指定编码。

sqoop import -Dorg.apache.sqoop.splitter.allow_text_splitter=true --connect "${url}/${db}?tinyInt1isBit=false&useUnicode=true&characterEncoding=utf-8" --username ${username} --password ${password} --table seewo_system_info  --hive-import --hive-database seewo --hive-overwrite --bindir . --delete-target-dir --hive-table seewo_class_seewo_system_info_text  --hive-drop-import-delims --autoreset-to-one-mapper --verbose

重新执行,还是报刚刚的错误,居然不是编码问题!!有点方。。。

猜测:是不是数据库本来就乱码呢?去看原有的mysql数据库表,resourceid 为 varchar 类型,数据只有76 , 完全没有乱码,这乱码哪里来的???
无头绪,百度,谷歌,cloudera 社区,没到相似问题,这下真的慌了。。。。
换个思路,莫非是sqoop并行切分有问题?指定 -m 1 执行

sqoop import -Dorg.apache.sqoop.splitter.allow_text_splitter=true --connect ${url}/${db}?tinyInt1isBit=false --username ${username} --password ${password} --table seewo_system_info  --hive-import --hive-database seewo --hive-overwrite --bindir . --delete-target-dir --hive-table seewo_class_seewo_system_info_text  --hive-drop-import-delims --autoreset-to-one-mapper --verbose -m 1

神奇,居然导入成功了。。

那为什么并行会有问题呢???
带着问题重看抱错日志。。

18/04/16 15:50:01 WARN db.TextSplitter: Generating splits for a textual index column.
18/04/16 15:50:01 WARN db.TextSplitter: If your database sorts in a case-insensitive order, this may result in a partial import or duplicate records.
18/04/16 15:50:01 WARN db.TextSplitter: You are strongly encouraged to choose an integral split column.

TextSplitter 切分的并行任务有可能会存在数据重复导入,强烈建议我们更换别的分割机制。会重复导入,跟乱码没半毛钱关系。。
切分出来的区间值,并不是resourceid值的任何一个,莫非TextSplitter切分策略有bug??有毛病??

内心挣扎,这是要逼着我去看源代码啊,验证我们使用的sqoop版本1.4.6,github 源码走起

找到 TextSplitter类,继承关系 TextSplitter extends BigDecimalSplitter implements DBSplitter

什么原因让sqoop选着了用TextSplitter切分区间呢??
DataDrivenDBInputFormat 这哥们在作怪。

protected DBSplitter getSplitter(int sqlDataType) {
    switch (sqlDataType) {
    case Types.NUMERIC:
    case Types.DECIMAL:
    return new BigDecimalSplitter();

    case Types.BIT:
    case Types.BOOLEAN:
    return new BooleanSplitter();

    case Types.INTEGER:
    case Types.TINYINT:
    case Types.SMALLINT:
    case Types.BIGINT:
    return new IntegerSplitter();

    case Types.REAL:
    case Types.FLOAT:
    case Types.DOUBLE:
    return new FloatSplitter();

    case Types.NVARCHAR:
    case Types.NCHAR:
    return new NTextSplitter();

    case Types.CHAR:
    case Types.VARCHAR:
    case Types.LONGVARCHAR:
    return new TextSplitter();

    case Types.DATE:
    case Types.TIME:
    case Types.TIMESTAMP:
    return new DateSplitter();

    default:
    // TODO: Support BINARY, VARBINARY, LONGVARBINARY, DISTINCT, CLOB,
    // BLOB, ARRAY, STRUCT, REF, DATALINK, and JAVA_OBJECT.
    return null;
    }
}

根据不同的mysql类型调用不同Splitter。
到底什么原因让TextSplitter 切分出一堆乱码呢??
认真看TextSplitter的代码吧。。
切分入口:

public List<InputSplit> split(Configuration conf, ResultSet results,String colName) throws SQLException {

// 篇幅问题部分代码省略。。

//1.拿到切分的最小和最大字符串。
String minString = results.getString(1);
String maxString = results.getString(2);

//2.拿到切分的map数

int numSplits = ConfigurationHelper.getConfNumMaps(conf);

//3.拿到最短的字符长度

int maxPrefixLen = Math.min(minString.length(), maxString.length());

//4.截取共同的长度

String commonPrefix = minString.substring(0, sharedLen);

//5.切分段区间

List<String> splitStrings = split(numSplits, minString, maxString,commonPrefix);

}

public List<String> split(int numSplits, String minString,
String maxString, String commonPrefix) throws SQLException {

//6.将string转成有序代表该string的BigDecimal值,

BigDecimal minVal = stringToBigDecimal(minString);
BigDecimal maxVal = stringToBigDecimal(maxString);

//7.将minVal,maxVal 按照 numSplits 切分

List<BigDecimal> splitPoints = split(new BigDecimal(numSplits), minVal, maxVal);

// 8.将 BigDecimal 转换成 string 
for (BigDecimal bd : splitPoints) {
splitStrings.add(commonPrefix + bigDecimalToString(bd));
}

}

//9.万恶的乱码在这里生成的,

/**

*返回在BigDecimal中编码的字符串。
*反复将输入值乘以65536; 之后的整数部分
*这种乘法表示基数为65536的单个字符。
*将其转换回char,并直到我们创建一个字符串
*

*/

public String bigDecimalToString(BigDecimal bd) {

BigDecimal cur = bd.stripTrailingZeros();
StringBuilder sb = new StringBuilder();

for (int numConverted = 0; numConverted < MAX_CHARS; numConverted++) {
cur = cur.multiply(ONE_PLACE);
int curCodePoint = cur.intValue();
if (0 == curCodePoint) {
break;
}

cur = cur.subtract(new BigDecimal(curCodePoint));
sb.append(Character.toChars(curCodePoint));
}

return sb.toString();

}

到此处原因找到了。


问题原因

sqoop导入mysql表时,没有指定map数量时,会并行导入
并行导入时会根据表的主键类型选着不同的Splitter,参见DataDrivenDBInputFormat的getSplitter方法
TextSplitter会现将string转BigDecimal再转string ,这个过程产生乱码和特殊字符
报错是因为在3步骤产生的’X联쁕’쁌聉耧&” 字符串中含有 ’ 就是它 ’ 没错就是它 ‘,导致后期的sql语法错误


解决方案:

导入中不要选用String类型的主键切分map个数
导入小表时指定-m 1,单map 即节省资源,又防止并行切分字符串主键问题

评论 5
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值