现象
今天有位用户在Trafodion数据库中对表进行更新统计信息时遇到报错如下,
*** ERROR[9214] Object TRAFODION.BIGDATA_REPORT_TEST.TRAF_SAMPLE_339036475483133524_1508207122_572837 could not be created. [2017-10-17 10:26:29]
*** ERROR[8448] Unable to access Hbase interface. Call to ExpHbaseInterface::create() returned error HBASE_CREATE_ERROR(701). Cause: java.io.IOException: createTable exception. Unable to create table TRAFODION.BIGDATA_REPORT_TEST.TRAF_SAMPLE_339036475483133524_1508207122_572837
org.apache.hadoop.hbase.client.transactional.RMInterface.createTable(RMInterface.java:594)
org.trafodion.sql.HBaseClient.createk(HBaseClient.java:503). [2017-10-17 10:26:29]
*** ERROR[9200] UPDATE STATISTICS for table TRAFODION.BIGDATA_REPORT_TEST.ST_CONTENTVIEW_EVENTS encountered an error (8609) from statement Process_Query. [2017-10-17 10:26:29]
*** ERROR[8609] Waited rollback performed without starting a transaction. [2017-10-17 10:26:29]
*** ERROR[9201] Unable to DROP object TRAFODION.BIGDATA_REPORT_TEST.TRAF_SAMPLE_339036475483133524_1508207122_572837. [2017-10-17 10:26:29]
*** ERROR[1389] Object TRAFODION.BIGDATA_REPORT_TEST.TRAF_SAMPLE_339036475483133524_1508207122_572837 does not exist in Trafodion. [2017-10-17 10:26:29]
*** ERROR[9200] UPDATE STATISTICS for table TRAFODION.BIGDATA_REPORT_TEST.ST_CONTENTVIEW_EVENTS encountered an error (8609) from statement Process_Query. [2017-10-17 10:26:29]
*** ERROR[8609] Waited rollback performed without starting a transaction. [2017-10-17 10:26:29]
分析
从以上报错信息看,问题出在无法创建SAMPLE表,通过sqcheck发现数据库一切正常,HBase检查也正常。关于建表出错,我们在前面一篇博客中提到一个可能的原因是节点时钟不同步导致。
解决
根据以上信息,我们检查tm_xxx.log日志,果然发现以下错误信息,
2017-10-17 10:43:48,960, ERROR, TM, Node: 0 Pid: 61878 Name: $TM0 TransId: 10450048 Event: 103005311 Message: Error at CHbaseTM::createTable() caused by exception java.io.IOException: createTable call error
org.trafodion.dtm.HBaseTxClient.callCreateTable(HBaseTxClient.java:1814) Caused by
java.io.IOException: java.util.concurrent.ExecutionException: java.io.IOException: java.io.IOException: pushOnlineEpoch -- Error: current onlineEpoch 1508208219670 is less than new onlineEpoch 1508208221684, transId: 10450048 in region: TRAFODION.BIGDATA_REPORT_TEST.TRAF_SAMPLE_339036475483133524_1508208194_356302,,1508208218191.28ffe67ad3fa9e36660693043136e719.
org.apache.hadoop.hbase.client.transactional.TransactionManager.pushRegionEpoch(TransactionManager.java:2131)
org.apache.hadoop.hbase.client.transactional.TransactionManager.createTable(TransactionManager.java:2785)
org.trafodion.dtm.HBaseTxClient.callCreateTable(HBaseTxClient.java:1809) Caused by
java.util.concurrent.ExecutionException: java.io.IOException: java.io.IOException: pushOnlineEpoch -- Error: current onlineEpoch 1508208219670 is less than new onlineEpoch 1508208221684, transId: 10450048 in region: TRAFODION.BIGDATA_REPORT_TEST.TRAF_SAMPLE_339036475483133524_1508208194_356302,,1508208218191.28ffe67ad3fa9e36660693043136e719.
再检查各节点的ntp服务发现ntp服务均正常,查看节点的时间结果如下,表明各节点之前的时间并未完成同步,
[trafodion@tc2 logs]$ pdsh $MY_NODES date
tc3: Tue Oct 17 10:44:28 CST 2017
tc4: Tue Oct 17 10:44:44 CST 2017
tc2: Tue Oct 17 10:44:49 CST 2017
由于此判定确实是由于节点时钟不同步问题导致,后续解决办法请参考前一篇博客:http://blog.csdn.net/post_yuan/article/details/74199704
上述博客的解决方案属于临时解决,可能一段时间后又会出现时间不一致的情况,这可能由于网络集群与远程时钟服务器的网络有关。为避免集群与远程时钟服务器的网络影响,我们可以配置本地时钟服务器,即把其中一个节点作为本地时钟服务器,其余节点与之做时钟同步,具体方法可参考博客: http://blog.csdn.net/post_yuan/article/details/76906986