一、问题描述
在hdfs切换viewfs协议的时候,kylin为了支持viewfs协议做了相应的更改,在执行cube build任务的时候,绝大部分任务都成功执行,存在如下cube build失败。
异常信息:
2019-04-15 04:12:41,178 ERROR [Job 9f0f5715-f7ce-4b2f-be07-b47231883ce1-238] common.HadoopShellExecutable:65 : error execute HadoopShellExecutable{id=9f0f5715-f7ce-4b2f-be07-b47231883ce1-03, name=Build Dimension Dictionary, state=RUNNING}
java.lang.RuntimeException: Failed to create dictionary on BASE_TCLIVECHAT.V_CHATTRACKENTRYRECORD.ROBOTSESSIONID
at org.apache.kylin.dict.DictionaryManager.buildDictFromReadableTable(DictionaryManager.java:308)
at org.apache.kylin.dict.DictionaryManager.buildDictionary(DictionaryManager.java:292)
at org.apache.kylin.cube.CubeManager.buildDictionary(CubeManager.java:223)
at org.apache.kylin.cube.cli.DictionaryGeneratorCLI.processSegment(DictionaryGeneratorCLI.java:71)
at org.apache.kylin.cube.cli.DictionaryGeneratorCLI.processSegment(DictionaryGeneratorCLI.java:54)
at org.apache.kylin.engine.mr.steps.CreateDictionaryJob.run(CreateDictionaryJob.java:66)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84)
at org.apache.kylin.engine.mr.common.HadoopShellExecutable.doWork(HadoopShellExecutable.java:63)
at org.apache.kylin.job.execution.AbstractExecutable.execute(AbstractExecutable.java:124)
at org.apache.kylin.job.execution.DefaultChainedExecutable.doWork(DefaultChainedExecutable.java:64)
at org.apache.kylin.job.execution.AbstractExecutable.execute(AbstractExecutable.java:124)
at org.apache.kylin.job.impl.threadpool.DefaultScheduler$JobRunner.run(DefaultScheduler.java:142)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: com.google.common.util.concurrent.UncheckedExecutionException: java.lang.RuntimeException: java.io.IOException: Incomplete HDFS URI, no host: hdfs:///kylin/kylin_metadata/resources/GlobalDict/dict/BASE_TCLIVECHAT.V_CHATTRACKENTRYRECORD/ROBOTSESSIONID
at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2256)
at com.google.common.cache.LocalCache.get(LocalCache.java:3985)
at com.google.common.cache.LocalCache.getOrLoad(LocalCache.java:3989)
at com.google.common.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4873)
at org.apache.kylin.dict.DictionaryManager.getDictionaryInfo(DictionaryManager.java:119)
at org.apache.kylin.dict.DictionaryManager.getDictionary(DictionaryManager.java:113)
at org.apache.kylin.dict.AppendTrieDictionary$Builder.createNewBuilder(AppendTrieDictionary.java:873)
at org.apache.kylin.dict.AppendTrieDictionary$Builder.getInstance(AppendTrieDictionary.java:833)
at org.apache.kylin.dict.AppendTrieDictionary$Builder.getInstance(AppendTrieDictionary.java:827)
at org.apache.kylin.dict.GlobalDictionaryBuilder.init(GlobalDictionaryBuilder.java:39)
at org.apache.kylin.dict.DictionaryGenerator.buildDictionary(DictionaryGenerator.java:73)
at org.apache.kylin.dict.DictionaryManager.buildDictFromReadableTable(DictionaryManager.java:305)
... 15 more
Caused by: java.lang.RuntimeException: java.io.IOException: Incomplete HDFS URI, no host: hdfs:///kylin/kylin_metadata/resources/GlobalDict/dict/BASE_TCLIVECHAT.V_CHATTRACKENTRYRECORD/ROBOTSESSIONID
at org.apache.kylin.common.util.HadoopUtil.getFileSystem(HadoopUtil.java:90)
at org.apache.kylin.dict.CachedTreeMap.openLatestIndexInput(CachedTreeMap.java:451)
at org.apache.kylin.dict.AppendTrieDictionary.readFields(AppendTrieDictionary.java:1197)
at org.apache.kylin.dict.DictionaryInfoSerializer.deserialize(DictionaryInfoSerializer.java:74)
at org.apache.kylin.dict.DictionaryInfoSerializer.deserialize(DictionaryInfoSerializer.java:34)
at org.apache.kylin.common.persistence.ResourceStore.getResource(ResourceStore.java:154)
at org.apache.kylin.dict.DictionaryManager.load(DictionaryManager.java:445)
at org.apache.kylin.dict.DictionaryManager$1.load(DictionaryManager.java:102)
at org.apache.kylin.dict.DictionaryManager$1.load(DictionaryManager.java:99)
at com.google.common.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3584)
at com.google.common.cache.LocalCache$Segment.loadSync(LocalCache.java:2372)
at com.google.common.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2335)
at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2250)
... 26 more
Caused by: java.io.IOException: Incomplete HDFS URI, no host: hdfs:///kylin/kylin_metadata/resources/GlobalDict/dict/BASE_TCLIVECHAT.V_CHATTRACKENTRYRECORD/ROBOTSESSIONID
at org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:141)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2596)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:91)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2630)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2612)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:370)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:296)
at org.apache.kylin.common.util.HadoopUtil.getFileSystem(HadoopUtil.java:88)
... 38 more
二、问题定位:
发现在该cube进行build dict的时候没有正确的获取viewfs协议。
通过堆栈信息,查看源码,使用arthas,查看部分方法的入参,发现:
由于该字段ROBOTSESSIONID使用了全局字典,由于全局字典相关的字典数据不是存储在hbase中的,hbase中只存在引用全局字典的hdfs路径。
具体信息如下:
字典信息存储在hbase的rowkey为:
/dict/BASE_TCLIVECHAT.V_CHATTRACKENTRYRECORD/ROBOTSESSIONID/2ee6dc87-7842-4bfe-a1fe-cb88a8f91b3b.dict
对应的value值为:
^B_{
"uuid" : "2ee6dc87-7842-4bfe-a1fe-cb88a8f91b3b",
"last_modified" : 0,
"version" : "2.0.0",
"source_table" : "BASE_TCLIVECHAT.V_CHATTRACKENTRYRECORD",
"source_column" : "ROBOTSESSIONID",
"source_column_index" : 6,
"data_type" : "bigint",
"input" : {
"path" : "hdfs:///kylin/kylin_metadata/kylin-6e31d660-8f35-472c-abce-afb2c2bb19e3/IM_Track_Basic_Modeal_Cube/fact_distinct_columns/V_CHATTRACKENTRYRECORD.ROBOTSESSIONID",
"size" : 40424626,
"last_modified_time" : 1539833412761
},
"dictionary_class" : "org.apache.kylin.dict.AppendTrieDictionary",
"cardinality" : 2223751
}^@mhdfs:///kylin/kylin_metadata/resources/GlobalDict/dict/BASE_TCLIVECHAT.V_CHATTRACKENTRYRECORD/ROBOTSESSIONID/
发现存储全局字典的路径为:
hdfs:///kylin/kylin_metadata/resources/GlobalDict/dict/BASE_TCLIVECHAT.V_CHATTRACKENTRYRECORD/ROBOTSESSIONID/
并不是使用的viewfs协议。直接修改hdfs:///kylin/kylin_metadata/resources/GlobalDict/dict/BASE_TCLIVECHAT.V_CHATTRACKENTRYRECORD/ROBOTSESSIONID/为viewfs://test/kylin/kylin_metadata/resources/GlobalDict/dict/BASE_TCLIVECHAT.V_CHATTRACKENTRYRECORD/ROBOTSESSIONID/
是无法执行成功的。由于数据采用的是DataOutputStream的writeUTF函数写入的,该函数会使用两个字节来记录写入数据的长度,由于hdfs:///kylin/kylin_metadata/resources/GlobalDict/dict/BASE_TCLIVECHAT.V_CHATTRACKENTRYRECORD/ROBOTSESSIONID/的长度为109而viewfs://test/kylin/kylin_metadata/resources/GlobalDict/dict/BASE_TCLIVECHAT.V_CHATTRACKENTRYRECORD/ROBOTSESSIONID/长度为115。
因此无法正确获取修改后的路径,而是获得:viewfs://test/kylin/kylin_metadata/resources/GlobalDict/dict/BASE_TCLIVECHAT.V_CHATTRACKENTRYRECORD/ROBOTSES导致路径不存在异常。
通过程序解析^@m发现该值对应为109。因此需要将115编译成二进制进行替换:^@s
三、问题修复
最终执行操作如下:
#kylin元数据备份目录
cd /BigData/run/kylin/meta_backups
#创建需要修改的元数据目录
mkdir -p meta_repair_2019_04_15/dict/BASE_TCLIVECHAT.V_CHATTRACKENTRYRECORD/ROBOTSESSIONID/
cd meta_repair_2019_04_15/dict/BASE_TCLIVECHAT.V_CHATTRACKENTRYRECORD/ROBOTSESSIONID/
cp /BigData/run/kylin/meta_backups/meta_2019_04_15_00_00_06/dict/BASE_TCLIVECHAT.V_CHATTRACKENTRYRECORD/ROBOTSESSIONID/2ee6dc87-7842-4bfe-a1fe-cb88a8f91b3b.dict .
修改2ee6dc87-7842-4bfe-a1fe-cb88a8f91b3b.dict 文件对应的内容为:
^B_{
"uuid" : "2ee6dc87-7842-4bfe-a1fe-cb88a8f91b3b",
"last_modified" : 0,
"version" : "2.0.0",
"source_table" : "BASE_TCLIVECHAT.V_CHATTRACKENTRYRECORD",
"source_column" : "ROBOTSESSIONID",
"source_column_index" : 6,
"data_type" : "bigint",
"input" : {
"path" : "hdfs:///kylin/kylin_metadata/kylin-6e31d660-8f35-472c-abce-afb2c2bb19e3/IM_Track_Basic_Modeal_Cube/fact_distinct_columns/V_CHATTRACKENTRYRECORD.ROBOTSESSIONID",
"size" : 40424626,
"last_modified_time" : 1539833412761
},
"dictionary_class" : "org.apache.kylin.dict.AppendTrieDictionary",
"cardinality" : 2223751
}^@sviewfs://test/kylin/kylin_metadata/resources/GlobalDict/dict/BASE_TCLIVECHAT.V_CHATTRACKENTRYRECORD/ROBOTSESSIONID/
保存退出。
然后执行元数据修复命令
cd /BigData/run/kylin
metastore.sh restore meta_backups/meta_repair_2019_04_15
修复成功以后,重新调起kylin的build任务,执行成功。