KYLIN使用spark构建引擎(HDP2.6.5.0环境)

一、搭建环境

  • kylin版本: 2.6.4
  • hdp版本:2.6.5.0
  • spark版本:2.3.2

 

二、配置

1)配置HADOOP_CONF_DIR

export HADOOP_CONF_DIR=/usr/hdp/2.6.5.0-292/hadoop/conf

2)配置SPARK_HOME

# spark
export SPARK_HOME=/usr/hdp/2.6.5.0-292/spark2
export PATH=$SPARK_HOME/bin:$PATH

3)配置KYLIN_HOME

export KYLIN_HOME=/opt/kylin

4)配置KYLIN(推荐)

# 动态分配spark资源
kylin.engine.spark-conf.spark.dynamicAllocation.enabled=true
kylin.engine.spark-conf.spark.dynamicAllocation.minExecutors=1
kylin.engine.spark-conf.spark.dynamicAllocation.maxExecutors=1000
kylin.engine.spark-conf.spark.dynamicAllocation.executorIdleTimeout=300

# spark相关
kylin.engine.spark-conf.spark.master=yarn
kylin.engine.spark-conf.spark.submit.deployMode=cluster
kylin.engine.spark-conf.spark.yarn.queue=default
kylin.engine.spark-conf.spark.driver.memory=2G
kylin.engine.spark-conf.spark.executor.memory=4G
kylin.engine.spark-conf.spark.yarn.executor.memoryOverhead=1024
kylin.engine.spark-conf.spark.executor.cores=10
kylin.engine.spark-conf.spark.network.timeout=600
kylin.engine.spark-conf.spark.shuffle.service.enabled=true
kylin.engine.spark-conf.spark.eventLog.enabled=true
kylin.engine.spark-conf.spark.hadoop.dfs.replication=2
kylin.engine.spark-conf.spark.hadoop.mapreduce.output.fileoutputformat.compress=true
kylin.engine.spark-conf.spark.hadoop.mapreduce.output.fileoutputformat.compress.codec=org.apache.hadoop.io.compress.DefaultCodec
kylin.engine.spark-conf.spark.io.compression.codec=org.apache.spark.io.SnappyCompressionCodec
kylin.engine.spark-conf.spark.eventLog.dir=hdfs\:///kylin/spark-history
kylin.engine.spark-conf.spark.history.fs.logDirectory=hdfs\:///kylin/spark-history
kylin.engine.spark-conf.spark.yarn.archive=hdfs://192.168.2.101:8020/kylin/spark/spark-libs.jar

## uncomment for HDP
kylin.engine.spark-conf.spark.driver.extraJavaOptions=-Dhdp.version=2.6.5.0-292
kylin.engine.spark-conf.spark.yarn.am.extraJavaOptions=-Dhdp.version=2.6.5.0-292
kylin.engine.spark-conf.spark.executor.extraJavaOptions=-Dhdp.version=2.6.5.0-292

5)下载最新版本spark(2.4.4)

# 下载
cd $KYLIN_HOME
wget http://mirrors.tuna.tsinghua.edu.cn/apache/spark/spark-2.4.4/spark-2.4.4-bin-hadoop2.7.tgz

# 解压
tar -zxvf spark-2.4.4-bin-hadoop2.7.tgz -C .
mv spark-2.4.4-bin-hadoop2.7 spark

# 打包jars
jar cv0f spark-libs.jar -C $KYLIN_HOME/spark/jars/ .

# 上传hdfs
hadoop fs -mkdir -p /kylin/spark/
hadoop fs -put spark-libs.jar /kylin/spark/

 

三、使用

1)更换构建引擎

2)按需设置(可选)

      样例 cube 有两个耗尽内存的度量: “COUNT DISTINCT” 和 “TOPN(100)”;当源数据较小时,他们的大小估计的不太准确: 预估的大小会比真实的大很多,导致了更多的 RDD partitions 被切分,使得 build 的速度降低。500 对于其是一个较为合理的数字。

 

 

 

坑一:

Caused by: java.lang.RuntimeException: Could not create  interface org.apache.hadoop.hbase.regionserver.MetricsRegionServerSourceFactory Is the hadoop compatibility jar on the classpath?
	at org.apache.hadoop.hbase.CompatibilitySingletonFactory.getInstance(CompatibilitySingletonFactory.java:73)
	at org.apache.hadoop.hbase.io.MetricsIO.<init>(MetricsIO.java:31)
	at org.apache.hadoop.hbase.io.hfile.HFile.<clinit>(HFile.java:192)
	... 15 more
Caused by: java.util.NoSuchElementException
	at java.util.ServiceLoader$LazyIterator.nextService(ServiceLoader.java:365)
	at java.util.ServiceLoader$LazyIterator.next(ServiceLoader.java:404)
	at java.util.ServiceLoader$1.next(ServiceLoader.java:480)
	at org.apache.hadoop.hbase.CompatibilitySingletonFactory.getInstance(CompatibilitySingletonFactory.java:59)
	... 17 more

解决办法是:

         将 hbase-hadoop2-compat-*.jar 和 hbase-hadoop-compat-*.jar 拷贝到 $KYLIN_HOME/spark/jars 目录下 (这两个 jar 文件可以从 HBase 的 lib 目录找到); 如果你已经生成了 Spark assembly jar 并上传到了 HDFS, 那么你需要重新打包上传。在这之后,重试失败的 cube 任务,应该就可以成功了。相关的 JIRA issue 是 KYLIN-3607,会在未来版本修复.(已修复)

 

坑二:

19/10/14 16:48:25 INFO Client: 
	 client token: N/A
	 diagnostics: User class threw exception: java.lang.RuntimeException: error execute org.apache.kylin.storage.hbase.steps.SparkCubeHFile. Root cause: Job aborted.
	at org.apache.kylin.common.util.AbstractApplication.execute(AbstractApplication.java:42)
	at org.apache.kylin.common.util.SparkEntry.main(SparkEntry.java:44)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$4.run(ApplicationMaster.scala:721)
Caused by: org.apache.spark.SparkException: Job aborted.
	at org.apache.spark.internal.io.SparkHadoopWriter$.write(SparkHadoopWriter.scala:100)
	at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply$mcV$sp(PairRDDFunctions.scala:1083)
	at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply(PairRDDFunctions.scala:1081)
	at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply(PairRDDFunctions.scala:1081)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
	at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
	at org.apache.spark.rdd.PairRDDFunctions.saveAsNewAPIHadoopDataset(PairRDDFunctions.scala:1081)
	at org.apache.spark.api.java.JavaPairRDD.saveAsNewAPIHadoopDataset(JavaPairRDD.scala:831)
	at org.apache.kylin.storage.hbase.steps.SparkCubeHFile.execute(SparkCubeHFile.java:238)
	at org.apache.kylin.common.util.AbstractApplication.execute(AbstractApplication.java:37)
	... 6 more
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 1.0 failed 4 times, most recent failure: Lost task 0.3 in stage 1.0 (TID 4, server3.tuzhanai.com, executor 39): org.apache.spark.SparkException: Task failed while writing rows
	at org.apache.spark.internal.io.SparkHadoopWriter$.org$apache$spark$internal$io$SparkHadoopWriter$$executeTask(SparkHadoopWriter.scala:155)
	at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:83)
	at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:78)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
	at org.apache.spark.scheduler.Task.run(Task.scala:109)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.NoClassDefFoundError: org/apache/hadoop/hbase/util/Counter
	at org.apache.hadoop.metrics2.lib.MutableHistogram.<init>(MutableHistogram.java:42)
	at org.apache.hadoop.metrics2.lib.MutableRangeHistogram.<init>(MutableRangeHistogram.java:41)
	at org.apache.hadoop.metrics2.lib.MutableTimeHistogram.<init>(MutableTimeHistogram.java:42)
	at org.apache.hadoop.metrics2.lib.MutableTimeHistogram.<init>(MutableTimeHistogram.java:38)
	at org.apache.hadoop.metrics2.lib.DynamicMetricsRegistry.newTimeHistogram(DynamicMetricsRegistry.java:262)
	at org.apache.hadoop.hbase.io.MetricsIOSourceImpl.<init>(MetricsIOSourceImpl.java:49)
	at org.apache.hadoop.hbase.io.MetricsIOSourceImpl.<init>(MetricsIOSourceImpl.java:36)
	at org.apache.hadoop.hbase.regionserver.MetricsRegionServerSourceFactoryImpl.createIO(MetricsRegionServerSourceFactoryImpl.java:89)
	at org.apache.hadoop.hbase.io.MetricsIO.<init>(MetricsIO.java:32)
	at org.apache.hadoop.hbase.io.hfile.HFile.<clinit>(HFile.java:192)
	at org.apache.hadoop.hbase.mapreduce.HFileOutputFormat2$1.getNewWriter(HFileOutputFormat2.java:247)
	at org.apache.hadoop.hbase.mapreduce.HFileOutputFormat2$1.write(HFileOutputFormat2.java:194)
	at org.apache.hadoop.hbase.mapreduce.HFileOutputFormat2$1.write(HFileOutputFormat2.java:152)
	at org.apache.spark.internal.io.HadoopMapReduceWriteConfigUtil.write(SparkHadoopWriter.scala:356)
	at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$4.apply(SparkHadoopWriter.scala:130)
	at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$4.apply(SparkHadoopWriter.scala:127)
	at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1415)
	at org.apache.spark.internal.io.SparkHadoopWriter$.org$apache$spark$internal$io$SparkHadoopWriter$$executeTask(SparkHadoopWriter.scala:139)
	... 8 more
Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.hbase.util.Counter
	at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
	at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
	... 26 more

解决办法是:

      原因是(hdp自带的spark2.3.0和kylin自带的spark2.3.2都有BUG),下载spark2.4.4,然后将 jars 打成 spark-libs.jar ,然后上传到hdfs  /kylin/spark/ 目录下,并在配置文件配置即可

 

 

 

 

 

 

kylin.query.spark-conf.spark.executor.memoryOverhead=4g是Kylin中关于Spark执行器内存的参数设置。在Kylin使用Spark作为计算引擎时,该参数用于设置每个Spark执行器在运行过程中可以使用的最大堆外内存。堆外内存是指位于堆以外的Java进程使用的内存空间,它通常用于存储直接内存,如Java垃圾收集器的元数据和Spark任务的执行过程中产生的临时数据。 通过将kylin.query.spark-conf.spark.executor.memoryOverhead设置为4g,可以为每个Spark执行器分配4GB的堆外内存空间。这样做的目的是提高Spark任务的执行效率和稳定性。由于Spark任务在执行过程中会产生大量的临时数据,如果没有足够的堆外内存空间进行存储和管理,可能会导致Spark任务频繁进行垃圾收集和内存回收,进而影响任务的性能和稳定性。 设置kylin.query.spark-conf.spark.executor.memoryOverhead=4g时需要考虑集群的可用内存大小和Spark任务的实际需求。如果集群的可用内存比较充足,并且Spark任务产生的临时数据较多,则可以适当增加该参数的值,以提高Spark任务的执行效率。反之,如果集群的可用内存有限或者Spark任务产生的临时数据较少,则可以减小该参数的值,以节省资源和提高任务的稳定性。 总之,kylin.query.spark-conf.spark.executor.memoryOverhead=4g是Kylin中关于Spark执行器内存的配置参数,它决定了每个Spark执行器可以使用的最大堆外内存空间大小,合理设置该参数可以提高Spark任务的执行效率和稳定性。
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值