一、第一步编译
spark3.1.2、hadoop 3.0.0 、cdh6.0.1
hive版本默认2.3.7,使用hive2.1.1需要修改源码(在此不做赘述)
二、Spark配置
1、修改spark配置文件
cd /data12/spark3/conf
cd /data12/spark3/conf
# 把hive hdfs 相关配置文件的软连接构建起来
ln -s /etc/hive/conf/hive-site.xml hive-site.xml
ln -s /etc/hive/conf/hdfs-site.xml hdfs-site.xml
ln -s /etc/hive/conf/core-site.xml core-site.xml
ln -s /etc/hadoop/conf/yarn-site.xml yarn-site.xml
mv log4j.properties.template log4j.properties
mv spark-defaults.conf.template spark-defaults.conf
mv spark-env.sh.template spark-env.sh
spark-defaults.conf
vim spark-defaults.conf
# 自定义spark要加载的hive metastore的版本以及jar包的路径,注意在spark3.1.x 与 spark3.0.x中此处配置的差异
spark.sql.hive.metastore.version=2.1.1
spark.sql.hive.metastore.jars=path
spark.sql.hive.metastore.jars.path=file:///opt/cloudera/parcels/CDH/lib/hive/lib/*
spark-env.sh
export SPARK_HOME=/data12/spark3
export HIVE_HOME=/opt/cloudera/parcels/CDH/lib/hive
export HADOOP_CONF_DIR=/etc/hadoop/conf
export YARN_CONF_DIR=/etc/hadoop/conf
export JAVA_HOME=/usr/java/jdk1.8.0_162
SPARK_DIST_CLASSPATH="/data12/spark3/jars/*"
SPARK_DIST_CLASSPATH="$SPARK_DIST_CLASSPATH:/opt/cloudera/parcels/CDH/lib/hive/lib/*"
SPARK_DIST_CLASSPATH="$SPARK_DIST_CLASSPATH:/opt/cloudera/parcels/CDH/lib/hadoop-mapreduce/*"
SPARK_DIST_CLASSPATH="$SPARK_DIST_CLASSPATH:/opt/cloudera/parcels/CDH/lib/hadoop-yarn/*"
2、测试 spark-shell 以及 spark-sql
命令参考:
export SPARK_HOPME=/opt/spark-3.1.2
# 本地跑spark-shell
sh /data12/spark3/bin/spark-shell
# 以yarn-client的方式跑spark-shell
sh /data12/spark3/bin/spark-shell --master yarn --deploy-mode client --executor-memory 1G --num-executors 2
# 以yarn-cluster的方式跑spark-sql
sh /data12/spark3/bin/spark-sql --master yarn -deploy-mode cluster --executor-memory 1G --num-executors 2
# 跑一些spark代码 操作hive表中的一些数据,测试正常就OK
三、Kyuubi配置
1、下载kyuubi 1.2.0
可以直接与CDH集成,无需编译
Releases · apache/incubator-kyuubi · GitHub
2、修改kyuubi配置文件
kyuubi-defaults.conf
vim kyuubi-defaults.conf
# 启动的spark引擎以yarn-client模式跑
spark.master=yarn
spark.submit.deployMode=client
spark.driver.memory=20g
spark.hadoop.fs.hdfs.impl.disable.cache=true
spark.executor.heartbeatInterval=30s
spark.yarn.jars=hdfs://nameservice3/user/spark3_1_2/*.jar
spark.shuffle.useOldFetchProtocol=true
spark.shuffle.service.enabled=true
spark.dynamicAllocation.enabled=true
#spark.dynamicAllocation.shuffleTracking.enabled=true
spark.dynamicAllocation.minExecutors=1
spark.dynamicAllocation.maxExecutors=1000
spark.dynamicAllocation.initialExecutors=1
spark.dynamicAllocation.schedulerBacklogTimeout=1s
spark.dynamicAllocation.executorIdleTimeout=60s
spark.dynamicAllocation.sustainedSchedulerBacklogTimeout=5s
spark.driver.maxResultSize=5g
spark.executor.cores=6
spark.executor.memory=12G
spark.driver.memoryOverhead=1228m
spark.executor.memoryOverhead=3088m
spark.network.maxRemoteBlockSizeFetchToMem=2147483135
spark.ui.enabled=true
spark.ui.killEnabled=true
# spark引擎共享级别,user,即同一用户共享一个引擎,
#kyuubi.engine.share.level=CONNECTION
kyuubi.engine.share.level=USER
kyuubi.session.engine.idle.timeout=PT1H
# 启用HA,指定ZK地址
kyuubi.ha.enabled=false
#kyuubi.ha.zookeeper.quorum=134.84.68.201:2181,134.84.68.202:2181,134.84.68.203:2181
kyuubi.ha.zookeeper.client.port=2181
# kerbero授权开启
kyuubi.authentication=KERBEROS
kyuubi.kinit.keytab=/opt/spark/hive.keytab
kyuubi.kinit.principal=hive/hadoop-134-84-68-201.anhuitelecom.com@DW1.ANHUITELECOM.COM
kyuubi-env.sh
export JAVA_HOME=/usr/java/jdk1.8.0_162
export SPARK_HOME=/data12/spark3
export SPARK_CONF_DIR=${SPARK_HOME}/conf
export HADOOP_CONF_DIR=/etc/hadoop/conf:/etc/hive/conf
export KYUUBI_MAX_LOG_FILES=10
四、启动脚本
restart_kyuubi_brd_gzfx.sh
#!/bin/bash
/opt/spark/kyuuubi-1.2.0-bin-without-spark/bin/kyuubi stop
#发现问题,此命令可以讲KyuubiServer进程停掉,但是启动的SparkSubmit无法停止,bug~~~
sleep 3
/opt/spark/kyuuubi-1.2.0-bin-without-spark/bin/kyuubi start
备注:因为使用的是本地ZK,第一重启可能会报端口被占用,所以脚本执行两次或以上,也可直接杀掉kyuubi进程
五、问题
1、spark3如果放在/opt下面,spark-sql执行日志页面采集不到日志;
2、Spark 3 修改了 shuffle 通信协议,在与 CDH 2.4 版本的 ESS 交互时,需要设置 spark.shuffle.useOldFetchProtocol=true,否则可能报如下错误。[SPARK-29435] Spark 3 doesn't work with older shuffle service,IllegalArgumentException: Unexpected message type: <number>。
3、kyuubi1.3.2之前,Kyuubi如果想部署为集群模式,ZK集群必须不含有Kerberos认证。
六、参数解释
kyuubi.engine.share.level=CONNECTION|USER|SERVER
#CONNECTION场景比较特殊,Driver是不会被复用的,所以对于CONNECTION模式,engine.idle.timeout是没有意义的,只要连接断开Driver就会立刻退出。
kyuubi.session.engine.idle.timeout=PT1H
#引擎TTL。约定Driver闲置了多长时间以后才释放
七、Bug
由于票据生存时间是7天,所以在7天之后,Kyuubi不会刷新票据,执行spark任务报如下异常:token...can't be found in cache。
在kyuubi1.4.0+spark3.2.0解决了这个bug,部署方式查看《kyuubi1.4.0基于spark3.2.0集群模式部署》
八、参考网址
Kyuubi实践 | 编译Spark3.1以适配CDH5并集成Kyuubi
Apache Kyuubi on Spark 在CDH上的深度实践 - 网易数帆的个人空间 - OSCHINA - 中文开源技术交流社区