hive on spark环境搭建(官方源码编译方式)

此前,我已经搭建了 hive on spark, 不 准确说 是 spark on hive, 我可以在spark 中愉快得玩耍 hive,这也符合我当时得需求:hive on spark集群环境搭建

然而,通过hive客户端连接,hive 使用spark 引擎时,却报了 我无法解决得错误:hive on spark异常Failed to create Spark client for Spark session解决过程

所以,只得参考官方网站方式来从新搭建:hive on spark:Hive on Spark: Getting Started

官方说要自己编译一个不包含hive的,而官方下载的spark一般都是包含hive的。所以自己动手编译spark

Note that you must have a version of Spark which does not include the Hive jars. Meaning one which was not built with the Hive profile.

环境准备:环境变量 host hostname jdk 免密登陆 关闭防火墙 参考之前得帖子。

1.hadoop环境搭建

core-site.xml

hadoop.tmp.dir file:/opt/hadoop/data/hadoop/tmp io.file.buffer.size 131072 fs.defaultFS hdfs://master:9000 fs.trash.interval 10080 fs.trash.checkpoint.interval 60 hadoop.proxyuser.root.hosts * hadoop.proxyuser.root.groups * hadoop-env.sh

export JAVA_HOME=/opt/hadoop/jdk1.8.0_77
hdfs-site.xml

dfs.namenode.secondary.http-address master:9001 dfs.replication 1 dfs.namenode.name.dir file:/opt/hadoop/data/hadoop/namenode dfs.datanode.data.dir file:/opt/hadoop/data/hadoop/datanode dfs.permissions.enabled false dfs.datanode.du.reserved 21474836480 mapred-site.xml mapreduce.framework.name yarn mapreduce.jobhistory.address master:10020 mapreduce.jobhistory.webapp.address master:19888 slaves

localhost
yarn-site.xml

yarn.nodemanager.aux-services mapreduce_shuffle yarn.nodemanager.aux-services.mapreduce.shuffle.class org.apache.hadoop.mapred.ShuffleHandler yarn.resourcemanager.scheduler.class org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler Whether to enable log aggregation yarn.log-aggregation-enable true yarn.resourcemanager.hostname master yarn.resourcemanager.address master:8032 yarn.resourcemanager.scheduler.address master:8030 yarn.resourcemanager.resource-tracker.address master:8035 yarn.resourcemanager.admin.address master:8033 yarn.nodemanager.pmem-check-enabled false yarn.nodemanager.vmem-check-enabled false yarn.log-aggregation-enable true yarn.log.server.url http://master:19888/jobhistory/job yarn.log-aggregation.retain-seconds 86400 yarn.nodemanager.pmem-check-enabled false yarn.nodemanager.vmem-check-enabled false

启动之前执行 初始化namenode 否则容易报错:

hadoop namenode -format

启动hadoop

./sbin/start-all.sh

访问50070和8088端口 查看是否启动成功

master:50070

1.png

master:8088

2.png

2.下载编译spark

master 2.3.0
3.0.x 2.3.0
2.3.x 2.0.0
2.1根据官方推荐我先选择hive3.0.0 spark2.3.0

wget http://archive.apache.org/dist/spark/spark-2.3.0/spark-2.3.0.tgz
2.2解压spark根据pom 中的maven版本下载maven

wget http://archive.apache.org/dist/maven/maven-3/3.3.9/binaries/apache-maven-3.3.9-bin.tar.gz
tar xzvf apache-maven-3.3.9-bin.tar.gz
2.3添加maven 环境变量

export PATH=/opt/apache-maven-3.3.9/bin:${PATH}
source /etc/profile
2.4编译spark

cd /opt/hadoop/spark/spark-2.3.0
./dev/make-distribution.sh --name “hadoop2-without-hive” --tgz “-Pyarn,hadoop-provided,hadoop-2.7,parquet-provided,orc-provided”
2.5等了40分钟编译完成

生成了一个spark-2.3.0-bin-hadoop2-without-hive.tgz

解压到/opt/hadoop/spark-2.3.0-bin-hadoop2

2.6下载scala

根据spark源码pom 查看需要scala 2.11.8

wget https://downloads.lightbend.com/scala/2.11.8/scala-2.11.8.tgz
解压到/opt/hadoop/scala-2.11.8

2.7添加scala和spark环境变量

#Java
export JAVA_HOME=/opt/hadoop/jdk1.8.0_77
#hadoop
export HADOOP_HOME=/opt/hadoop/hadoop-2.7.7
export HADOOP_CONF_DIR= H A D O O P H O M E / e t c / h a d o o p e x p o r t H A D O O P C O M M O N L I B N A T I V E D I R = {HADOOP_HOME}/etc/hadoop export HADOOP_COMMON_LIB_NATIVE_DIR= HADOOPHOME/etc/hadoopexportHADOOPCOMMONLIBNATIVEDIR={HADOOP_HOME}/lib/native
export HADOOP_OPTS="-Djava.library.path= H A D O O P H O M E / l i b " e x p o r t S P A R K H O M E = / o p t / h a d o o p / s p a r k − 2.3.0 − b i n − h a d o o p 2 e x p o r t S C A L A H O M E = / o p t / h a d o o p / s c a l a − 2.11.8 e x p o r t H I V E H O M E = / o p t / h a d o o p / a p a c h e − h i v e − 3.0.0 − b i n e x p o r t P A T H = {HADOOP_HOME}/lib" export SPARK_HOME=/opt/hadoop/spark-2.3.0-bin-hadoop2 export SCALA_HOME=/opt/hadoop/scala-2.11.8 export HIVE_HOME=/opt/hadoop/apache-hive-3.0.0-bin export PATH= HADOOPHOME/lib"exportSPARKHOME=/opt/hadoop/spark2.3.0binhadoop2exportSCALAHOME=/opt/hadoop/scala2.11.8exportHIVEHOME=/opt/hadoop/apachehive3.0.0binexportPATH=PATH: S C A L A H O M E / b i n : {SCALA_HOME}/bin: SCALAHOME/bin:{HADOOP_HOME}/bin: H A D O O P H O M E / s b i n : {HADOOP_HOME}/sbin: HADOOPHOME/sbin:{SPARK_HOME}/bin: S C A L A H O M E / b i n : {SCALA_HOME}/bin: SCALAHOME/bin:{JAVA_HOME}/bin: H I V E H O M E / b i n e x p o r t C L A S S P A T H = . : {HIVE_HOME}/bin export CLASSPATH=.: HIVEHOME/binexportCLASSPATH=.:{JAVA_HOME}/jre/lib/rt.jar: J A V A H O M E / l i b / d t . j a r : {JAVA_HOME}/lib/dt.jar: JAVAHOME/lib/dt.jar:{JAVA_HOME}/lib/tools.jar
#Maven
export PATH=/opt/apache-maven-3.3.9/bin:${PATH}

3.配置spark

slaves

cd /opt/hadoop/spark-2.3.0-bin-hadoop2/conf
cp slaves.template slaves
localhost
spark-defaults.conf

spark.master yarn
#spark.submit.deployMode cluster
spark.executor.cores 5
spark.num.executors 5
spark.eventLog.enabled true
spark.eventLog.compress true
spark.eventLog.dir hdfs://master:9000/tmp/logs/root/logs
spark.history.fs.logDirectory hdfs://master:9000/tmp/logs/root/logs
spark.yarn.historyServer.address http://master:18080
spark.sql.parquet.writeLegacyFormat true

spark-env.sh

export SCALA_HOME=/opt/hadoop/scala-2.11.8
export JAVA_HOME=/opt/hadoop/jdk1.8.0_77
export HADOOP_HOME=/opt/hadoop/hadoop-2.7.7
export HADOOP_CONF_DIR=KaTeX parse error: Expected 'EOF', got '#' at position 108: …STER_IP=master #̲export SPARK_LO…HADOOP_HOME/etc/hadoop
export SPARK_YARN_USER_ENV=$HADOOP_HOME/etc/hadoop
export SPARK_EXECUTOR_MEMORY=4G
export SPARK_WORKER_DIR=/opt/hadoop/data/spark/work/

4.下载配置hive

4.1安装mysql

linux centos yum安装mysql
新建hive元数据库"hive"

4.2下载hive 解压到 /opt/hadoop/apache-hive-3.0.0-bin

wget http://archive.apache.org/dist/hive/hive-3.0.0/apache-hive-3.0.0-bin.tar.gz

4.3配置hive

hive-site.xml

找到所有这种引用变量路径 s y s t e m : j a v a . i o . t m p d i r / {system:java.io.tmpdir}/ system:java.io.tmpdir/{system:user.name}

全部替换为/opt/hadoop/data/hive/iotmp

追加内容

hive.server2.thrift.bind.host 10.10.22.133 Bind host on which to run the HiveServer2 Thrift service. hive.metastore.uris thrift://master:9083 javax.jdo.option.ConnectionURL jdbc:mysql://master:3306/hive javax.jdo.option.ConnectionDriverName com.mysql.jdbc.Driver javax.jdo.option.ConnectionUserName root javax.jdo.option.ConnectionPassword 123456 hive.execution.engine spark spark.home /opt/hadoop/spark-2.3.0-bin-hadoop2 spark.serializer org.apache.spark.serializer.KryoSerializer spark.master yarn spark.sql.parquet.writeLegacyFormat true hive.metastore.event.db.notification.api.auth false hive.server2.active.passive.ha.enable true spark.sql.parquet.writeLegacyFormat true 注意把hive-site.xml复制到spark/conf

hive-env.sh

export HADOOP_HOME=/opt/hadoop/hadoop-2.7.7
export SPARK_HOME=/opt/hadoop/spark-2.3.0-bin-hadoop2
export HIVE_HOME=/opt/hadoop/apache-hive-3.0.0-bin
export HIVE_CONF_DIR=/opt/hadoop/apache-hive-3.0.0-bin/conf

4.4初始化hive元数据库

./schematool -dbType mysql -initSchema

5.启动测试

5.1启动spark

/opt/hadoop/spark-2.3.0-bin-hadoop2.7/sbin/start-all.sh

但是报错:

starting org.apache.spark.deploy.master.Master, logging to /opt/hadoop/spark-2.3.0-bin-hadoop2.7/logs/spark-root-org.apache.spark.deploy.master.Master-1-master.out
failed to launch: nice -n 0 /opt/hadoop/spark-2.3.0-bin-hadoop2.7/bin/spark-class org.apache.spark.deploy.master.Master --host master --port 7077 --webui-port 8080
Spark Command: /opt/hadoop/jdk1.8.0_77/bin/java -cp /opt/hadoop/spark-2.3.0-bin-hadoop2.7/conf/:/opt/hadoop/spark-2.3.0-bin-hadoop2.7/jars/*:/opt/hadoop/hadoop-2.7.7/etc/hadoop/ -Xmx1g org.apache.spark.deploy.master.Master --host master --port 7077 --webui-port 8080

full log in /opt/hadoop/spark-2.3.0-bin-hadoop2.7/logs/spark-root-org.apache.spark.deploy.master.Master-1-master.out
我在官网下载了一个同版本带hadoop的spark,并把所有jars复制到我编译好的这个jars,问题就解决了。但同时也复制hive的jar。后面出问题再解决,这里我想到 自己编译和官网下载的是不是就是jars的区别,那我下载好了 删hive的jars就可以了,干嘛还自己编译,我打算等会再试一下。

5.2启动hive

nohup hive --service metastore &

nohup hive --service hiveserver2 &

5.3 hive测试

hive-site.xml中直接配置了spark engine,所以直接测试

./hive
hive>create table test(ts BIGINT,line STRING);
hive>select count(*) from test;
问题来了,报错:

Failed to execute spark task, with exception ‘org.apache.hadoop.hive.ql.metadata.HiveException(Failed to create Spark client for Spark session e4aae433-e79b-48c2-8edf-9d04796da7cf)’
FAILED: Execution Error, return code 30041 from org.apache.hadoop.hive.ql.exec.spark.SparkTask. Failed to create Spark client for Spark session e4aae433-e79b-48c2-8edf-9d04796da7cf
去spark/jars中删除所有带hive的jar,这些都是我后来复制进来的5.1步。

删除后再试

hive>create table test(ts BIGINT,line STRING);
hive>select count(*) from test;
Query ID = root_20190118163315_8a679820-288e-46f7-b464-f8b7fceb6abd
Total jobs = 1
Launching Job 1 out of 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=
In order to set a constant number of reducers:
set mapreduce.job.reduces=
Running with YARN Application = application_1547172099098_0075
Kill Command = /opt/hadoop/hadoop-2.7.7/bin/yarn application -kill application_1547172099098_0075
Hive on Spark Session Web UI URL: http://slave2:49196
Query Hive on Spark job[0] stages: [0, 1]
Spark job[0] status = RUNNING

      STAGES   ATTEMPT        STATUS  TOTAL  COMPLETED  RUNNING  PENDING  FAILED  

Stage-0 … 0 FINISHED 2 2 0 0 0
Stage-1 … 0 FINISHED 1 1 0 0 0

STAGES: 02/02 [==========================>>] 100% ELAPSED TIME: 10.20 s

Spark job[0] finished successfully in 10.20 second(s)
OK
591285
Time taken: 39.154 seconds, Fetched: 1 row(s)
到这里我明白了,编译出来的spark少了很多包,把官方编译好的spark jar复制过来又多了hive,总之spark中不能包含hive。

至此:hive on spark 搭建完成。

=======================================================================

但是我是一个搞事情的人,我要试一下,直接下载官方打包好的spark(一般包含hive),然后删掉里面的hive是不是就可以了。

使用官方下载好的spark

[root@master jars]# rm -rf spark-hive*
[root@master jars]# rm -rf hive-*
./sbin/start-all.sh
hive> select count(1) from subject_total_score;
Query ID = root_20190118164946_88709ec2-a5e1-4099-88eb-f98d24de6e88
Total jobs = 1
Launching Job 1 out of 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=
In order to set a constant number of reducers:
set mapreduce.job.reduces=
Running with YARN Application = application_1547172099098_0076
Kill Command = /opt/hadoop/hadoop-2.7.7/bin/yarn application -kill application_1547172099098_0076
Hive on Spark Session Web UI URL: http://master:40695
Query Hive on Spark job[0] stages: [0, 1]
Spark job[0] status = RUNNING

      STAGES   ATTEMPT        STATUS  TOTAL  COMPLETED  RUNNING  PENDING  FAILED  

Stage-0 … 0 FINISHED 2 2 0 0 0
Stage-1 … 0 FINISHED 1 1 0 0 0

STAGES: 02/02 [==========================>>] 100% ELAPSED TIME: 9.16 s

Spark job[0] finished successfully in 9.16 second(s)
OK
591285
Time taken: 39.291 seconds, Fetched: 1 row(s)

果然:前面做的都是无用功,hive on spark 只需要删掉所有包含hive的 jar就能马上实现,而不需要自己编译。

然而 :

spark-shell --master yarn

scala>spark.sql(“show tables”).show

已经查询不到hive表。我之前的spark程序都是基于这个hive库,但是实现hive on spark就不能实现spark on hive。那么就需要两套spark。

你要么hive 驱动 spark,实现hive on spark,要么反着来,spark直接操作hive,实现spark on hive,而不能双向来搞。

所以思路应该是 不要用hivethriftserver,而要用sparkthriftserver来提供hive的外部服务。

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 1
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值