此前,我已经搭建了 hive on spark, 不 准确说 是 spark on hive, 我可以在spark 中愉快得玩耍 hive,这也符合我当时得需求:hive on spark集群环境搭建
然而,通过hive客户端连接,hive 使用spark 引擎时,却报了 我无法解决得错误:hive on spark异常Failed to create Spark client for Spark session解决过程
所以,只得参考官方网站方式来从新搭建:hive on spark:Hive on Spark: Getting Started
官方说要自己编译一个不包含hive的,而官方下载的spark一般都是包含hive的。所以自己动手编译spark
Note that you must have a version of Spark which does not include the Hive jars. Meaning one which was not built with the Hive profile.
环境准备:环境变量 host hostname jdk 免密登陆 关闭防火墙 参考之前得帖子。
1.hadoop环境搭建
core-site.xml
hadoop.tmp.dir file:/opt/hadoop/data/hadoop/tmp io.file.buffer.size 131072 fs.defaultFS hdfs://master:9000 fs.trash.interval 10080 fs.trash.checkpoint.interval 60 hadoop.proxyuser.root.hosts * hadoop.proxyuser.root.groups * hadoop-env.shexport JAVA_HOME=/opt/hadoop/jdk1.8.0_77
hdfs-site.xml
localhost
yarn-site.xml
启动之前执行 初始化namenode 否则容易报错:
hadoop namenode -format
启动hadoop
./sbin/start-all.sh
访问50070和8088端口 查看是否启动成功
master:50070
1.png
master:8088
2.png
2.下载编译spark
master 2.3.0
3.0.x 2.3.0
2.3.x 2.0.0
2.1根据官方推荐我先选择hive3.0.0 spark2.3.0
wget http://archive.apache.org/dist/spark/spark-2.3.0/spark-2.3.0.tgz
2.2解压spark根据pom 中的maven版本下载maven
wget http://archive.apache.org/dist/maven/maven-3/3.3.9/binaries/apache-maven-3.3.9-bin.tar.gz
tar xzvf apache-maven-3.3.9-bin.tar.gz
2.3添加maven 环境变量
export PATH=/opt/apache-maven-3.3.9/bin:${PATH}
source /etc/profile
2.4编译spark
cd /opt/hadoop/spark/spark-2.3.0
./dev/make-distribution.sh --name “hadoop2-without-hive” --tgz “-Pyarn,hadoop-provided,hadoop-2.7,parquet-provided,orc-provided”
2.5等了40分钟编译完成
生成了一个spark-2.3.0-bin-hadoop2-without-hive.tgz
解压到/opt/hadoop/spark-2.3.0-bin-hadoop2
2.6下载scala
根据spark源码pom 查看需要scala 2.11.8
wget https://downloads.lightbend.com/scala/2.11.8/scala-2.11.8.tgz
解压到/opt/hadoop/scala-2.11.8
2.7添加scala和spark环境变量
#Java
export JAVA_HOME=/opt/hadoop/jdk1.8.0_77
#hadoop
export HADOOP_HOME=/opt/hadoop/hadoop-2.7.7
export HADOOP_CONF_DIR=
H
A
D
O
O
P
H
O
M
E
/
e
t
c
/
h
a
d
o
o
p
e
x
p
o
r
t
H
A
D
O
O
P
C
O
M
M
O
N
L
I
B
N
A
T
I
V
E
D
I
R
=
{HADOOP_HOME}/etc/hadoop export HADOOP_COMMON_LIB_NATIVE_DIR=
HADOOPHOME/etc/hadoopexportHADOOPCOMMONLIBNATIVEDIR={HADOOP_HOME}/lib/native
export HADOOP_OPTS="-Djava.library.path=
H
A
D
O
O
P
H
O
M
E
/
l
i
b
"
e
x
p
o
r
t
S
P
A
R
K
H
O
M
E
=
/
o
p
t
/
h
a
d
o
o
p
/
s
p
a
r
k
−
2.3.0
−
b
i
n
−
h
a
d
o
o
p
2
e
x
p
o
r
t
S
C
A
L
A
H
O
M
E
=
/
o
p
t
/
h
a
d
o
o
p
/
s
c
a
l
a
−
2.11.8
e
x
p
o
r
t
H
I
V
E
H
O
M
E
=
/
o
p
t
/
h
a
d
o
o
p
/
a
p
a
c
h
e
−
h
i
v
e
−
3.0.0
−
b
i
n
e
x
p
o
r
t
P
A
T
H
=
{HADOOP_HOME}/lib" export SPARK_HOME=/opt/hadoop/spark-2.3.0-bin-hadoop2 export SCALA_HOME=/opt/hadoop/scala-2.11.8 export HIVE_HOME=/opt/hadoop/apache-hive-3.0.0-bin export PATH=
HADOOPHOME/lib"exportSPARKHOME=/opt/hadoop/spark−2.3.0−bin−hadoop2exportSCALAHOME=/opt/hadoop/scala−2.11.8exportHIVEHOME=/opt/hadoop/apache−hive−3.0.0−binexportPATH=PATH:
S
C
A
L
A
H
O
M
E
/
b
i
n
:
{SCALA_HOME}/bin:
SCALAHOME/bin:{HADOOP_HOME}/bin:
H
A
D
O
O
P
H
O
M
E
/
s
b
i
n
:
{HADOOP_HOME}/sbin:
HADOOPHOME/sbin:{SPARK_HOME}/bin:
S
C
A
L
A
H
O
M
E
/
b
i
n
:
{SCALA_HOME}/bin:
SCALAHOME/bin:{JAVA_HOME}/bin:
H
I
V
E
H
O
M
E
/
b
i
n
e
x
p
o
r
t
C
L
A
S
S
P
A
T
H
=
.
:
{HIVE_HOME}/bin export CLASSPATH=.:
HIVEHOME/binexportCLASSPATH=.:{JAVA_HOME}/jre/lib/rt.jar:
J
A
V
A
H
O
M
E
/
l
i
b
/
d
t
.
j
a
r
:
{JAVA_HOME}/lib/dt.jar:
JAVAHOME/lib/dt.jar:{JAVA_HOME}/lib/tools.jar
#Maven
export PATH=/opt/apache-maven-3.3.9/bin:${PATH}
3.配置spark
slaves
cd /opt/hadoop/spark-2.3.0-bin-hadoop2/conf
cp slaves.template slaves
localhost
spark-defaults.conf
spark.master yarn
#spark.submit.deployMode cluster
spark.executor.cores 5
spark.num.executors 5
spark.eventLog.enabled true
spark.eventLog.compress true
spark.eventLog.dir hdfs://master:9000/tmp/logs/root/logs
spark.history.fs.logDirectory hdfs://master:9000/tmp/logs/root/logs
spark.yarn.historyServer.address http://master:18080
spark.sql.parquet.writeLegacyFormat true
spark-env.sh
export SCALA_HOME=/opt/hadoop/scala-2.11.8
export JAVA_HOME=/opt/hadoop/jdk1.8.0_77
export HADOOP_HOME=/opt/hadoop/hadoop-2.7.7
export HADOOP_CONF_DIR=KaTeX parse error: Expected 'EOF', got '#' at position 108: …STER_IP=master #̲export SPARK_LO…HADOOP_HOME/etc/hadoop
export SPARK_YARN_USER_ENV=$HADOOP_HOME/etc/hadoop
export SPARK_EXECUTOR_MEMORY=4G
export SPARK_WORKER_DIR=/opt/hadoop/data/spark/work/
4.下载配置hive
4.1安装mysql
linux centos yum安装mysql
新建hive元数据库"hive"
4.2下载hive 解压到 /opt/hadoop/apache-hive-3.0.0-bin
wget http://archive.apache.org/dist/hive/hive-3.0.0/apache-hive-3.0.0-bin.tar.gz
4.3配置hive
hive-site.xml
找到所有这种引用变量路径 s y s t e m : j a v a . i o . t m p d i r / {system:java.io.tmpdir}/ system:java.io.tmpdir/{system:user.name}
全部替换为/opt/hadoop/data/hive/iotmp
追加内容
hive.server2.thrift.bind.host 10.10.22.133 Bind host on which to run the HiveServer2 Thrift service. hive.metastore.uris thrift://master:9083 javax.jdo.option.ConnectionURL jdbc:mysql://master:3306/hive javax.jdo.option.ConnectionDriverName com.mysql.jdbc.Driver javax.jdo.option.ConnectionUserName root javax.jdo.option.ConnectionPassword 123456 hive.execution.engine spark spark.home /opt/hadoop/spark-2.3.0-bin-hadoop2 spark.serializer org.apache.spark.serializer.KryoSerializer spark.master yarn spark.sql.parquet.writeLegacyFormat true hive.metastore.event.db.notification.api.auth false hive.server2.active.passive.ha.enable true spark.sql.parquet.writeLegacyFormat true 注意把hive-site.xml复制到spark/confhive-env.sh
export HADOOP_HOME=/opt/hadoop/hadoop-2.7.7
export SPARK_HOME=/opt/hadoop/spark-2.3.0-bin-hadoop2
export HIVE_HOME=/opt/hadoop/apache-hive-3.0.0-bin
export HIVE_CONF_DIR=/opt/hadoop/apache-hive-3.0.0-bin/conf
4.4初始化hive元数据库
./schematool -dbType mysql -initSchema
5.启动测试
5.1启动spark
/opt/hadoop/spark-2.3.0-bin-hadoop2.7/sbin/start-all.sh
但是报错:
starting org.apache.spark.deploy.master.Master, logging to /opt/hadoop/spark-2.3.0-bin-hadoop2.7/logs/spark-root-org.apache.spark.deploy.master.Master-1-master.out
failed to launch: nice -n 0 /opt/hadoop/spark-2.3.0-bin-hadoop2.7/bin/spark-class org.apache.spark.deploy.master.Master --host master --port 7077 --webui-port 8080
Spark Command: /opt/hadoop/jdk1.8.0_77/bin/java -cp /opt/hadoop/spark-2.3.0-bin-hadoop2.7/conf/:/opt/hadoop/spark-2.3.0-bin-hadoop2.7/jars/*:/opt/hadoop/hadoop-2.7.7/etc/hadoop/ -Xmx1g org.apache.spark.deploy.master.Master --host master --port 7077 --webui-port 8080
full log in /opt/hadoop/spark-2.3.0-bin-hadoop2.7/logs/spark-root-org.apache.spark.deploy.master.Master-1-master.out
我在官网下载了一个同版本带hadoop的spark,并把所有jars复制到我编译好的这个jars,问题就解决了。但同时也复制hive的jar。后面出问题再解决,这里我想到 自己编译和官网下载的是不是就是jars的区别,那我下载好了 删hive的jars就可以了,干嘛还自己编译,我打算等会再试一下。
5.2启动hive
nohup hive --service metastore &
nohup hive --service hiveserver2 &
5.3 hive测试
hive-site.xml中直接配置了spark engine,所以直接测试
./hive
hive>create table test(ts BIGINT,line STRING);
hive>select count(*) from test;
问题来了,报错:
Failed to execute spark task, with exception ‘org.apache.hadoop.hive.ql.metadata.HiveException(Failed to create Spark client for Spark session e4aae433-e79b-48c2-8edf-9d04796da7cf)’
FAILED: Execution Error, return code 30041 from org.apache.hadoop.hive.ql.exec.spark.SparkTask. Failed to create Spark client for Spark session e4aae433-e79b-48c2-8edf-9d04796da7cf
去spark/jars中删除所有带hive的jar,这些都是我后来复制进来的5.1步。
删除后再试
hive>create table test(ts BIGINT,line STRING);
hive>select count(*) from test;
Query ID = root_20190118163315_8a679820-288e-46f7-b464-f8b7fceb6abd
Total jobs = 1
Launching Job 1 out of 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=
In order to set a constant number of reducers:
set mapreduce.job.reduces=
Running with YARN Application = application_1547172099098_0075
Kill Command = /opt/hadoop/hadoop-2.7.7/bin/yarn application -kill application_1547172099098_0075
Hive on Spark Session Web UI URL: http://slave2:49196
Query Hive on Spark job[0] stages: [0, 1]
Spark job[0] status = RUNNING
STAGES ATTEMPT STATUS TOTAL COMPLETED RUNNING PENDING FAILED
Stage-0 … 0 FINISHED 2 2 0 0 0
Stage-1 … 0 FINISHED 1 1 0 0 0
STAGES: 02/02 [==========================>>] 100% ELAPSED TIME: 10.20 s
Spark job[0] finished successfully in 10.20 second(s)
OK
591285
Time taken: 39.154 seconds, Fetched: 1 row(s)
到这里我明白了,编译出来的spark少了很多包,把官方编译好的spark jar复制过来又多了hive,总之spark中不能包含hive。
至此:hive on spark 搭建完成。
=======================================================================
但是我是一个搞事情的人,我要试一下,直接下载官方打包好的spark(一般包含hive),然后删掉里面的hive是不是就可以了。
使用官方下载好的spark
[root@master jars]# rm -rf spark-hive*
[root@master jars]# rm -rf hive-*
./sbin/start-all.sh
hive> select count(1) from subject_total_score;
Query ID = root_20190118164946_88709ec2-a5e1-4099-88eb-f98d24de6e88
Total jobs = 1
Launching Job 1 out of 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=
In order to set a constant number of reducers:
set mapreduce.job.reduces=
Running with YARN Application = application_1547172099098_0076
Kill Command = /opt/hadoop/hadoop-2.7.7/bin/yarn application -kill application_1547172099098_0076
Hive on Spark Session Web UI URL: http://master:40695
Query Hive on Spark job[0] stages: [0, 1]
Spark job[0] status = RUNNING
STAGES ATTEMPT STATUS TOTAL COMPLETED RUNNING PENDING FAILED
Stage-0 … 0 FINISHED 2 2 0 0 0
Stage-1 … 0 FINISHED 1 1 0 0 0
STAGES: 02/02 [==========================>>] 100% ELAPSED TIME: 9.16 s
Spark job[0] finished successfully in 9.16 second(s)
OK
591285
Time taken: 39.291 seconds, Fetched: 1 row(s)
果然:前面做的都是无用功,hive on spark 只需要删掉所有包含hive的 jar就能马上实现,而不需要自己编译。
然而 :
spark-shell --master yarn
scala>spark.sql(“show tables”).show
已经查询不到hive表。我之前的spark程序都是基于这个hive库,但是实现hive on spark就不能实现spark on hive。那么就需要两套spark。
你要么hive 驱动 spark,实现hive on spark,要么反着来,spark直接操作hive,实现spark on hive,而不能双向来搞。
所以思路应该是 不要用hivethriftserver,而要用sparkthriftserver来提供hive的外部服务。