之前试过pyhive直接读取hive数据,有几个依赖一直装不上,经过几天的摸索,终于使用spark自带的sql支持能够成功读取hive的数据。
其中的坑很大,有很多配置文件需要写,特此记录一下。
第一步全局环境变量:
vim ~/.bash_profile(mac)
vim ~/.bashrc(linux)
配置好hadoop spark hive java的环境变量
以及pyspark的默认python解释器的路径和pyspark-shell的ippython路径,不然后报错。
5 ##homebrew 替换源
6 export HOMEBREW_BOTTLE_DOMAIN=https://mirrors.ustc.edu.cn/homebrew-bottles
7
8 #hadoop path
9 export PATH=$PATH:/usr/local/Cellar/hadoop/3.2.1_1/libexec/sbin:/usr/local/Cellar/hadoop/3.2.1_1/libexec
10 export HADOOP_HOME=/usr/local/Cellar/hadoop/3.2.1_1/libexec
11
12 ## scala_home
13 export SCALA_HOME=/usr/local/Cellar/scala/2.13.2
14 export PATH=$PATH:$SCALA_HOME/bin
15
16 ## spark_home and sparkpath
17 export PATH=$PATH:/usr/local/Cellar/spark/bin
18 export SPARK_HOME=/usr/local/Cellar/spark
19
20 ## flink_path and flink home path
21
22
23 ## kalka path and kalka home path
24
25
26 ## zookeeper path and home path
27
28
29 ## hive path and home path
30 export HIVE_HOME=/usr/local/Cellar/hive/3.1.2_1
31 export PATH="$HIVE_HOME/bin:$PATH"
32
33 ## JAVA_HOME
34 export JAVA_HOME=/Library/Java/JavaVirtualMachines/jdk1.8.0_201.jdk/Contents/Home
35 export PYTHONPATH=$SPARK_HOME/python:$SPARK_HOME/python/lib/py4j-0.10.8.1-src.zip:$PYTHONPATH
36
37 export PYSPARK_PYTHON=/usr/bin/python3
首先需要安装好hadoop和hive,配置好hive的metasore数据库,也就是mysql,需要配置
vim $HIVE_HOME/libexec/conf/hive-site.xml
1 <?xml version="1.0" encoding="UTF-8" standalone="no"?>
2 <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
3 <configuration>
4 <property>
5 <name>javax.jdo.option.ConnectionUserName</name>
6 <value>mysql-username</value>
7 </property>
8 <property>
9 <name>javax.jdo.option.ConnectionPassword</name>
10 <value>mysql-password</value>
11 </property>
12 <property>
13 <name>javax.jdo.option.ConnectionURL</name>mysql
14 <value>jdbc:mysql://localhost:3306/hive</value>
15 </property>
16 <property>
17 <name>javax.jdo.option.ConnectionDriverName</name>
18 <value>com.mysql.cj.jdbc.Driver