1.Spark使用说明
spark-sql On Yarn
1.spark-sql命令行运行在yarn上
原理和spark-shell on yarn一样; 需要将Hive使用的相关包都加到Spark环境变量
-
将hive-site.xml拷贝到$SPARK_HOME/conf
-
export HIVE_HOME=/usr/local/apache-hive-0.13.1-bin 添加到spark-env.sh
-
将以下jar包添加到Spark环境变量:
datanucleus-api-jdo-3.2.6.jar、datanucleus-core-3.2.10.jar、datanucleus-rdbms-3.2.9.jar、mysql-connector-java-5.1.15-bin.jar
2.可以在spark-env.sh中直接添加到SPARK_CLASSPATH变量中
cp hive-site.xml ../../spark-2.3.4-bin-hadoop2.6/conf/
cp mysql-connector-java-5.1.34-bin.jar ../../spark-2.3.4-bin-hadoop2.6/jars/
2. SparkSQL和Hive集成
1. hive配置修改
参考链接:SparkSQL(1):SparkSQL和Hive集成
# 1. 编辑 $HIVE_HOME/conf/hive-site.xml,增加如下内容:
vim $HIVE_HOME/conf/hive-site.xml
<property>
<name>hive.metastore.uris</name>
<value>thrift://hostname:9083</value>
</property>
# 即如下代码:
<property>
<name>hive.metastore.uris</name>
<value>thrift://master:9083</value>
</property>
# 即:<value>thrift://127.0.0.1:9083</value>
# 2. 将hive的配置文件拷贝给spark
# 3. 将mysql的jdbc驱动包拷贝给spark
# cp $HIVE_HOME/lib/mysql-connector-java-5.1.12.jar $SPARK_HOME/jars/
# 启动hive的metastore服务
cd $HIVE_HOME/bin
./hive --service metastore &
# 启动spark-sql的shell交互界面
cd /usr/local/src/spark-2.3.4-bin-hadoop2.6/
./bin/spark-shell --master spark://hostname:7077
# ./bin/spark-shell --master spark://master:7077
# 以下代码尚未实践
<property>
<name>hive.server2.thrift.bind.host</name>
<value>master</value>
</property>
根据hive的配置文件的hive.metastore.uris参数的配置值选择不同的操作方式
-
如果没有给定参数(默认情况)
将hive元数据数据库的驱动包(就是mysql)添加到spark的classpath环境变量中即可完成spark和hive的集成
-
给定具体的metastore服务所在的节点信息(值非空,一般情况下用这种比较多)
2. 测试SparkSQL和hive集成
# Hive创建表,插入测试数据
cd /usr/local/src/spark-2.3.4-bin-hadoop2.6/
# 测试两种方式
# 1.spark-sql测试
./bin/spark-sql
# 运行Sql语句
# spark-sql (default)> select * from tableName;
# 2.spark-shell测试
./bin/spark-shell
# 运行scala
# scala>spark.sqlContext
# scala>spark.sqlContext.sql("select * from tableName").show()
3. 准备工作
# hadoop启动:
cd /usr/local/src/hadoop-2.6.1/sbin
./start-all.sh
# 查看信息:
hadoop fs -ls /
http://192.168.93.10:50070/dfshealth.html#tab-overview
# ZooKeeper启动:
cd /usr/local/src/zookeeper-3.4.14/bin/
./zkServer.sh start
# HBase启动:
cd /usr/local/src/hbase-1.3.6/bin
./start-hbase.sh
# Spark启动:
cd /usr/local/src/spark-2.3.4-bin-hadoop2.6/bin
spark-shell master yarn
spark.sql("show tables").show;
spark.sql("select * from badou.employee").show;
val data = spark.sql("select * from badou.employee").show;
val data = spark.sql("select * from badou.employee");
data.show();
# 等价于:
val data = spark.sql("select * from badou.employee").show;
# 启动顺序:
Hadoop => ZooKeeper => HBase => Spark
# 停止顺序:
Spark => HBase => ZooKeeper => Hadoop
# 数据查看:
http://192.168.93.10:8040
4.SparkSQL操作Hive数据
参考链接: SparkSQL常用操作
create database badou;
use badou;
create table news_seg(sentence string);
load data local inpath '/database/badou/allfiles.txt' into table news_seg;
create table news_noseg as select split(regexp_replace(sentence, ' ',''),'##@@##')[0] as sentence, split(regexp_replace(sentence, ' ',''),'##@@##')[1] as label from news_seg;
select * from news_noseg limit 2;
5.pyspark配置运行
# 创建run.sh
touch run.sh
#!/bin/sh
cd $SPARK_HOME
./bin/spark-submit \
--conf spark.yarn.appMasterEnv.PYSPARK_PYTHON=$PYSPARK_PYTHON \
--master yarn-cluster \
--files $HIVE_HOME/conf/hive-site.xml \
--archives /work/spark/code/sparker.zip#ANACONDA \
/work/spark/code/NB_test.py
# 运行run.sh
sh run.sh
特别说明:
$PYSPARK_PYTHON为环境变量,配置于~/.bashrc
# export SPARK_HOME=/root/anaconda3/bin/python
sparker.zip为anaconda创建环境压缩包
# conda create --name sparker python=3.7 jieba numpy pandas