Spark和hive配置较为简单,为了方便Spark对数据的使用与测试,因此在搭建Spark on Yarn模式的同时,也把Mysql + Hive一起搭建完成,并且配置Hive对Spark的支持,让Spark也能像Hive一样操作数据。
前期准备
scala-2.11.11.tgz
spark-2.1.1-bin-hadoop2.7.tar.gz
hive-1.2.1.tar.gz
mysql-connector-java-5.1.43-bin.jar
安装MySQL
通过yum 安装MySQL
MySQL因为只用来存储hive的元数据,因此只用在一个节点上安装就好
1、下载MySQL的repo源
wget http://dev.mysql.com/get/mysql57-community-release-el7-11.noarch.rpm
2、安装mysql源
yum localinstall mysql57-community-release-el7-11.noarch.rpm
3、检查源是否安装成功
yum repolist enabled | grep "mysql.*-community.*"
4、安装mysql
yum install mysql-community-server
5、启动mysql
systemctl start mysqld
6、查看mysql状态
systemctl status mysqld
出现active (running)表示成功
7、设置开机启动mysql
systemctl enable mysqld
systemctl daemon-reload
8、修改root本地登录密码
//生成默认密码,然后登录后修改
grep 'temporary password' /var/log/mysqld.log
mysql -uroot -p
//修改全局参数以便修改密码
//检查是否安装validate_password插件
SHOW VARIABLES LIKE 'validate_password%';
//修改validate_passwhiord_policy参数的值
set global validate_password_policy=0;
//设置root账户密码
set password for 'root'@'localhost'=password('rootroot');
9、添加远程登录用户
GRANT ALL PRIVILEGES ON *.* TO 'root'@'%' IDENTIFIED BY 'rootroot' WITH GRANT OPTION;
10、配置默认编码为utf-8
//修改/etc/my.cnf 在[mysqld]下添加编码
character_set_server=utf8
init_connect='SET NAMES utf8'
HIVE安装
在master1节点上
1、创建hdfs目录并赋予权限
这几步必须做,否则后面指定hive元数据库的时候回出错
hdfs dfs -mkdir -p /user/hive/warehouse
hdfs dfs -mkdir -p /user/hive/tmp
hdfs dfs -mkdir -p /user/hive/log
hdfs dfs -chmod 777 /user/hive/warehouse
hdfs dfs -chmod 777 /user/hive/tmp
hdfs dfs -chmod 777 /user/hive/log
增加环境变量
export HIVE_HOME=/usr/local/hive-1.2.1
export HIVE_CONF_DIR=/usr/local/hive/conf
2、创建mysql数据库信息并指定元数据库
//登录mysql,创建一个数据库命令为hive
create database hive;
//创建hive用户,并赋予所有的权限
CREATE USER 'hive'@'localhost' IDENTIFIED BY 'rootroot';
GRANT ALL PRIVILEGES ON *.* TO hive IDENTIFIED BY 'ROOTROOT' WITH GRANT OPTION;
//将mysql的JDBC驱动包拷贝到hive的安装目录的lib目录中
3、远程模式的服务端配置(master节点)
修改hive-site.xml配置
vim /usr/local/hive-1.2.1/conf/hive-site.xml
//具体配置如下
<configuration>
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:mysql://master:3306/hive?createDatabaseIfNotExist=true&useSSL=false</value>
<description>JDBC connect string for a JDBC metastore</description>
</property>
<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>com.mysql.jdbc.Driver</value>
<description>Driver class name for a JDBC metastore</description>
</property>
<property>
<name>javax.jdo.option.ConnectionUserName</name>
<value>root</value>
<description>Username to use against metastore database</description>
</property>
<property>
<name>javax.jdo.option.ConnectionPassword</name>
<value>rootroot</value>
<description>password to use against metastore database</description>
</property>
<property>
<name>hive.server2.logging.operation.log.location</name>
<value>/usr/local/hive-1.2.1/iotmp/operation_logs</value>
<description>Top level directory where operation logs are stored if logging functionality is enabled</description>
</property>
<property>
<name>hive.exec.scratchdir</name>
<value>/tmp/hive</value>
<description>HDFS root scratch dir for Hive jobs which gets created with write all (733) permission. For each connecting user, an HDFS scratch dir: ${hive.exec.scratchdir}/<username> is created, with ${hive.scratch.dir.permission}.</description>
</property>
<property>
<name>hive.exec.local.scratchdir</name>
<value>/usr/local/hive-1.2.1/iotmp</value>
<description>Local scratch space for Hive jobs</description>
</property>
<property>
<name>hive.downloaded.resources.dir</name>
<value>/usr/local/hive-1.2.1/iotmp</value>
<description>Temporary local directory for added resources in the remote file system.</description>
</property>
<property>
<name>hive.querylog.location</name>
<value>/usr/local/hive-1.2.1/iotmp</value>
<description>Location of Hive run time structured log file</description>
</property>
</configuration>
4、其他节点作为客户端(master1/slave1/slave2/slave3)
修改hive-site.xml配置
<configuration>
<property>
<name>hive.metastore.warehouse.dir</name>
<value>/usr/hive/warehouse</value>
</property>
<property>
<name>hive.metastore.uris</name>
<value>thrift://master:9083</value>
</property>
</configuration>
到这里 hive的远程模式就配置完成了。
测试一下hive是否正常启动
//在master节点上启动hive元数据服务
hive --service metastore &
//在master1节点上启动hive
hive
hive 可以显示数据
mysql存储hive元数据信息
HDFS存储数据
对应的HDFS上的数据
hive功能运行正常
Spark on Yarn 配置
1、解压spark包
//解压到/usr/local/spark
tar -zxvf spark-2.1.1-bin-hadoop2.7.tgz
mv spark-2.1.1-bin-hadoop2.7 /usr/local/spark
2、增加环境变量
vim ~/.bashrc
//增加
export SPARK_HOME=/usr/local/spark
//在PATH后面追加
%SPARK_HOME/bin:%SPARK_HOME/sbin
3、修改spark-env.sh配置文件
//增加配置
export HADOOP_CONF_DIR=/usr/local/hadoop/etc/hadoop
export JAVA_HOME=/usr/local/jdk1.8.0_144
export SPARK_HOME=/usr/local/spark
export SPARK_EXECUTOR_MEMORY=1G
export SPARK_EXECUTOR_cores=1
export SPARK_WORKER_CORES=1
export SCALA_HOME=/usr/local/scala
测试一下通过spark on yarn
使用spark知道的SparkPi来测试,指定master为yarn
/usr/local/spark/bin/spark-submit --class org.apache.spark.examples.SparkPi --master yarn --num-executors 2 /usr/local/spark/examples/jars/spark-examples_2.11-2.1.1.jar 5
也可以在yarn UI界面上看到Yarn为spark分配的application
Spark sql访问hive数据
1、将master节点的hive的配置文件hive-site.xml拷贝进入spark/conf目录中
hive-site.xml内容
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:mysql://master:3306/hive?createDatabaseIfNoExist=true</value>
</property>
<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>com.mysql.jdbc.Driver</value>
</property>
<property>
<name>javax.jdo.option.ConnectionUserName</name>
<value>root</value>
<property>
<property>
<name>javax.jdo.option.ConnectionPassword</name>
<value>rootroot</value>
</property>
<property>
<name>hive.metastore.warehouse.dir</name>
<value>/user/hive/warehouse<value>
</property>
2、修改spark-default.conf文件
//在配置文件里面增加如下配置
spark.sql.warehouse.dir /user/spark/warehouse
3、将hive-site.xml 和 spark-default.conf两个配置文件发送给其他的几个节点
scp hive-site.xml hadoop@master1:/usr/local/spark/conf
scp hive-site.xml hadoop@slave1:/usr/local/spark/conf
scp hive-site.xml hadoop@slave2:/usr/local/spark/conf
scp hive-site.xml hadoop@slave3:/usr/local/spark/conf
scp spark-default.conf hadoop@master1:/usr/local/spark/conf
scp spark-default.conf hadoop@slave1:/usr/local/spark/conf
scp spark-default.conf hadoop@slave2:/usr/local/spark/conf
scp spark-default.conf hadoop@slave3:/usr/local/spark/conf
4、把mysql的驱动包放入spark/jars里面
增加配置过后,就可以通过spark sql来操作hive数据库了
测试一下spark sql 对hive的操作
spark能通过sql语句访问,功能正常!
如果有什么意见或者建议,请联系我,谢谢。