找个spark集群搭建是针对于上一篇hadoop的基础上搭建的。
所以spark的版本也是要按照着hadoop版本进行下载。
1.解压spark,修改spark的/etc/profile的home目录。
2.安装SCALA,并配置SCALA_HOME。
3.修改spark conf目录下的spark-env.sh文件,并添加下列配置
export JAVA_HOME=/root/java/jdk1.8.0_181
export HADOOP_HOME=/root/hadoop/hadoop-2.7.6
export HADOOP_CONF_DIR=/root/hadoop/hadoop-2.7.6/etc/hadoop
export SCALA_HOME=/root/scala/scala-2.11.8
export SPARK_MASTER_IP=192.168.124.132
4.修改spark conf目录下的slavers文件配置:
centos01
centos02
centos03
5.分发spark到其它节点,包括/etc/profile文件
scp -r /etc/profile root@centos02:/etc/
scp -r ~/spark root@centos02:/root/
5.测试spark yarn提交模式,在spark example目录下有个PI运算jar包
spark-submit --class org.apache.spark.examples.SparkPi --master yarn-cluster spark-examples_2.11-2.3.0.jar 10
利用spark-shell --master yarn进行测试,会出现异常:
2018-08-27 01:01:30 ERROR SparkContext:91 - Error initializing SparkContext.
org.apache.spark.SparkException: Yarn application has already ended! It might have been killed or unable to launch application master.
at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.waitForApplication(YarnClientSchedulerBackend.scala:89)
at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:63)
at org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:164)
利用http://192.168.124.132:8088/cluster/app/application_1535337995441_0003:
查看Diagnostics(诊断)发现报错虚拟内存超过了:
is running beyond virtual memory limits.
Current usage: 40.9 MB of 1 GB physical memory used; 2.2 GB of 2.1 GB virtual memory used. Killing container.
配置yarn-site.xml:
<!--以下为解决spark-shell 以yarn client模式运行报错问题而增加的配置,估计spark-summit也会有这个问题。2个配置只用配置一个即可解决问题,当然都配置也没问题-->
<!--虚拟内存设置是否生效,若实际虚拟内存大于设置值 ,spark 以client模式运行可能会报错,"Yarn application has already ended! It might have been killed or unable to l"-->
<property>
<name>yarn.nodemanager.vmem-check-enabled</name>
<value>false</value>
<description>Whether virtual memory limits will be enforced for containers</description>
</property>
<!--配置虚拟内存/物理内存的值,默认为2.1,物理内存默认应该是1g,所以虚拟内存是2.1g-->
<property>
<name>yarn.nodemanager.vmem-pmem-ratio</name>
<value>4</value>
<description>Ratio between virtual memory to physical memory when setting memory limits for containers</description>
</property>
6.配置spark sql的metastore到mysql进行管理
查看机器是否安装过mysql:rpm -qa|grep -i mysql
wget http://repo.mysql.com/mysql57-community-release-el7-8.noarch.rpm
rpm -ivh mysql57-community-release-el7-8.noarch.rpm
yum install mysql-community-server
安装完成后,重启mysql:service mysqld restart
然后查看初始密码:grep "password" /var/log/mysqld.log
登陆mysql,修改密码~!:alter user 'root'@'localhost' identified by 'xxxx2018@spark';
刷新权限:flush privileges;
7.把jdbc的jar包复制到spark jars目录下:(注意:使用5版本的可以避免出现时区问题)
网上下载mysql jdbc驱动,需要注意驱动和mysql的版本~!
网上下载https://dev.mysql.com/downloads/connector/j/
rpm包解压:rpm2cpio mysql-connector-java-8.0.12-1.el7.noarch.rpm | cpio -div
把jdbc驱动复制到spark jars目录:cp mysql-connector-java-8.0.12.jar ~/sparkapp/spark-2.3.0-bin-hadoop2.7/jars/
8.在spark conf目录下创建一个hive-site.xml文件:
<configuration>
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:mysql://localhost:3306/hiveMetastore?createDatabaseIfNotExist=true&characterEncoding=utf8&useSSL=false</value>
<description>hiveMetastore:Metastore sive in mysql</description>
</property>
<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>com.mysql.jdbc.Driver</value>
<description>Driver class name for a JDBC metastore</description>
</property>
<property>
<name>javax.jdo.option.ConnectionUserName</name>
<value>root</value>e
<description>mysql account</description>
</property>
<property>
<name>javax.jdo.option.ConnectionPassword</name>
<value>xxxx2018@spark</value>
<description>mysql password</description>
</property>
</configuration>
9.在mysql上创建相关的spark元数据数据库,和视图:
mysql:
create database sparkmetastore;
use sparkmetastore;
create view databases_v
as select DBS.*
from hiveMetastore.DBS;
create view tables_v
as select TBLS.*, DBS.NAME
from hiveMetastore.DBS, hiveMetastore.TBLS
where TBLS.DB_ID=DBS.DB_ID AND TBLS.TBL_TYPE!='VIRTUAL_VIEW';
create view views_v
as select TBLS.*, DBS.NAME
from hiveMetastore.DBS, hiveMetastore.TBLS
where TBLS.DB_ID=DBS.DB_ID AND TBLS.TBL_TYPE='VIRTUAL_VIEW';
create view columns_v
as select COLUMNS_V2.*, TBLS.TBL_NAME, DBS.NAME
from hiveMetastore.DBS, hiveMetastore.TBLS, hiveMetastore.COLUMNS_V2
where DBS.DB_ID = TBLS.DB_ID AND COLUMNS_V2.CD_ID = TBLS.TBL_ID;
10.再在mysql上创建一个用户,该用户为spark-sql jdbc元数据查询用户:
CREATE USER 'spark'@'%' IDENTIFIED BY 'xxxx2018@spark';
GRANT SELECT ON sparkmetastore.* TO 'spark'@'%';
11.在spark-sql中进行反向查询元数据:
create database spark;
CREATE TABLE databases_v USING org.apache.spark.sql.jdbc OPTIONS("url" "jdbc:mysql://192.168.124.132:3306", "dbtable" "sparkmetastore.databases_v","user" "spark", "password" "xxxx2018@spark")
CREATE TABLE tables_v USING org.apache.spark.sql.jdbc OPTIONS("url" "jdbc:mysql://192.168.124.132:3306", "dbtable" "sparkmetastore.tables_v","user" "spark", "password" "xxxx2018@spark")
CREATE TABLE views_v USING org.apache.spark.sql.jdbc OPTIONS("url" "jdbc:mysql://192.168.124.132:3306", "dbtable" "sparkmetastore.views_v","user" "spark", "password" "xxxx2018@spark")
CREATE TABLE columns_v USING org.apache.spark.sql.jdbc OPTIONS("url" "jdbc:mysql://192.168.124.132:3306", "dbtable" "sparkmetastore.columns_v","user" "spark", "password" "xxxx2018@spark")
12.打开spark sbin./start-thriftserver.sh服务。
13.在远程root访问hdfs的时候,会出现权限问题:
在hdfs-site.xml加入找个配置
<property>
<name>dfs.namenode.secondary.http-address</name>
<value>xxxx.com:50090</value>
</property>
<!-- hdfs用root登陆的时候会出现权限错误>
<property>
<name>dfs.permissions.enabled</name>
<value>false</value>
<description>
If "true", enable permission checking in HDFS.
If "false", permission checking is turned off,
but all other behavior is unchanged.
Switching from one parameter value to the other does not change the mode,
owner or group of files or directories.
</description>
14.配置spark-default.conf:(一定要把spark-warehouse设置存储到hdfs上,不然会出现错误)
spark.master yarn
spark.sql.warehouse.dir hdfs://centos01:9000/spark-warehouse