spark集群搭建与mysql元数据管理

找个spark集群搭建是针对于上一篇hadoop的基础上搭建的。
所以spark的版本也是要按照着hadoop版本进行下载。

1.解压spark,修改spark的/etc/profile的home目录。

2.安装SCALA,并配置SCALA_HOME。

3.修改spark conf目录下的spark-env.sh文件,并添加下列配置    
    export JAVA_HOME=/root/java/jdk1.8.0_181
    export HADOOP_HOME=/root/hadoop/hadoop-2.7.6
    export HADOOP_CONF_DIR=/root/hadoop/hadoop-2.7.6/etc/hadoop
    export SCALA_HOME=/root/scala/scala-2.11.8
    export SPARK_MASTER_IP=192.168.124.132

4.修改spark conf目录下的slavers文件配置:
    centos01
    centos02
    centos03

5.分发spark到其它节点,包括/etc/profile文件
    scp -r /etc/profile root@centos02:/etc/
    scp -r ~/spark root@centos02:/root/

5.测试spark yarn提交模式,在spark example目录下有个PI运算jar包
    spark-submit --class org.apache.spark.examples.SparkPi --master yarn-cluster spark-examples_2.11-2.3.0.jar 10

    利用spark-shell --master yarn进行测试,会出现异常:
    2018-08-27 01:01:30 ERROR SparkContext:91 - Error initializing SparkContext.
    org.apache.spark.SparkException: Yarn application has already ended! It might have been killed or unable to launch application master.

at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.waitForApplication(YarnClientSchedulerBackend.scala:89)
    at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:63)
    at org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:164)
    利用http://192.168.124.132:8088/cluster/app/application_1535337995441_0003:
    查看Diagnostics(诊断)发现报错虚拟内存超过了:
    is running beyond virtual memory limits.
    Current usage: 40.9 MB of 1 GB physical memory used; 2.2 GB of 2.1 GB virtual memory used. Killing container.

    配置yarn-site.xml:
    <!--以下为解决spark-shell 以yarn client模式运行报错问题而增加的配置,估计spark-summit也会有这个问题。2个配置只用配置一个即可解决问题,当然都配置也没问题-->
    <!--虚拟内存设置是否生效,若实际虚拟内存大于设置值 ,spark 以client模式运行可能会报错,"Yarn application has already ended! It might have been killed or unable to l"-->
    <property>
        <name>yarn.nodemanager.vmem-check-enabled</name>
        <value>false</value>
        <description>Whether virtual memory limits will be enforced for containers</description>
    </property>
    <!--配置虚拟内存/物理内存的值,默认为2.1,物理内存默认应该是1g,所以虚拟内存是2.1g-->
    <property>
        <name>yarn.nodemanager.vmem-pmem-ratio</name>
        <value>4</value>
        <description>Ratio between virtual memory to physical memory when setting memory limits for containers</description>
    </property>


6.配置spark sql的metastore到mysql进行管理
    查看机器是否安装过mysql:rpm -qa|grep -i mysql
    wget http://repo.mysql.com/mysql57-community-release-el7-8.noarch.rpm
    rpm -ivh mysql57-community-release-el7-8.noarch.rpm
    yum install mysql-community-server

    安装完成后,重启mysql:service mysqld restart
    然后查看初始密码:grep "password" /var/log/mysqld.log
    登陆mysql,修改密码~!:alter user 'root'@'localhost' identified by 'xxxx2018@spark';
    刷新权限:flush privileges;

7.把jdbc的jar包复制到spark jars目录下:(注意:使用5版本的可以避免出现时区问题)
    网上下载mysql jdbc驱动,需要注意驱动和mysql的版本~!
    网上下载https://dev.mysql.com/downloads/connector/j/
    rpm包解压:rpm2cpio mysql-connector-java-8.0.12-1.el7.noarch.rpm | cpio -div
    把jdbc驱动复制到spark jars目录:cp mysql-connector-java-8.0.12.jar ~/sparkapp/spark-2.3.0-bin-hadoop2.7/jars/

8.在spark conf目录下创建一个hive-site.xml文件:
<configuration>
  <property>
    <name>javax.jdo.option.ConnectionURL</name>
    <value>jdbc:mysql://localhost:3306/hiveMetastore?createDatabaseIfNotExist=true&amp;characterEncoding=utf8&amp;useSSL=false</value>
    <description>hiveMetastore:Metastore sive in mysql</description>
  </property>
  <property>
    <name>javax.jdo.option.ConnectionDriverName</name>
    <value>com.mysql.jdbc.Driver</value>
    <description>Driver class name for a JDBC metastore</description>
  </property>
  <property>
    <name>javax.jdo.option.ConnectionUserName</name>
    <value>root</value>e
    <description>mysql account</description>
  </property>
  <property>
    <name>javax.jdo.option.ConnectionPassword</name>
    <value>xxxx2018@spark</value>
    <description>mysql password</description>
  </property>
</configuration>

9.在mysql上创建相关的spark元数据数据库,和视图:
    mysql:
    create database sparkmetastore;
    use sparkmetastore;

    create view databases_v 
    as select DBS.* 
    from hiveMetastore.DBS;

    create view tables_v 
    as select TBLS.*, DBS.NAME 
    from hiveMetastore.DBS, hiveMetastore.TBLS 
    where TBLS.DB_ID=DBS.DB_ID AND TBLS.TBL_TYPE!='VIRTUAL_VIEW';

    create view views_v 
    as select TBLS.*, DBS.NAME 
    from hiveMetastore.DBS, hiveMetastore.TBLS 
    where TBLS.DB_ID=DBS.DB_ID AND TBLS.TBL_TYPE='VIRTUAL_VIEW';

    create view columns_v 
    as select COLUMNS_V2.*, TBLS.TBL_NAME, DBS.NAME 
    from hiveMetastore.DBS, hiveMetastore.TBLS, hiveMetastore.COLUMNS_V2
    where DBS.DB_ID = TBLS.DB_ID AND COLUMNS_V2.CD_ID = TBLS.TBL_ID;

10.再在mysql上创建一个用户,该用户为spark-sql jdbc元数据查询用户:
    CREATE USER 'spark'@'%' IDENTIFIED BY 'xxxx2018@spark';
    GRANT SELECT ON sparkmetastore.* TO 'spark'@'%';

11.在spark-sql中进行反向查询元数据:
    create database spark;
    CREATE TABLE databases_v USING org.apache.spark.sql.jdbc OPTIONS("url" "jdbc:mysql://192.168.124.132:3306", "dbtable" "sparkmetastore.databases_v","user" "spark", "password" "xxxx2018@spark")
    CREATE TABLE tables_v USING org.apache.spark.sql.jdbc OPTIONS("url" "jdbc:mysql://192.168.124.132:3306", "dbtable" "sparkmetastore.tables_v","user" "spark", "password" "xxxx2018@spark")
    CREATE TABLE views_v USING org.apache.spark.sql.jdbc OPTIONS("url" "jdbc:mysql://192.168.124.132:3306", "dbtable" "sparkmetastore.views_v","user" "spark", "password" "xxxx2018@spark")
    CREATE TABLE columns_v USING org.apache.spark.sql.jdbc OPTIONS("url" "jdbc:mysql://192.168.124.132:3306", "dbtable" "sparkmetastore.columns_v","user" "spark", "password" "xxxx2018@spark")

12.打开spark sbin./start-thriftserver.sh服务。

13.在远程root访问hdfs的时候,会出现权限问题:
    在hdfs-site.xml加入找个配置
    <property>
        <name>dfs.namenode.secondary.http-address</name>
        <value>xxxx.com:50090</value>
    </property>
    <!-- hdfs用root登陆的时候会出现权限错误>
    <property>
        <name>dfs.permissions.enabled</name>
        <value>false</value>
        <description>
        If "true", enable permission checking in HDFS.
        If "false", permission checking is turned off,
        but all other behavior is unchanged.
        Switching from one parameter value to the other does not change the mode,
        owner or group of files or directories.
    </description>

14.配置spark-default.conf:(一定要把spark-warehouse设置存储到hdfs上,不然会出现错误)
    spark.master yarn

   spark.sql.warehouse.dir hdfs://centos01:9000/spark-warehouse

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值