015-Spark SQL与 Hive集成

1.spark和hive版本要求
    spark1.4.0
    hive-0.13.0(注意:不要使用hive0.14版本用于集成,否则会出很多问题)


2、集群规划(hadoop 采用伪分布式)

2.1 hadoop 伪分布式安装规划(同hadoop集群模式一样)
NameNode: 192.168.2.20
DataNode:192.168.2.20
ResourceManager:192.168.2.20
NodeManager:192.168.2.20

2.2 Spark 集群规划
Master: 192.168.2.20   192.168.2.33
Worker:192.168.2.33


2.3 Hive规划
hive在192.168.2.20 和 192.168.2.33 两台节点上都要安装

备注: hive使用mysql为存储元数据
[hadoop@mycluster conf]$ more hive-site.xml
<configuration>
<property>
     <name>javax.jdo.option.ConnectionURL</name>
     <value>jdbc:mysql://mycluster:3306/hive?createDatabaseIfNotExist=true</value>
</property>

<property>
     <name>javax.jdo.option.ConnectionDriverName</name>
     <value>com.mysql.jdbc.Driver</value>
</property>

<property>
     <name>javax.jdo.option.ConnectionUserName</name>
     <value>root</value>
</property>

<property>
     <name>javax.jdo.option.ConnectionPassword</name>
     <value>123</value>
</property>
</configuration>

3、Spark 参数配置
3.1 在$SPARK_HOME/conf下,修改spark-env.sh 
[hadoop@mycluster ~]$ cd $SPARK_HOME
[hadoop@mycluster spark]$ more conf/spark-env.sh
#!/usr/bin/env bash
SPARK_MASTER_IP=mycluster
export JAVA_HOME=/home/hadoop/app/jdk1.7.0_76
export SCALA_HOME=/home/hadoop/app/scala
export HADOOP_HOME=/home/hadoop/app/hadoop-2.6.0
export HIVE_HOME=/home/hadoop/app/hive
export SPARK_CLASSPATH=$HIVE_HOME/lib/mysql-connector-java-5.1.28.jar

注意: 所有spark节点都要相同的配置

3.2 把hive的hive-site.xml拷贝到$SPARK_HOME/conf下

3.3 .把mysql的驱动包拷贝到$SPARK_HOME/lib下

3.4 spark会打印许多INFO级别的日志,可以修改log4j.prop文件中的日志级别 设置warn
[hadoop@mycluster conf]$ more log4j.properties
# Set everything to be logged to the console
log4j.rootCategory= WARN , console
log4j.appender.console=org.apache.log4j.ConsoleAppender
log4j.appender.console.target=System.err
log4j.appender.console.layout=org.apache.log4j.PatternLayout
log4j.appender.console.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} %p %c{1}: %m%n

# Settings to quiet third party logs that are too verbose
log4j.logger.org.spark-project.jetty=WARN
log4j.logger.org.spark-project.jetty.util.component.AbstractLifeCycle=ERROR
log4j.logger.org.apache.spark.repl.SparkIMain$exprTyper=INFO
log4j.logger.org.apache.spark.repl.SparkILoop$SparkILoopInterpreter=INFO


4、启动 spark-sql
4.1 启动spark集群
cd $SPARK_HOME
sbin/start-all.sh

4.2 启动spark-sql
bin/spark-sql

5、验证spark-sql
5.1 登录hive进行查询

(1)登录hive命令
[hadoop@mycluster hive]$ bin/hive

(2)使用mydb数据库和显示所有表
hive (default)> use mydb;
OK
Time taken: 1.833 seconds
hive (mydb)> show tables;
OK
tab_name
access_log
bucket_table
jfapp_pv
order
stu
stu_index_table
stu_view
t1
Time taken: 0.256 seconds, Fetched: 8 row(s)


(3)使用count函数
hive (mydb)> select count(*) from stu;
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Starting Job = job_1441094211714_0002, Tracking URL = http://mycluster:8088/proxy/application_1441094211714_0002/
Kill Command = /home/hadoop/app/hadoop-2.6.0/bin/hadoop job  -kill job_1441094211714_0002
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2015-09-01 04:05:58,315 Stage-1 map = 0%,  reduce = 0%
2015-09-01 04:06:18,500 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 2.54 sec
2015-09-01 04:06:28,524 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 4.78 sec
MapReduce Total cumulative CPU time: 4 seconds 780 msec
Ended Job = job_1441094211714_0002
MapReduce Jobs Launched:
Job 0: Map: 1  Reduce: 1   Cumulative CPU: 4.78 sec   HDFS Read: 291 HDFS Write: 2 SUCCESS
Total MapReduce CPU Time Spent: 4 seconds 780 msec
OK
_c0
9
Time taken:  85.028 seconds, Fetched: 1 row(s)

注:hive的count需要通过mapreduce完成,效率上很低。

5.2 登录spark-sql如同hive一样操作

(1)登录hive命令
[hadoop@mycluster spark]$ bin/spark-sql
(2)使用mydb数据库和显示所有表
spark-sql (default)> use mydb;
OK
result
Time taken: 1.293 seconds
spark-sql (default)> show tables;
tableName       isTemporary
access_log      false
bucket_table    false
jfapp_pv        false
order   false
stu     false
stu_index_table false
stu_view        false
t1      false
Time taken: 0.195 seconds, Fetched 8 row(s)


(3) 使用count函数
spark-sql (default)> select count(*) from stu;
_c0
9
Time taken : 5.894 seconds, Fetched 1 row(s)



通过Spark-sql和hive计算的结果对比,使用spark-sql性能比hive要高很多,将近85:5约为16倍

  • 1
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 打赏
    打赏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

艾文教编程

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值