1.spark和hive版本要求
spark1.4.0
hive-0.13.0(注意:不要使用hive0.14版本用于集成,否则会出很多问题)
2、集群规划(hadoop 采用伪分布式)
2.1 hadoop 伪分布式安装规划(同hadoop集群模式一样)
NameNode: 192.168.2.20
DataNode:192.168.2.20
ResourceManager:192.168.2.20
NodeManager:192.168.2.20
2.2 Spark 集群规划
Master: 192.168.2.20 192.168.2.33
Worker:192.168.2.33
2.3 Hive规划
hive在192.168.2.20 和 192.168.2.33 两台节点上都要安装
备注: hive使用mysql为存储元数据
spark1.4.0
hive-0.13.0(注意:不要使用hive0.14版本用于集成,否则会出很多问题)
2、集群规划(hadoop 采用伪分布式)
2.1 hadoop 伪分布式安装规划(同hadoop集群模式一样)
NameNode: 192.168.2.20
DataNode:192.168.2.20
ResourceManager:192.168.2.20
NodeManager:192.168.2.20
2.2 Spark 集群规划
Master: 192.168.2.20 192.168.2.33
Worker:192.168.2.33
2.3 Hive规划
hive在192.168.2.20 和 192.168.2.33 两台节点上都要安装
备注: hive使用mysql为存储元数据
[hadoop@mycluster conf]$ more hive-site.xml
<configuration>
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:mysql://mycluster:3306/hive?createDatabaseIfNotExist=true</value>
</property>
<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>com.mysql.jdbc.Driver</value>
</property>
<property>
<name>javax.jdo.option.ConnectionUserName</name>
<value>root</value>
</property>
<property>
<name>javax.jdo.option.ConnectionPassword</name>
<value>123</value>
</property>
</configuration>
3、Spark 参数配置
3.1 在$SPARK_HOME/conf下,修改spark-env.sh
注意: 所有spark节点都要相同的配置
[hadoop@mycluster ~]$ cd $SPARK_HOME
[hadoop@mycluster spark]$ more conf/spark-env.sh
#!/usr/bin/env bash
SPARK_MASTER_IP=mycluster
export JAVA_HOME=/home/hadoop/app/jdk1.7.0_76
export SCALA_HOME=/home/hadoop/app/scala
export HADOOP_HOME=/home/hadoop/app/hadoop-2.6.0
export HIVE_HOME=/home/hadoop/app/hive
export SPARK_CLASSPATH=$HIVE_HOME/lib/mysql-connector-java-5.1.28.jar
注意: 所有spark节点都要相同的配置
3.2 把hive的hive-site.xml拷贝到$SPARK_HOME/conf下
3.3 .把mysql的驱动包拷贝到$SPARK_HOME/lib下
3.4 spark会打印许多INFO级别的日志,可以修改log4j.prop文件中的日志级别 设置warn
[hadoop@mycluster conf]$ more log4j.properties
# Set everything to be logged to the console
log4j.rootCategory= WARN , console
log4j.appender.console=org.apache.log4j.ConsoleAppender
log4j.appender.console.target=System.err
log4j.appender.console.layout=org.apache.log4j.PatternLayout
log4j.appender.console.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} %p %c{1}: %m%n
# Settings to quiet third party logs that are too verbose
log4j.logger.org.spark-project.jetty=WARN
log4j.logger.org.spark-project.jetty.util.component.AbstractLifeCycle=ERROR
log4j.logger.org.apache.spark.repl.SparkIMain$exprTyper=INFO
log4j.logger.org.apache.spark.repl.SparkILoop$SparkILoopInterpreter=INFO
# Set everything to be logged to the console
log4j.rootCategory= WARN , console
log4j.appender.console=org.apache.log4j.ConsoleAppender
log4j.appender.console.target=System.err
log4j.appender.console.layout=org.apache.log4j.PatternLayout
log4j.appender.console.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} %p %c{1}: %m%n
# Settings to quiet third party logs that are too verbose
log4j.logger.org.spark-project.jetty=WARN
log4j.logger.org.spark-project.jetty.util.component.AbstractLifeCycle=ERROR
log4j.logger.org.apache.spark.repl.SparkIMain$exprTyper=INFO
log4j.logger.org.apache.spark.repl.SparkILoop$SparkILoopInterpreter=INFO
4、启动 spark-sql
4.1 启动spark集群
cd $SPARK_HOME
sbin/start-all.sh
4.2 启动spark-sql
bin/spark-sql
5、验证spark-sql
5.1 登录hive进行查询
(1)登录hive命令
[hadoop@mycluster hive]$ bin/hive
(2)使用mydb数据库和显示所有表
hive (default)> use mydb;
OK
Time taken: 1.833 seconds
hive (mydb)> show tables;
OK
tab_name
access_log
bucket_table
jfapp_pv
order
stu
stu_index_table
stu_view
t1
Time taken: 0.256 seconds, Fetched: 8 row(s)
OK
Time taken: 1.833 seconds
hive (mydb)> show tables;
OK
tab_name
access_log
bucket_table
jfapp_pv
order
stu
stu_index_table
stu_view
t1
Time taken: 0.256 seconds, Fetched: 8 row(s)
(3)使用count函数
hive (mydb)> select count(*) from stu;
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
Starting Job = job_1441094211714_0002, Tracking URL = http://mycluster:8088/proxy/application_1441094211714_0002/
Kill Command = /home/hadoop/app/hadoop-2.6.0/bin/hadoop job -kill job_1441094211714_0002
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2015-09-01 04:05:58,315 Stage-1 map = 0%, reduce = 0%
2015-09-01 04:06:18,500 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 2.54 sec
2015-09-01 04:06:28,524 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 4.78 sec
MapReduce Total cumulative CPU time: 4 seconds 780 msec
Ended Job = job_1441094211714_0002
MapReduce Jobs Launched:
Job 0: Map: 1 Reduce: 1 Cumulative CPU: 4.78 sec HDFS Read: 291 HDFS Write: 2 SUCCESS
Total MapReduce CPU Time Spent: 4 seconds 780 msec
OK
_c0
9
Time taken: 85.028 seconds, Fetched: 1 row(s)
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
Starting Job = job_1441094211714_0002, Tracking URL = http://mycluster:8088/proxy/application_1441094211714_0002/
Kill Command = /home/hadoop/app/hadoop-2.6.0/bin/hadoop job -kill job_1441094211714_0002
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2015-09-01 04:05:58,315 Stage-1 map = 0%, reduce = 0%
2015-09-01 04:06:18,500 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 2.54 sec
2015-09-01 04:06:28,524 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 4.78 sec
MapReduce Total cumulative CPU time: 4 seconds 780 msec
Ended Job = job_1441094211714_0002
MapReduce Jobs Launched:
Job 0: Map: 1 Reduce: 1 Cumulative CPU: 4.78 sec HDFS Read: 291 HDFS Write: 2 SUCCESS
Total MapReduce CPU Time Spent: 4 seconds 780 msec
OK
_c0
9
Time taken: 85.028 seconds, Fetched: 1 row(s)
注:hive的count需要通过mapreduce完成,效率上很低。
5.2 登录spark-sql如同hive一样操作
(1)登录hive命令
[hadoop@mycluster spark]$ bin/spark-sql
(2)使用mydb数据库和显示所有表
spark-sql (default)> use mydb;
OK
result
Time taken: 1.293 seconds
spark-sql (default)> show tables;
tableName isTemporary
access_log false
bucket_table false
jfapp_pv false
order false
stu false
stu_index_table false
stu_view false
t1 false
Time taken: 0.195 seconds, Fetched 8 row(s)
OK
result
Time taken: 1.293 seconds
spark-sql (default)> show tables;
tableName isTemporary
access_log false
bucket_table false
jfapp_pv false
order false
stu false
stu_index_table false
stu_view false
t1 false
Time taken: 0.195 seconds, Fetched 8 row(s)
(3) 使用count函数
spark-sql (default)> select count(*) from stu;
_c0
9
Time taken : 5.894 seconds, Fetched 1 row(s)
_c0
9
Time taken : 5.894 seconds, Fetched 1 row(s)
通过Spark-sql和hive计算的结果对比,使用spark-sql性能比hive要高很多,将近85:5约为16倍