Spark on Hive? Hive on Spark傻傻分不清?
1 spark on hive
Spark on hive,是spark计算引擎依托hive data source,spark 可以依托JDBC、File、Collection等数据源,hive只是spark集成的数据源之一;具体可以参考spark官方文档;我们可以在代码中配置enableHiveSupport
来启用Hive数据源。这里是Spark集成Hive。
NOTE : 此时的客户端是SparkSQL,Hive目前的角色只是Catalog元数据的支持组件
2 Hive on Spark
Hive on Spark,显而易见,Hive on MapReduce是默认配置,就像mapreduce一样,spark可以作为query processor中的查询引擎,Tez也是可以作为Hive的底层查询引擎,所以这里是Hive集成Spark。具体可以参考、Hive官方文档。
就类似于Hive on Hbase一样,Hive可以作为数据库管理工具管理Hbase中的数据 ,这里也是Hive中的hbase集成控件的一部分。
NOTE: 此时的客户端是Hive Cli,SPark只是作为计算引擎,
2.1 hive on spark之版本匹配
Hive on Spark is only tested with a specific version of Spark, so a given version of Hive is only guaranteed to work with a specific version of Spark. Other versions of Spark may work with a given version of Hive, but that is not guaranteed. Below is a list of Hive versions and their corresponding compatible Spark versions.
Hive Version | Spark Version |
---|---|
master | 2.3.0 |
3.0.x | 2.3.0 |
2.3.x | 2.0.0 |
2.2.x | 1.6.0 |
2.1.x | 1.6.0 |
2.0.x | 1.5.0 |
1.2.x | 1.3.1 |
1.1.x | 1.2.0 |
2.2 安装Spark
目前Hive不仅支持standalone的spark模式,包括client、cluster,同时支持spark on yarn的模式。
2.3 hive on spark对yarn的配置
必须用fairScheduler 替代默认的 capacityScheduler
具体配置可以参考:fairScheduler
2.4 配置hive
- Hive2.2.0之前的版本,需要将
spark-assembly-VERSION.jar
通过拷贝或者团链接的形式添加到$HIVE_HOME/lib下; - Hive2.2.0之后,不需要跟之前的方式一样,如果是yarn模式的话,do this~
- 引入scala-library 到$HIVE_HOME/lib
- 引入Spark-core到$HIVE_HOME/lib
- 引入Spark-network-common到$HIVE_HOME/lib
- 运行本地模式,继续观看官网,手把手教程
2.5 切换执行引擎为spark
以下配置可以在交互式bin/hive 模式或者在hive-site.xml中指定。
set hive.execution.engine=spark;
set spark.master=yarn-cluster;
set spark.eventLog.enabled=true;
set spark.eventLog.dir=hdfs:///spark-log;
set spark.executor.memory=512m;
set spark.serializer=org.apache.spark.serializer.KryoSerializer;
具体配置解析如下:
spark.executor.memory
: Amount of memory to use per executor process.spark.executor.cores
: Number of cores per executor.spark.yarn.executor.memoryOverhead
: The amount of off heap memory (in megabytes) to be allocated per executor, when running Spark on Yarn. This is memory that accounts for things like VM overheads, interned strings, other native overheads, etc. In addition to the executor’s memory, the container in which the executor is launched needs some extra memory for system processes, and this is what this overhead is for.spark.executor.instances
: The number of executors assigned to each application.spark.driver.memory
: The amount of memory assigned to the Remote Spark Context (RSC). We recommend 4GB.spark.yarn.driver.memoryOverhead
: We recommend 400 (MB).
2.6 指定yarn node去缓存spark运行依赖的jar包配置
2.6.1 Hive 2.2.0以前的版本操作
<!--在hive-site.xml中指定-->
<!--我们需要$SPARK_HOME/jars/spark-assembly.jar上传到指定的hdfs目录下-->
<property>
<name>spark.yarn.jar</name>
<value>hdfs://shufang101:8020/spark-assembly.jar</value>
</property>
2.6.2 Hive 2.2.0及之后的版本
<!--在hive-site.xml中指定-->
<!--我们需要$SPARK_HOME/jars 下的所有东西都上传到指定的hdfs目录下-->
<property>
<name>spark.yarn.jars</name>
<value>hdfs://shufang101:8020/spark-jars/*</value>
</property>