背 景
Hive原生态的计算引擎室MapReduce,由于MapReduce的需要大量的和磁盘交互,离线T+1还勉强可用,如果要即时查询,那简直是等的花都快谢了;
那有没有代替底层引擎的方式呢?答案显然是有的,其实目前MapReduce基本已经淘汰了,取而代之的有Apache Tez,Spark,Impala,以及目前如火如荼的Flink;
这里就不一一介绍了,介绍一个目前还比较通用的spark on hive
,采用spark-sql
代替hive-sql
操作hive,底层采用spark引擎,由于spark基于内存计算以及有向无环图(DAG)的加持,能加快10-20倍的速度,极端情况下,可以达到100倍;
安装Spark
首先你得安装Spark,关于安装Spark的教程,可以参考博客:Spark从入门到放弃——Spark2.4.7安装和启动(二);
添加hive-site.xml文件
将你安装好的hive的配置文件复制一份到spark的conf下,实现指令如下,具体路径根据自己的hive和spark路径做调整;
cp /data/tools/apache-hive-2.3.5-bin/conf/hive-site.xml /data/tools/spark-2.4.7-bin-hadoop2.7/conf/
目的是为了实现spark on hive,即采用spark-sql读取操作hive内的数据;
添加/mysql-connector-java-8.0.13.jar
因为Hive的元数据在mysql上,所以需要将hive内连接mysql的驱动包mysql-connector-java-8.0.13.jar
也一起拿过来,具体操作 如下;
cp /data/tools/apache-hive-2.3.5-bin/lib/mysql-connector-java-8.0.13.jar /data/tools/spark-2.4.7-bin-hadoop2.7/jars/
到此就能真正实现spark on hive;
测试下spark-sql
&emsp先来测试下spark-sql
,在node1上输入spark-sql,如下;
[hadoop@node1 ~]$ spark-sql
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/data/tools/spark-2.4.7-bin-hadoop2.7/jars/slf4j-log4j12-1.7.16.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/data/tools/hadoop-2.8.5/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
……
然后一顿加载,不报错的话,基本就成功了,报错把相应的bug解决,然后测试下hive内的数据请求,我在hive先 建好了一个student表,只有两个字段id和sname,表内有一行记录;
id | sname |
---|---|
1 | rowyet |
然后我们的spark-sql测试语句效果如下;
spark-sql> select * from student;
20/12/14 02:27:33 INFO metastore.HiveMetaStore: 0: get_table : db=dw tbl=student
20/12/14 02:27:33 INFO HiveMetaStore.audit: ugi=hadoop ip=unknown-ip-addr cmd=get_table : db=dw tbl=student
20/12/14 02:27:33 INFO memory.MemoryStore: Block broadcast_1 stored as values in memory (estimated size 558.8 KB, free 412.7 MB)
20/12/14 02:27:34 INFO memory.MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 51.4 KB, free 412.7 MB)
20/12/14 02:27:34 INFO storage.BlockManagerInfo: Added broadcast_1_piece0 in memory on node1:37157 (size: 51.4 KB, free: 413.7 MB)
20/12/14 02:27:34 INFO spark.SparkContext: Created broadcast 1 from
20/12/14 02:27:35 INFO mapred.FileInputFormat: Total input paths to process : 1
20/12/14 02:27:35 INFO spark.SparkContext: Starting job: processCmd at CliDriver.java:376
20/12/14 02:27:35 INFO scheduler.DAGScheduler: Got job 1 (processCmd at CliDriver.java:376) with 3 output partitions
20/12/14 02:27:35 INFO scheduler.DAGScheduler: Final stage: ResultStage 1 (processCmd at CliDriver.java:376)
20/12/14 02:27:35 INFO scheduler.DAGScheduler: Parents of final stage: List()
20/12/14 02:27:35 INFO scheduler.DAGScheduler: Missing parents: List()
20/12/14 02:27:35 INFO scheduler.DAGScheduler: Submitting ResultStage 1 (MapPartitionsRDD[6] at processCmd at CliDriver.java:376), which has no missing parents
20/12/14 02:27:35 INFO memory.MemoryStore: Block broadcast_2 stored as values in memory (estimated size 8.2 KB, free 412.7 MB)
20/12/14 02:27:35 INFO memory.MemoryStore: Block broadcast_2_piece0 stored as bytes in memory (estimated size 4.4 KB, free 412.7 MB)
20/12/14 02:27:35 INFO storage.BlockManagerInfo: Added broadcast_2_piece0 in memory on node1:37157 (size: 4.4 KB, free: 413.7 MB)
20/12/14 02:27:35 INFO spark.SparkContext: Created broadcast 2 from broadcast at DAGScheduler.scala:1184
20/12/14 02:27:35 INFO scheduler.DAGScheduler: Submitting 3 missing tasks from ResultStage 1 (MapPartitionsRDD[6] at processCmd at CliDriver.java:376) (first 15 tasks are for partitions Vector(0, 1, 2))
20/12/14 02:27:35 INFO scheduler.TaskSchedulerImpl: Adding task set 1.0 with 3 tasks
20/12/14 02:27:35 INFO spark.ContextCleaner: Cleaned accumulator 36
20/12/14 02:27:35 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 1.0 (TID 1, 192.168.144.138, executor 0, partition 0, ANY, 7963 bytes)
20/12/14 02:27:35 INFO scheduler.TaskSetManager: Starting task 1.0 in stage 1.0 (TID 2, 192.168.144.140, executor 2, partition 1, ANY, 7963 bytes)
20/12/14 02:27:35 INFO scheduler.TaskSetManager: Starting task 2.0 in stage 1.0 (TID 3, 192.168.144.139, executor 1, partition 2, ANY, 7963 bytes)
20/12/14 02:27:35 INFO storage.BlockManagerInfo: Added broadcast_2_piece0 in memory on 192.168.144.138:46066 (size: 4.4 KB, free: 1007.7 MB)
20/12/14 02:27:36 INFO storage.BlockManagerInfo: Added broadcast_1_piece0 in memory on 192.168.144.138:46066 (size: 51.4 KB, free: 1007.6 MB)
20/12/14 02:27:45 INFO storage.BlockManagerInfo: Added broadcast_2_piece0 in memory on 192.168.144.140:37936 (size: 4.4 KB, free: 1007.8 MB)
20/12/14 02:28:02 INFO storage.BlockManagerInfo: Added broadcast_1_piece0 in memory on 192.168.144.140:37936 (size: 51.4 KB, free: 1007.8 MB)
20/12/14 02:28:21 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 1.0 (TID 1) in 46439 ms on 192.168.144.138 (executor 0) (1/3)
20/12/14 02:28:34 INFO scheduler.TaskSetManager: Finished task 1.0 in stage 1.0 (TID 2) in 59298 ms on 192.168.144.140 (executor 2) (2/3)
20/12/14 02:28:47 INFO storage.BlockManagerInfo: Added broadcast_2_piece0 in memory on 192.168.144.139:43240 (size: 4.4 KB, free: 1007.8 MB)
20/12/14 02:29:01 INFO storage.BlockManagerInfo: Added broadcast_1_piece0 in memory on 192.168.144.139:43240 (size: 51.4 KB, free: 1007.8 MB)
20/12/14 02:29:09 INFO scheduler.TaskSetManager: Finished task 2.0 in stage 1.0 (TID 3) in 94410 ms on 192.168.144.139 (executor 1) (3/3)
20/12/14 02:29:09 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 1.0, whose tasks have all completed, from pool
20/12/14 02:29:09 INFO scheduler.DAGScheduler: ResultStage 1 (processCmd at CliDriver.java:376) finished in 94.516 s
20/12/14 02:29:09 INFO scheduler.DAGScheduler: Job 1 finished: processCmd at CliDriver.java:376, took 94.576264 s
1 rowyet
Time taken: 97.013 seconds, Fetched 1 row(s)
Hive表相关的Insert,select相关操作都可以在spark-sql操作;
spark-shell中使用spark-sql来提取数据
通常的场景,我们需要将spark-sql查询出来的数据保存成文件导出分发给到别人,这里因为集群是linux,所以一般保存成csv文件;
repartition(1) :因为只有一个文件,必须要重分区为1,这里也可以替换成
coalesce(1)
,但是coalesce(1)
数据量大了会卡,建议使用repartition(1)
;
option(“delimiter”, “^”):数据的列分隔符采用^
,我这里是选一个不太会被列数据使用符号,大家可以根据情况而定;
.option(“header”, true):保存的文件带表头
csv("/source/ods_rs_media_tbb_fview3_black_list.csv"):指定的保存在是hadoop集群的HDFS文件系统内的路径;
具体操作如下,只要去HDFS上把文件get下来就行了;
spark.sql("""
select building_id `项目id`
,type `类型`
from dw.ods_rs_media_tbb_fview3_black_list
where event_day='20201213'
""").repartition(1).write.option("delimiter", "^").option("header", true).csv("/source/ods_rs_media_tbb_fview3_black_list.csv")
使用spark-sql注意
spark-sql
底层自然使用的是spark
,而spark
默认的文件类型就是parquet
,所以Hive建表的时候最好指定parquet
相关的信息,以及和parquet
基情满满的snappy
压缩格式,具体如下;
CREATE EXTERNAL TABLE `dw.ods_rs_media_tbb_fview3_black_list`(
`building_id` bigint COMMENT '大楼ID',
`type` string COMMENT '类型')
COMMENT 'ods大楼黑名单'
PARTITIONED BY (
`event_week` int,
`event_day` string,
`event_hour` string)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
'hdfs://dw-cluster/hive/warehouse/ods/rs/media/ods_rs_media_tbb_fview3_black_list'
TBLPROPERTIES ('parquet.compression'='snappy')
;
- 使用
spark-shell
获取一下spark-sql
生成的parquet
文件的schema,确保和建表语句的column 类型一致,不然容易出异常,比如你的parquet文件building_id是IntType
,而你建表的时候写成了building_id bigint COMMENT '大楼ID'
,那么spark-sql在操作该表的building_id
就会报错,显示类型不兼容,所以最好使用外部表,这种情况你就把表drop掉然后改好建表的column类型,然后采用MSCK REPAIR table ods_rs_media_tbb_fview3_black_list;
修复表数据,bug就能解决;
scala> spark.read.parquet("/hive/warehouse/ods/rs/media/ods_rs_media_tbb_fview3_black_list/event_week=50/event_day=20201213/event_hour=00").schema.foreach(println)
StructField(building_id,LongType,true)
StructField(type,StringType,true)
以上就是Hive和Spark的结合,更多Hive和其他技术的结合,我们在其他技术上来讲述,这里要提一句,更换Hive引擎带来的查询效率收益,会远远大于通常的一些所谓Hive优化技巧,而更换引擎加Hive优化技巧一起,才能使你的Hive查询速度快到飞起;