Hive从入门到放弃——Hive与其他技术的结合之spark on hive(十七)

背 景

  Hive原生态的计算引擎室MapReduce,由于MapReduce的需要大量的和磁盘交互,离线T+1还勉强可用,如果要即时查询,那简直是等的花都快谢了;
  那有没有代替底层引擎的方式呢?答案显然是有的,其实目前MapReduce基本已经淘汰了,取而代之的有Apache Tez,Spark,Impala,以及目前如火如荼的Flink;
  这里就不一一介绍了,介绍一个目前还比较通用的spark on hive,采用spark-sql代替hive-sql操作hive,底层采用spark引擎,由于spark基于内存计算以及有向无环图(DAG)的加持,能加快10-20倍的速度,极端情况下,可以达到100倍;

安装Spark

  首先你得安装Spark,关于安装Spark的教程,可以参考博客:Spark从入门到放弃——Spark2.4.7安装和启动(二);

添加hive-site.xml文件

  将你安装好的hive的配置文件复制一份到spark的conf下,实现指令如下,具体路径根据自己的hive和spark路径做调整;

cp /data/tools/apache-hive-2.3.5-bin/conf/hive-site.xml /data/tools/spark-2.4.7-bin-hadoop2.7/conf/

  目的是为了实现spark on hive,即采用spark-sql读取操作hive内的数据;

添加/mysql-connector-java-8.0.13.jar

  因为Hive的元数据在mysql上,所以需要将hive内连接mysql的驱动包mysql-connector-java-8.0.13.jar也一起拿过来,具体操作 如下;

cp /data/tools/apache-hive-2.3.5-bin/lib/mysql-connector-java-8.0.13.jar /data/tools/spark-2.4.7-bin-hadoop2.7/jars/

  到此就能真正实现spark on hive;

测试下spark-sql

 &emsp先来测试下spark-sql,在node1上输入spark-sql,如下;

[hadoop@node1 ~]$ spark-sql
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/data/tools/spark-2.4.7-bin-hadoop2.7/jars/slf4j-log4j12-1.7.16.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/data/tools/hadoop-2.8.5/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
……

  然后一顿加载,不报错的话,基本就成功了,报错把相应的bug解决,然后测试下hive内的数据请求,我在hive先 建好了一个student表,只有两个字段id和sname,表内有一行记录;

表1 hive表student内的记录
idsname
1rowyet

  然后我们的spark-sql测试语句效果如下;

spark-sql> select * from student;
20/12/14 02:27:33 INFO metastore.HiveMetaStore: 0: get_table : db=dw tbl=student
20/12/14 02:27:33 INFO HiveMetaStore.audit: ugi=hadoop  ip=unknown-ip-addr      cmd=get_table : db=dw tbl=student
20/12/14 02:27:33 INFO memory.MemoryStore: Block broadcast_1 stored as values in memory (estimated size 558.8 KB, free 412.7 MB)
20/12/14 02:27:34 INFO memory.MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 51.4 KB, free 412.7 MB)
20/12/14 02:27:34 INFO storage.BlockManagerInfo: Added broadcast_1_piece0 in memory on node1:37157 (size: 51.4 KB, free: 413.7 MB)
20/12/14 02:27:34 INFO spark.SparkContext: Created broadcast 1 from
20/12/14 02:27:35 INFO mapred.FileInputFormat: Total input paths to process : 1
20/12/14 02:27:35 INFO spark.SparkContext: Starting job: processCmd at CliDriver.java:376
20/12/14 02:27:35 INFO scheduler.DAGScheduler: Got job 1 (processCmd at CliDriver.java:376) with 3 output partitions
20/12/14 02:27:35 INFO scheduler.DAGScheduler: Final stage: ResultStage 1 (processCmd at CliDriver.java:376)
20/12/14 02:27:35 INFO scheduler.DAGScheduler: Parents of final stage: List()
20/12/14 02:27:35 INFO scheduler.DAGScheduler: Missing parents: List()
20/12/14 02:27:35 INFO scheduler.DAGScheduler: Submitting ResultStage 1 (MapPartitionsRDD[6] at processCmd at CliDriver.java:376), which has no missing parents
20/12/14 02:27:35 INFO memory.MemoryStore: Block broadcast_2 stored as values in memory (estimated size 8.2 KB, free 412.7 MB)
20/12/14 02:27:35 INFO memory.MemoryStore: Block broadcast_2_piece0 stored as bytes in memory (estimated size 4.4 KB, free 412.7 MB)
20/12/14 02:27:35 INFO storage.BlockManagerInfo: Added broadcast_2_piece0 in memory on node1:37157 (size: 4.4 KB, free: 413.7 MB)
20/12/14 02:27:35 INFO spark.SparkContext: Created broadcast 2 from broadcast at DAGScheduler.scala:1184
20/12/14 02:27:35 INFO scheduler.DAGScheduler: Submitting 3 missing tasks from ResultStage 1 (MapPartitionsRDD[6] at processCmd at CliDriver.java:376) (first 15 tasks are for partitions Vector(0, 1, 2))
20/12/14 02:27:35 INFO scheduler.TaskSchedulerImpl: Adding task set 1.0 with 3 tasks
20/12/14 02:27:35 INFO spark.ContextCleaner: Cleaned accumulator 36
20/12/14 02:27:35 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 1.0 (TID 1, 192.168.144.138, executor 0, partition 0, ANY, 7963 bytes)
20/12/14 02:27:35 INFO scheduler.TaskSetManager: Starting task 1.0 in stage 1.0 (TID 2, 192.168.144.140, executor 2, partition 1, ANY, 7963 bytes)
20/12/14 02:27:35 INFO scheduler.TaskSetManager: Starting task 2.0 in stage 1.0 (TID 3, 192.168.144.139, executor 1, partition 2, ANY, 7963 bytes)
20/12/14 02:27:35 INFO storage.BlockManagerInfo: Added broadcast_2_piece0 in memory on 192.168.144.138:46066 (size: 4.4 KB, free: 1007.7 MB)
20/12/14 02:27:36 INFO storage.BlockManagerInfo: Added broadcast_1_piece0 in memory on 192.168.144.138:46066 (size: 51.4 KB, free: 1007.6 MB)
20/12/14 02:27:45 INFO storage.BlockManagerInfo: Added broadcast_2_piece0 in memory on 192.168.144.140:37936 (size: 4.4 KB, free: 1007.8 MB)
20/12/14 02:28:02 INFO storage.BlockManagerInfo: Added broadcast_1_piece0 in memory on 192.168.144.140:37936 (size: 51.4 KB, free: 1007.8 MB)
20/12/14 02:28:21 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 1.0 (TID 1) in 46439 ms on 192.168.144.138 (executor 0) (1/3)
20/12/14 02:28:34 INFO scheduler.TaskSetManager: Finished task 1.0 in stage 1.0 (TID 2) in 59298 ms on 192.168.144.140 (executor 2) (2/3)
20/12/14 02:28:47 INFO storage.BlockManagerInfo: Added broadcast_2_piece0 in memory on 192.168.144.139:43240 (size: 4.4 KB, free: 1007.8 MB)
20/12/14 02:29:01 INFO storage.BlockManagerInfo: Added broadcast_1_piece0 in memory on 192.168.144.139:43240 (size: 51.4 KB, free: 1007.8 MB)
20/12/14 02:29:09 INFO scheduler.TaskSetManager: Finished task 2.0 in stage 1.0 (TID 3) in 94410 ms on 192.168.144.139 (executor 1) (3/3)
20/12/14 02:29:09 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 1.0, whose tasks have all completed, from pool
20/12/14 02:29:09 INFO scheduler.DAGScheduler: ResultStage 1 (processCmd at CliDriver.java:376) finished in 94.516 s
20/12/14 02:29:09 INFO scheduler.DAGScheduler: Job 1 finished: processCmd at CliDriver.java:376, took 94.576264 s
1       rowyet
Time taken: 97.013 seconds, Fetched 1 row(s)

  Hive表相关的Insert,select相关操作都可以在spark-sql操作;

spark-shell中使用spark-sql来提取数据

  通常的场景,我们需要将spark-sql查询出来的数据保存成文件导出分发给到别人,这里因为集群是linux,所以一般保存成csv文件;

repartition(1) :因为只有一个文件,必须要重分区为1,这里也可以替换成coalesce(1),但是coalesce(1)数据量大了会卡,建议使用repartition(1);
option(“delimiter”, “^”):数据的列分隔符采用^,我这里是选一个不太会被列数据使用符号,大家可以根据情况而定;
.option(“header”, true):保存的文件带表头
csv("/source/ods_rs_media_tbb_fview3_black_list.csv"):指定的保存在是hadoop集群的HDFS文件系统内的路径;

  具体操作如下,只要去HDFS上把文件get下来就行了;

spark.sql("""
select building_id `项目id`
      ,type `类型`
from dw.ods_rs_media_tbb_fview3_black_list
where event_day='20201213'
""").repartition(1).write.option("delimiter", "^").option("header", true).csv("/source/ods_rs_media_tbb_fview3_black_list.csv")

使用spark-sql注意

  1. spark-sql底层自然使用的是spark,而spark默认的文件类型就是parquet,所以Hive建表的时候最好指定parquet相关的信息,以及和parquet基情满满的snappy压缩格式,具体如下;
CREATE EXTERNAL TABLE `dw.ods_rs_media_tbb_fview3_black_list`(
  `building_id` bigint COMMENT '大楼ID', 
  `type` string COMMENT '类型')
COMMENT 'ods大楼黑名单'
PARTITIONED BY ( 
  `event_week` int, 
  `event_day` string, 
  `event_hour` string)
ROW FORMAT SERDE 
  'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' 
STORED AS INPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
  'hdfs://dw-cluster/hive/warehouse/ods/rs/media/ods_rs_media_tbb_fview3_black_list'
TBLPROPERTIES ('parquet.compression'='snappy')
;

  1. 使用spark-shell获取一下spark-sql生成的parquet文件的schema,确保和建表语句的column 类型一致,不然容易出异常,比如你的parquet文件building_id是IntType,而你建表的时候写成了 building_id bigint COMMENT '大楼ID',那么spark-sql在操作该表的building_id就会报错,显示类型不兼容,所以最好使用外部表,这种情况你就把表drop掉然后改好建表的column类型,然后采用MSCK REPAIR table ods_rs_media_tbb_fview3_black_list;修复表数据,bug就能解决;
scala> spark.read.parquet("/hive/warehouse/ods/rs/media/ods_rs_media_tbb_fview3_black_list/event_week=50/event_day=20201213/event_hour=00").schema.foreach(println)
StructField(building_id,LongType,true)
StructField(type,StringType,true)

  以上就是Hive和Spark的结合,更多Hive和其他技术的结合,我们在其他技术上来讲述,这里要提一句,更换Hive引擎带来的查询效率收益,会远远大于通常的一些所谓Hive优化技巧,而更换引擎加Hive优化技巧一起,才能使你的Hive查询速度快到飞起;

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

╭⌒若隐_RowYet——大数据

谢谢小哥哥,小姐姐的巨款

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值