如何写好一个hql

最新推荐文章于 2023-05-17 21:38:17 发布

¥程序猿¥

最新推荐文章于 2023-05-17 21:38:17 发布

阅读量1k

点赞数 2

分类专栏： Database 文章标签： sql hive 数据库

本文链接：https://blog.csdn.net/zhao_guo_feng/article/details/120866216

版权

Database 专栏收录该内容

27 篇文章 1 订阅

订阅专栏

本文介绍了Hive SQL查询的优化方法，包括列裁剪、分区裁剪、避免使用distinct、尽早过滤、使用insert into而非union all以及优化join操作。通过实例展示了如何通过选择性读取列、限制分区范围、合理使用group by、避免全表扫描和数据倾斜处理来提升查询效率。此外，还提到了mapjoin、空值处理和倾斜键的特殊优化策略。

摘要由CSDN通过智能技术生成

简介

作为一个数据开发工程师，hive sql是我们必备的技能，好的hql语句让我们事半功倍。

列裁剪

所谓列裁剪就是在查询时只读取需要的列。以我们的日历记录表为例：

#错误示范
select 
	*
from user_info 
where age = 18

#正确示范
select 
	uid
	,uname
	,sex 
from user_info 
where age = 18

当列很多时，如果select * ，全列扫描效率很低。
建议：不要图省事select *，应该进行列剪裁。

分区裁剪

所谓分区裁剪就是在查询时只读取需要的分区。以我们的日历记录表为例：

#错误示范
select 
	uid
	,uname
	,sex 
from user_info 
where age = 18

#正确示范
select 
	uid
	,uname
	,sex 
from user_info 
where dt_ymd >= '20190201' 
and dt_ymd <= '20190224'
and age = 18

当数据量很大时，如果不指定分区，全表扫描效率很低。
建议：按照时间分区的hive表，查询时要加上时间限制，默认从所有数据进行遍历。

不使用distinct

所谓distinct就是在查询时对相应字段数据进行去重。以我们的日历记录表为例：

#错误示范
select 
    distinct uid
	,uname
	,sex 
from user_info 

#正确示范
select 
    uid
	,uname
	,sex 
from user_info  
group by uid
	,uname
	,sex

当数据量很大时，如果使用distinct，效率很低。
建议：用group by替代。

尽早过滤

所谓尽早过滤，就是在提前查询对数据集筛选过滤，减少数据量。以我们的日历记录表为例：

#错误示范
select 
    user_id
    ,uid
	,uname
	,sex 
from user_info t1
left join
orer_info t2 on t1.user_id = t2.user_id
where t1.age = 18

#正确示范
select 
    uid
	,uname
	,sex 
from 
（select user_id,uid,uname,sex from  user_info  where age = 18）t1
left join
orer_info t2 on t1.user_id = t2.user_id

insert into代替union all

#错误示范
insert overwtite table xxxx
select a, b, c from table1
union all 
select d, e, f from table2
union all 
select h, i, j from table3
 
#正确示范 1
drop table if exists table xxxx;
insert into table xxxx
select a, b, c from table1;
insert into table xxxx
select d, e, f from table2;
insert into table xxxx
select h, i, j from table3;

#正确示范 2
insert into table xxxx1
select a, b, c from table1;
insert into table xxxx2
select d, e, f from table2;
insert into table xxxx3
select h, i, j from table3;

insert overwtite table xxxx
select a, b, c from xxxx1
union all 
select d, e, f from xxxx2
union all 
select h, i, j from xxxx3

union all替代join

#错误示范
select 
	ep.productid
	,productname
	,count(st.tduserid)
	,count(distinct sl.tduserid)
	,count(distinct sn.tduserid)
	,avg(sl.interval_level)
from(select 
		productid
		,productname 
	from xxx.product 
	where productid = '3006090'
	) ep
join(select 
		tduserid
		,productid 
	from xxx_page_ex 
	where l_date <= '2019-04-07' 
		and l_date >= date_add('2019-04-07', -6)
	) st
on ep.productid=st.productid
join(select 
		tduserid
		,interval_level
		,productid 
	from xxx_launch_ex 
	where l_date <= '2019-04-07' 
	and l_date >= date_add('2019-04-07', -6)
	) sl
on st.productid=sl.productid
join(select 
		tduserid
		,productid 
	from xxx_newuser_ex 
	where l_date <= '2019-04-07' 
	and l_date >= date_add('2019-04-07', -6)
	) sn
on sl.productid=sn.productid
group by ep.productid
	    ,productname
;
	
刚开始然后写出的hql语句基本没啥优化，然后在生产集群跑了20分钟没跑完。用union all只跑了1m26s，写起来可能复杂些，不多说代码如下：

#正确示范
 select 
       '2019-04-07' dates,
       '3006090' productid,
       max(pro) productname,
       sum(pv) pv,
       sum(uv) uv,
       cast(sum(duration) as decimal(10,4)) duration,
       sum(new_uv) new_uv
from 
(select productname pro,
       '0' pv,
       '0' uv,
       '0' duration,
       '0' new_uv
       from xxx.product where productid = '3006090'
union all
select '0' pro,
       count(tduserid) pv,
       '0' uv,
       '0' duration,
       '0' new_uv
       from xxx_page_ex 
       where l_date <= '2019-04-07' 
       and l_date >= date_add('2019-04-07', -6) 
       and productid = '3006090'
union all
select '0' pro,
       '0' pv,
       count(distinct tduserid) uv,
       avg(interval_level) duration,
       '0' new_uv
       from xxx_launch_ex 
       where l_date <= '2019-04-07' 
       and l_date >= date_add('2019-04-07', -6) 
       and productid = '3006090'
union all
select '0' pro,
       '0' pv,
       '0' uv,
       '0' duration,
       count(distinct tduserid) new_uv
       from xxx_newuser_ex 
       where l_date <= '2019-04-07' 
       and l_date >= date_add('2019-04-07', -6) 
       and productid = '3006090'
) t;

那么数值的可以用sum求和，当有汉字时怎么办呢，用max就可以解决这个问题。

join基础优化

join优化是一个复杂的话题，下面先说5点最基本的注意事项。

build table（小表）前置

在最常见的hash join方法中，一般总有一张相对小的表和一张相对大的表，小表叫build table，大表叫probe table。如下图所示。
在这里插入图片描述
Hive在解析带join的SQL语句时，会默认将最后一个表作为probe table，将前面的表作为build table并试图将它们读进内存。如果表顺序写反，probe table在前面，引发OOM的风险就高了。
在维度建模数据仓库中，事实表就是probe table，维度表就是build table。假设现在要将日历记录事实表和记录项编码维度表来join：

#错误写法
select a.event_type,a.event_code,a.event_desc,b.upload_time
from  (
select event_type,upload_time from calendar_record_log
where pt_date = 20190225
) a
inner join
calendar_event_code b
on a.event_type = b.event_type;

#正确写法
select a.event_type,a.event_code,a.event_desc,b.upload_time
from calendar_event_code a
inner join (
select event_type,upload_time from calendar_record_log
where pt_date = 20190225
) b on a.event_type = b.event_type;

多表join时key相同
这种情况会将多个join合并为一个MR job来处理，例如：

错误写法
select a.event_type,a.event_code,a.event_desc,b.upload_time
from calendar_event_code a
inner join (
select event_type,upload_time from calendar_record_log
where pt_date = 20190225
) b on a.event_type = b.event_type
inner join (
select event_type,event_code,upload_time from calendar_record_log_2
where pt_date = 20190225
) c on a.event_code = c.event_code;

正确写法
select a.event_type,a.event_code,a.event_desc,b.upload_time
from calendar_event_code a
inner join (
select event_type,upload_time from calendar_record_log
where pt_date = 20190225
) b on a.event_type = b.event_type
inner join (
select event_type,event_code,upload_time from calendar_record_log_2
where pt_date = 20190225
) c on a.event_type = c.event_type;

如果上面两个join的条件不相同，比如改成a.event_code = c.event_code，就会拆成两个MR job计算。
负责这个的是相关性优化器CorrelationOptimizer，它的功能除此之外还非常多，逻辑复杂，参考Hive官方的文档可以获得更多细节：https://cwiki.apache.org/confluence/display/Hive/Correlation+Optimizer

利用map join特性

map join特别适合大小表join的情况。Hive会将build table和probe table在map端直接完成join过程，消灭了reduce，效率很高。

#错误写法
select 
     t1.order_id 
     ,t2.city_name
from edw.bs_order t1 
left join edw.bs_city t2 on t1.city_id = t2.city_id 
where t1.dt_ymd = '20191212'

#正确写法
select /*+ mapjoin(bs_city) */
     t1.order_id 
     ,t2.city_name
from edw.bs_order t1 
left join edw.bs_city t2 on t1.city_id = t2.city_id 
where t1.dt_ymd = '20191212'

上面的语句中加了一条map join hint，以显式启用map join特性。早在Hive 0.8版本之后，就不需要写这条hint了。map join还支持不等值连接，应用更加灵活。
map join的配置项是hive.auto.convert.join，默认值true，对应逻辑优化器是MapJoinProcessor。
还有一些参数用来控制map join的行为，比如hive.mapjoin.smalltable.filesize，当build table大小小于该值就会启用map join，默认值25000000（25MB）。还有hive.mapjoin.cache.numrows，表示缓存build table的多少行数据到内存，默认值25000。

注意：多表关联，可以使用mapjoin强制将某个表（数量少的表）写入内存中，以提高查询关联效率。

分桶表map join

map join对分桶表还有特别的优化。由于分桶表是基于一列进行hash存储的，因此非常适合抽样（按桶或按块抽样）。
它对应的配置项是hive.optimize.bucketmapjoin，优化器是BucketMapJoinOptimizer。但我们的业务中用分桶表较少，所以就不班门弄斧了，只是提一句。

倾斜均衡配置项

这个配置与上面group by的倾斜均衡配置项异曲同工，通过hive.optimize.skewjoin来配置，默认false。

如果开启了，在join过程中Hive会将计数超过阈值hive.skewjoin.key（默认100000）的倾斜key对应的行临时写进文件中，然后再启动另一个job做map join生成结果。通过hive.skewjoin.mapjoin.map.tasks参数还可以控制第二个job的mapper数量，默认10000。
再重复一遍，通过自带的配置项经常不能解决数据倾斜问题。join是数据倾斜的重灾区，后面还要介绍在SQL层面处理倾斜的各种方法。

优化SQL处理join数据倾斜

空值或无意义值

这种情况很常见，比如当事实表是日志类数据时，往往会有一些项没有记录到，我们视情况会将它置为null，或者空字符串、-1等。如果缺失的项很多，在做join时这些空值就会非常集中，拖累进度。

因此，若不需要空值数据，就提前写where语句过滤掉。需要保留的话，将空值key用随机方式打散，例如将用户ID为null的记录随机改为负值：

select a.uid,a.event_type,b.nickname,b.age
from (
select
  (case when uid is null then cast(rand()*-10240 as int) else uid end) as uid,
  event_type from calendar_record_log
where pt_date >= 20190201
) a left outer join (
select uid,nickname,age from user_info where status = 4
) b on a.uid = b.uid;

单独处理倾斜key

这其实是上面处理空值方法的拓展，不过倾斜的key变成了有意义的。一般来讲倾斜的key都很少，我们可以将它们抽样出来，对应的行单独存入临时表中，然后打上一个较小的随机数前缀（比如0~9），最后再进行聚合。SQL语句与上面的相仿，不再赘述。

不同数据类型

这种情况不太常见，主要出现在相同业务含义的列发生过逻辑上的变化时。
举个例子，假如我们有一旧一新两张日历记录表，旧表的记录类型字段是(event_type int)，新表的是(event_type string)。为了兼容旧版记录，新表的event_type也会以字符串形式存储旧版的值，比如’17’。当这两张表join时，经常要耗费很长时间。其原因就是如果不转换类型，计算key的hash值时默认是以int型做的，这就导致所有“真正的”string型key都分配到一个reducer上。所以要注意类型转换：

select a.uid,a.event_type,b.record_data
from calendar_record_log a
left outer join (
select uid,event_type from calendar_record_log_2
where pt_date = 20190228
) b on a.uid = b.uid and b.event_type = cast(a.event_type as string)
where a.pt_date = 20190228;

build table过大

有时，build table会大到无法直接使用map join的地步，比如全量用户维度表，而使用普通join又有数据分布不均的问题。这时就要充分利用probe table的限制条件，削减build table的数据量，再使用map join解决。代价就是需要进行两次join。举个例子：

select /*+mapjoin(b)*/ a.uid,a.event_type,b.status,b.extra_info
from calendar_record_log a
left outer join (
select /*+mapjoin(s)*/ t.uid,t.status,t.extra_info
from (select distinct uid from calendar_record_log where pt_date = 20190228) s
inner join user_info t on s.uid = t.uid
) b on a.uid = b.uid
where a.pt_date = 20190228;

分区在以下情况下失效，应注意使用。

只要对字段使用函数，该字段的索引不起作用

#错误写法
select 
	order_id
	,city_id
	,courier_id
	,user_id
from order_info
where substr(dt_ymd,1,6)='202109'

#正确写法
select 
	order_id
	,city_id
	,courier_id
	,user_id
from order_info
where dt_ymd>='20210901' 
and dt_ymd<='20210931'