sql/hql的代码积累与优化

九筒-

已于 2022-02-21 16:32:32 修改

阅读量421

点赞数

分类专栏：大数据数据挖掘文章标签： 1024程序员节 sql hql

于 2021-10-24 16:11:52 首次发布

本文链接：https://blog.csdn.net/weixin_44343282/article/details/120935605

版权

大数据同时被 2 个专栏收录

18 篇文章 0 订阅

订阅专栏

数据挖掘

3 篇文章 0 订阅

订阅专栏

在写数仓过程中，积累了一些sql代码优化技巧，总结分享一下（大多是hql代码）

优化

使用count(1) 替代 count(*)

一、从执行效果来看

count(1) and count()：
基本没差别
count()包括了所有的列，相当于行数，在统计结果的时候，不会忽略NULL
count(1) and count(列名)：
count(1) 会统计表中的所有的记录数，不会忽略NULL，包含字段为null 的记录。
count(列名) 会统计该列字段在表中出现的次数，会忽略字段为null 的情况，即不统计字段为null 的记录。

二、从执行效率来看
若列名为主键，count(列名)会比count(1)快
若列名不为主键，count(1)会比count(列名)快
若表多个列并且没有主键，则 count（1）的执行效率优于 count（）
若表有主键，则 select count（主键）的执行效率是最优的
若表只有一个字段，则 select count（）最优。
所以实际业务中一般用count(1)比较普遍，但是如果需要聚合多个列，则用count(列名)比较合适。

减少like的使用
业务场景中，曾误将精确匹配（in）的需求理解成模糊匹配（like），写过5个字段，每个字段用like模糊匹配千百条字符串。这个执行效率异常堪忧
语句中涉及到计算,计算使用的是聚合后的数值，将聚合函数作为子查询，聚合函数结果在外查询中直接调用

--避免二次聚合
select
cnt_a / cnt_b as rate
from(
	select
	count(a) as cnt_a,
	count(b) as cnt_b,
	--count(a) / count(b) as rate
	from xxx
) as t1

判空函数coalesce(a,0)
判断字段a为null时，用0替换

积累

多字段开窗函数
over()开窗函数可以对多个字段进行开窗

--对articleid和app开窗分组
max(readpv_or_playvv) over(partition by articleid, app) as readpv_or_playvv, --按articleid和app进行分组
max(likevv) over(partition by articleid, app) as likevv, 
max(commentpv) over(partition by articleid, app) as commentpv

count查询重复值

select
a,b,c
count(1) as cnt
from
where
group by a,b,c
having count(1) > 1

group by 多字段去重

select
a,b,c
from xxx
where xxx
group by a,b,c

order by 多字段排序

--先按rank升序，相同按id升序
select
id,
number,
dense_rank() over(order by number desc) rk
from passing_number
order by rk asc, id asc

distinct 多字段去重

Select
Distinct col1, col2, col3  --即可对多个字段去重，

select
Distinct(col1), col2, col3 --对单字段去重

in 多字段

select 
* 
from b 
where (aaa,bbb) not in ( 
	select 
		aaa,bbb 
	from a 
)

exist 多字段

select 
	* 
from b 
where not exists ( 
	select 
		* 
	from a 
	where a.aaa=b.aaa and a.bbb=b.bbb
)

使用开窗函数去重

--对a,b,c分组开窗排序，取出每组的第一个
select
	a,b,c
from
(
    select
    	a,b,c
    row_number() over(partition by a,b,c order by a) as rk
    from xxx
)t_rank
where rk  = 1

踩坑

avg()、sum()、count()对null值的处理

avg()统计字段中的null值不参与计算
sum()统计字段中的null值不参与计算
count(列名)，null值不参与计算；count(1)null值参与计算
踩坑：case when 不满足条件返回0时，0会参与avg()统计计算，导致计算结果偏小

avg(case when action_type = 3 and (action_result = 0 or action_result = 1) and evaluate_cost_time != -1 then evaluate_cost_time else 0 end) as avg_evaluate_cost_time,

应该用null替换0

avg(case when action_type = 3 and (action_result = 0 or action_result = 1) and evaluate_cost_time != -1 then evaluate_cost_time else NULL end) as avg_evaluate_cost_time

九筒-

关注

0
点赞
踩
4

收藏

觉得还不错? 一键收藏
0
评论
sql/hql的代码积累与优化

在写数仓过程中，积累了一些sql代码优化技巧，总结分享一下（大多是hql代码）优化使用count(1) 替代 count(*)一、从执行效果来看count(1) and count()：基本没差别count()包括了所有的列，相当于行数，在统计结果的时候，不会忽略NULLcount(1) and count(列名)：count(1) 会统计表中的所有的记录数，不会忽略NULL，包含字段为null 的记录。count(列名) 会统计该列字段在表中出现的次数，会忽略字段为null 的情
复制链接

扫一扫