大数据 sql 变慢的N个理由

最新推荐文章于 2022-11-24 00:04:23 发布

G7N3F

最新推荐文章于 2022-11-24 00:04:23 发布

阅读量383

点赞数

原文链接：http://aliyunhelp.doc.html

版权

1. map变慢 --hashmap导致

select ...
from (
select ds
,unique_id
,pre_page
from cbucdm.tmp_dwd_cn_log_app_ut_1
where ds='${bizdate}'
and pre_page is not null
) a
left outer join
(select t.*
,length(t.page_type_rule) rule_length
from cbucdm.dim_cn_app_page_ut t
where ds='${bizdate}'
and is_enable = 'Y'
) b
on 1=1
where a.pre_page rlike b.page_type_rule ;

从日志看，长尾的Instance中处理的数据量比较小，单个Instance的处理时间达到了40分钟，在对表中字段的数据分布情况进行统计分析，发现pre_page这个字段的值很多只出现一次，而一些热门的字段出现的频率特别高，导致整体的数据分布很不均匀。因此，在Map端热点Key与小表做笛卡尔积，非常耗时，造成Map端长尾。针对这种情况，可以使用distribute by rand()来打乱数据分布，使数据尽可能的分布均匀。
修改后代码如下：

lect ...
from (
select ds
,unique_id
,pre_page
from cbucdm.tmp_dwd_cn_log_app_ut_1
where ds='${bizdate}'
and pre_page is not null
distribute by rand()
) a
left outer join
(select t.*
,length(t.page_type_rule) rule_length
from cbucdm.dim_cn_app_page_ut t
where ds='${bizdate}'
and is_enable = 'Y'
) b
on 1=1
where a.pre_page rlike b.page_type_rule ;

`2.reduce 端多个distinct造成的长尾`

Distinct是ODPS SQL中的支持的语法，用于对字段去重，是非常重要的，比如计算某个时间段内的支付买家数、访问UV等等，都需要去重。ODPS中的Distinct的执行原理是将需要Distinct的这个字段以及Group By 字段联合作为Ky将数据分发到Reduce端的。
由于Distinct操作的存在，数据无法在Map端的Shuffle阶段根据Group By先做一次聚合操作,减少传输的数据量，而是将所有的数据都传输到Reduce端，当Key的数据分发不均匀时，就会导致Reduce端长尾，特别当多个Distinct同时出现在一段SQL代码中时，数据会被分发多次，不仅会造成数据膨胀N倍，也会把长尾现象放大N倍。

下面的语句如果有数据倾斜，相当于每一次 distinct （倾斜热点reduce导致慢）和 count （聚合唯一reduce）都会很慢，下面语句有几个distinct 就会做几次同样的 distinct （倾斜热点reduce导致慢）和 count （聚合唯一reduce）。所以非常慢

Select

browser_type,

count(1) as pv,

count(distinct uniq_id) as uv,

count(distinct client_ip) as ip_cnt,

count(distinct session_id) as session_cnt,

count(distinct apay_aid) as apay_aid_cnt,

count(distinct apay_uid) as apay_uid_cnt

From dw_log where xx=xx group by xxx;

参考https://datavalley.github.io/2016/02/15/Hive%E4%B9%8BCOUNT-DISTINCT%E4%BC%98%E5%8C%96

插入下单纯的 select count(1) 为什么不慢，以为map 阶段就可以做 count ，最后reduce统一汇总各map 的count 的sum值
————————————————

G7N3F

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
大数据 sql 变慢的N个理由

1. map变慢 --hashmap导致egselect ... from ( select ds ,unique_id ,pre_page from cbucdm.tmp_dwd_cn_log_app_ut_1 where ds='${bizdate}' and pre_page is not null ) a left outer join (select t.* ,length(t.page_type_rule) rule_length from cbucdm.dim_
复制链接

扫一扫