Hive针对distinct的优化(二)

最新推荐文章于 2024-03-15 23:35:11 发布

我的学长是王欣

最新推荐文章于 2024-03-15 23:35:11 发布

阅读量553

点赞数 1

分类专栏：数据仓库 hive 大数据文章标签： hive 数据仓库大数据

本文链接：https://blog.csdn.net/z3jjlzt/article/details/78981898

版权

数据仓库同时被 3 个专栏收录

3 篇文章 0 订阅

订阅专栏

hive

3 篇文章 0 订阅

订阅专栏

大数据

3 篇文章 0 订阅

订阅专栏

之前一篇针对单个count(distinct xxx)的优化，本文来讲讲对多个count(distinct xxx)的优化。

0x00 解决思路

优化是在之前单个count的基础上，通过使用union all以及窗口分析函数lag的结合来进行的。具体思路如下。

0x01 分治法

SELECT 
    pid, c1, c2 
FROM
    (select 
        pid, c1, lag(c2,1) over win c2, row_number() over win rn 
    from 
        (select--取得c1的值 
            pid, sum(tn) c1, null c2  
        from 
            (select 
                pid, substr(uid,1,4) tag, count(distinct substr(uid, 5)) tn 
            from 
                xxtable 
            group by  
                pid,substr(uid,1,4)
            )t1
        group by pid

        union all
        select--取得c2的值 
            pid, null c1, sum(tn1) c2  
        from 
            (select 
                pid, substr(cid,1,4) tag, count(distinct substr(cid, 5)) tn1 
            from
                xxtable 
            group by  
                pid,substr(cid,1,4)
            )t2
        group by pid
        )t3
    window win as (partition by pid order by c1)
    )t4
WHERE 
    rn = 2 --值取决于具体情况

此方法适用于求少量count distinct的情况，有多少个count distinct就union all多少次。
在最内层t1、t2中分别求得需要的计数值。
在中间层t3使用窗口分析函数lag或者lead（或者使用max方法也行）对多个结果值进行聚合。
在最外层t4通过row_number筛选出需要行。

经过验证，该方法在5000万数据量的情况下，不优化需要4.5分钟，经过优化需要1.5分钟，提升效果较为明显。

0x10 随机分组法

SELECT 
    pid, c1, c2 
FROM
    (select 
        pid, c1, lag(c2,1) over win c2, row_number() over win rn 
    from 
        (select 
            pid,sum(tc) c1 , null c2 
        from 
            (select 
                pid, count(1) tc,tag  
            from 
                (select 
                    pid, cast(rand() * 100 as bigint) tag, uid 
                from 
                    xxtable
                group by 
                    pid, uid
                )t1 
            group by pid, tag
            )t2
        group by pid

        union all
        select 
            pid, null c1 , sum(tc) c2 
        from 
            (select 
                pid, count(1) tc,tag  
            from 
                (select 
                    pid, cast(rand() * 100 as bigint) tag, class_id 
                from 
                    xxtable
                group by pid, cid
                )t1 
            group by pid, tag
            )t2
            group by pid
        )t3
    window win as (partition by pid order by c1)
    )t4
WHERE 
    rn = 2 --值取决于具体情况

经过验证，该方法在5000万数据量的情况下，不优化需要4.5分钟，经过优化需要40秒，效果更加明显。

我的学长是王欣

关注

1
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
Hive针对distinct的优化(二)

之前一篇针对单个count(distinct xxx)的优化，本文来讲讲对多个count(distinct xxx)的优化。0x00 解决思路优化是在之前单个count的基础上，通过使用union all以及窗口分析函数lag的结合来进行的。具体思路如下。0x01 分治法SELECT pid, c1, c2 FROM (select
复制链接

扫一扫

专栏目录