hive：通过join连接实现排列组合

最新推荐文章于 2023-12-14 17:58:27 发布

ziyin_2013

最新推荐文章于 2023-12-14 17:58:27 发布

阅读量844

点赞数 1

分类专栏： hive 数据处理文章标签： hive 数据处理

本文链接：https://blog.csdn.net/ziyin_2013/article/details/126087498

版权

数据处理同时被 2 个专栏收录

25 篇文章

订阅专栏

hive

5 篇文章

订阅专栏

问题
有一些订单信息，记录了用户是否购买过某个产品。想统计一下同时购买过任意两个产品的用户，类似于组合；还有购买过两个产品并且分购买先后顺序的用户，比如先购买了300055再购买300056、先购买了300056再购买300055，分两种统计，类似于排列。如果用其他语言的话，可以考虑采用循环，但是在hive中循环不太好实现，可以采用join实现。

解决方法
通过uid来连接，然后限定不同的条件实现排列组合。

连接，通过uid来关联
同时购买过任意两种产品的用户，C(3,2)；限制条件tmp1.result=1 and tmp2.result=1 and tmp1.pid<tmp2.pid
购买过任意两种产品中至少一个产品的用户，C(3,2)；限制条件where (tmp1.result=1 or tmp2.result=1) and (tmp1.pid<tmp2.pid)
同时购买过任意两种产品且有先后顺序的用户，A(3,2)；限制条件tmp1.result=1 and tmp2.result=1 and tmp1.pid!=tmp2.pid
购买过任意两种产品中至少一个产品且有先后顺序的用户，A(3,2)；限制条件where (tmp1.result=1 or tmp2.result=1) and (tmp1.pid!=tmp2.pid)

select
    distinct
    tmp1.uid as uid,
    concat(tmp1.pid,"_",tmp2.pid) as pid,  --两两组合
    concat(tmp1.result,"_",tmp2.result) as result,  --两两组合
from
    (select
        pid,
        uid,
        result
    from test) tmp1
full join
    (select
        pid,
        uid,
        result
    from test) tmp2
on tmp1.uid=tmp2.uid
where tmp1.result=1 and tmp2.result=1  and tmp1.pid<tmp2.pid  --产品两两组合，且同时购买
--where (tmp1.result=1 or tmp2.result=1) and (tmp1.pid<tmp2.pid)  --产品两两组合，且购买过其中至少一种产品
--where tmp1.result=1 and tmp2.result=1  and tmp1.pid!=tmp2.pid  --产品两两组合，同时购买且有先后顺序
--where (tmp1.result=1 or tmp2.result=1) and (tmp1.pid!=tmp2.pid)  --产品两两组合，购买过其中至少一种产品且有先后顺序