hivesql如何在数据量超大时避免join操作

最新推荐文章于 2024-06-02 09:56:09 发布

时代新人0-0

最新推荐文章于 2024-06-02 09:56:09 发布

阅读量351

点赞数 1

分类专栏：数据仓库文章标签： hive sql

本文链接：https://blog.csdn.net/qq_39889944/article/details/139287531

版权

数据仓库专栏收录该内容

11 篇文章 0 订阅

订阅专栏

hivesql如何在数据量超大时避免join操作

当在hive中对超大的表进行查询时，在这种情况下不能进行mapjoin，也选择不进行skewjoin或是smbjoin

。此时，针对特定的应用场景，可以设计特殊的sql避免join操作。下面给出一个典型案例：

假设有一个用户关注者记录表（t_user_follower），其中有两个字段，用户id（user_id），关注者列表（follower_ids），关注者列表中是关注用户的用户ID。找出相互关注的用户的id

我们可以通过以下语句建立一个样例表,表的实际数据量应该很大

create table t_user_follower
(
user_id string comment '用户id',
follower_ids string comment '关注者列表'
);

-- 数据插入语句
insert into t_user_follower values
('0001','0002,0003'),
('0002','0001,0003'),
('0003','0004'),
('0004','0001,0002');

解决方法如下

#1.使用lateral view 和explode函数将follower_ids列中的数据转换成多行
select
  user_id,
  follower_ids,
  follower_id
from t_user_follower
lateral view explode(split(follower_ids,',')) t as follower_id;
#2.把user_id和follwer_id进行有序拼接，确保拼接的列都是小数在前大数在后。例如 0001 关注0002 和 0002 关注 0001生成的新列都是0001,0002
select
  user_id,
  follower_ids, 			if(user_id<follower_id,concat_ws(',',user_id,follower_id),concat_ws(',',follower_id,user_id)) as friend,
  follower_id
from t_user_follower
lateral view explode(split(follower_ids,',')) t as follower_id;
#3.按friend列进行分组，只要出现同一组中有两个相同的friend，就说明两个用户是互相关注的。
select
    friend
from
(select
  user_id,
  follower_ids, if(user_id<follower_id,concat_ws(',',user_id,follower_id),concat_ws(',',follower_id,user_id)) as friend,
  follower_id
from t_user_follower
lateral view explode(split(follower_ids,',')) t as follower_id)tt
group by friend
having count(*)=2;