1 求(event userid time)格式下不同event(比如浏览、点击、购买等行为)的pv uv。
//最普通的算法,数据倾斜,超慢,不合适
select count(userid) as pv, count(distinct userid) as uv from table group by event;
//先看错误的优化方法
with tablea as(
select str(event+random) as fixevent, count(userid) as pv1, count(distinct userid) as uv1 from table group by str(event+random)
)
select
substr(fixevent,5) as event,
sum(pv1) as pv,
sum(uv1) as uv //错误!!两阶段聚合不适合计算distinct的uv情况。
from tablea
group by substr(fixevent,5);
//正确做法,第一步先形成(userid event num)的数据结构,num代表同一个userid触发了event多少次。
//并且第一次聚合大大减少了数据量,将相同的时间合并,统一用num来表示,过滤了很多数据。
with tablea as(
select event, userid, count(*) as num from table group by event, userid
)
slect count(*) as uv, sum(num) as pv from tablea group by event;
1.2 在flink纯实时计算pv uv 不分窗口
keyby(event+userid).flatmap(阶段1).key