在clickhouse中通过字段进行分组排序最后取所需要的前N条
记录一下分组取TOP N的经验,以后说不定有机会要翻出来看一下当时的思路,在此先声明,本人只是初学者,写的SQL语句很烂。我在这里只是记录一下我在学习过程中遇到的某种案例的解决思路,并不是标准答案
查询要求
按产品维度统计2017年每月的业绩,指标包括 不同品类的采购量、采购额,每个产品里采购量top 10的商品
解决思路
1、先对产品和产品类别里的所有商品进行分组排序处理,分组先后顺序是产品 --> 月份 --> 商品,得出产品维度下每月所有的商品采购额的排序表t,排序方式是将采购额降序处理,即采购额从多到少的方式
SELECT
a1.chanpinID,
a1.chanpinming ,
toYear(a.dingdanriqi) as nian,
toMonth(a.dingdanriqi) as yue,
a.shangpinID,
a.shangpinming ,
SUM(a.caigouliang) as pfshu,
SUM(a.caigoue) as cge
FROM
dingdanbiao a
left join chanpinbiao a1 on a.chanpID = a1.chanpinID
where
toYear(a.dingdanriqi) = 2017
group by
a1.chanpinID ,
a1.chanpinming ,
a.shangpinID ,
a.shangpinming ,
nian,
yue
order by
a1.chanpinID ASC ,
yue,
cge DESC
2、通过clickehouse的groupArray()函数将t表里的每个产品每个月采购额排名前10的商品的ID存放进groupArray数组里
SELECT
chanpinID,
chanpinming,
nian,
yue,
groupArray(10)(shangpinID) as shpin
from
(SELECT
a1.chanpinID,
a1.chanpinming ,
toYear(a.dingdanriqi) as nian,
toMonth(a.dingdanriqi) as yue,
a.shangpinID,
a.shangpinming ,
SUM(a.caigouliang) as pfshu,
SUM(a.caigoue) as cge
FROM
dingdanbiao a
left join chanpinbiao a1 on a.chanpID = a1.chanpinID
where
toYear(a.dingdanriqi) = 2017
group by
a1.chanpinID ,
a1.chanpinming ,
a.shangpinID ,
a.shangpinming ,
nian,
yue
order by
a1.chanpinID ASC ,
yue,
cge DESC)
group by
chanpinID,
chanpinming,
yue,
nian
order by
chanpinID ,
yue
3、再通过array join将存放在groupArray数组里的shangpinID取出来形成前10商品ID表t1
SELECT
chanpinID,
chanpinming,
nian,
yue,
shpin
FROM
(SELECT
chanpinID,
chanpinming,
nian,
yue,
groupArray(10)(shangpinID) as shpin
from
(SELECT
a1.chanpinID,
a1.chanpinming ,
toYear(a.dingdanriqi) as nian,
toMonth(a.dingdanriqi) as yue,
a.shangpinID,
a.shangpinming ,
SUM(a.caigouliang) as pfshu,
SUM(a.caigoue) as cge
FROM
dingdanbiao a
left join chanpinbiao a1 on a.chanpID = a1.chanpinID
where
toYear(a.dingdanriqi) = 2017
group by
a1.chanpinID ,
a1.chanpinming ,
a.shangpinID ,
a.shangpinming ,
nian,
yue
order by
a1.chanpinID ASC ,
yue,
cge DESC)
group by
chanpinID,
chanpinming,
yue,
nian
order by
chanpinID ,
yue)
ARRAY JOIN shpin
4、最后通过内连接inner join匹配排序表t和前10商品表t1,下面是整体的SQL语句
SELECT
t.chanpinID as pleiID,
t.chanpinming as plei,
t.nian ,
t.yue,
t.shangpinID as spinID,
t.shangpinming as spin,
t.caigouliang,
t.caigoue
FROM
(SELECT
a1.chanpinID,
a1.chanpinming ,
toYear(a.dingdanriqi) as nian,
toMonth(a.dingdanriqi) as yue,
a.shangpinID,
a.shangpinming ,
SUM(a.caigouliang) as pfshu,
SUM(a.caigoue) as cge
FROM
dingdanbiao a
left chanpinbiao a1 on a.chanpID = a1.chanpinID
where
toYear(a.dingdanriqi) = 2017
group by
a1.chanpinID ,
a1.chanpinming ,
a.shangpinID ,
a.shangpinming ,
nian,
yue
order by
a1.chanpinID ASC ,
yue,
cge DESC) t
inner join
(SELECT
chanpinID,
chanpinming,
nian,
yue,
shpin
FROM
(SELECT
chanpinID,
chanpinming,
nian,
yue,
groupArray(10)(shangpinID) as shpin
from
(SELECT
a1.chanpinID,
a1.chanpinming ,
toYear(a.dingdanriqi) as nian,
toMonth(a.dingdanriqi) as yue,
a.shangpinID,
a.shangpinming ,
SUM(a.caigouliang) as pfshu,
SUM(a.caigoue) as cge
FROM
dingdanbiao a
left join chanpinbiao a1 on a.chanpID = a1.chanpinID
where
toYear(a.dingdanriqi) = 2017
group by
a1.chanpinID ,
a1.chanpinming ,
a.shangpinID ,
a.shangpinming ,
nian,
yue
order by
a1.chanpinID ASC ,
yue,
cge DESC)
group by
chanpinID,
chanpinming,
yue,
nian
order by
chanpinID ,
yue)
ARRAY JOIN shpin) t1
on t1.shpin = t.chanpID and t1.yue = t.yue
运行上面的SQL语句后最后输出的结果就是所需要的结果表
总结
我认为这种分组取TOP N的思路大致是一样的:对所需要的数据进行分组排序,用groupArray()函数对分组后所需的值先取出来存放进数组里,然后通过array join子句将数组里的值都列出来。
我在这里解释一下为什么不使用clickhouse里的topK函数,在官方的技术文档里,topK函数的定义
“返回指定列中近似最常见值的数组,生成的数组按值的近似频率降序排序(而不是值本身),此函数不提供保证的结果。 在某些情况下,可能会发生错误,并且可能会返回不是最高频的值。我们建议使用 N < 10
值,N
值越大,性能越低。最大值 N = 65536
。”
topK函数文档:https://clickhouse.tech/docs/zh/sql-reference/aggregate-functions/reference/topk/
也就是说topK虽然高效但不保证最后求值的准确率。对于数据量较少的分组排序取top N时,topK函数的确是不错的选择,但从准确率方面来考虑的话,groupArray函数显然是更好的选择。
至于SQL语句的冗余问题,等SQL技术提高了并且空闲的时候再来修正。