【Clickhouse】Clickhouse 分析函数 window functions 窗口函数

最新推荐文章于 2024-05-02 18:20:34 发布

九师兄

最新推荐文章于 2024-05-02 18:20:34 发布

阅读量2k

点赞数 2

分类专栏： clickhouse 文章标签： clickhouse 分析函数 window

原文链接：https://vkingnew.blog.csdn.net/article/details/106781788

版权

clickhouse 专栏收录该内容

138 篇文章 626 订阅 ¥19.90 ¥99.00

订阅专栏

文章目录

1.概述
2.Top N:
3.案例

1.概述

转载：Clickhouse 分析函数 window functions 窗口函数

clickhouse 暂时么有提供标准SQL的分析函数，但是有计划添加此功能，预计2020年第四季度的早期实现。

在clickhouse 2020年第四季度之前可以使用clickhouse提供的函数来实现。

1.1.窗口函数：

Online Anallytical Processing，联机分析处理，可以对数据库数据进行实时分析处理

1.2.标准SQL语法

分析函数 over（partition by 列名 order by 列名 ）

1.3.分析函数分类：

聚合类

avg(列名)、sum(列名)、count(列名)、max(列名)、min(列名)

排名类

row_number() 按照值排序时产生一个自增编号，不会重复
rank() 按照值排序时产生一个自增编号，值相等时会重复，会产生空位
dense_rank() 按照值排序时产生一个自增编号，值相等时会重复，不会产生空位

注意：排名类分析函数不需要任何参数。

其他类

lag(列名,往前的行数,[行数为null时的默认值，不指定为null])
lead(列名,往后的行数,[行数为null时的默认值，不指定为null])
ntile(n) 用于将分组数据按照顺序切分成n片，返回当前切片值，如果切片不均匀，默认增加第一个切片的分布。

概述：
window function时 SQL：2003 新加入的标准。
常见的分析函数：

1.排名(rownumber,rank,dense_rank)
 
2.sum() over()
 
3.count() over()
 
4.lead()/lag() over()
 
5.ntile(n) over()

1.排名函数：

Clickhouse没有直接提供对应的开窗函数，需要使用clickhouse提供的函数来变相实现，主要涉及以下数组函数：

arrayEnumerate
arrayEnumerateDense
arrayEnumerateUniq

简单示例：

Clickhouse> with (select [100,200,300,200,400,500] ) as arr select arrayEnumerate(arr) as rownumber,arrayEnumerateDense(arr) as dense_rank,arrayEnumerateUniq(arr) as uniq_rank;
 
WITH 
    (
        SELECT [100, 200, 300, 200, 400, 500]
    ) AS arr
SELECT 
    arrayEnumerate(arr) AS rownumber, 
    arrayEnumerateDense(arr) AS dense_rank, 
    arrayEnumerateUniq(arr) AS uniq_rank
 
┌─rownumber─────┬─dense_rank────┬─uniq_rank─────┐
│ [1,2,3,4,5,6] │ [1,2,3,2,4,5] │ [1,1,1,2,1,1] │
└───────────────┴───────────────┴───────────────┘
 
1 rows in set. Elapsed: 0.004 sec.

arrayEnumerate 等同于row_number（）
arrayEnumerateDense 等同于dense_rank()
arrayEnumerateUniq 只返回了元素第一次出现的次数

数据准备：

CREATE TABLE t_data
ENGINE = Memory AS
WITH 
    (
        SELECT ['A', 'A', 'A', 'A', 'B', 'B', 'B', 'C', 'C', 'C', '90', '80', '90', '88', '90', '100', '77', '99', '90', '80']
    ) AS dict
SELECT 
    dict[(number % 10) + 1] AS id, 
    dict[number - 10] AS score
FROM system.numbers
LIMIT 10
 
SELECT *
FROM t_data
 
┌─id─┬─score─┐
│ A  │ 90    │
│ A  │ 80    │
│ A  │ 90    │
│ A  │ 88    │
│ B  │ 90    │
│ B  │ 100   │
│ B  │ 77    │
│ C  │ 99    │
│ C  │ 90    │
│ C  │ 80    │
└────┴───────┘
 
10 rows in set. Elapsed: 0.002 sec.

– 先查询：

SELECT 
    id, 
    groupArray(score) AS arr, 
    arrayEnumerate(arr) AS rownumber, 
    arrayEnumerateDense(arr) AS dense_rank, 
    arrayEnumerateUniq(arr) AS uniq_rank
FROM 
(
    SELECT *
    FROM t_data
    ORDER BY score DESC
)
GROUP BY id
ORDER BY id ASC
 
┌─id─┬─arr───────────────────┬─rownumber─┬─dense_rank─┬─uniq_rank─┐
│ A  │ ['90','90','88','80'] │ [1,2,3,4] │ [1,1,2,3]  │ [1,2,1,1] │
│ B  │ ['90','77','100']     │ [1,2,3]   │ [1,2,3]    │ [1,1,1]   │
│ C  │ ['99','90','80']      │ [1,2,3]   │ [1,2,3]    │ [1,1,1]   │
└────┴───────────────────────┴───────────┴────────────┴───────────┘
 
3 rows in set. Elapsed: 0.004 sec.

– 最终的查询：

SELECT 
    id, 
    score, 
    rownumber, 
    dense_rank, 
    uniq_rank
FROM 
(
    SELECT 
        id, 
        groupArray(score) AS arr, 
        arrayEnumerate(arr) AS rownumber, 
        arrayEnumerateDense(arr) AS dense_rank, 
        arrayEnumerateUniq(arr) AS uniq_rank
    FROM 
    (
        SELECT *
        FROM t_data
        ORDER BY score DESC
    )
    GROUP BY id
    ORDER BY id ASC
)
ARRAY JOIN 
    arr AS score, 
    rownumber, 
    dense_rank, 
    uniq_rank
ORDER BY 
    id ASC, 
    rownumber ASC, 
    dense_rank ASC
 
┌─id─┬─score─┬─rownumber─┬─dense_rank─┬─uniq_rank─┐
│ A  │ 90    │         1 │          1 │         1 │
│ A  │ 90    │         2 │          1 │         2 │
│ A  │ 88    │         3 │          2 │         1 │
│ A  │ 80    │         4 │          3 │         1 │
│ B  │ 90    │         1 │          1 │         1 │
│ B  │ 77    │         2 │          2 │         1 │
│ B  │ 100   │         3 │          3 │         1 │
│ C  │ 99    │         1 │          1 │         1 │
│ C  │ 90    │         2 │          2 │         1 │
│ C  │ 80    │         3 │          3 │         1 │
└────┴───────┴───────────┴────────────┴───────────┘
 
10 rows in set. Elapsed: 0.006 sec.

可以看到rank()函数的还是没有实现。

2.Top N:

取每个用户的Top 2 的分数：

可以看到id的分数可能有重复，需要使用distinct去重。

Clickhouse> select id,score,d_rank from (select id,score,d_rank from(select id,groupArray(score) as arr,arrayEnumerateDense(arr) AS d_rank from(select distinct id,score from t_data order by score desc) group by id) array join arr as score,d_rank order by id asc,d_rank asc) where d_rank <3;
 
 
SELECT 
    id, 
    score, 
    d_rank
FROM 
(
    SELECT 
        id, 
        score, 
        d_rank
    FROM 
    (
        SELECT 
            id, 
            groupArray(score) AS arr, 
            arrayEnumerateDense(arr) AS d_rank
        FROM 
        (
            SELECT DISTINCT 
                id, 
                score
            FROM t_data
            ORDER BY score DESC
        )
        GROUP BY id
    )
    ARRAY JOIN 
        arr AS score, 
        d_rank
    ORDER BY 
        id ASC, 
        d_rank ASC
)
WHERE d_rank < 3
 
┌─id─┬─score─┬─d_rank─┐
│ A  │ 90    │      1 │
│ A  │ 88    │      2 │
│ B  │ 90    │      1 │
│ B  │ 77    │      2 │
│ C  │ 99    │      1 │
│ C  │ 90    │      2 │
└────┴───────┴────────┘
 
6 rows in set. Elapsed: 0.014 sec.

3.同比环比计算：

同比：今年当月和去年的同一个月的数据，比如现在是2020年7月份，则需要和去年7月份的数据比较。英文名为Year-over-Year

环比：今年当月和今年的上一个月的数据比较，比如现在是2020年7月份则需要和2020年的6月份的数据比较，英文名为Month-over-Month

同比增长率= (本月数据-去年同月数据）/去年同月的数据

环比增长率=（本月数据 - 上月数据)/上月数据

在clickhouse中没有提供标准SQL种的lead 和lag函数。我们可以借助于clickhouse的提供的neighbor函数来实现。

neighbor(column, offset[, default_value])

The result of the function depends on the affected data blocks and the order of data in the block.
If you make a subquery with ORDER BY and call the function from outside the subquery, you can get the expected result.

Parameters

column — A column name or scalar expression.
offset — The number of rows forwards or backwards from the current row of column. Int64.
default_value — Optional. The value to be returned if offset goes beyond the scope of the block. Type of data blocks affected.
Returned values

Value for column in offset distance from current row if offset value is not outside block bounds.
Default value for column if offset value is outside block bounds. If default_value is given, then it will be used.
Type: type of data blocks affected or default value type.

参考：
https://clickhouse.tech/docs/en/sql-reference/functions/other-functions/
数据准备：

Clickhouse> with toDate('2019-01-01') as start_date select toStartOfMonth(start_date + (number*32)) month_start,(number+10)*100 amount from numbers(24);
 
WITH toDate('2019-01-01') AS start_date
SELECT 
    toStartOfMonth(start_date + (number * 32)) AS month_start, 
    (number + 10) * 100 AS amount
FROM numbers(24)
 
┌─month_start─┬─amount─┐
│  2019-01-01 │   1000 │
│  2019-02-01 │   1100 │
│  2019-03-01 │   1200 │
│  2019-04-01 │   1300 │
│  2019-05-01 │   1400 │
│  2019-06-01 │   1500 │
│  2019-07-01 │   1600 │
│  2019-08-01 │   1700 │
│  2019-09-01 │   1800 │
│  2019-10-01 │   1900 │
│  2019-11-01 │   2000 │
│  2019-12-01 │   2100 │
│  2020-01-01 │   2200 │
│  2020-02-01 │   2300 │
│  2020-03-01 │   2400 │
│  2020-04-01 │   2500 │
│  2020-05-01 │   2600 │
│  2020-06-01 │   2700 │
│  2020-07-01 │   2800 │
│  2020-08-01 │   2900 │
│  2020-10-01 │   3000 │
│  2020-11-01 │   3100 │
│  2020-12-01 │   3200 │
│  2021-01-01 │   3300 │
└─────────────┴────────┘
 
24 rows in set. Elapsed: 0.003 sec. 
 
 
WITH toDate('2019-01-01') AS start_date
SELECT 
    toStartOfMonth(start_date + (number * 32)) AS month_start, 
    (number + 10) * 100 AS amount, 
    neighbor(amount, -12) AS prev_year_amount, 
    neighbor(amount, -1) AS prev_month_amount, 
    if(prev_year_amount = 0, -999, round((amount - prev_year_amount) / prev_year_amount, 2)) AS year_over_year, 
    if(prev_month_amount = 0, -999, round((amount - prev_month_amount) / prev_month_amount, 2)) AS month_over_month
FROM numbers(24)
 
┌─month_start─┬─amount─┬─prev_year_amount─┬─prev_month_amount─┬─year_over_year─┬─month_over_month─┐
│  2019-01-01 │   1000 │                0 │                 0 │           -999 │             -999 │
│  2019-02-01 │   1100 │                0 │              1000 │           -999 │              0.1 │
│  2019-03-01 │   1200 │                0 │              1100 │           -999 │             0.09 │
│  2019-04-01 │   1300 │                0 │              1200 │           -999 │             0.08 │
│  2019-05-01 │   1400 │                0 │              1300 │           -999 │             0.08 │
│  2019-06-01 │   1500 │                0 │              1400 │           -999 │             0.07 │
│  2019-07-01 │   1600 │                0 │              1500 │           -999 │             0.07 │
│  2019-08-01 │   1700 │                0 │              1600 │           -999 │             0.06 │
│  2019-09-01 │   1800 │                0 │              1700 │           -999 │             0.06 │
│  2019-10-01 │   1900 │                0 │              1800 │           -999 │             0.06 │
│  2019-11-01 │   2000 │                0 │              1900 │           -999 │             0.05 │
│  2019-12-01 │   2100 │                0 │              2000 │           -999 │             0.05 │
│  2020-01-01 │   2200 │             1000 │              2100 │            1.2 │             0.05 │
│  2020-02-01 │   2300 │             1100 │              2200 │           1.09 │             0.05 │
│  2020-03-01 │   2400 │             1200 │              2300 │              1 │             0.04 │
│  2020-04-01 │   2500 │             1300 │              2400 │           0.92 │             0.04 │
│  2020-05-01 │   2600 │             1400 │              2500 │           0.86 │             0.04 │
│  2020-06-01 │   2700 │             1500 │              2600 │            0.8 │             0.04 │
│  2020-07-01 │   2800 │             1600 │              2700 │           0.75 │             0.04 │
│  2020-08-01 │   2900 │             1700 │              2800 │           0.71 │             0.04 │
│  2020-10-01 │   3000 │             1800 │              2900 │           0.67 │             0.03 │
│  2020-11-01 │   3100 │             1900 │              3000 │           0.63 │             0.03 │
│  2020-12-01 │   3200 │             2000 │              3100 │            0.6 │             0.03 │
│  2021-01-01 │   3300 │             2100 │              3200 │           0.57 │             0.03 │
└─────────────┴────────┴──────────────────┴───────────────────┴────────────────┴──────────────────┘
 
24 rows in set. Elapsed: 0.003 sec.

结论：

3.案例

clickhouse数据模型之有序漏斗分析

参考：

参考：
https://en.wikipedia.org/wiki/SQL_window_function
https://en.wikipedia.org/wiki/SQL:2003

https://github.com/ClickHouse/ClickHouse/issues/9887
window view：
https://github.com/ClickHouse/ClickHouse/pull/8331

https://github.com/ClickHouse/ClickHouse/pull/5925
特殊函数：
sequenceMatch
windowfunnel

https://github.com/ClickHouse/ClickHouse/issues/5485
https://github.com/ClickHouse/ClickHouse/issues/5132
https://github.com/ClickHouse/ClickHouse/issues/1469

https://clickhouse.tech/docs/en/sql-reference/functions/other-functions/

九师兄

关注

2
点赞
踩
6

收藏

觉得还不错? 一键收藏
0
评论
【Clickhouse】Clickhouse 分析函数 window functions 窗口函数

1.概述转载：Clickhouse 分析函数 window functions 窗口函数clickhouse 暂时么有提供标准SQL的分析函数，但是有计划添加此功能，预计2020年第四季度的早期实现。在clickhouse 2020年第四季度之前可以使用clickhouse提供的函数来实现。1.1.窗口函数：Online Anallytical Processing，联机分析处理，可以对数据库数据进行实时分析处理1.2.标准SQL语法分析函数 over（partition by 列名 orde
复制链接

扫一扫