PostgreSQL aggregate function 4 : Hypothetical-Set Aggregate Functions

最新推荐文章于 2023-06-16 10:53:59 发布

Postgresql中国用户会

最新推荐文章于 2023-06-16 10:53:59 发布

阅读量433

点赞数

分类专栏：数据挖掘转载文章标签： postgresql 数据库函数

转载同时被 2 个专栏收录

108 篇文章 0 订阅

订阅专栏

数据挖掘

11 篇文章 0 订阅

订阅专栏

Postgres2015全国用户大会将于11月20至21日在北京丽亭华苑酒店召开。本次大会嘉宾阵容强大，国内顶级PostgreSQL数据库专家将悉数到场，并特邀欧洲、俄罗斯、日本、美国等国家和地区的数据库方面专家助阵:

Postgres-XC项目的发起人铃木市一(SUZUKI Koichi)
Postgres-XL的项目发起人Mason Sharp
pgpool的作者石井达夫(Tatsuo Ishii)
PG-Strom的作者海外浩平(Kaigai Kohei)
Greenplum研发总监姚延栋
周正中(德哥), PostgreSQL中国用户会创始人之一
汪洋，平安科技数据库技术部经理
……

 
 2015年度PG大象会报名地址：http://postgres2015.eventdove.com/
PostgreSQL中国社区： http://postgres.cn/
PostgreSQL专业1群： 3336901（已满）
PostgreSQL专业2群： 100910388
PostgreSQL专业3群： 150657323

本文讲一下聚合函数的最后一个分类, Hypothetical-Set Aggregate Functions.

这类聚合函数还有对应的窗口函数, 首先来看一下对应窗口函数的用法.

`rank()`	`bigint`	rank of the current row with gaps; same as `row_number` of its first peer
`dense_rank()`	`bigint`	rank of the current row without gaps; this function counts peer groups
`percent_rank()`	`double precision`	relative rank of the current row: (`rank` - 1) / (total rows - 1)
`cume_dist()`	`double precision`	relative rank of the current row: (number of rows preceding or peer with current row) / (total rows)

the four ranking functions are defined so that they give the same answer for any two peer rows.

rank 返回值在分组内的等级, 如果值有重复的话, 跳级处理.

dense_rank 返回值在分组内的等级, 如果值有重复的话, 不跳级处理.

percent_rank 返回 (rank - 1) / (total rows - 1), rank指当前rank值, rows指当前组的记录数

cume_dist 返回(number of rows preceding or peer with current row) / (total rows), 即截至当前记录等级一共有多少行除以本组的总行数.

看一个例子比较明白.


  
  
   
   postgres=# select *,row_number() over(partition by info order by id),rank() over(partition by info order by id),dense_rank() over(partition by info order by id),percent_rank() over(partition by info order by id),cume_dist() over(partition by info order by id) from test;
  
  
  
  
   
     id  | info  | row_number | rank | dense_rank | percent_rank |     cume_dist     
  
  
  
  
   
   ------+-------+------------+------+------------+--------------+-------------------
  
  
  
  
   
       1 | test1 |          1 |    1 |          1 |            0 |                 1
  
  
  
  
   
       2 | test2 |          1 |    1 |          1 |            0 | 0.444444444444444
  
  
  
  
   
       2 | test2 |          2 |    1 |          1 |            0 | 0.444444444444444
  
  
  
  
   
       2 | test2 |          3 |    1 |          1 |            0 | 0.444444444444444
  
  
  
  
   
       2 | test2 |          4 |    1 |          1 |            0 | 0.444444444444444
  
  
  
  
   
       3 | test2 |          5 |    5 |          2 |          0.5 | 0.555555555555556
  
  
  
  
   
       4 | test2 |          6 |    6 |          3 |        0.625 | 0.666666666666667
  
  
  
  
   
       5 | test2 |          7 |    7 |          4 |         0.75 | 0.777777777777778
  
  
  
  
   
       6 | test2 |          8 |    8 |          5 |        0.875 | 0.888888888888889
  
  
  
  
   
       7 | test2 |          9 |    9 |          6 |            1 |                 1
  
  
  
  
   
       8 | test3 |          1 |    1 |          1 |            0 |               0.5
  
  
  
  
   
     100 | test3 |          2 |    2 |          2 |            1 |                 1
  
  
  
  
   
    1000 | test4 |          1 |    1 |          1 |            0 |                 1
  
  
  
  
   
   (13 rows)

算法 :


  
  
   
   以info='test2'这个组为例 : 
  
  
  
  
   
   
    
        2 | test2 |          1 |    1 |          1 |            0 | 0.444444444444444
   
   
   
   
    
        2 | test2 |          2 |    1 |          1 |            0 | 0.444444444444444
   
   
   
   
    
        2 | test2 |          3 |    1 |          1 |            0 | 0.444444444444444
   
   
   
   
    
        2 | test2 |          4 |    1 |          1 |            0 | 0.444444444444444
   
   
  
  
  
  
   
   id=2 的 rank和dense_rank都是1.
  
  
  
  
   
   
    
    percent_rank 返回 (rank - 1) / (total rows - 1), rank指当前rank值, rows指当前组的记录数
   
   
   
   
   
   
    
    cume_dist 返回(number of rows preceding or peer with current row) / (total rows), 截至当前记录等级一共有多少行除以本组的总行数.
   
   
   
   
    
    所以
   
   
   
   
    
    percent_rank = (1-1)/(9-1)=0
   
   
   
   
    
    cume_dist = (4)/(9) = 0.444444444444444
   
   
   
   
    
    

   
   
  
  
  
  
   
       3 | test2 |          5 |    5 |          2 |          0.5 | 0.555555555555556
  
  
  
  
   
   rank = 5, 跳级
  
  
  
  
   
   dense_rank = 2, 不跳级
  
  
  
  
   
   
    
    percent_rank = (5-1)/(9-1)=0.5
   
   
   
   
    
    cume_dist = (5)/(9) = 0.555555555555556

接下来回到正题, 我们看看这些窗口函数的另一种用法, 聚合用法.

Each of the aggregates listed in Table 9-52 is associated with a window function of the same name defined in Section 9.21. In each case, the aggregate result is the value that the associated window function would have returned for the "hypothetical" row constructed from args, if such a row had been added to the sorted group of rows computed from the sorted_args.

Table 9-52. Hypothetical-Set Aggregate Functions

Function	Direct Argument Type(s)	Aggregated Argument Type(s)	Return Type	Description
`rank(args) WITHIN GROUP (ORDER BY sorted_args)`	`VARIADIC` `"any"`	`VARIADIC` `"any"`	`bigint`	rank of the hypothetical row, with gaps for duplicate rows
`dense_rank(args) WITHIN GROUP (ORDER BY sorted_args)`	`VARIADIC` `"any"`	`VARIADIC` `"any"`	`bigint`	rank of the hypothetical row, without gaps
`percent_rank(args) WITHIN GROUP (ORDER BYsorted_args)`	`VARIADIC` `"any"`	`VARIADIC` `"any"`	`double precision`	relative rank of the hypothetical row, ranging from 0 to 1
`cume_dist(args) WITHIN GROUP (ORDER BY sorted_args)`	`VARIADIC` `"any"`	`VARIADIC` `"any"`	`double precision`	relative rank of the hypothetical row, ranging from 1/`N` to 1

For each of these hypothetical-set aggregates, the list of direct arguments given in args must match the number and types of the aggregated arguments given in sorted_args. Unlike most built-in aggregates, these aggregates are not strict, that is they do not drop input rows containing nulls. Null values sort according to the rule specified in the ORDER BY clause.

这些用法比较奇特, 其实是要返回给定参数在集合中的位置.

例如 :

如果我们给一个参数值是2.2, 应该排在以上数据中的第三行.

例子 :


  
  
   
   
    
    postgres=# select * from test order by info,id;
   
   
   
   
    
      id  | info  
   
   
   
   
    
    ------+-------
   
   
   
   
    
        1 | test1
   
   
   
   
    
        2 | test2
   
   
   
   
    
        2 | test2
   
   
   
   
    
        2 | test2
   
   
   
   
    
        2 | test2
   
   
   
   
    
        3 | test2
   
   
   
   
    
        4 | test2
   
   
   
   
    
        5 | test2
   
   
   
   
    
        6 | test2
   
   
   
   
    
        7 | test2
   
   
   
   
    
        8 | test3
   
   
   
   
    
      100 | test3
   
   
   
   
    
     1000 | test4
   
   
   
   
    
    (13 rows)
   
   
  
  
  
  
   
   

  
  
  
  
   
   
    
    postgres=# select info,rank(4.9) within group (order by id),dense_rank(4.9) within group (order by id) from test group by info;
   
   
   
   
    
     info  | rank | dense_rank 
   
   
   
   
    
    -------+------+------------
   
   
   
   
    
     test1 |    2 |          2
   
   
   
   
    
     test2 |    7 |          4
   
   
   
   
    
     test3 |    1 |          1
   
   
   
   
    
     test4 |    1 |          1
   
   
   
   
    
    (4 rows)

4.9在test1这个分组, 排名第2, 并且这个分组只有1个值, 所以没有gap.

重点关注test2这个组, 这个组有9个值, 其中有4个重复值2, 所以4.9在这里排名需要考虑gap.

rank 返回7, 即4.9在这里考虑GAP排名第7

dense_rank 返回4, 即4.9在这里不考虑GAP排名第4.

又如 :


  
  
   
   
    
    postgres=# select info,rank(5) within group (order by id),dense_rank(5) within group (order by id) from test group by info;
   
   
   
   
    
     info  | rank | dense_rank 
   
   
   
   
    
    -------+------+------------
   
   
   
   
    
     test1 |    2 |          2
   
   
   
   
    
     test2 |    7 |          4
   
   
   
   
    
     test3 |    1 |          1
   
   
   
   
    
     test4 |    1 |          1
   
   
   
   
    
    (4 rows)
   
   
  
  
  
  
   
   
    
    postgres=# select info,rank(5.1) within group (order by id),dense_rank(5.1) within group (order by id) from test group by info;
   
   
   
   
    
     info  | rank | dense_rank 
   
   
   
   
    
    -------+------+------------
   
   
   
   
    
     test1 |    2 |          2
   
   
   
   
    
     test2 |    8 |          5
   
   
   
   
    
     test3 |    1 |          1
   
   
   
   
    
     test4 |    1 |          1
   
   
   
   
    
    (4 rows)

最后要看计算0~1代表位置的聚合函数 percent_rank和 cume_dist .

算法

percent_rank 返回 (rank - 1) / (total rows - 1), rank指当前rank值, rows指当前组的记录数

cume_dist 返回(number of rows preceding or peer with current row) / (total rows), 截至当前记录等级一共有多少行除以本组的总行数.

例子1 :


  
  
   
   postgres=# select info,rank(4.9) within group (order by id),dense_rank(4.9) within group (order by id),percent_rank(4.9) within group (order by id),cume_dist(4.9) within group (order by id) from test group by info;
  
  
  
  
   
    info  | rank | dense_rank |   percent_rank    |     cume_dist     
  
  
  
  
   
   -------+------+------------+-------------------+-------------------
  
  
  
  
   
    test1 |    2 |          2 |                 1 |                 1
  
  
  
  
   
    test2 |    7 |          4 | 0.666666666666667 |               0.7
  
  
  
  
   
    test3 |    1 |          1 |                 0 | 0.333333333333333
  
  
  
  
   
    test4 |    1 |          1 |                 0 |               0.5
  
  
  
  
   
   (4 rows)

同样以test2为分组, 讲解算法. 把4.9插入到这个分组后. 数据应该变成 :


  
  
   
       2 | test2
  
  
  
  
   
       2 | test2
  
  
  
  
   
       2 | test2
  
  
  
  
   
       2 | test2
  
  
  
  
   
       3 | test2
  
  
  
  
   
       4 | test2
  
  
  
  
   
       4.9 | test2  # 计算位置
  
  
  
  
   
       5 | test2
  
  
  
  
   
       6 | test2
  
  
  
  
   
       7 | test2

一共10行.

percent_rank 返回 (rank - 1) / (total rows - 1), rank指当前rank值, rows指当前组的记录数

cume_dist 返回(number of rows preceding or peer with current row) / (total rows), 截至当前记录等级一共有多少行除以本组的总行数.

所以4.9对应的 percent_rank 和 cume_dist 分别为 :

percent_rank = (rank - 1) / (total rows - 1) = (7-1)/(10-1) = 0.666666666666667

cume_dist = (7)/10 = 0.7

例子2 :


  
  
   
   
    
    postgres=# select info,rank(5) within group (order by id),dense_rank(5) within group (order by id),percent_rank(5) within group (order by id),cume_dist(5) within group (order by id) from test group by info;
   
   
   
   
    
     info  | rank | dense_rank |   percent_rank    |     cume_dist     
   
   
   
   
    
    -------+------+------------+-------------------+-------------------
   
   
   
   
    
     test1 |    2 |          2 |                 1 |                 1
   
   
   
   
    
     test2 |    7 |          4 | 0.666666666666667 |               0.8
   
   
   
   
    
     test3 |    1 |          1 |                 0 | 0.333333333333333
   
   
   
   
    
     test4 |    1 |          1 |                 0 |               0.5
   
   
   
   
    
    (4 rows)
   
   
  
  
  
  
   
   

  
  
  
  
   
   插入计算值5后, 数据变成
  
  
  
  
   
   
    
        2 | test2
   
   
   
   
    
        2 | test2
   
   
   
   
    
        2 | test2
   
   
   
   
    
        2 | test2
   
   
   
   
    
        3 | test2
   
   
   
   
    
        4 | test2
   
   
   
   
    
        5 | test2  # 计算位置, 即参数值
   
   
   
   
    
        5 | test2
   
   
   
   
    
        6 | test2
   
   
   
   
    
        7 | test2

依旧10行. 但是截至当前记录等级一共有多少行? 注意是8了.

percent_rank = (rank - 1) / (total rows - 1) = (7-1)/(10-1) = 0.666666666666667

cume_dist = (8)/10 = 0.8

例子3 :


  
  
   
   
    
    postgres=# select info,rank(5.1) within group (order by id),dense_rank(5.1) within group (order by id),percent_rank(5.1) within group (order by id),cume_dist(5.1) within group (order by id) from test group by info;
   
   
   
   
    
     info  | rank | dense_rank |   percent_rank    |     cume_dist     
   
   
   
   
    
    -------+------+------------+-------------------+-------------------
   
   
   
   
    
     test1 |    2 |          2 |                 1 |                 1
   
   
   
   
    
     test2 |    8 |          5 | 0.777777777777778 |               0.8
   
   
   
   
    
     test3 |    1 |          1 |                 0 | 0.333333333333333
   
   
   
   
    
     test4 |    1 |          1 |                 0 |               0.5
   
   
   
   
    
    (4 rows)
   
   
  
  
  
  
   
   

  
  
  
  
   
   插入计算值5.1后, 数据变成 : 
  
  
  
  
   
   
    
        2 | test2
   
   
   
   
    
        2 | test2
   
   
   
   
    
        2 | test2
   
   
   
   
    
        2 | test2
   
   
   
   
    
        3 | test2
   
   
   
   
    
        4 | test2
   
   
   
   
    
        5 | test2
   
   
   
   
    
        5.1 | test2  # 计算位置, 即参数值
   
   
   
   
    
        6 | test2
   
   
   
   
    
        7 | test2

结果自己验证吧.

例子4 :


   
   
    
    postgres=# select info,rank(5) within group (order by id desc),dense_rank(5) within group (order by id desc),percent_rank(5) within group (order by id desc),cume_dist(5) within group (order by id desc) from test group by info;
   
   
   
   
    
     info  | rank | dense_rank |   percent_rank    | cume_dist 
   
   
   
   
    
    -------+------+------------+-------------------+-----------
   
   
   
   
    
     test1 |    1 |          1 |                 0 |       0.5
   
   
   
   
    
     test2 |    3 |          3 | 0.222222222222222 |       0.4
   
   
   
   
    
     test3 |    3 |          3 |                 1 |         1
   
   
   
   
    
     test4 |    2 |          2 |                 1 |         1
   
   
   
   
    
    (4 rows)

插入计算值5后, 数据变成 :

7 | test2

6 | test2

5 | test2 # 注意, 这才是计算位置, 即插入位置.

5 | test2

4 | test2

3 | test2

2 | test2

结果自己验证吧.

[参考]

1. http://blog.163.com/digoal@126/blog/static/16387704020152223539859/

2. http://blog.163.com/digoal@126/blog/static/1638770402015224124337/

3. http://blog.163.com/digoal@126/blog/static/1638770402015379286873/

4. http://www.postgresql.org/docs/devel/static/functions-window.html

5. http://www.postgresql.org/docs/devel/static/functions-aggregate.html

Postgresql中国用户会

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
PostgreSQL aggregate function 4 : Hypothetical-Set Aggregate Functions

本文讲一下聚合函数的最后一个分类, Hypothetical-Set Aggregate Functions.这类聚合函数还有对应的窗口函数, 首先来看一下对应窗口函数的用法.rank()bigintrank of the current row with gaps; same as row_number of its first peer
复制链接

扫一扫

专栏目录