PostgreSQL 聚合函数讲解 - 6 分组排序聚合

最新推荐文章于 2024-08-13 16:09:20 发布

flowerspring

最新推荐文章于 2024-08-13 16:09:20 发布

阅读量2.1k

点赞数

分类专栏： pgsql

pgsql 专栏收录该内容

18 篇文章 1 订阅

订阅专栏

PostgreSQL 聚合函数讲解 - 6 分组排序聚合

德哥 2015-12-14 17:14:00 浏览6377 评论0

PostgreSQL

摘要： 分组排序聚合的例子. Table 9-51. Ordered-Set Aggregate Functions Function Direct Argument Type(s) Aggregated Argument Type(s) Return Type Description mo...

分组排序聚合的例子.

Table 9-51. Ordered-Set Aggregate Functions

Function	Direct Argument Type(s)	Aggregated Argument Type(s)	Return Type	Description
`mode() WITHIN GROUP (ORDER BYsort_expression)`		any sortable type	same as sort expression	returns the most frequent input value (arbitrarily choosing the first one if there are multiple equally-frequent results)
`percentile_cont(fraction) WITHIN GROUP (ORDER BY sort_expression)`	double precision	double precisionor interval	same as sort expression	continuous percentile: returns a value corresponding to the specified fraction in the ordering, interpolating between adjacent input items if needed
`percentile_cont(fractions) WITHIN GROUP (ORDER BY sort_expression)`	double precision[]	double precisionor interval	array of sort expression's type	multiple continuous percentile: returns an array of results matching the shape of the fractionsparameter, with each non-null element replaced by the value corresponding to that percentile
`percentile_disc(fraction) WITHIN GROUP (ORDER BY sort_expression)`	double precision	any sortable type	same as sort expression	discrete percentile: returns the first input value whose position in the ordering equals or exceeds the specified fraction
`percentile_disc(fractions) WITHIN GROUP (ORDER BY sort_expression)`	double precision[]	any sortable type	array of sort expression's type	multiple discrete percentile: returns an array of results matching the shape of the fractionsparameter, with each non-null element replaced by the input value corresponding to that percentile

All the aggregates listed in Table 9-51 ignore null values in their sorted input. For those that take a fraction parameter, the fraction value must be between 0 and 1; an error is thrown if not. However, a null fraction value simply produces a null result.

mode比较好理解, 就是取分组中出现频率最高的值或表达式, 如果最高频率的值有多个, 则随机取一个.

postgres=# create table test(id int, info text);

CREATE TABLE

postgres=# insert into test values (1,'test1');

INSERT 0 1

postgres=# insert into test values (1,'test1');

INSERT 0 1

postgres=# insert into test values (1,'test2');

INSERT 0 1

postgres=# insert into test values (1,'test3');

INSERT 0 1

postgres=# insert into test values (2,'test1');

INSERT 0 1

postgres=# insert into test values (2,'test1');

INSERT 0 1

postgres=# insert into test values (2,'test1');

INSERT 0 1

postgres=# insert into test values (3,'test4');

INSERT 0 1

postgres=# insert into test values (3,'test4');

INSERT 0 1

postgres=# insert into test values (3,'test4');

INSERT 0 1

postgres=# insert into test values (3,'test4');

INSERT 0 1

postgres=# insert into test values (3,'test4');

INSERT 0 1

postgres=# select * from test;

id | info

----+-------

1 | test1

1 | test1

1 | test2

1 | test3

2 | test1

2 | test1

2 | test1

3 | test4

3 | test4

3 | test4

3 | test4

3 | test4

(12 rows)

取出所有数据中, 出现频率最高的info, 有可能是test1也有可能是test4, 因为他们的出现频率一致.

mode的返回结果数据类型和order by后面的表达式一致.

postgres=# select mode() within group (order by info) from test;

mode

-------

test1

(1 row)

如果按INFO来分组的话, 取出出现频率最高的info, 实际上这个操作是没有任何意义的, 返回值就是所有记录的info的唯一值.

postgres=# select mode() within group (order by info) from test group by info;

mode

-------

test1

test2

test3

test4

(4 rows)

按id来分组, 取出组内出现频率最高的info值, 这个是有意义的.

postgres=# select mode() within group (order by info) from test group by id;

mode

-------

test1

test1

test4

(3 rows)

id=1 , 出现频率最高的info是test1. 出现2次.

如下 :

postgres=# select id,info,count(*) from test group by id,info;

id | info | count

----+-------+-------

1 | test1 | 2

1 | test3 | 1

3 | test4 | 5

1 | test2 | 1

2 | test1 | 3

(5 rows)

如果要返回mode()并返回频率次数. 可以使用row_number()窗口来实现. 如下.

postgres=# select id,info,cnt from (select id,info,cnt,row_number() over(partition by id order by cnt desc) as rn from (select id,info,count(*) cnt from test group by id,info) t) t where t.rn=1;

id | info | cnt

----+-------+-----

1 | test1 | 2

2 | test1 | 3

3 | test4 | 5

(3 rows)

其他, mode的返回结果数据类型和order by后面的表达式一致.

postgres=# select mode() within group (order by id) from test;

mode

------

3

(1 row)

postgres=# select mode() within group (order by id+1) from test;

mode

------

4

(1 row)

另外还有4个函数是和数据分布有关的, 需要指定从0到1的分布位置. 返回排序后, 在指定分布位置的值或表达式的值.

src/backend/utils/adt/orderedsetaggs.c

if (percentile < 0 || percentile > 1 || isnan(percentile))

ereport(ERROR,

(errcode(ERRCODE_NUMERIC_VALUE_OUT_OF_RANGE),

errmsg("percentile value %g is not between 0 and 1",

percentile)));

同时还需要注意区分连续分布和离散分布.

postgres=# create table test(id int, info text);

CREATE TABLE

postgres=# insert into test values (1,'test1');

INSERT 0 1

postgres=# insert into test values (2,'test2');

INSERT 0 1

postgres=# insert into test values (3,'test2');

INSERT 0 1

postgres=# insert into test values (4,'test2');

INSERT 0 1

postgres=# insert into test values (5,'test2');

INSERT 0 1

postgres=# insert into test values (6,'test2');

INSERT 0 1

postgres=# insert into test values (7,'test2');

INSERT 0 1

postgres=# insert into test values (8,'test3');

INSERT 0 1

postgres=# insert into test values (100,'test3');

INSERT 0 1

postgres=# insert into test values (1000,'test4');

INSERT 0 1

postgres=# select * from test;

id | info

------+-------

1 | test1

2 | test2

3 | test2

4 | test2

5 | test2

6 | test2

7 | test2

8 | test3

100 | test3

1000 | test4

(10 rows)

取连续分布的中位数可以用percentile_cont(0.5)来获得.

postgres=# select percentile_cont(0.5) within group (order by id) from test;

percentile_cont

-----------------

5.5

(1 row)

这个5.5是怎么计算来的呢? 参考本文末尾 :

If (CRN = FRN = RN) then the result is

(value of expression from row at RN)

Otherwise the result is

(CRN - RN) * (value of expression for row at FRN) +

(RN - FRN) * (value of expression for row at CRN)

解释 :

N = 当前分组的行数 = 10

RN = (1+传入参数*(N-1)) = (1+0.5*(10-1)) = 5.5

CRN = ceiling(RN) = 6

FRN = floor(RN) = 5

value of expression for row at FRN : 当前分组内第FRN行的值 = 5

value of expression for row at CRN : 当前分组内第CRN行的值 = 6

所以最终中位数值 :

(CRN - RN) * (value of expression for row at FRN) +

(RN - FRN) * (value of expression for row at CRN) =

(6-5.5)*(5) + (5.5 - 5)*(6) = 5.5;

使用info分组 :

postgres=# select percentile_cont(0.5) within group (order by id),info from test group by info;

percentile_cont | info

-----------------+-------

1 | test1

4.5 | test2

54 | test3

1000 | test4

(4 rows)

验证这个值4.5 | test2 :

2 | test2

3 | test2

4 | test2

5 | test2

6 | test2

7 | test2

N = 当前分组的行数 = 6

RN = (1+传入参数*(N-1)) = (1+0.5*(6-1)) = 3.5

CRN = ceiling(RN) = 4

FRN = floor(RN) = 3

value of expression for row at FRN : 当前分组内第FRN行的值 = 4

value of expression for row at CRN : 当前分组内第CRN行的值 = 5

所以最终中位数值 :

(CRN - RN) * (value of expression for row at FRN) +

(RN - FRN) * (value of expression for row at CRN) =

(4-3.5)*(4) + (3.5 - 3)*(5) = 4.5;

当输入参数为数组时, 返回值也是数组, 如下 :

postgres=# select percentile_cont(array[0.5, 1]) within group (order by id) from test;

percentile_cont

-----------------

{5.5,1000}

(1 row)

接下来看一下稀疏分布 :

返回行号大于等于指定百分比的值或表达式值.

例如 :

postgres=# select id from test;

id

------

1

2

3

4

5

6

7

8

100

1000

(10 rows)

当前组一共10行, 取位置在0.5的.即行号>=0.5*10的第一行的值或表达式的值.

postgres=# select percentile_disc(0.5) within group (order by id) from test;

percentile_disc

-----------------

5

(1 row)

postgres=# select percentile_disc(0.5) within group (order by id^2) from test;

percentile_disc

-----------------

25

(1 row)

输入0.11, 表示行号返回>=1.1的第一行的值.

postgres=# select percentile_disc(0.11) within group (order by id) from test;

percentile_disc

-----------------

2

(1 row)

再看个例子

postgres=# select id,info,count(*) over (partition by info) from test;

id | info | count

------+-------+-------

1 | test1 | 1

2 | test2 | 6

3 | test2 | 6

4 | test2 | 6

5 | test2 | 6

6 | test2 | 6

7 | test2 | 6

8 | test3 | 2

100 | test3 | 2

1000 | test4 | 1

(10 rows)

取分组的数据, 主要看test2 这个组一共有6行, 0.3*6=1.8, 所以它需要取第二行的数据.

postgres=# select info,percentile_disc(0.3) within group (order by id) from test group by info;

info | percentile_disc

-------+-----------------

test1 | 1

test2 | 3

test3 | 8

test4 | 1000

(4 rows)

[注意]

最终计算的是表达式的分布数, 而不是计算列值的分布数后再计算表达式.

验证如下 :

或者你可以看代码 :

postgres=# select percentile_cont(0.5) within group (order by id^2),info from test group by info;

percentile_cont | info

-----------------+-------

1 | test1

20.5 | test2

5032 | test3

1000000 | test4

(4 rows)

postgres=# select percentile_cont(0.5) within group (order by id),info from test group by info;

percentile_cont | info

-----------------+-------

1 | test1

4.5 | test2

54 | test3

1000 | test4

(4 rows)

postgres=# select 4.5^2;

?column?

---------------------

20.2500000000000000

(1 row)

postgres=# select 54^2;

?column?

----------

2916

(1 row)

[参考]

1. http://www.postgresql.org/docs/devel/static/functions-aggregate.html

2. http://blog.163.com/digoal@126/blog/static/16387704020152223539859/

3. http://blog.163.com/digoal@126/blog/static/1638770402015224124337/

4. http://blog.163.com/digoal@126/blog/static/16387704020137124851944

5. src/backend/utils/adt/orderedsetaggs.c

6. 算法 :

PERCENTILE_CONT函数解释 :

The result of PERCENTILE_CONT is computed by linear interpolation between values after ordering them. Using the percentile value (P) and the number of rows (N) in the aggregation group, you can compute the row number you are interested in after ordering the rows with respect to the sort specification. This row number (RN) is computed according to the formula RN = (1+(P*(N-1)). The final result of the aggregate function is computed by linear interpolation between the values from rows at row numbers CRN = CEILING(RN) and FRN = FLOOR(RN).

The final result will be:

If (CRN = FRN = RN) then the result is

(value of expression from row at RN)

Otherwise the result is

(CRN - RN) * (value of expression for row at FRN) +

(RN - FRN) * (value of expression for row at CRN)

PERCENTILE_DISC函数解释 :

The first expr must evaluate to a numeric value between 0 and 1, because it is a percentile value. This expression must be constant within each aggregate group. The ORDER BY clause takes a single expression that can be of any type that can be sorted.

For a given percentile value P, PERCENTILE_DISC sorts the values of the expression in the ORDER BY clause and returns the value with the smallest CUME_DIST value (with respect to the same sort specification) that is greater than or equal to P.

MEDIAN(中位数)详解, Oracle有单独的计算中位数的函数, 实际上就是PERCENTILE_CONT(0.5) :

MEDIAN is an inverse distribution function that assumes a continuous distribution model. It takes a numeric or datetime value and returns the middle value or an interpolated value that would be the middle value once the values are sorted. Nulls are ignored in the calculation.

This function takes as arguments any numeric data type or any nonnumeric data type that can be implicitly converted to a numeric data type. If you specify only expr, then the function returns the same data type as the numeric data type of the argument. If you specify the OVER clause, then Oracle Database determines the argument with the highest numeric precedence, implicitly converts the remaining arguments to that data type, and returns that data type.

The result of MEDIAN is computed by first ordering the rows. Using N as the number of rows in the group, Oracle calculates the row number (RN) of interest with the formula RN = (1 + (0.5*(N-1)). The final result of the aggregate function is computed by linear interpolation between the values from rows at row numbers CRN = CEILING(RN) and FRN = FLOOR(RN).

The final result will be:

if (CRN = FRN = RN) then

(value of expression from row at RN)

else

(CRN - RN) * (value of expression for row at FRN) +

(RN - FRN) * (value of expression for row at CRN)

关注

0
点赞
踩
3

收藏

觉得还不错? 一键收藏
2
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

专栏目录

博客等级

码龄18年

128
原创

404
点赞

1943
收藏

274
粉丝

关注

私信

热门文章

分类专栏

最新评论

win10无法登陆微软账户，解决方法
2401_85938516: 需要把设置改回去吗？
win10无法登陆微软账户，解决方法
2401_85938516: 感谢，终于解决了！
qt mysql 断开重连问题
weixin_42166397: 亲，设置重连属性为1不行啊，照断不误！似乎只能去执行sql操作来判断
弹出USB大容量存储设备时出问题的解决方法
执剑走天涯xp: 如果警告是 ‘ID 为 4 的应用程序 System 已停止删除或弹出设备的问题’，可以打开计算机管理，将移动硬盘脱机后在弹出。不过重新挂的时候需要重新打开计算机管理把这个移动硬盘启动。可以解决就是有点麻烦
使用Visual Studio 2017开发Linux程序
大胡子的艾娃: 非常感谢博主该篇博文对我的帮助，非常简洁明了的说明了该方面的知识。有个小小的问题： 4.1 "常规"配置中“配置主程序”的“远程生成项目目录”图文不对版，和“动态库”模块一样，该项无需修改

最新文章

目录

评论 2

被折叠的条评论为什么被折叠?

到【灌水乐园】发言

查看更多评论

添加红包

成就一亿技术人!

hope_wisdom

发出的红包

实付元

使用余额支付

点击重新获取

扫码支付

钱包余额 0

抵扣说明：

1.余额是钱包充值的虚拟货币，按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载，可以购买VIP、付费专栏及课程。