有序集是一种强大的特性,其想法实际上很简单:数据被正常分组,然后将每个分组中的数据按照给定条件排序,最后在排序好的数据上做计算。
装载测试数据
测试数据内容见:测试数据oil_ext.txt
CREATE TABLE t_oil (
region text,
country text,
year int,
production int,
consumption int
);
COPY t_oil
FROM
PROGRAM 'cat /home/postgres/oil_ext.txt';
percentile_函数
中位数计算:
SELECT
region,
percentile_disc(0.5) WITHIN GROUP (
ORDER BY
production
)
FROM
t_oil
GROUP BY
1;
region | percentile_disc
---------------+-----------------
Middle East | 1082
North America | 3054
(2 rows)
percentile_disc将跳过该组的50%并返回想要的值。其效果类似于:
SELECT
production
FROM
t_oil
WHERE
region = 'Middle East'
ORDER BY
production
LIMIT 1 OFFSET (
SELECT
round(count(production) * 0.5) - 1
FROM
t_oil
WHERE
region = 'Middle East'
);
production
------------
1082
(1 row)
可以把有序集和分组集一起使用:
SELECT
region,
percentile_disc(0.5) WITHIN GROUP (
ORDER BY
production
)
FROM
t_oil
GROUP BY
ROLLUP (1);
region | percentile_disc
---------------+-----------------
Middle East | 1082
North America | 3054
| 1696
(3 rows)
PostgreSQL提供了两种percentile_函数。percentile_disc返回一个值,它是数据集中实际包含的值;percentile_cont将在找不到精确匹配时插值。
SELECT
percentile_disc(0.62) WITHIN GROUP (
ORDER BY
id
),
percentile_cont(0.62) WITHIN GROUP (
ORDER BY
id
)
FROM
generate_series(1, 5) AS id;
percentile_disc | percentile_cont
-----------------+-----------------
4 | 3.48
(1 row)
mode函数
mode函数可用在分组中查找最频繁的值。
SELECT
production,
count(*)
FROM
t_oil
WHERE
country = 'Other Middle East'
GROUP BY
production
ORDER BY
2 DESC
limit 4;
production | count
------------+-------
50 | 5
48 | 5
52 | 5
53 | 4
(4 rows)
3个不同的production值都出现了5次,当然,mode函数只会返回其中一个:
SELECT
country,
mode() WITHIN GROUP (
ORDER BY
production
)
FROM
t_oil
WHERE
country = 'Other Middle East'
GROUP BY
1;
country | mode
-------------------+------
Other Middle East | 48
(1 row)