Select first row in each GROUP BY group?
stackflow上面的一个问题。用窗口函数比较简单,但是那些没有窗口函数的数据库怎么办?
id | customer | total ---+----------+------ 1 | Joe | 5 2 | Sally | 3 3 | Joe | 2 4 | Sally | 1
WITH summary AS (
SELECT p.id,
p.customer,
p.total,
ROW_NUMBER() OVER(PARTITION BY p.customer ORDER BY p.total DESC) AS ranks
FROM PURCHASES p)
SELECT s.*
FROM summary s
WHERE s.ranks = 1
所以给出通用方法:
SELECT MIN(x.id), -- change to MAX if you want the highest
x.customer,
x.total
FROM PURCHASES x
JOIN (SELECT p.customer,
MAX(total) AS max_total
FROM PURCHASES p
GROUP BY p.customer) y ON y.customer = x.customer
AND y.max_total = x.total
GROUP BY x.customer, x.total
PS:原博还提到了一种Postresql中特有的解法:DISTINCT ON ()
SELECT DISTINCT ON (customer)
id, customer, total
FROM purchases
ORDER BY customer, total DESC, id;
Or shorter (if not as clear) with ordinal numbers of output columns:
SELECT DISTINCT ON (2)
id, customer, total
FROM purchases
ORDER BY 2, 3 DESC, 1;
If total
can be NULL (won't hurt either way, but you'll want to match existing indexes):
...
ORDER BY customer, total DESC NULLS LAST, id;
If total
can be NULL, you most probably want the row with the greatest non-null value. Add NULLS LAST
like demonstrated.
--如果total可以为空,则最可能希望具有最大非空值的行。最后添加空值。具体可参照:
其实有点不明白distinct on,看前辈的博客点击打开链接。还用了IN 子查询
DISTINCT ON ( expression [, …] )把记录根据[, …]的值进行分组,分组之后仅返回每一组的第一行。需要注意的是,如果你不指定ORDER BY子句,返回的第一条的不确定的。如果你使用了ORDER BY 子句,那么[, …]里面的值必须靠近ORDER BY子句的最左边。
1. 当没用指定ORDER BY子句的时候返回的记录是不确定的。
postgres=# select distinct on(course)id,name,course,score from student;
id | name | course | score
----+--------+--------+-------
10 | 周星驰 | 化学 | 83
8 | 周星驰 | 外语 | 88
2 | 周润发 | 数学 | 99
14 | 黎明 | 物理 | 90
6 | 周星驰 | 语文 | 91
(5 rows)
2. 获取每门课程的最高分
postgres=# select distinct on(course)id,name,course,score from student order by course,score desc;
id | name | course | score
----+--------+--------+-------
5 | 周润发 | 化学 | 87
13 | 黎明 | 外语 | 95
2 | 周润发 | 数学 | 99
14 | 黎明 | 物理 | 90
6 | 周星驰 | 语文 | 91
(5 rows)
3. 如果指定ORDER BY 必须把分组的字段放在最左边
postgres=# select distinct on(course)id,name,course,score from student order by score desc;
ERROR: SELECT DISTINCT ON expressions must match initial ORDER BY expressions
LINE 1: select distinct on(course)id,name,course,score from student ...
4. 获取每门课程的最高分同样可以使用IN子句来实现
postgres=# select * from student where(course,score) in(select course,max(score) from student group by course);
id | name | course | score
----+--------+--------+-------
2 | 周润发 | 数学 | 99
5 | 周润发 | 化学 | 87
6 | 周星驰 | 语文 | 91
13 | 黎明 | 外语 | 95
14 | 黎明 | 物理 | 90
(5 rows)
原文还提到
在 row_number() over(), distinct on和in子句之间有一个小区别
,主要是因为前两个方法是用行号,且行号唯一。解决办法就是用rank()窗口函数,让同成绩的行号出现重复。
下面是一位大神提供的6种方法,有些需要在PG中实现。而且这个大神的建表语句也让我学习了。
Queries
1. row_number()
in CTE, (see other answer) 公用表达式
WITH cte AS (
SELECT id, customer_id, total
, row_number() OVER(PARTITION BY customer_id ORDER BY total DESC) AS rn
FROM purchases
)
SELECT id, customer_id, total
FROM cte
WHERE rn = 1;
2. row_number()
in subquery (my optimization) 子查询
SELECT id, customer_id, total
FROM (
SELECT id, customer_id, total
, row_number() OVER(PARTITION BY customer_id ORDER BY total DESC) AS rn
FROM purchases
) sub
WHERE rn = 1;
3. DISTINCT ON
(see other answer)
SELECT DISTINCT ON (customer_id)
id, customer_id, total
FROM purchases
ORDER BY customer_id, total DESC, id;
4. rCTE with LATERAL
subquery (see here) 递归和LATERAL
在FROM 或者JOIN子句的子查询里面可以关联查询FROM子句或者JOIN子句的另一边的子句或者表.
见这一篇→点击打开链接
WITH RECURSIVE cte AS (
( -- parentheses required
SELECT id, customer_id, total
FROM purchases
ORDER BY customer_id, total DESC
LIMIT 1
)
UNION ALL
SELECT u.*
FROM cte c
, LATERAL (
SELECT id, customer_id, total
FROM purchases
WHERE customer_id > c.customer_id -- lateral reference
ORDER BY customer_id, total DESC
LIMIT 1
) u
)
SELECT id, customer_id, total
FROM cte
ORDER BY customer_id;
5. customer
table with LATERAL
(see here)
SELECT l.*
FROM customer c
, LATERAL (
SELECT id, customer_id, total
FROM purchases
WHERE customer_id = c.customer_id -- lateral reference
ORDER BY total DESC
LIMIT 1
) l;
6. array_agg()
with ORDER BY
(see other answer)
SELECT (array_agg(id ORDER BY total DESC))[1] AS id
, customer_id
, max(total) AS total
FROM purchases
GROUP BY customer_id;
Results(性能)
Execution time for above queries with EXPLAIN ANALYZE
(and all options off), best of 5 runs.
All queries used an Index Only Scan on purchases2_3c_idx
(among other steps). Some of them just for the smaller size of the index, others more effectively.
A. Postgres 9.4 with 200k rows and ~ 20 per customer_id
1. 273.274 ms
2. 194.572 ms
3. 111.067 ms
4. 92.922 ms
5. 37.679 ms -- winner
6. 189.495 ms
B. The same with Postgres 9.5
1. 288.006 ms
2. 223.032 ms
3. 107.074 ms
4. 78.032 ms
5. 33.944 ms -- winner
6. 211.540 ms
C. Same as B., but with ~ 2.3 rows per customer_id
1. 381.573 ms
2. 311.976 ms
3. 124.074 ms -- winner
4. 710.631 ms
5. 311.976 ms
6. 421.679 ms
参考资料:https://www.oschina.net/translate/postgresqls-powerful-new-join-type-lateral