SQLite | Group By 和 Order By 子句_sqlite3.8 聚合函数 order by-CSDN博客

本文链接：https://blog.csdn.net/weixin_45488228/article/details/104436653

文章目录

1. Group by and Order by

1. Group by and Order by

我们在上一篇中介绍了 Where 子句，接下来我们将使用 Group by 和 Order by 子句，对数据进行聚合和排序。

使用Jupyter Notebook 运行 SQL 语句需安装 ipython-sql
%sql 以及 %%sql 为在 Notebook 中运行 SQL 语句，在 SQLite 命令行或 SQLite Stiduo 中不需要 %sql 或 %%sql

载入 SQL 以及连接 SQLite：

%load_ext sql
%sql sqlite:///DataBase/weather_stations.db

'Connected: @DataBase/weather_stations.db'

本文将使用 weather_stations.db 数据库，其中包含了 STATION_DATA 表。

首先查看 STATION_DATA 表中的数据：

%sql select * from station_data limit 0,5; -- 筛选前五行

 * sqlite:///DataBase/weather_stations.db
Done.

station_number	report_code	year	month	day	dew_point	station_pressure	visibility	wind_speed	temperature	precipitation	snow_depth	fog	rain	hail	thunder	tornado
143080	34DDA7	2002	12	21	33.8	987.4	3.4	0.2	36	0	None	1	1	1	1	1
766440	39537B	1998	10	1	72.7	1014.6	5.9	6.7	83.3	0	None	0	0	0	0	0
176010	C3C6D5	2001	5	18	55.7	None	7.3	4.3	69.1	0	None	0	0	0	0	0
125600	145150	2007	10	14	33	None	6.9	2.5	39.7	0	None	0	0	0	0	0
470160	EF616A	1967	7	29	65.6	None	9.2	1.2	72.4	0.04	None	0	0	0	0	0

1.1 Group Records

首先从最简单的聚合方法开始：计数：

%%sql
select count(*) as record_cound from station_data;

 * sqlite:///DataBase/weather_stations.db
Done.

record_cound
28000

count(*) 意味着计算记录的长度，你也可以和其他 SQL 操作符结合起来使用，比如 where，我们可以这样计算 tornado 出现的次数：

%%sql
select count(*) as record_count from station_data
where tornado == 1;

 * sqlite:///DataBase/weather_stations.db
Done.

record_count
3000

我们找到了 3000 条包含 tornado 的记录，但如果我们想要按年计数呢？我们可以这样写：

%%sql
select year, 
count(*) as record_count 
from station_data
where tornado == 1
group by year
limit 0,3; -- 只展示前三条

 * sqlite:///DataBase/weather_stations.db
Done.

year	record_count
1937	3
1941	3
1942	3

我们现在可以看到每年的计数，让我们拆分下这个查询来看看怎么执行的：

select year,              -- 1. 首先，我们选择了 year（select year）
count(*) as record_count  -- 2. 然后我们用 **count(\*)** 对筛选的记录进行了计数
from station_data
where tornado == 1        -- 3. 我们筛选了 tornado 为 true 的数据
group by year             -- 4. 最后，按年进行分类

我们也可以在多个 field 上进行聚合：

%%sql
select year, month,
count(*) as record_count
from station_data
where tornado == 1
group by year, month
limit 0,3;

 * sqlite:///DataBase/weather_stations.db
Done.

year	month	record_count
1937	7	3
1941	8	3
1942	10	3

此外，在使用 group by 时，我们可以也用 序数位置（ordinal positions）：

%%sql
select year, month,
count(*) as record_count
from station_data
where tornado == 1
group by 1, 2 -- ordinal positions
limit 0,5;

 * sqlite:///DataBase/weather_stations.db
Done.

year	month	record_count
1937	7	3
1941	8	3
1942	10	3
1943	1	3
1943	4	3

不是所有的平台都支持 ordinal positions，例如 Oracle 和 SQL Server，就只能写全称

1.2 Ordering Records

需要注意到，我们通过 group 得到的数据中 month 并不是按自然月份排序的，所以字哦好就是同时使用 oreder by 操作符来进行排序，如果你想要先按年份排序，再按月份排序，你只需要添加：

%%sql
select year, month,
count(*) as record_count
from station_data
where tornado == 1
group by 1, 2 -- ordinal positions
order by 1, 2 -- order by 同样支持 ordinal positions
limit 0,5;

 * sqlite:///DataBase/weather_stations.db
Done.

year	month	record_count
1937	7	3
1941	8	3
1942	10	3
1943	1	3
1943	4	3

order by 默认是按升序（ASC）排列的，然而你可能更对近期的数据感兴趣，你可以通过添加 DESC 来指定排序方式：

%%sql
select year, month,
count(*) as record_count
from station_data
where tornado == 1
group by year, month
order by year DESC, month
limit 0,5;

 * sqlite:///DataBase/weather_stations.db
Done.

year	month	record_count
2010	3	6
2009	1	3
2009	2	3
2009	4	2
2009	5	6

1.3 Aggregate Functions

我们已经使用 count(*) 来对记录进行计数了，但还有其他的一些聚合函数（AggregateyFunctions），
如 sum()、min()、max() 和 avg()。我们可以在特定的列上使用聚合函数来进行计算。

图1 SQLite 内置聚合函数

但首先让我们来看看 count() 的另一种使用方式， count() 可以用于除了计数以外的其他用途。如果你不使用 * ，
而是指定某一列，那么它将会计算所有非缺失值（non-null）的个数。举个例子，我们可以计算 snow_depth 中非缺失值的个数：

%%sql
select count(snow_depth) as recorded_snow_depth_count
from station_data

 * sqlite:///DataBase/weather_stations.db
Done.

recorded_snow_depth_count
1552

让我们进一步看看聚合函数，如果你想要看看你从 2000 年开始每个月的平均温度，你可以先筛选 2000 年的记录，
然后按月份分组，最后计算平均温度：

%%sql
select month, avg(temperature) as avg_temp
from station_data
where year >= 2000
group by month
limit 0,3;

 * sqlite:///DataBase/weather_stations.db
Done.

month	avg_temp
1	41.55585443037976
2	38.98063127690104
3	48.975062656641576

sum() 是另一个常见的聚合操作符，为了得到 2000 年至今每年的下雪深度，你可以这样查询：

%%sql
select year, sum(snow_depth) as total_snow
from station_data
where year >= 2000
group by year
limit 0,3;

 * sqlite:///DataBase/weather_stations.db
Done.

year	total_snow
2000	685.8999999999999
2001	391.90000000000003
2002	437.69999999999993

你可以在一次查询中多次使用聚合操作，我们将 2000 年以来的下雪总量、下雨总量和最大降雨量分别统计出来，并保留两位小数：

%%sql
select year,
round(sum(snow_depth), 2) as total_snow,
round(sum(precipitation), 2) as total_precipitation,
round(max(precipitation), 2) as max_precipitation
from station_data
where year >= 2000
group by year
limit 0,3;

 * sqlite:///DataBase/weather_stations.db
Done.

year	total_snow	total_precipitation	max_precipitation
2000	685.9	27.57	0.87
2001	391.9	38.15	2.95
2002	437.7	43.06	5.0

1.4 The Having Statement

假设你想要基于一个聚合值来筛选记录，你的第一反应应该是使用 where 子句。确实， where 子句可以
用来筛选记录，但是却无法用于聚合值上。举个例子，如果你想使用 where 子句筛选出总下雨量大于 30 的记录，
就会出现以下错误：

%%sql
select year, sum(precipitation) as total_precipitation
from station_data
where total_precipitation > 30
group by year
limit 0,3;

 * sqlite:///DataBase/weather_stations.db
(sqlite3.OperationalError) misuse of aggregate: sum()
[SQL: select year, sum(precipitation) as total_precipitation
from station_data
where total_precipitation > 30
group by year
limit 0,3;]
(Background on this error at: http://sqlalche.me/e/e3q8)

为什么不起作用呢？首先我们来看下聚合的原理，首先程序一行一行的扫描，找出那些在 where 子句
上成立的数据，然后再进行聚合。然而在聚合前并没有 total_precipitation 这一列数据，因此出错。

当你想在聚合值上执行 where 这个方法时，只能使用 having 这个关键词：

%%sql
select year,
sum(precipitation) as total_precipitation
from station_data
group by year
having total_precipitation > 30
limit 0,3

 * sqlite:///DataBase/weather_stations.db
Done.

year	total_precipitation
1973	35.07999999999996
1974	42.209999999999994
1975	48.25999999999997

having 相当于聚合版的 where，但并不是所有平台都支持在 aliases 上使用 having ，
如 Oracle（group by 也不行），这意味着当你使用 having 时需要再输入一次聚合函数，像这样：

%%sql
select year,
sum(precipitation) as total_preicipitation
from station_data
group by year
having sum(precipitation) > 30
limit 0,3

 * sqlite:///DataBase/weather_stations.db
Done.

year	total_preicipitation
1973	35.07999999999996
1974	42.209999999999994
1975	48.25999999999997

1.5 Getting Distinct Records

当我们使用 **select from** 时，记录中可能会包含重复值，如果你只想要返回**唯一值（distinct records）**，
你可以使用 **select distinct from**，比如我们的 station_data，表中 station_number 一列包含
了 28000 个值，但你通过 **select distinct from** 后会发现其中是 6368 个值不断重复出现组成的

%%sql
select count(station_number) as duplicate_num
from station_data;

 * sqlite:///DataBase/weather_stations.db
Done.

duplicate_num
28000

%%sql
select count(distinct station_number) as distinct_num
from station_data;

 * sqlite:///DataBase/weather_stations.db
Done.

distinct_num
6368

参考资料

[1] Thomas Nield.Getting Started with SQL[M].US: O’Reilly, 2016: 29-37