最近看到Datawhale出了套sql进阶教程, 链接地址:https://www.heywhale.com/home/competition/648d5f4a0f80cbce101aa718/content
随即试着进行练手,以下内容记录学习的内容和答题闯关过程(文中内容都是基于sqlite,在python环境中运行,部分函数,标识符与其他数据库如mysql等函数、用法可能会有一定冲突)本文包含其中关卡一和二,剩余关卡可移步至:
关卡一:SQL 基础速览:以 SQLite 为例
本章主要介绍sql的一般写法,内部逻辑等,内容较为基础,如果需要学习sql基础的同学,也可以移步至我之前写的文档进行学习:
【阿里云天池龙珠计划SQL训练营】SQL复习Task1_数据求学家的博客-CSDN博客
【阿里云天池龙珠计划SQL训练营】SQL复习Task2_数据求学家的博客-CSDN博客
【阿里云天池龙珠计划SQL训练营】SQL复习Task3_数据求学家的博客-CSDN博客
【阿里云天池龙珠计划SQL训练营】SQL复习Task4_数据求学家的博客-CSDN博客
【阿里云天池龙珠计划SQL训练营】SQL复习Task5_数据求学家的博客-CSDN博客
【阿里云天池龙珠计划SQL训练营】SQL复习Task6综合练习题_数据求学家的博客-CSDN博客
直接跳到答题区:
题目要求:使用泰坦尼克数据集,按照 10 岁为一个年龄段(0-9,10-19,20-29,30-39,40-49,50-59)且 60 岁以上(包括 60 岁)统一为一个年龄段(60+), 计算每个年龄段的用户的幸存率(幸存人数/总人数,使用 round 保留 2 位小数),将查询结果按照幸存率从高到低排序,你将回答如下几个问题:
Q1:最高的幸存率是多少?
Q2:最低的幸存率发生在哪一个组?
Q3:20-29 岁和 40-49 岁的乘客,哪组的幸存率更高?
Q4:没有提供年龄信息的乘客们的幸存率是多少?
SELECT
(case
when Age >=0 and Age <=9 then '0-9'
when Age >=10 and Age <=19 then '10-19'
when Age >=20 and Age <=29 then '20-29'
when Age >=30 and Age <=39 then '30-39'
when Age >=40 and Age <=49 then '40-49'
when Age >=50 and Age <=59 then '50-59'
when Age >=60 then '60' end),
count(*),
sum(Survived),
round(sum(Survived)*1.0/count(*),2)a
FROM example_table
group by
(case
when Age >=0 and Age <=9 then '0-9'
when Age >=10 and Age <=19 then '10-19'
when Age >=20 and Age <=29 then '20-29'
when Age >=30 and Age <=39 then '30-39'
when Age >=40 and Age <=49 then '40-49'
when Age >=50 and Age <=59 then '50-59'
when Age >=60 then '60' end)
order by a
LIMIT 20
[('60', 26, 7, 0.27), (None, 177, 52, 0.29), ('20-29', 220, 77, 0.35), ('40-49', 89, 34, 0.38), ('10-19', 102, 41, 0.4), ('50-59', 48, 20, 0.42), ('30-39', 167, 73, 0.44), ('0-9', 62, 38, 0.61)]
关卡二:复杂数据结构处理
本章主要介绍了用sql直接处理json等格式的数据,这其实在数据开发等实际工作过程中也是会遇到的。
1.学习内容
PassengerId | Survived | passengerInfo | ticketInfo | |
---|---|---|---|---|
0 | 1 | 0 | {"Name": "Braund, Mr. Owen Harris", "Sex": "ma... | {"Pclass": 3, "Ticket": "A/5 21171", "Fare": 7... |
1 | 2 | 1 | {"Name": "Cumings, Mrs. John Bradley (Florence... | {"Pclass": 1, "Ticket": "PC 17599", "Fare": 71... |
2 | 3 | 1 | {"Name": "Heikkinen, Miss. Laina", "Sex": "fem... | {"Pclass": 3, "Ticket": "STON/O2. 3101282", "F... |
3 | 4 | 1 | {"Name": "Futrelle, Mrs. Jacques Heath (Lily M... | {"Pclass": 1, "Ticket": "113803", "Fare": 53.1... |
4 | 5 | 0 | {"Name": "Allen, Mr. William Henry", "Sex": "m... | {"Pclass": 3, "Ticket": "373450", "Fare": 8.05... |
1)json_extract(用于提取某个键的值)
SELECT
PassengerId
,json_extract(passengerInfo,'$.Age') -- 一个路径表达式
FROM example_table_json
(1, 22.0)
其中,$.Age’里面 '.Age' 代表我们希望查询这个 JSON 字符串当中名为 'Age' 的字段值
2)json_patch(可以将两个 JSON 字段进行整合,类似于 Python 当中合并两个字典)
SELECT
PassengerId
,passengerInfo
,ticketInfo
,json_patch(passengerInfo,ticketInfo) -- 等价于 passengerInfo + ticketInfo
FROM example_table_json
(1, '{"Name": "Braund, Mr. Owen Harris", "Sex": "male", "Age": 22.0, "SibSp": 1, "Parch": 0}', '{"Pclass": 3, "Ticket": "A/5 21171", "Fare": 7.25, "Cabin": null, "Embarked": "S"}', '{"Name":"Braund, Mr. Owen Harris","Sex":"male","Age":22.0,"SibSp":1,"Parch":0,"Pclass":3,"Ticket":"A/5 21171","Fare":7.25,"Embarked":"S"}')
3) json_remove(从指定的 JSON 字段中去掉某些特定的字段,可以用于反选)
json_remove(passengerInfo,'$.Age','$.Sex')
4)json_insert(在 JSON 当中插入新的键值对)
,json_insert(ticketInfo,'$.Name',json_extract(passengerInfo,'$.Name')) -- 将 passengerInfo 当中的 Name 字段插入
,json_insert(ticketInfo,'$.Ticket','长期饭票') -- 已经存在的字段将不会被更新
5)json_replace(在 JSON 当中替换匹配的键所对应的值)
,json_replace(ticketInfo,'$.Name',json_extract(passengerInfo,'$.Name')) -- 不匹配的值将不会被更新
,json_replace(ticketInfo,'$.Ticket','长期饭票') -- 匹配的值会被更新
6)json_set(json_set() = json_replace() + json_insert。即对于匹配的值进行替换,对于不匹配的值将进行插入。)
,json_set(ticketInfo,'$.Name',json_extract(passengerInfo,'$.Name')) -- 不匹配的值将会做插入
,json_set(ticketInfo,'$.Ticket','长期饭票') -- 匹配的值会被更新
(1, '{"Name": "Braund, Mr. Owen Harris", "Sex": "male", "Age": 22.0, "SibSp": 1, "Parch": 0}', '{"Pclass": 3, "Ticket": "A/5 21171", "Fare": 7.25, "Cabin": null, "Embarked": "S"}', '{"Pclass":3,"Ticket":"A/5 21171","Fare":7.25,"Cabin":null,"Embarked":"S","Name":"Braund, Mr. Owen Harris"}', '{"Pclass":3,"Ticket":"长期饭票","Fare":7.25,"Cabin":null,"Embarked":"S"}')
7)数组的表示
'$.a[2]' 代表获取 a 字段下的数组中的第 3 个数值,例如 {"a":[1,3,5,7],"b":1} 将会返回数字 5,下方代码中json_group_array将普通数据变为了数组
SELECT
json_extract(passengerInfo,'$.Age') AS Age
,json_extract(json_group_array(PassengerId),'$[0]') -- 注意这里没有 . 了
,json_extract(json_group_array(PassengerId),'$[0]','$[3]') -- 查询数组中的多个值,如果数组长度不够则自动返回 NULL
FROM example_table_json
WHERE Survived==1 AND json_extract(passengerInfo,'$.Age') IS NOT NULL
AND Age>40
GROUP BY json_extract(passengerInfo,'$.Age')
[(41.0, 273, '[273,null]'), (42.0, 289, '[289,622]'), (43.0, 780, '[780,null]'), (44.0, 195, '[195,null]'), (45.0, 188, '[188,707]'), (47.0, 872, '[872,null]'), (48.0, 461, '[461,713]'), (49.0, 53, '[53,797]'), (50.0, 260, '[260,527]'), (51.0, 766, '[766,null]'), (52.0, 450, '[450,null]'), (53.0, 572, '[572,null]'), (54.0, 497, '[497,null]'), (55.0, 16, '[16,null]'), (56.0, 648, '[648,null]'), (58.0, 12, '[12,null]'), (60.0, 367, '[367,null]'), (62.0, 571, '[571,null]'), (63.0, 276, '[276,null]'), (80.0, 631, '[631,null]')]
json_extract(json_group_array(PassengerId),'$[#-1]')
其中的[#-1]代表数组中的最后一个数 ,# 符号代表的是当前数组的长度
8)将普通数据变为json格式
# 这里我们使用的是 example_table 这张表,是不含 JSON 字符串格式数据的
# 通过 json_object 函数构造 json 数据结构
# 为了节省空间,我们这里只构造 passengerInfo
SELECT
PassengerId
,Survived
/*
json_object 的输入格式:json_object(字段名A,字段值A,字段名B,字段值B...)
*/
,json_object('Sex',Sex,'Age',Age,'SibSp',SibSp,'Parch',Parch) AS passengerInfo
FROM example_table
(1, 0, '{"Sex":"male","Age":22.0,"SibSp":1,"Parch":0}')
2.答题
Q1:查询 example_table_json 中的 passengerInfo 字段时,下列哪个操作一定会改变字段值
A . json_extract(passengerInfo,'$.Sex')
B . json_insert(passengerInfo,'$.Sex','unknown')
C . json_replace(passengerInfo,'$.PassengerName',json_extract(passengerInfo,'$.Name'))
D . json_set(passengerInfo,'$.Age','2002')
Q2:参考 1)-7)中的代码,按照 ticketInfo 中的 PClass 分组并查看每个组内的幸存情况。请问下面哪个判断是正确的:
A . Pclass 为 1的分组中的前三名乘客全部生还
B . Pclass 为 2 的分组中,前 20 名乘客的生还率超过了 50%
C . Pclass 为 3 的分组中,前 5 名乘客无一生还
D . 三个分组中各自的 第 10 名乘客合计 2人生还
Q3:参考之前的学习内容,按照 passengerInfo 中的 Age 分组后(去掉 Age 为 None 的组),查看每个组的生存情况,将每个组的倒数第三位的数字设置为 -1 。请问最后一共加入了多少 -1
答案:
1.D
2.B
SELECT json_extract(ticketinfo,'$.Pclass') AS Pclass ,json_group_array(survived) ,json_extract(json_group_array(survived),'$[9]') FROM example_table_json GROUP BY json_extract(ticketinfo,'$.Pclass') ''' [(1, '[1,1,0,1,1,0,0,1,0,0,1,0,1,1,0,0,0,1,0,0,1,0,0,0,0,1,0,0,1,0,1,0,0,0,0,0,1,1,1,1,1,1,1,1,0,1,0,1,1,1,0,0,1,1,0,0,1,0,1,1,0,0,1,1,1,1,1,1,1,1,1,1,1,1,0,0,1,0,1,0,1,0,1,1,1,1,0,1,0,1,1,1,1,1,1,0,1,0,1,1,1,0,1,0,1,1,0,0,0,1,1,0,0,0,1,0,1,0,1,1,1,0,1,1,0,0,1,1,1,0,0,1,0,1,0,1,1,1,1,1,0,1,1,1,1,0,1,1,1,1,0,1,1,1,0,1,1,1,0,1,0,1,0,1,1,1,1,0,0,1,1,1,1,1,0,1,1,1,1,1,1,0,1,0,0,1,1,1,0,1,1,0,0,0,1,1,0,1,0,1,0,1,1,1,1,1,1,1,1,1,0,1,0,1,1,1]', 0), (2, '[1,1,1,0,1,0,0,1,1,1,1,1,0,0,1,1,1,0,0,0,0,1,1,0,0,0,0,0,0,0,1,0,0,1,1,0,1,0,1,0,0,0,0,1,0,0,0,0,1,0,0,0,1,0,1,0,1,0,1,0,1,0,0,0,1,0,1,1,1,1,0,0,0,1,1,0,0,0,1,1,0,0,1,0,1,0,1,1,0,1,1,1,1,0,1,1,1,0,1,0,0,1,1,0,0,1,1,1,1,0,1,1,1,1,1,1,0,0,1,1,1,0,0,0,1,1,1,1,1,0,0,1,0,1,0,0,0,0,1,0,1,0,0,0,0,0,1,0,1,1,0,0,1,0,0,0,0,1,1,1,1,0,0,1,0,0,0,1,0,0,0,1,1,0,0,0,0,0,1,1,1,1,0,0]', 1), (3, '[0,1,0,0,0,1,1,0,0,0,0,0,1,1,0,1,0,1,0,1,1,0,0,1,0,0,1,0,0,1,0,0,0,0,0,0,0,0,1,0,1,0,0,0,1,0,0,0,1,0,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,1,0,0,0,0,0,0,0,0,1,0,1,1,0,0,0,0,0,0,1,1,0,1,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,1,1,0,0,1,0,0,1,0,0,0,0,1,0,0,1,1,0,0,0,1,1,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,1,0,1,0,0,1,1,1,0,0,1,0,0,0,1,0,1,0,1,0,0,0,1,1,0,0,0,1,0,0,0,0,1,1,0,0,1,1,1,0,0,0,0,0,0,1,1,0,0,0,0,0,1,1,0,0,0,1,0,0,1,0,0,0,0,1,0,1,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,1,1,0,0,1,0,0,0,0,0,1,0,0,0,0,0,1,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,1,1,0,0,0,0,0,0,0,0,1,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,1,0,0,1,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,1,1,0,0,1,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,1,0,1,0,0,0,1,0,0,1,1,0,0,0,0,0,0,0,0,0,1,1,0,0,0,1,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0]', 0)] '''
3.56
SELECT COUNT(DISTINCT AGE) FROM (SELECT json_extract(passengerInfo,'$.Age') AS AGE ,json_extract(json_group_array(survived),'$[#-3]') A FROM example_table_json WHERE json_extract(passengerInfo,'$.Age') IS NOT NULL GROUP BY AGE) WHERE A !=-1