数据分析 | SQL基础查询语句+例题详解

ZoraAvo

已于 2022-07-01 15:30:22 修改

阅读量7.3k

点赞数 2

分类专栏：数分基础文章标签：大数据 sql 数据分析阿里云

于 2022-03-20 16:00:55 首次发布

本文链接：https://blog.csdn.net/m0_46224483/article/details/123364570

版权

数分基础专栏收录该内容

8 篇文章 3 订阅

订阅专栏

SQL

语法结构：select--from--where--group by--having--order by--limit

运行顺序：

from--where--group by--having--select(窗口函数在此发生,并创建表头)--order by--limit--select

select 字段名1, 字段名2, 字段名3 from 表名

结果会根据select后面的字段顺序进行展示

（*）是通配符，即select* from world，就会按默认列顺序查询world表中所有列

select 字段名 as 别名（as可省略），可以将结果显示的表头展示为别名

select name as 国家名, continent 大洲, population 人口 from world

去重：即展示的数据不会重复出现

select distinct continent from world

在select distinct后加多个字段，即对字段组合形成的数据去重（理解为每一行都不会一模一样）

select distinct name,continent from world

select中还可以直接进行计算

select name, gdp, population, gdp/population 人均gdp from world;

select 字段名 from 表名 where 条件

select name,gdp/population 人均gdp
from world
where population>=200000000;

is null即是空值的，一般在数据库里显示为<null>

='null'，此处的null是个字符串，注意区分

select name, population from world
where name ='Sweden'
or name = 'Norway'
or name ='Denmark';

in会筛选出字段中所有与括号内数据相等的行

in多条件的逻辑是or

select name, population from world
where name in ('Sweden', 'Norway', 'Denmark');

通配符：

% 表示任何字段出现任意次数

_ 表示任何字符出现一次

模糊查询：where 字段名 like '通配符+字符'

查询国家名中以C开头以ia结尾的国家

/*查询国家名中以C开头以ia结尾的国家*/
select name from world
where name like 'C%ia' ;

/*查询国家名中第二个字符为t的国家*/
select name from world
where name like '_t%' ;

sql中，and的优先级高于or

/*查询国家名中含有三个a且面积大于60万(600000)的国家及其面积，
或者人口大于13亿(1300000000)且面积大于500万(5000000)的国家及其面积*/
select name, area from world
where (name like '%a%a%a%' and area > 600000)
or ( population > 1300000000
and area > 5000000 ) ;

between and 是包括边界的

/*查询1980年除诺贝尔化学奖和诺贝尔医学奖外其余奖项获奖者的所有信息*/
select *
from nobel
where yr = 1980
and subject not in ('Chemistry','Medicine') ;

select 字段名 from 表名 order by 字段名 asc/desc

order by 字段名 asc|desc 规定查询出的结果集显示的顺序

asc指定该字段升序排序，desc为降序排序，不写则默认为升序排序

/*查询姓名以Sir开头的获奖者(winner)，获奖年份(yr)和科目(subject)，
查询结果按照年份从近到远排序，再按照姓名顺序升序排序*/
select winner, yr, subject from nobel
where winner like 'Sir%'
order by yr desc,winner asc ；

/*查询1984年所有获奖者的姓名和奖项科目。
结果将诺贝尔化学奖和物理学奖排在最后，
然后按照科目排序，再按照获奖者姓名排序*/
select * from nobel
where yr = 1984
order by subject in ('Chemistry','Physics'),
subject, winner ;

select 字段名 from 表名 order by 字段名 asc/desc limit [位置偏移量,]行数

limit子句是可选项，行数是子句中的必选参数，参数位置偏移量是可选参数

order by 可以对数值进行排序，limit n返回前n条数据

limit x,n意味从x+1行开始返回n行

/*查询人口数第4到第7的国家和人口*/
select name, population from world
order by population desc
limit 3, 4 ;

limit & offset

1) selete * from testtable limit 2,1;

2) selete * from testtable limit 2 offset 1;

注意：

1.数据库数据计算是从0开始的

2.offset X是跳过X个数据，limit Y是选取Y个数据

3.limit X,Y 中X表示跳过X个数据，读取Y个数据

这两个都是能完成需要，但是他们之间是有区别的：

1)是从数据库中第三条开始查询，取一条数据，即第三条数据读取，一二条跳过

2)是从数据库中的第二条数据开始查询两条数据，即第二条和第三条。

聚合函数 & group by

group by的意义就是相当于创建了一个数据透视表，该透视表经过分组去重。然后方便根据数据透视表中选定的字段进行聚合运算。

group by 字段1, 字段2, 字段3 (顺序是有意义的）

sum() avg() max() min()

sum、avg、max、min这些函数必须指定字段进行聚合运算，无法使用*通配符，同时会忽略空值行

select sum(population) 人口总数
from world
where continent = 'Africa' ;

计算并显示表格总行数 select count(*) from world ;

/*查询每个大洲（continent）和大洲内的国家（name）数量*/
select continent, count (name)
from world 
group by continent ;

逻辑：1.将国家按照大洲进行分区；2.根据分区进行去重分组，即剩下只有单独的各大洲；以上就是group by的运行。接下来运行select。3.select出continent，再将分区内的多行数据聚合计算成一行数据；在此题即count计算每个分区内的name个数

可以理解为group by就是创建了一张数据透视表

/*查询2013至2015年每年每个科目的获奖人数，
结果按年份从大到小，人数从大到小排序*/
select yr, subject, count (winner) 获奖人数
from nobel
where yr between 2013 and 2015
group by yr, subject 
order by yr desc, count (winner) desc;

使用group by子句时，select只能使用聚合函数和group by引用过的字段，否则会报错

/*查询每个大洲和该大洲里人口数超过1千万的国家的数量*/
select continent, count (name) 'num of countries'
from world
where population > 10000000
group by continent ;

having

having 表达式限定分组聚合后的查询行必须满足的条件

having核心子句是可选项，使用该子句是为了对group by分组后的数据进行筛选

/*查询总人口数量至少为1亿（100000000）的大洲*/
select continent, sum (population)
from world
group by continent
having sum (population) >= 100000000 ;

where后不能跟聚合函数，但having可以

/*查询总人口数至少为3亿的大洲和其平均gdp，
其中只有gdp高于200亿且人口数大于6000万
或者gdp低于80亿且首都中含有三个a的国家的计入计算，
最后按国家数从大到小排序，只显示第一行*/
select continent, avg (gdp)
from world
where ( gdp > 20000000000 and population > 60000000 )
or ( gdp < 8000000000 and capital like '%a%a%a%' )
group by continent 
having sum (population) >= 300000000
order by count (name) desc
limit 1 ;

sql运行原理：

from--where--group by--having--order by--limit--select

执行from语句从数据库中调取复制一份表格

执行where语句在复制的表格中筛选出符合条件的数据行

执行group by语句依据指定字段对筛选后的数据分区，将依据的字段去重分组，相当于Excel建立了一个数据透视表，添加了行标签

执行having语句筛选满足条件的分组

执行order by语句对筛选后的数据进行排序

执行limit语句对排序后的数据限制显示的行

执行select语句，提取最后要显示的字段

其它常见函数

round (x,y)

round函数对x值进行四舍五入，精确到小数点后y位

y为负值时，保留小数点左边相应的位数为0，不进行四舍五入

例如：round(3.15,1)返回3.2，round(14.15,-1)返回10

concat (s1, s2, ...)

concat函数返回连接参数s1、s2等产生的字符串

concat内可以放字段名

任一参数为null时，则返回null

例如：concat( 'My', ' ', 'SQL' )返回My SQL，concat('My',null,'SQL')返回null

/*查询首都和名称，其中首都需是国家名称的扩展
例如：答案中应该包括墨西哥城（Mexico City），因为它比墨西哥（Mexico）更长，
而不应该将卢森堡（Luxembourg）包括在内，因为首都名与国家名相同*/
select name, capital
from world
where capital like concat ('%', name, '%') 
and name != capital ;

replace (s, s1, s2)

replace函数使用字符串s2代替s中所有的s1

例如：replace ( 'MySQLMySQL', 'SQL', 'sql' )返回MysqlMysql

left (s, n) right (s, n) substring (s, n, len)

left函数返回字符串s最左边n个字符

right函数返回字符串s最右边n个字符

substring函数返回字符串s从第n个字符起取长度为len的子字符串，n也可以为负值，则从倒数第n个字符起取长度为len的子字符串，没有len值则取从第n个字符起到最后一位

例如：left('abcdefg',3)返回abc，right('abcdefg',3)返回efg，substring('abcdefg',2,3)返回bcd，substring('abcdefg',-2,3)返回fg，substring('abcdefg',2)返回bcdefg

/*查询国家名称及其首都名称都以相同的字母开头的国家名及其首都，
且不能包括国家名称和首都名称完全相同的情况*/
select name, capital
from world
where left (capital,1) = left (name, 1)
and capital != name ;

cast (x as type)

cast函数将一个类型的x值转换为另一个类型的值

type参数可以填写char(n)、date、time、datetime、decimal等转换为对应的数据类型

year (date) month (date) day (date)

date可以是年月日组成的日期，也可以是年月日时分秒组成的日期时间

year(date)返回日期格式中的年份

month(date)返回日期格式中的月份

day(date)返回年日期格式中的日份

例如：year('2021-08-03')返回2021，month('2021-08-03')返回8，day('2021-08-03')返回3

date_add (date, interval expr type) date_sub (date, interval expr type)

对指定起始时间进行加减操作

date用来指定起始时间,可以是年月日组成的日期，也可以是年月日时分秒组成的日期时间

expr用来指定从起始时间添加或减去的时间间隔

type指示expr被解释的方式，type可以可以是以下值

date_add函数对起始时间进行加操作，date_sub函数对起始时间进行减操作

例如：date_add('2021-08-03 23:59:59',interval 1 second)返回2021-08-04 24:00:00，date_sub('2021-08-03 23:59:59',interval 2 month)返回2021-06-03 23:59:59

datediff (date1, date2)

计算两个日期之间间隔的天数

datediff函数由date1-date2计算出间隔的时间，只有date的日期部分参与计算，时间不参与

例如：datediff('2021-06-08','2021-06-01')返回7，datediff('2021-06-08 23:59:59','2021-06-01 21:00:00')返回7，datediff('2021-06-01','2021-06-08')返回-7

date_format (date, format)

将日期和时间格式化

例如：date_format('2018-06-01 16:23:12','%b %d %Y %h:%i %p')返回Jun 01 2018 04:23 PM，date_format('2018-06-01 16:23:12','%Y/%d/%m')返回2018/01/06

if (expr, v1, v2)

如果表达式expr是true返回值v1，否则返回v2

例如：if(1<2,'Y','N')返回Y，if(1>2,'Y','N')返回N

case expr when v1 then r1 [when v2 then r2] ...[else rn] end

例如：case when 1<0 then 'T' else 'F' end 返回F

1<0的结果为false，因此函数返回值为else后面的F

窗口函数 over( [partition by 字段名] [order by 字段名 asc|desc] )

over()中两个子句为可选项，partition by指定分区依据，order by指定排序依据

窗口函数只能写在select子句中

partition by只分区，不去重

窗口函数中没有partition by子句时，不对数据分区，直接整个表为一个区

order by可以接窗口函数

【排序窗口函数】

排序窗口函数中，order by子句是必选项

排序窗口函数中，order by子句在分区内依据指定字段和排序方法对数据排序

rank()over()

跳跃式排序，比如数值为99、99、90、89，那么得到排名1、1、3、4

dense_rank()over()

并列连续型排序，比如数值为99、99、90、89，那么得到排名1、1、2、3

row_number()over()

连续型排序，比如数值为99、99、90、89，那么得到排名1、2、3、4

/*查询每一年S14000021选区中所有候选人所在的团体(party)和得票数(votes)，
并对每一年中的所有候选人根据选票数的高低赋予名次，
选票数最高则为1，第二名则为2，后续以此类推，最后根据团体(party)和年份(yr)排序*/
select yr, party, votes,
rank()over(partition by yr order by votes desc) as posn
from ge
where constituency = 'S14000021'
order by party, yr ;

/*查询2017年选区为 'S14000024' 的所有候选人所在团体（party）和其选票数（votes）、
还有候选人得票数在选区内对应的的排名。结果按团队party排序*/
select party, votes,
rank () over (partition by constituency order by votes desc) as posn
from ge
where yr = 2017 and constituency = 'S14000024'
order by party ;

【偏移分析函数】

lag (字段名, 偏移量[,默认值]) over() 想要的数据往上偏移

lead (字段名,偏移量[,默认值]) over()

/*查询法国和德国1月每天新增确诊人数，
最后显示国家名、标准日期（2020-01-27）、当天截至时间累计确诊人数、
昨天截至时间累计确诊人数、每天新增确诊人数，按照截至时间排序*/
SET @@sql_mode='ANSI';
select name,
date_format (whn,'%Y-%m-%d')日期,
confirmed 截止今天确诊数,
lag (confirmed,1) over(partition by name order by whn)截止昨天确诊数,
(confirmed - lag (confirmed,1)over(partition by name order by whn))今日新增数
from covid
where name in ('France','Germany') and month(whn) = 1 
order by whn ;

SET @@sql_mode='ANSI';
select name, date_format (whn,'%Y-%m-%d')date,
(confirmed -lag(confirmed,1)over(partition by name order by whn))
from covid
where name = 'Italy' and weekday(whn) = 0
order by whn ;

表连接

表连接的本质，就是表1的一行去匹配表2，找到了匹配的就生成，然后继续检索，直到表2所有能对应的都产生新行了。然后表1的下一行重新开始对表2的检索，如此循环往复。因此表连接可能产生更多的行。

where a.xx is null

【内连接】

select 字段名

from 表名1 inner join 表名2 on 表名1.字段名 = 表名2.字段名

注意内连接inner可以省略，直接使用join默认为内连接

两个表但凡有null值，即没连接上，就会去除，即只保留连接上的数据

【左连接】

select 字段名

from 表名1 left join 表名2 on 表名1.字段名 = 表名2.字段名

左边的表的所有行都保留，右边表没匹配上的数据全给删掉

【右连接】

select 字段名

from 表名1 right join 表名2 on 表名1.字段名 = 表名2.字段名

多表连接时，若字段名不唯一，要指定字段的表名（即表名.字段名）

join例1：

/*查询至少出演过第1主角30次的演员名*/
select name
from actor join casting on actor.id = casting.actorid
where ord = 1
group by name
having count(movieid) >= 30

join例2：

/*查询每场比赛，每个球队的得分情况，按照以下格式显示*/
select mdate, 
team1, 
sum(case when teamid = team1 then 1 else 0 end) score1,
team2,
sum(case when teamid = team2 then 1 else 0 end) score2
from game join goal on game.id = goal.matchid
group by mdate, team1, team2 ;

三表连接实例（将同个表连接两次）：

select e.emp_no, (a.salary - b.salary) growth
from employees e join salaries a on e.emp_no = a.emp_no
and to_date = '9999-01-01'
join salaries b on e.emp_no = b.emp_no
and e.hire_date = b.from_date
order by growth

子查询

子查询优先与主查询运行

/*查询跟阿尔及尼亚（Argentina）和澳大利亚（Australia）
在同一大洲的所有国家名及其所属大洲， 并按照国家名进行排序*/
select
name,continent
from world
where continent in (
select continent
from world
where name in ('Argentina','Australia')
)
order by name ；

若子查询在from里，需要给该子查询命名一个别名，告诉电脑这是新的什么名字的表格

/*已知爱丁堡选区编号为S14000021至S14000026，
查询2017年所有在爱丁堡选区内票数最高的议员
的获得票数、所在的选区及其党派*/
select party ,constituency, votes
from(
select party, constituency, votes,
rank() over(partition by constituency order by votes desc) position
from ge
where yr = '2017' 
and constituency between 'S14000021' and 'S14000026'
) as rankk
where rankk.position = 1 ;

/*查询在欧洲(Europe)人均gdp大于英国 (United Kingdom)的国家名*/
select name
from world
where continent = 'Europe'
and (gdp/population) > (
select gdp/population pcGDP
from world
where name = 'United Kingdom'
)

/*查询所有国家人口均≤25000000的大洲，及其国家名（name）和人口（population）*/
select name, continent, population
from world
where continent not in (
select distinct continent
from world
where population > 25000000 ) ;

此例有两种答案：

/*查找每个大陆（continent）中最大的国家（按区域area），
显示该大洲（continent），国家名（name）和面积（area）*/
select name, continent, area
from world
where (continent,area) in (
select continent, max(area)
from world
group by continent ) ;

/*查找每个大陆（continent）中最大的国家（按区域area），
显示该大洲（continent），国家名（name）和面积（area）*/
select name, continent, area
from(
select name, continent, area,
rank() over(partition by continent order by area desc) arank
from world ) as n
where n.arank = 1 ；

sql云端数据库（datagrip）

describe ddm.shop 描述ddm数据库的shop表

在datagrip中导出result为csv格式，在excel中打开可能会显示乱码。解决方法：用记事本打开该csv文件-在记事本中另存为-编码格式改为ANSI。再重新用excel打开即可正常显示。

场景实例：

/*场景：运营需要查看旗下所有品牌和门店在2019年12月1日至7日在美团上的GMV和下单人数*/
select 日期,  品牌名称, 门店名称,GMV, 下单人数
from ddm.shop
where 日期 between '2019-12-01' and '2019-12-07'
and 平台 = 'meituan' ;

/*场景：查询所有门店每天的GMV和CPC消耗*/
select shop.日期, 门店名称, GMV, cpc总费用
from ddm.shop shop join ddm.cpc cpc
    on shop.日期 = cpc.日期
    and shop.门店ID = cpc.门店ID ;

/*场景：业务提新需求了，要看旗下所有品牌各门店12月1日至7日期间在所有平台上的总GMV和总下单人数*/
select 品牌名称, 门店名称, sum(GMV), sum(下单人数)
from ddm.shop
where 日期 between '2019.12.01' and '2019.12.07'
group by 品牌名称, 门店名称 ;

/*业务看了眼刚才的数据，表示只想看累计GMV在3万以上，并且下单人数在200人以上的门店*/
select 品牌名称, 门店名称, sum(GMV), sum(下单人数)
from ddm.shop
where 日期 between '2019.12.01' and '2019.12.07'
group by 品牌名称, 门店名称
having sum(GMV) > 30000 and sum(下单人数) > 200 ;

/*场景：查询2020年饿了么平台上每个门店GMV最高那天的日期和GMV*/
select 门店名称, 日期, GMV
from (
select 门店名称, 日期, GMV,
       rank() over (partition by 门店名称 order by GMV desc ) position
from ddm.shop
where year(日期) = '2020'
and 平台 = 'eleme' ) as gmvrank
where gmvrank.position = 1 ;

Excel连接数据库

安装Mysql驱动

依次安装课程【Excel&Tableau数据库连接驱动下】文件夹下的驱动文件

先安装VC_redist.x64.exe

再安装mysql-connector-odbc-8.0.26-winx64.msi

最后安装mysql-connector-odbc-8.0.26-winx32.msi

【查看Excel是32位还是64位】

【配置ODBC】

控制面板>>>系统和安全>>>管理工具>>>ODBC数据源64位（如果Excel是32位，则配置ODBC数据源32位）

用户DSN>>>添加>>>选择"MySQL ODBC 8.0 Unicode Driver"

填写参数，IP Sever就是申请的外网地址。database可以不写

点击test进行测试

【Excel使用ODBC从MySQL获取数据】

打开Excel

数据选项卡>>>获取数据>>>自其他源>>>从ODBC

数据源名称选择配置ODBC时写的名称（这里为mysqldata）

点击高级选项，输入SQL代码来指定读取的数据

这里输入代码，获取整个数据表中的数据（不输入SQL代码会报错）

select * from ddm.shop

可能需要再次输入数据库账号密码，输入后点击连接

点击加载，从MySQL数据库中获取数据到Excel

【基于数据库获取的数据制作图表】

日期GMV折线图

sum等函数引用整列后，可以基于刷新后的数据进行计算

真实业务场景下，基于连接数据库的原表制作自动化报表，只需要刷新数据就可以完成报表的更新

Tableau连接数据库

Mysql驱动已经在Excel部分安装完毕

直接选择连接到服务器，选择Mysql，填写数据库参数

可以拖拽数据库，也可以写自定义SQL

ZoraAvo

关注

2
点赞
踩
34

收藏

觉得还不错? 一键收藏
2
评论
数据分析 | SQL基础查询语句+例题详解

数分系统知识目的：及时发现异常，找到数据之间的因果关系一、观测——通过技术手段获取数据，并对数据进行分析和测量，即获取数据制作报表、图表、仪表盘观察：采集、储存、展示数据【采集数据】1.解析系统日志2.埋点3.通过传感器采集4.爬虫解析别人的网站，抓取别人的数据5.API(Application Programming Interface)【储存数据】各种数据库，hive, MySQL, PostgreSQL, SQLServer, Prest
复制链接

扫一扫