常用SQL总结

꧁꫞ND꫞꧂

已于 2023-03-24 12:38:25 修改

阅读量729

点赞数 1

分类专栏： hive spark 文章标签： sql 数据库

于 2022-07-01 16:45:59 首次发布

本文链接：https://blog.csdn.net/Baron_ND/article/details/125561699

版权

spark 同时被 2 个专栏收录

11 篇文章 0 订阅

订阅专栏

hive

7 篇文章 1 订阅

订阅专栏

1.时间函数

当前日期	select current_date();	2022-06-26
当前日期-加	select date_add(current_date(),1);	2022-06-27
当前日期-减	select date_sub(current_date(),1);	2022-06-25
指定日期-加	select date_add('2022-06-06',1);	2022-06-07
当前日期-自定义格式	select date_format(current_date(),'yyyyMMdd');	20220626
当前日期-减-自定义格式	select date_format(date_sub(current_date(),1), 'yyyy-MM-dd');	2022-06-25
开始日期减去结束日期的天数	select datediff('2022-06-26','2022-06-1');	25

当年第一天	select trunc(current_date,'YY');	2022-01-01
当月第一天	select trunc(current_date,'MM');	2022-06-01

当前的日期和时间	select now();	2022-06-22 19:25:29.893
当前的日期和时间	select CURRENT_TIMESTAMP();	2022-06-22 19:25:49.84
当前的日期和时间-自定义格式	select date_format(CURRENT_TIMESTAMP(), 'yyyy-MM-dd HH:mm:ss');	2022-06-22 19:25:49.84
当前的日期和时间-自定义格式	select date_format(CURRENT_TIMESTAMP(), 'yyyy-MM-dd');	2022-06-22
当前的日期和时间-转化为日期	select to_date(CURRENT_TIMESTAMP());	2022-06-26
日期时间转日期	select to_date('2022-06-26 11:22:33');	2022-06-26

获得当前时区的UNIX时间戳	select unix_timestamp();	1655897243
将当前时间戳转换为UTC时间	select from_unixtime(unix_timestamp());	2022-06-22 19:28:35
将时间戳转化为UTC时间	select from_unixtime(188888888);	1975-12-27 13:08:08
时间戳的自定义格式转化	select from_unixtime(188888888,'yyyy-MM-dd');	1975-12-27
	select from_unixtime(unix_timestamp(),'yyyy-MM-dd HH:mm:ss');	2022-06-22 19:19:44
	select from_unixtime(unix_timestamp(),'yyyyMMdd');	20220622
	select from_unixtime(unix_timestamp() - 86400 ,'yyyyMM01');	20220601

获取日期的年	select year('2022-06-26 11:22:33');	2022
获取日期的月	select month('2022-06-26 11:22:33');	6
获取日期的天	select day('2022-06-26 11:22:33');	26
获取日期的小时	select hour('2022-06-26 11:22:33');	11
获取日期的分钟	select minute('2022-06-26 11:22:33');	22
获取日期的秒	select second('2022-06-26 11:22:33');	33

2.字符串处理

字符串连接函数	select concat('abc','def','gh');	abcdefgh
带分隔符字符串连接函数	select concat_ws(',','abc','def','gh');	abc,def,gh

字符串截取函数-用法1	select substr('123456789',3);	3456789
字符串截取函数-用法1	select substring('123456789',3);	3456789
字符串截取函数-用法2	select substr('123456789',3,4);	3456
字符串截取函数-用法2	select substring('123456789',3,4);	3456

字符串长度函数	select length('abcedfg');	7
字符串反转函数	select reverse('abcedfg');	gfdecba

去除字符串两边的空格	select trim(' abc ');	abc
去除字符串左边的空格	select ltrim(' abc ');	abc
去除字符串右边的空格	select rtrim(' abc ');	abc

字符串转大写函数	select upper('abCD');	ABCD
字符串转大写函数	select ucase('abCD');	ABCD
字符串转小写函数	select lower('abCD');	abcd
字符串转小写函数	select lcase('abCD');	abcd

类型转换（遇到类型不兼容报错的，可以转化一下）	select cast(123 as string);
类型转换（遇到类型不兼容报错的，可以转化一下）	select cast('123' as int);

3.开窗函数

1.开窗函数简介

与聚合函数一样，开窗函数也是对行集组进行聚合计算，但是它不像普通聚合函数那样每组只返回一个值，开窗函数可以为每组返回多个值，因为开窗函数所执行聚合计算的行集组是窗口。

2.为什么需要开窗函数

在开窗函数出现之前存在着很多用 SQL 语句很难解决的问题，很多都要通过复杂的相关子查询或者存储过程来完成。为了解决这些问题，在 2003 年 ISO SQL 标准加入了开窗函数，开窗函数的使用使得这些经典的难题可以被轻松的解决。

分数表-创建临时表的sql

with score_table as(

SELECT

1 AS class_id,

1001 AS student_id,

90 AS score

UNION

SELECT

1 AS class_id,

1002 AS student_id,

100 AS score

UNION

SELECT

1 AS class_id,

1003 AS student_id,

100 AS score

UNION

SELECT

2 AS class_id,

1004 AS student_id,

100 AS score

UNION

SELECT

2 AS class_id,

1005 AS student_id,

80 AS score

)

select class_id,student_id,score from score_table order by 1,3;

分数表：

class_id	student_id	score
1	1001	90
1	1002	100
1	1003	100
2	1005	80
2	1004	100

求每个班级的的学生人数与平均分

使用group by

在每条分数数据上，加上本班级的平均分等

1.使用group by求出班级平均分，然后再和分数表进行关联

2.聚合开窗函数：

每个学生在班级内的排名

排序开窗函数

3.开窗函数语法

开窗函数格式：函数名(列) OVER(选项)

例如：

row_number() over(partition by ... order by ...)

rank() over(partition by ... order by ...)

dense_rank() over(partition by ... order by ...)

count() over(partition by ... order by ...)

max() over(partition by ...)

min() over(partition by ...)

sum() over(partition by ...)

avg() over(partition by ...)

first_value() over(partition by ... order by ...)

last_value() over(partition by ... order by ...)

lag() over(partition by ... order by ...)

lead() over(partition by ... order by ...）

OVER 关键字表示把函数当成开窗函数而不是聚合函数。

PARTITION BY 子句：

开窗函数的 OVER 关键字后括号中的可以使用 PARTITION BY 子句来定义行的分区来供进行聚合计算。与 GROUP BY 子句不同，PARTITION BY 子句创建的分区是独立于结果集的，创建的分区只是供进行聚合计算的，而且不同的开窗函数所创建的分区也不互相影响

ORDER BY子句：

开窗函数中可以在OVER关键字后的选项中使用ORDER BY子句来指定排序规则，而且有的开窗函数还要求必须指定排序规则。使用ORDER BY子句可以对结果集按照指定的排序规则进行排序，并且在一个指定的范围内进行聚合运算

select

name,

score,

sum(score) over(order by score range between 2 preceding and 2 following) s1, -- 当前行的score值加减2的范围内的所有行

sum(score) over(order by score rows between 2 preceding and 2 following) s2, -- 当前行+前后2行，一共5行

sum(score) over(partition by name order by score range between unbounded preceding and unbounded following) s3, -- 全部行，不做限制

sum(score) over(partition by name order by score rows between unbounded preceding and unbounded following) s4, -- 全部行，不做限制

sum(score) over(order by score) s5, -- 第一行到当前行（和当前行相同score值的所有行都会包含进去）

sum(score) over(partition by name order by score rows between unbounded preceding and current row) s6, -- 第一行到当前行（和当前行相同score值的其他行不会包含进去，这是和上面的区别）

sum(score) over(partition by name order by score rows between 3 preceding and current row) s7, -- 当前行+往前3行

sum(score) over(partition by name order by score rows between 3 preceding and 1 following) s8, --当前行+往前3行+往后1行

sum(score) over(partition by name order by score rows between current row and unbounded following) s9 --当前行+往后所有行

from

stu_score

order by

score;

4.常用的开窗函数

-- 聚合开窗函数

count(); -- 窗口内总条数

sum(); -- 窗口内数据的和

min(); -- 窗口内最小值

max(); -- 窗口内最大值

avg(); -- 窗口内的平均值

-- 排序开窗函数

row_number(); -- 从1开始，按照顺序，生成分组内记录的序列

rank(); -- 生成数据项在分组中的排名，排名相等会在名次中留下空位

dense_rank(); -- 生成数据项在分组中的排名，排名相等会在名次中不会留下空位

ntile(n); -- 将分区中已排序的行划分为大小尽可能相等的指定数量的排名的组，并返回给定行所在的组的排名。

percent_rank(); -- 计算给定行的百分比排名。可以用来计算超过了百分之多少的人。(当前行的rank值-1)/(分组内的总行数-1),如360小助手开机速度超过了百分之多少的人。

cume_dist(); -- 计算某个窗口或分区中某个值的累积分布。假定升序排序，则使用以下公式确定累积分布：小于等于当前值x的行数 / 窗口或partition分区内的总行数。其中，x 等于 order by 子句中指定的列的当前行中的值。

-- 其他窗口函数

FIRST_VALUE(); -- 返回分区中的第一个值。

LAST_VALUE(); -- 返回分区中的最后一个值。

LAG(col,n,default); -- 用于统计窗口内往上第n个值。

LEAD(col,n,default); -- 用于统计窗口内往下第n个值。

4.行列转换

列转行

EXPLODE(col)：将hive一列中复杂的array或者map结构拆分成多行。

LATERAL VIEW

用法：LATERAL VIEW udtf(expression) tableAlias AS columnAlias

解释：用于和split, explode等UDTF一起使用，它能够将一列数据拆成多行数据，在此基础上可以对拆分后的数据进行聚合

select

name,

cast(bind_id as int) as province_id

from

(

select

name,

split(regexp_extract(sale_region, '^\\\[(.*)]$', 1), ',') as sale_region

from info.ods_xmbuyn_xm_branch_company

) t lateral view explode(t.sale_region) myTable as bind_id;

结果对比：

--查询sale_region：

+---------+--------------+--+

| name | sale_region |

+---------+--------------+--+

| 安徽分公司 | [13] |

| 福建分公司 | [14] |

| 甘肃分公司 | [29,30,31] |

| 广东分公司 | [20,22] |

--转换后结果：

+---------+--------------+--+

| name | province_id |

+---------+--------------+--+

| 安徽分公司 | 13 |

| 福建分公司 | 14 |

| 甘肃分公司 | 29 |

| 甘肃分公司 | 30 |

| 甘肃分公司 | 31 |

| 广东分公司 | 20 |

| 广东分公司 | 22 |

行转列

CONCAT_WS(separator, str1, str2,...)：它是一个特殊形式的 CONCAT()。第一个参数剩余参数间的分隔符。分隔符可以是与剩余参数一样的字符串。如果分隔符是 NULL，返回值也将为 NULL。这个函数会跳过分隔符参数后的任何 NULL 和空字符串。分隔符将被加到被连接的字符串之间;

COLLECT_SET(col)：函数只接受基本数据类型，它的主要作用是将某字段的值进行去重汇总，产生array类型字段，可以搭配group合并同一组内数据；

COLLECT_LIST(col)：他和COLLECT_SET不同的是不会对数据去重。

select

t.base,

concat_ws('|', collect_set(t1.name)) name

from

(select

name,

concat(星座, ",", 血型) base

from person_info) t

group by

t.base;

结果对比：相当于列转行反向转换

5.可读性类

with

with... as...这个语法会将查询的结果集保存在内存中，不重复查询；

with子句必须在引用的select语句之前定义,同级with关键字只能使用一次,多个只能用逗号分割；最后一个with 子句与下面的查询之间不能有逗号，只通过右括号分割,with 子句的查询必须用括号括起来.

with table_a as (

select

city_id,

city_name

from table_a

),

table_b as (

select * from table_c

)

select * from table_b limit 22;

with... as... 也是可以嵌套使用的

with table_a as (

with tmp_a as (

select

city_id,

city_name

from table_a

)

select * from tmp_a

),

table_b as (

select * from table_c

)

select * from table_b limit 22;

cache

spark语法，将查出来的结果缓存到内存中，后续多次使用

cache table c_sku as select * from tmp_a；

...

uncache table c_sku;

temporary view

spark语法，如果在整个会话中多次使用某次的查询结果的话，可以把该次查询结果创建为临时视图，方便后续使用。保存的仅仅是一段查询语句的逻辑，而不是查询的结果集，使用一次就触发一次查询，如果逻辑复杂，不建议使用。

create temporary view t1 as

select

trade_date,

sec_code,

sec_name

from tra_stock_info_ss;

6.其他技巧

union/union all:在查询几个相同结构数据源的时候，可以使用union/union all 来合并结果集，他们都会按默认的字段排序，所以查询语句要求同序列的字段顺序相同；union：相同的记录合并去重返回，union all不会去重；
Case when : 一旦满足了某一个WHEN, 则这一条数据就会退出CASE WHEN , 而不再考虑其他CASE;
distinct/group by:都可以实现去重的效果，但是查询多个字段的去重指标的时候，建议使用group by，distinct 是针对要查询的全部字段去重，而 group by 可以针对要查询的全部字段中的部分字段去重，它的作用主要是：获取数据表中以分组字段为依据的其他统计数据。
coalesce( value1,value2,… ) 取第一个非null数据
if( value1 is null, value2, value1) 有妙用，另类列转行

-- T+1的分公司销量

select 分公司,

sum(销售额) 月销售额,

sum(if substr(cast(ymd as string),7,2)="01",销售额,0) 销售额01,

sum(if substr(cast(ymd as string),7,2)="02",销售额,0) 销售额02,

.....

sum(if substr(cast(ymd as string),7,2)="31",销售额,0) 销售额31,

from xxx

where ymd <= from_unixtime(unix_timestamp() - 86400,'yyyyMMdd') -- 昨日

and ymd >= from_unixtime(unix_timestamp() - 86400,'yyyyMM01') -- 昨日的月

group by 1 order by 2 desc;

....

7.查询引擎的选取

默认presto会比较快，预发略有差别

大表presto查不出来的，选用spark引擎查询

很大搞不定的，考虑写个spark任务把数据计算出来落到一个临时表里面

8.SQL调优

大表最好是限定条件过滤条件，inner join等来过滤数据

꧁꫞ND꫞꧂

关注

1
点赞
踩
4

收藏

觉得还不错? 一键收藏
打赏
0
评论
常用SQL总结

1.时间函数当前日期select current_date();2022-06-26当前日期-加select date_add(current_date(),1);2022-06-27当前日期-减select date_sub(current_date(),1);2022-06-25指定日期-加select date_add('2022-06-06',1);2022-06-07当前日期-自定义格式select date_format(current_date(),'yyyyMMdd');20220626当前
复制链接

扫一扫