自己总结sql用法

最新推荐文章于 2023-04-18 17:04:14 发布

SmartShylyBoy

最新推荐文章于 2023-04-18 17:04:14 发布

阅读量529

点赞数

分类专栏： hive 文章标签： sql hive sql hql hql with as distinct

本文链接：https://blog.csdn.net/SmartShylyBoy/article/details/84819585

版权

hive 专栏收录该内容

3 篇文章 0 订阅

订阅专栏

**
一篇超级好的sql执行文档
https://www.cnblogs.com/cdf-opensource-007/p/6502556.html

- 2018/12/05 with as 用法

一、hivesql中（注意：mysql不支持），with as语句的作用是相当于创建了一个中间表，加载到内存中，这样在后续的使用中极大的提高速度（不用建表，insert数据到中间表；加载内存，使用数据速度快）。
二、使用
WITH t1 AS (
SELECT *
FROM carinfo
),
t2 AS (
SELECT *
FROM car_blacklist
)
SELECT *
FROM t1, t2

注意：这里必须要整体作为一条sql查询，即with as语句后不能加分号，不然会报错。

三、注意事项

with子句必须在引用的select语句之前定义,同级with关键字只能使用一次,多个只能用逗号分割；最后一个with 子句与下面的查询之间不能有逗号，只通过右括号分割,with 子句的查询必须用括号括起来.

以下写法会报错：

with t1 as (select * from carinfo)
with t2 as (select * from car_blacklist)
select * from t1,t2
1
2
3
with t1 as (select * from carinfo);
select * from t1
1
2
2.如果定义了with子句，但其后没有跟select查询，则会报错！

以下写法会报错：

with t1 as (select * from carinfo)
1
正确写法（没有使用 t1没关系，其后有select就行）：

with t1 as (select * from carinfo)
select * from carinfo
1
2
3.前面的with子句定义的查询在后面的with子句中可以使用。但是一个with子句内部不能嵌套with子句！

正确写法：

with t1 as (select * from carinfo),
t2 as (select t1.id from t1)
select * from t2

- 2018/12/13 使用group up代替distinct

近期我司大数据部门事故频发，distinct一张表的四万分区的表，zookeeper产生四万余lock，大数据部门查的是使用distinct，产生严重的数据倾斜。
Hive去重统计
相信使用Hive的人平时会经常用到去重统计之类的吧，但是好像平时很少关注这个去重的性能问题，但是当一个表的数据量非常大的时候，会发现一个简单的count(distinct order_no)这种语句跑的特别慢，和直接运行count(order_no)的时间差了很多，于是研究了一下。
先说结论:能使用group by代替distinc就不要使用distinct，例子：

实际论证
order_snap为订单的快照表总记录条数763191489，即将近8亿条记录,总大小:108.877GB,存储的是公司所有的订单信息，表的字段大概有20个,其中订单号是没有重复的,所以在统计总共有多少订单号的时候去重不去重结果都一样，我们来看看:
统计所有的订单有多少条条数，一个count函数就可以搞定的sql性能如何。

DISTINCT
1
2
3
4
5
6
7
select count(distinct order_no) from order_snap;
Stage-Stage-1: Map: 396 Reduce: 1 Cumulative CPU: 7915.67 sec HDFS Read: 119072894175 HDFS Write: 10 SUCCESS
Total MapReduce CPU Time Spent: 0 days 2 hours 11 minutes 55 seconds 670 msec
OK
_c0
763191489
Time taken: 1818.864 seconds, Fetched: 1 row(s)
GROUP BY
1
2
3
4
5
6
7
8
select count(t.order_no) from (select order_no from order_snap group by order_no) t;
Stage-Stage-1: Map: 396 Reduce: 457 Cumulative CPU: 10056.7 sec HDFS Read: 119074266583 HDFS Write: 53469 SUCCESS
Stage-Stage-2: Map: 177 Reduce: 1 Cumulative CPU: 280.22 sec HDFS Read: 472596 HDFS Write: 10 SUCCESS
Total MapReduce CPU Time Spent: 0 days 2 hours 52 minutes 16 seconds 920 msec
OK
_c0
763191489
Time taken: 244.192 seconds, Fetched: 1 row(s)
结论:第二种写法的性能是第一种的7.448499541倍
注意到为什么会有这个差异，Hadoop其实就是处理大数据的，Hive并不怕数据有多大，怕的就是数据倾斜,我们看看两者的输出信息:

1
2
3
4

distinct

Stage-Stage-1: Map: 396 Reduce: 1 Cumulative CPU: 7915.67 sec HDFS Read: 119072894175 HDFS Write: 10 SUCCESS

group by

Stage-Stage-1: Map: 396 Reduce: 457 Cumulative CPU: 10056.7 sec HDFS Read: 119074266583 HDFS Write: 53469 SUCCESS
发现猫腻了没有，使用distinct会将所有的order_no都shuffle到一个reducer里面，这就是我们所说的数据倾斜，都倾斜到一个reducer这样性能能不低么？再看第二个，直接按订单号分组，起了457个reducer，将数据分布到多台机器上执行，时间当然快了.
由于没有手动指定Reduce的个数，Hive会根据数据的大小动态的指定Reduce大小，你也可以手动指定

hive> set mapred.reduce.tasks=100；

类似这样,所以如果数据量特别大的情况下，尽量不要使用distinct吧。
但是如果你想在一条语句里看总记录条数以及去重之后的记录条数，那没有办法过滤，所以你有两个选择，要么使用两个sql语句分别跑，然后union all或者就使用普通的distinct。具体来说得看具体情况，直接使用distinct可读性好，数据量如果不大的话推荐使用，如果数据太大了，性能受到影响了，再考虑优化

-2018/12/14 join的用法（相比union他是横向扩展，union 是纵向做加法，也就是说join后的展示是多了列，union是多了行）

数据库在通过连接两张或多张表来返回记录时，都会生成一张中间的临时表，然后再将这张临时表返回给用户。

  在使用left jion时，on和where条件的区别如下：
  1、 on条件是在生成临时表时使用的条件，它不管on中的条件是否为真，都会返回左边表中的记录。
  2、where条件是在临时表生成好后，再对临时表进行过滤的条件。这时已经没有left join的含义（必须返回左边表的记录）了，条件不为真的就全部过滤掉。

有时为了得到完整的结果，我们需要从两个或更多的表中获取结果。我们就需要执行 join。
join（也就是inner join）
写法：
SELECT Persons.LastName, Persons.FirstName, Orders.OrderNo
FROM Persons, Orders
WHERE Persons.Id_P = Orders.Id_P
或者
SELECT Persons.LastName, Persons.FirstName, Orders.OrderNo
FROM Persons
INNER JOIN Orders
ON Persons.Id_P = Orders.Id_P
ORDER BY Persons.LastName

除了我们在上面的例子中使用的 INNER JOIN（内连接），我们还可以使用其他几种连接。

下面列出了您可以使用的 JOIN 类型，以及它们之间的差异。

JOIN: 如果表中有至少一个匹配，则返回行
LEFT JOIN: 即使右表中没有匹配，也从左表返回所有的行
RIGHT JOIN: 即使左表中没有匹配，也从右表返回所有的行
FULL JOIN: 只要其中一个表中存在匹配，就返回行

不管是什么join只要不写on或者on的条件出问题，比如都是null会导致产生笛卡儿积，产生爆炸数据量。

-2018/12/14 union和union all用法

SQL UNION 操作符
UNION 操作符用于合并两个或多个 SELECT 语句的结果集。

请注意，UNION 内部的 SELECT 语句必须拥有相同数量的列。列也必须拥有相似的数据类型。同时，每条 SELECT 语句中的列的顺序必须相同。

SQL UNION 语法
SELECT column_name(s) FROM table_name1
UNION
SELECT column_name(s) FROM table_name2
注释：默认地，UNION 操作符选取不同的值。如果允许重复的值，请使用 UNION ALL。

SQL UNION ALL 语法
SELECT column_name(s) FROM table_name1
UNION ALL
SELECT column_name(s) FROM table_name2
另外，UNION 结果集中的列名总是等于 UNION 中第一个 SELECT 语句中的列名。

2019\3\7 case when 用法

例如 http://www.cnblogs.com/qingyunzong/p/8747656.html 中，第二个案例。
涉及到纵向比较的时候，先行转列，横向比较比较简单。设计case when 的用法，是每一行开始遍历 case XXX when XXX 生成一行，一个学号有几个课程就有几行，再用group By 加 sid max(shuxue),max(yuwen) 合成一行，再作对比！！

2019\3\7 行转列不止只有case when 还有explode（）用法，行转列可用collect_set ()[] 和 collect_list()[] 和

https://www.cnblogs.com/cc11001100/p/9043946.html

collect_list()不去重，collect_set() 去重，他们一般会配合group
By 函数【】表示这个数组的第几位

https://www.cnblogs.com/linehrr-freehacker/p/3309088.html
explode（）将数组转成列

UDTF
UDTF将单一输入行转化为多个输出行，并且在使用UDTF时，select语句中不能包含其他的列，UDTF不支持嵌套，也不支持group by 、sort by等语句。如果想避免上述限制，需要使用lateral view语法，案例：
select a.timestamp, get_json_object(a.appevents, ‘ $eventid'), get_json_object(a.appenvets, '$ .eventname’) from log a;
select a.timestamp, b.*
from log a lateral view json_tuple(a.appevent, ‘eventid’, ‘eventname’) b as f1, f2;

其中，get_json_object为UDF函数，json_tuple为UDTF函数。
UDTF函数在某些应用场景下可以大大提高hql语句的性能，如需要多次解析json或者xml数据的应用场景。

2019\3\8 Hive几种排序的特点

   1.order by 全局排序

   2.sort by  非全局排序

   3.distribute by hash散列分区，常和sort by同时使用。即分区又排序，需要设置mapreduce.job.reduces的个数

   4.cluster by 当distribute by 和sort by的字段相同时，等同于cluster by.可以看做特殊的distribute + sort

2019\3\8 Multi-group by 是hive的一个非常好的特性，请举例说明

from A
insert overwrite table B
select A.a, count(distinct A.b) group by A.a
insert overwrite table C
select A.c, count(distinct A.b) group by A.c

2019\3\8 行转列：

LATERAL VIEW的使用：
侧视图的意义是配合explode（或者其他的UDTF），一个语句生成把单行数据拆解成多行后的数据结果集。
语法：
select mobile_no,lala,wawa from credit.aa
lateral view explode(split(first_bill_date,’-’)) goods as lala
lateral view explode(split(created_at,’:’))kk as wawa;

（ lateral view explode(split(created_at,’:’))kk lateral view explode(split(created_at,’:’)) 相当于一个虚拟表，kk相当于这个虚拟表的表名，通过虚拟表与原表credit.aa笛卡尔积得到我们想要的表）

98997229723 2017 2016-05-28 20
98997229723 2017 52
98997229723 2017 06
98997229723 05 2016-05-28 20
98997229723 05 52
98997229723 05 06
98997229723 01 2016-05-28 20
98997229723 01 52
98997229723 01 06

因为

只是使用explode的话，要使用mobile的话就不行了
hive> select mobile_no, explode(split(first_bill_date,’-’)) from credit.aa ;
FAILED: SemanticException [Error 10081]: UDTF’s are not supported outside the SELECT clause, nor nested in expressions

只能

hive> select explode(split(first_bill_date,’-’)) from credit.aa ;
Query ID = bdpms_20190308144343_868a2a86-6b8f-45c5-87b9-79ea92c4730e
Total jobs = 1
Launching Job 1 out of 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=
In order to set a constant number of reducers:
set mapreduce.job.reduces=
。。。。。

2019\3\11 hql的 over()：

over()必须和分析函数一块使用，比如row_number() 、sum() …

2019\3\21 hql的宏：

Hive中常被忽视的利器——宏
大数据 hadoop hive 3.1k 次阅读 · 读完需要 12 分钟
我们都知道Hive中有UDF（user defined function）——即用户自定义函数，但是由于UDF是Java编写的，代码中堆变量的内存回收完全不受开发者控制，而UDF程序又是嵌套在Hive SQL中执行的，对规模较大的表，就往往会出现由于UDF内存回收不及时造成的out-of-memory错误。因此，在生产环境中，UDF是严格受限的。那么，怎么办呢？

幸好，大多数情况下我们并不是真正需要（不得不用）UDF，大多数时候我们仅仅只是需要把一些虽然繁琐但其实结构简单的逻辑封装起来以便重复使用。举个栗子：

SELECT
if(birthday is not null and length(birthday)=4,
case
when birthday >= ‘0321’ and birthday <= ‘0420’ then ‘白羊座’
when birthday >= ‘0421’ and birthday <= ‘0520’ then ‘金牛座’
when birthday >= ‘0521’ and birthday <= ‘0621’ then ‘双子座’
when birthday >= ‘0622’ and birthday <= ‘0722’ then ‘巨蟹座’
when birthday >= ‘0723’ and birthday <= ‘0822’ then ‘狮子座’
when birthday >= ‘0823’ and birthday <= ‘0922’ then ‘处女座’
when birthday >= ‘0923’ and birthday <= ‘1022’ then ‘天秤座’
when birthday >= ‘1023’ and birthday <= ‘1121’ then ‘天蝎座’
when birthday >= ‘1122’ and birthday <= ‘1221’ then ‘射手座’
when birthday >= ‘1222’ and birthday <= ‘1231’ then ‘摩羯座’
when birthday >= ‘0101’ and birthday <= ‘0119’ then ‘摩羯座’
when birthday >= ‘0120’ and birthday <= ‘0218’ then ‘水瓶座’
when birthday >= ‘0219’ and birthday <= ‘0320’ then ‘双鱼座’
else null
end, null) as zodiac
FROM
employee;
这么长一大段其实只是完成了一个很简单的逻辑——把生日转换成星座，如果这个逻辑需要在多个地方重复使用，那代码就会变得非常难看，几乎没法维护。这时往往我们就想到要去定义一个UDF，但其实没必要，杀鸡焉用牛刀。现在轮到本文的主角隆重出场了。我们可以创建一个宏：
–根据生日推算星座

DROP TEMPORARY MACRO IF EXISTS getZodiacFromBirth;
CREATE TEMPORARY MACRO getZodiacFromBirth(birthday string)
if(birthday is not null and length(birthday)=4,
case
when birthday >= ‘0321’ and birthday <= ‘0420’ then ‘白羊座’
when birthday >= ‘0421’ and birthday <= ‘0520’ then ‘金牛座’
when birthday >= ‘0521’ and birthday <= ‘0621’ then ‘双子座’
when birthday >= ‘0622’ and birthday <= ‘0722’ then ‘巨蟹座’
when birthday >= ‘0723’ and birthday <= ‘0822’ then ‘狮子座’
when birthday >= ‘0823’ and birthday <= ‘0922’ then ‘处女座’
when birthday >= ‘0923’ and birthday <= ‘1022’ then ‘天秤座’
when birthday >= ‘1023’ and birthday <= ‘1121’ then ‘天蝎座’
when birthday >= ‘1122’ and birthday <= ‘1221’ then ‘射手座’
when birthday >= ‘1222’ and birthday <= ‘1231’ then ‘摩羯座’
when birthday >= ‘0101’ and birthday <= ‘0119’ then ‘摩羯座’
when birthday >= ‘0120’ and birthday <= ‘0218’ then ‘水瓶座’
when birthday >= ‘0219’ and birthday <= ‘0320’ then ‘双鱼座’
else null
end, null);
然后，我们前面这个SQL就可以简化成下面这个版本了。

select getZodiacFromBirth(birthday) from employee;
是不是很清爽呢？其实大多数不需要循环结构的逻辑，基本上都可以用宏来处理。它不仅可以用来做字段值的转换映射，也可以做逻辑校验。比如：

–判断身份证号是否合法

DROP TEMPORARY MACRO IS_VALID_IDNO;
CREATE TEMPORARY MACRO IS_VALID_IDNO(idno string)
IF(idno rlike ‘¹\d{5}(19|20)\d{2}(0[1-9]|1[0-2])([0-2]\d|3[0-1])\d{4} $OR idno rlike '^[1-9]\\d{5}\\d{2}(0[1-9]|1[0-2])([0-2]\\d|3[0-1])\\d{3}$ ’,true,false);
聪明如你，现在应该已经可以举一反三的运用宏这柄利器了，打字很累，我也就不再举例了。
最后，说一下宏的局限性。宏只能是临时宏，只在本次会话中可见、有效。因此你需要将宏脚本放在SQL脚本的头部。

但是 mysql语法有所不同

CREATE FUNCTION getNthHighestSalary(N INT) RETURNS INT
BEGIN
RETURN (
# Write your MySQL query statement below.

);
END

2019\5\5 hql的别名：

–select * from dm_jtftp_bdpms.tmp_scy_20190117_live15_detail c

和

select * from dm_jtftp_bdpms.tmp_scy_20190117_live15_detail_internal c where bdpms_etl_time=20190427000000
这样的区别就是第一个是视图，他们的范围就是视图，但是如果是全量表 c
加分区的话这样也是不行的，必须是（select * from dm_jtftp_bdpms.tmp_scy_20190117_live15_detail_internal where bdpms_etl_time=20190427000000）c

2019、5、9 如果统计数相同取最近时间

row_number() OVER(partition by mbl_nr order by cust_nm_cnt desc,dt_id desc) as rak

2019/5/9 计算最大联系徐天数内调用量最大，距今时间等：

–m1_optr_3elmts_max_cntd_inqry_dy_cnt int comment ‘近1个月运营商三要素最长连续查询天数’,
–m1_optr_3elmts_max_cntd_inqry_end_dt date comment ‘近1个月运营商三要素最长连续查询天数结束日期’,
–m1_optr_3elmts_max_cntd_inqry_totl_cnt int comment ‘近1个月运营商三要素最长连续查询期间总查询次数’,
(select
mbl_nr as mbl_no,
cons_day as m1_optr_3elmts_max_cntd_inqry_dy_cnt,
3elmts_cons_end_day_1m as m1_optr_3elmts_max_cntd_inqry_end_dt,
sum_cnt as m1_optr_3elmts_max_cntd_inqry_totl_cnt
from
(select sortconsday.*,row_number() over(partition by mbl_nr order by cons_day desc) rnk from
(select mbl_nr,min(dt_id_date) 3elmts_cons_start_day_1m,max(dt_id_date) 3elmts_cons_end_day_1m,contiday.id1-contiday.id2 diff,max(id1)-min(id1)+1 cons_day,sum(cnt) sum_cnt
from(select mbl_nr,dt_id_date,cnt,datediff(dt_id_date,‘2017-01-01’) id1, row_number() over (partition by mbl_nr order by dt_id_date asc) id2 from
(select mbl_nr,substr(dt_id,1,10) as dt_id_date ,count(1) cnt
from app_label_credit_bdpms.lbl_acct_trsn_dly_smmry_addprod_clasfct_fial
where bdpms_etl_time>($(ct.convertTo(4).add(-1).get())) and prod_clasfct_three_by_prod_id=‘3elmts’ and incr_col=‘RQST_RCVD_TS’
group by mbl_nr,dt_id)sortdt)contiday
group by mbl_nr,contiday.id1-contiday.id2)sortconsday) t where rnk = 1) sanyaosu_1mth_ae
on aa.mbl_no=sanyaosu_1mth_ae.mbl_no

解析步骤：
一、
select mbl_nr,substr(dt_id,1,10) as dt_id_date ,count(1) cnt --手机号、日期、这个日期总调用次数
from app_label_credit_bdpms.lbl_acct_trsn_dly_smmry_addprod_clasfct_fial
where bdpms_etl_time>($(ct.convertTo(4).add(-1).get())) and prod_clasfct_three_by_prod_id=‘3elmts’ and incr_col=‘RQST_RCVD_TS’
group by mbl_nr,dt_id

select mbl_nr,dt_id_date,cnt,datediff(dt_id_date,‘2017-01-01’) id1, row_number() over (partition by mbl_nr order by dt_id_date asc) id2 from
()sortdt --手机号码、日期、调用时间至今、那时间排名

二、
select mbl_nr,min(dt_id_date) 3elmts_cons_start_day_1m,max(dt_id_date) 3elmts_cons_end_day_1m,contiday.id1-contiday.id2 diff,max(id1)-min(id1)+1 cons_day,sum(cnt) sum_cnt
from()contiday --手机号、这个时间差中最小、最大时间、这个时间段（时间差）、连续天数、这个手机号码调用的这个时间段中的所有次调用
group by mbl_nr,contiday.id1-contiday.id2

三、
select
mbl_nr as mbl_no, --使用最大连续天数;手机号码、连续天数、结束日期、这个时间段的调用次数
cons_day as m1_optr_3elmts_max_cntd_inqry_dy_cnt,
3elmts_cons_end_day_1m as m1_optr_3elmts_max_cntd_inqry_end_dt,
sum_cnt as m1_optr_3elmts_max_cntd_inqry_totl_cnt
from
(select sortconsday.*,row_number() over(partition by mbl_nr order by cons_day desc) rnk from
()sortconsday) t where rnk = 1

2019/07/17 translate和replace

1.translate
语法：TRANSLATE(char, from, to)

用法：返回将出现在from中的每个字符替换为to中的相应字符以后的字符串。
若from比to字符串长，那么在from中比to中多出的字符将会被删除。
三个参数中有一个是空，返回值也将是空值。
举例：SQL> select translate(‘abcdefga’,‘abc’,‘wo’) 返回值 from dual;
返回值
-------
wodefgw
分析：该语句要将’abcdefga’中的’abc’转换为’wo’,
由于’abc’中’a’对应’wo’中的’w’,
故将’abcdefga’中的’a’全部转换成’w’;
而’abc’中’b’对应’wo’中的’o’,
故将’abcdefga’中的’b’全部转换成’o’;
'abc’中的’c’在’wo’中没有与之对应的字符,
故将’abcdefga’中的’c’全部删除;
简单说来，就是将from中的字符转换为to中与之位置对应的字符，
若to中找不到与之对应的字符，返回值中的该字符将会被删除。
在实际的业务中，可以用来删除一些异常数据，
比如表a中的一个字段t_no表示电话号码，
而电话号码本身应该是一个由数字组成的字符串，
为了删除那些含有非数字的异常数据，
就用到了translate函数：
SQL> delete from a，
where length(translate(trim(a.t_no),
‘0123456789’ || a.t_no,
‘0123456789’)) <> length(trim(a.t_no));

2.replace

语法：REPLACE(char, search_string,replacement_string)
用法：将char中的字符串search_string全部转换为字符串replacement_string。

举例：SQL> select REPLACE(‘fgsgswsgs’, ‘fk’ ,‘j’) 返回值 from dual;
返回值
---------
fgsgswsgs

        SQL> select REPLACE('fgsgswsgs', 'sg' ,'eeerrrttt') 返回值 from dual;
        返回值
        -----------------------
        fgeeerrrtttsweeerrrttts

分析：第一个例子中由于’fgsgswsgs’中没有与’fk’匹配的字符串，
故返回值仍然是’fgsgswsgs’；
第二个例子中将’fgsgswsgs’中的字符串’sg’全部转换为’eeerrrttt’。

总结：综上所述，replace与translate都是替代函数，

只不过replace针对的是字符串，而translate针对的是单个字符。