数据分析必备之sql细节
字符串截取
数据分析数据清洗时对表中某字段截取部分的需求,这种场景有时直接通过数据库操作来实现比通过python等代码实现要更方便快捷些,mysql、hive有很多字符串函数可以用来处理这些需求,如字符串截取总结:left()、right()、substring()、substring_index()。
一.从左开始截取字符串
用法:left(str, length),即:left(被截取字符串, 截取长度)
SELECT LEFT(‘www.yuanrengu.com’,8)
结果为:www.yuan
二.从右开始截取字符串
用法:right(str, length),即:right(被截取字符串, 截取长度)
SELECT RIGHT(‘www.yuanrengu.com’,6)
结果为:gu.com
三.截取特定长度的字符串
用法:
substring(str, pos),即:substring(被截取字符串, 从第几位开始截取)
substring(str, pos, length),即:substring(被截取字符串,从第几位开始截取,截取长度)
1.从字符串的第9个字符开始读取直至结束
SELECT SUBSTRING(‘www.yuanrengu.com’, 9)
结果为:rengu.com
2.从字符串的第9个字符开始,只取3个字符
SELECT SUBSTRING(‘www.yuanrengu.com’, 9, 3)
结果为:ren
3.从字符串的倒数第6个字符开始读取直至结束
SELECT SUBSTRING(‘www.yuanrengu.com’, -6)
结果为:gu.com
4.从字符串的倒数第6个字符开始读取,只取2个字符
SELECT SUBSTRING(‘www.yuanrengu.com’, -6, 2)
结果为:gu
四.按关键字进行读取
用法:substring_index(str, delim, count),即:substring_index(被截取字符串,关键字,关键字出现的次数)
1.截取第二个“.”之前的所有字符
SELECT SUBSTRING_INDEX(‘www.yuanrengu.com’, ‘.’, 2);
结果为:www.yuanrengu
2.截取倒数第二个“.”之后的所有字符
SELECT SUBSTRING_INDEX(‘www.yuanrengu.com’, ‘.’, -2);
结果为:yuanrengu.com
3.如果关键字不存在,则返回整个字符串
SELECT SUBSTRING_INDEX(‘www.yuanrengu.com’, ‘sprite’, 1);
结果为:www.yuanrengu.com
字符串拼接
一、拼接函数
concat(),concat_ws()与group_concat()函数
1、concat()
用于将多个字符拼接起来,CONCAT(str1,str2,…),如果有某个字符为空的情况结果为空
SELECT CONCAT('FIRST ', 'SECOND');
-- FIRST SECOND
SELECT CONCAT('FIRST ', NULL, 'SECOND');
-- NULL
2、concat_ws()函数
用于在拼接的时候指定分隔符,CONCAT_WS(separator,str1,str2,…),如果有某个字符为空的情况结果不为空
SELECT CONCAT_WS(',','1','2','3','4')
-- 1,2,3,4
SELECT CONCAT_WS(',','1','2', NULL, '3','4');
-- 1,2,3,4
分组拼接
用法:
select id,
concat_ws(’+’,sort_array(collect_set(name))) as grp_name
from taget_table1
group by id
字符串拼接
用法:
select concat_ws(’-’,str1,str2,str3,str4,str5) as strs
from taget_table2
3、group_concat()函数
返回一个字符串结果,该结果由分组中的值连接而成,GROUP_CONCAT(expr)
GROUP_CONCAT([DISTINCT] expr [,expr ...] [ORDER BY {unsigned_integer | col_name | formula} [ASC | DESC] [,col ...]]
[SEPARATOR str_val])
通过使用DISTINCT排除重复值,使用ORDER BY进行排序,SEPARATOR是一个字符串,用作结果集的分隔符,缺省值为逗号,可以通过指定SEPARATOR “”去掉分隔符。
可以通过group_concat_max_len设置最大长度,如果最大长度被限制,结果集将被剪切到这个最大长度:
SET [SESSION | GLOBAL] group_concat_max_len = unsigned_integer;
使用info表为例:
+----------+----+--------------------------+
| locus | id | journal |
+----------+----+--------------------------+
| AB086827 | 1 | Unpublished |
| AB086827 | 2 | Submitted (20-JUN-2002) |
| AF040764 | 23 | Unpublished |
| AF040764 | 24 | Submitted (31-DEC-1997) |
+----------+----+--------------------------+
在select下使用直接group_concat结果是先拼接再加分隔符:
SELECT GROUP_CONCAT(id,locus) FROM info;
-- 1AB086827,2AB086827,3AF040764,4AF040764
在select下使用group by后的group_concat结果是先分组在拼接的:
SELECT locus,GROUP_CONCAT(id) FROM info WHERE locus IN('AB086827','AF040764') GROUP BY locus;
| locus | GROUP_CONCAT(id) |
+----------+------------------+
| AB086827 | 1,2 |
| AF040764 | 23,24 |
使用distinct过滤重复值,使用order by排序,使用separator重新设置分隔符:
SELECT locus,GROUP_CONCAT(distinct id ORDER BY id DESC SEPARATOR '_') FROM info WHERE locus IN('AB086827','AF040764') GROUP BY l
| locus | GROUP_CONCAT(distinct id ORDER BY id DESC SEPARATOR '_') |
+----------+----------------------------------------------------------+
| AB086827 | 2_1 |
| AF040764 | 24_23 |
组合使用group_concat与concat_ws:
SELECT locus,GROUP_CONCAT(concat_ws(', ',id,journal) ORDER BY id DESC SEPARATOR '. ') FROM info WHERE locus IN('AB086827','AF040764') GROUP BY locus;
+----------+--------------------------------------------------------------------------+
| locus | GROUP_CONCAT(concat_ws(', ',id,journal) ORDER BY id DESC SEPARATOR '. ') |
+----------+--------------------------------------------------------------------------+
| AB086827 | 2, Submitted (20-JUN-2002). 1, Unpublished |
| AF040764 | 24, Submitted (31-DEC-1997) . 23, Unpublished |
https://blog.csdn.net/u011305680/article/details/52233124
二、字符串有序拼接
1、字母排序
由collect_list形成的列表经过concat_ws拼接后顺序具有随机性,要保证列表有序只需要在生成列表后使用sort_array函数进行排序即可,示例如下:
SELECT
memberid,
regexp_replace(
concat_ws('-',
sort_array(
collect_list(
concat_ws(':',cast(legcount as string),airways)
)
)
),'\\d\:','') hs
from
(
select 1 as memberid,'A' as airways,2 as legcount
union ALL
select 1 as memberid,'B' as airways,3 as legcount
union ALL
select 1 as memberid,'C' as airways,4 as legcount
union ALL
select 1 as memberid,'D' as airways,1 as legcount
union ALL
select 1 as memberid,'E' as airways,8 as legcount
) as t
group by memberid
https://blog.csdn.net/madaokuma/article/details/81544849
2、数字排序
select 'a' as category, 19 as duration
union all
select 'b' as category, 15 as duration
union all
select 'c' as category, 12 as duration
union all
select 'd' as category, 53 as duration
union all
select 'e' as category, 27 as duration
union all
select 'f' as category, 9 as duration;
category | duration
b | 15
f | 9
e | 27
c | 12
d | 53
a | 19
想要多行转一行并且按照duration排序,形成如下效果d,e,a,b,c,f
concat_ws(’,’,collect_set(category)) 但是得到的结果却是乱序的,产生这个问题的根本原因自然在MapReduce,如果启动了多于一个mapper/reducer来处理数据,select出来的数据顺序就几乎肯定与原始顺序不同了。
解决方法可以把mapper数固定成1,或者把rank加进来再进行一次排序,拼接完之后把rank去掉:
select
regexp_replace(
concat_ws(',',
sort_array(
collect_list(
concat_ws(':',lpad(cast(duration_rank as string),5,'0'),cast(category as string))
)
)
),
'\\d+\:','')
from
(select
category
,row_number() over (order by cast(duration as int) desc) duration_rank
from
(select 'a' as category, 19 as duration
union all
select 'b' as category, 15 as duration
union all
select 'c' as category, 12 as duration
union all
select 'd' as category, 53 as duration
union all
select 'e' as category, 27 as duration
union all
select 'f' as category, 9 as duration) t
) T;
duration_rank 必须要在高位补足够的0对齐,因为排序的是字符串而不是数字,如果不补0的话,按字典序排序就会变成1, 10, 11, 12, 13, 2, 3, 4…,又不对了。将排序的结果拼起来之后,用regexp_replace函数替换掉冒号及其前面的数字,大功告成。
https://cloud.tencent.com/developer/article/1634334
应用:
select
key,
substring(
regexp_replace(
regexp_replace(
concat_ws(',',
sort_array(
collect_list(
concat_ws('flag',lpad(cast(id_rank as string),5,'0'),cast(score as string))
)
)
),
'\\d+\flag','-'),
',-','-'),
2) as line_score
from
(
select *,
row_number() over (partition by key order by cast(id as int)) classid_rank
-- 按照key分组,id大小排序
from t1
)
group by key;
字符串判断
用法:instr()
返回字符串的位置,没有返回0
一、.判断字符串是否含有目标字符串str1……
用法:
select nvl(instr('原始字符串','目标字符串1'),0)>0
二.判断是否存在目标字符串1或1或3
用法:
select nvl(instr('原始字符串','目标字符串1'),0)
+ nvl(instr('原始字符串','目标字符串2'),0)
+ nvl(instr('原始字符串','目标字符串3'),0)
>0
三.判断目标字符串1或2或3或4……或n出现的次数是否超过3次
用法:
select if(nvl(instr('原始字符串','目标字符串1'),0)>0,1,0)
+ if(nvl(instr('原始字符串','目标字符串2'),0)>0,1,0)
+ ……
+ if(nvl(instr('原始字符串','目标字符串n'),0)>0,1,0)
>0
字符串替换
REGEXP_REPLACE(inputString, regexString, replacementString)
第一个参数:表中字段
第二个参数:正则表达式
第三个参数:要替换称为的字符
https://stackoverflow.com/questions/43415974/sort-array-order-by-a-different-column-hive
https://blog.csdn.net/madaokuma/article/details/81544849
条件统计 case when与聚合函数
一、case when()+ count() = count([distinct] case when c1 then v1 else null end)
用法:
select
count(distinct case when score>60 then uid else null end) as cnt_60,
count(distinct case when score>80 and cast(score_rank as int)>=2 then uid else null end) as cnt_80,
count(distinct case when score>100 and cast(score_rank as int)>=3 then uid else null end) as cnt_100,
from table1
table1:uid-term-score 学生各个学科成绩表,
cnt_60为各科成绩大于60的学生个数(去重)
二、case when() + sum() = sum(case when c1 then v1 else v2 end)
用法:
select col1,col2,sum(case when col1>col2 then value1 else value2 end)
group by col1,col2
jason字符串处理
一、读取jason字符串数据
用法:action ‘{key:a,data{page:1,filter:0}}’
select action[‘key’],action.key,action.data[‘page’] from maidian_table
返回值:a a 1
二、字符串转化jason格式
用法:select to_json(struct(’{1,2,3:4}’))
返回值:{“col1”:"{1,2,3:4}"}
正则表达式
';'转义
用法:select ‘;’
返回值:;
一列转多列
Lateral View
Lateral View与用户自定义生成函数即UDTF(如explode()或者split()等)结合使用。UDTF:为每一个输入行生成0个或者多个输出行。Lateral View将UDTF应用于基础表的每一行,然后将输出行连接到输入行,以形成具有所提供的表别名的虚拟表。
基本用法:
lateral view:LATERAL VIEW udtf(expression) tableAlias AS columnAlias
explode
LATERAL VIEW 与 posexplode
I have a hive table with the following schema:
COOKIE | PRODUCT_ID | CAT_ID | QTY
1234123 [1,2,3] [r,t,null] [2,1,null]
How can I normalize the arrays so I get the following result
COOKIE | PRODUCT_ID | CAT_ID | QTY
1234123 [1] [r] [2]
1234123 [2] [t] [1]
1234123 [3] null null
I have tried the following:however the result comes out as a Cartesian product.
select concat_ws(’|’,visid_high,visid_low) as cookie
,pid
,catid
,qty
from table
lateral view explode(productid) ptable as pid
lateral view explode(catalogId) ptable2 as catid
lateral view explode(qty) ptable3 as qty
正确做法
SELECT COOKIE ,
ePRODUCT_ID,
eCAT_ID,
eQTY
FROM TABLE
LATERAL VIEW posexplode(PRODUCT_ID) ePRODUCT_ID AS seqp, ePRODUCT_ID
LATERAL VIEW posexplode(CAT_ID) eCAT_ID AS seqc, eCAT_ID
LATERAL VIEW posexplode(QTY) eQTY AS seqq, eDateReported
WHERE seqp = seqc AND seqc = seqq;
参考链接:https://stackoverflow.com/questions/20667473/hive-explode-lateral-view-multiple-arrays