Where 语句中的不等于: != <> , 使用时会把为NULL的值去掉,所以有的时候需要补上。例如:
where rate!=‘D’ or rate is null;
或者
where rate<>‘D’ or rate is null;
Where和having的区别
Where执行是在聚合函数之前,having执行是在聚合函数之后;
Where是对聚合之前的数据做过滤的,having是对聚合之后的数据做过滤的
和group by 的关系
Where 用在 group by 之前,sql 会在分组之前计算where语句
Having 用在 group by 之后,sql会在分组之后计算having语句,查询结果中只返回满足having条件的组。
Group by order by 于having where 的位置先后
顺序是:
where…group by…having…order by (这里的group by 与 having 的位置可以调换,但是having放在后面效率高!),即where最前,order by最后。
数学函数
Round(double d,int n):返回保留n位小数的近似d值
hive (default)> select round(3.1415,2) round from dual;
3.14
Floor(double d):返回小于d的最大整值
hive (default)> select floor(3.1415) round from dual;
3
Ceil(double d):返回大于d的最小整值
hive (default)> select ceil(3.1415) round from dual;
4
Rand(int seed):返回随机数,seed是随机因子
hive (default)> select rand(10) from dual;
0.7304302967434272
hive (default)> select rand(1) from dual;
0.7308781907032909
hive (default)> select rand(2) from dual;
0.7311469360199058
Bin(int d):计算二进制d的string值
hive (default)> select bin(8) from dual;
1000
日期函数
To_date(string timestamp):返回时间字符串中的日期部分
hive (default)> select to_date('1970-01-01 00:00:00') from dual;
1970-01-01
date_format 日期格式化函数:格式化日期时候如果第一个参数为字符串,那么只能识别yyyy-MM-dd 类型的数据, : 第一个参数只能是String 类型或者日期类型
hive (default)> desc function date_format;
date_format(date/timestamp/string, fmt) - converts a date/timestamp/string to a value of string in the format specified by the date format fmt.
hive (default)> select date_format(to_date('2021-09-01'),'yyyy-MM') from dual;
2021-09
current_date:返回当前日期
hive (default)> select current_date from dual;
2021-09-05
year(date):返回日期date的年,类型为int如year(‘2019-01-01’)=2019
hive (default)> select year('2021-09-05') from dual;
2021
month(date):返回日期date的月,类型为int,如month(‘2019-01-01’)=1
hive (default)> select month('2021-09-05') from dual;
9
day(date): 返回日期date的天,类型为int,如day(‘2019-01-01’)=1
hive (default)> select day('2021-09-05') from dual;
5
weekofyear(date1):返回日期date1位于该年第几周
hive (default)> select weekofyear('2021-01-04') from dual;
1
datediff(date1,date2):返回日期date1与date2相差的天数
hive (default)> select datediff('2021-09-05','2021-09-01') from dual;
4
date_add(date1,int1):返回日期date1加上int1的日期
hive (default)> select date_add('2021-09-01',3) from dual;
2021-09-04
date_sub(date1,int1):返回日期date1减去int1的日期
hive (default)> select date_sub('2021-09-05',3) from dual;
2021-09-02
months_between(date1,date2):返回date1与date2相差月份
hive (default)> select months_between('2021-05-04','2021-09-04') from dual;
-4.0
add_months(date1,int1):返回date1加上int1个月的日期,int1可为负数
hive (default)> select add_months('2021-05-04',2) from dual;
2021-07-04
last_day(date1):返回date1所在月份最后一天
hive (default)> select last_day('2021-05-04') from dual;
2021-05-31
next_day(date1,day1):返回日期date1的下个星期day1的日期
hive (default)> select next_day('2021-09-04','MO') from dual;
2021-09-06
trunc(date1,string1):返回日期最开始年份或月份。string1可为年(YYYY/YY/YEAR)或月(MONTH/MON/MM)
hive (default)> select trunc('2021-09-02','YYYY') from dual;
2021-01-01
hive (default)> select trunc('2021-09-02','MM') from dual;
2021-09-01
unix_timestamp():返回当前时间的unix时间戳,可指定日期格式
hive (default)> select unix_timestamp('2021-09-01','yyyy-mm-dd') from dual;
1609430940
from_unixtime():返回unix时间戳的日期,可指定格式。如select from_unixtime(unix_timestamp('2019-
03-06','yyyy-mm-dd'),'yyyymmdd')='20190306'
hive (default)> select from_unixtime(unix_timestamp('2021-09-01','yyyy-mm-dd'),'yyyymmdd') from dual;
20210901
条件函数
if(boolean,t1,t2):若布尔值成立,则返回t1,反正返回t2
hive (default)> select if(1>2,100,200) from dual;
200
coalesce(v0,v1,v2):返回参数中的第一个非空值,若所有值均为null,则返回null
hive (default)> select coalesce(null,1,2) from dual;
1
hive (default)> select coalesce(null,null,2) from dual;
2
hive (default)> select coalesce(null,null,null) from dual;
NULL
isnull(a):若a为null则返回true,否则返回false
hive (default)> select isnull(null) from dual;
true
hive (default)> select isnull('word') from dual;
false
字符串函数
length(string1):返回字符串长度
hive (default)> select length('world') from dual;
5
concat(string1,string2):返回拼接string1及string2后的字符串
hive (default)> select concat('hello','world') from dual;
helloworld
concat_ws(sep,string1,string2):返回按指定分隔符拼接的字符串
hive (default)> select concat_ws('-','hello','world') from dual;
hello-world
lower(string1):返回小写字符串,同lcase(string1)。upper()/ucase():返回大写字符串
hive (default)> select lower('Hive') from dual;
hive
trim(string1):去字符串左右空格,ltrim(string1):去字符串左空格。rtrim(string1):去字符串右空格
hive (default)> select trim(' Hello ') from dual;
Hello
repeat(string1,int1):返回重复string1字符串int1次后的字符串
hive (default)> select repeat('hello',3) from dual;
hellohellohello
reverse(string1):返回string1反转后的字符串。如reverse(‘abc’)返回’cba’
hive (default)> select reverse('hello') from dual;
olleh
rpad(string1,len1,pad1):以pad1字符右填充string1字符串,至len1长度。如rpad(‘abc’,5,‘1’)返回’abc11’。lpad():左填充
hive (default)> select rpad('abc',5,'de') from dual;
abcde
select rpad('abc',5,'def') from dual;
hive (default)> select rpad('abc',5,'def') from dual;
abcde
split(string1,pat1):以pat1正则分隔字符串string1,返回数组。如split(‘a,b,c’,’,’)返回[“a”,“b”,“c”]
hive (default)> select split('hello,world,hadoop',',') from dual;
["hello","world","hadoop"]
substr(string1,index1,int1):以index位置起截取int1个字符。如substr(‘abcde’,1,2)返回’ab’
hive (default)> select substr('abcdefg',2,3) from dual;
bcd
egexp_replace 字符串替换函数,将源字符替换成为目标字符
聚合函数
count():统计行数
sum(col1):统计指定列和
avg(col1):统计指定列平均值
min(col1):返回指定列最小值
max(col1):返回指定列最大值
表生成函数
explode (array):返回多行array中对应的元素。explode常用来做行列转换。
hive (default)> select explode(array('A','B','C')) from dual;
OK
col
A
B
C
案例一
hive (default)> select * from advertise;
OK
advertise.pageid advertise.ad_list
front_page ["1","2","3"]
contract_page ["3","4","5"]
hive (default)> SELECT pageid, adid FROM advertise LATERAL VIEW explode(ad_list) adTable AS adid;
OK
pageid adid
front_page 1
front_page 2
front_page 3
contract_page 3
contract_page 4
contract_page 5
SELECT adid, count(1) cnt
FROM advertise LATERAL VIEW explode(ad_list) adTable AS adid
GROUP BY adid;
adid cnt
1 1
2 1
3 2
4 1
5 1
hive (default)> select * from ad_v2;
OK
ad_v2.col1 ad_v2.col2
["1","2"] ["a","b","c"]
["3","4"] ["d","e","f"]
Time taken: 0.057 seconds, Fetched: 2 row(s)
hive (default)> SELECT myCol1, myCol2 FROM ad_v2
> LATERAL VIEW explode(col1) myTable1 AS myCol1
> LATERAL VIEW explode(col2) myTable2 AS myCol2;
OK
mycol1 mycol2
1 a
1 b
1 c
2 a
2 b
2 c
3 d
3 e
3 f
4 d
4 e
4 f
行列转换
https://blog.csdn.net/u010003835/article/details/106632597
Hive 行列转换:
https://blog.csdn.net/u010003835/article/details/106604698#comments_17639212
行转列
方式一:通过 GROUP BY + CASE WHEN + 聚合函数
1.有用户学习情况的 uid, 科目,分数 多行数据。已知每个人所学分为 英语,数学,语文3个科目。
uid, subject,score 转换为 uid,math_score, chinese_score, english_score
INSERT OVERWRITE TABLE user_subject_score2
SELECT
uid
,MAX(
CASE
WHEN subject = 'math' THEN score
ELSE -1
END
) AS chinese_score
,MAX(
CASE
WHEN subject = 'chinese' THEN score
ELSE -1
END
) AS chinese_score
,MAX(
CASE
WHEN subject = 'english' THEN score
ELSE -1
END
) AS chinese_score
FROM user_subject_score1
GROUP BY uid
;
+--------------------------+---------------------------------+------------------------------------+------------------------------------+
| user_subject_score2.uid | user_subject_score2.math_score | user_subject_score2.chinese_score | user_subject_score2.english_score |
+--------------------------+---------------------------------+------------------------------------+------------------------------------+
| 1 | 80.0 | 76.0 | 80.0 |
| 2 | 88.0 | 88.0 | -1.0 |
| 3 | 66.0 | 30.0 | -1.0 |
+--------------------------+---------------------------------+------------------------------------+------------------------------------+
方式二: 通过 GROUP BY + CONCAT_WS + COLLECT_LIST
有 用户id ,订单id 求 用户的所有订单列表
uid, order_id -> uid,order_ids
我们看下这两个函数的使用方法
CONCAT_WS
+----------------------------------------------------+
| tab_name |
+----------------------------------------------------+
| concat_ws(separator, [string | array(string)]+) - returns the concatenation of the strings separated by the separator. |
+----------------------------------------------------+
COLLECT_LIST
+----------------------------------------------------+
| tab_name |
+----------------------------------------------------+
| collect_list(x) - Returns a list of objects with duplicates |
+----------------------------------------------------+
use data_warehouse_test;
CREATE TABLE IF NOT EXISTS user_order (
uid BIGINT
,order_id BIGINT
);
CREATE TABLE IF NOT EXISTS user_orders (
uid BIGINT
,order_ids STRING
);
INSERT OVERWRITE TABLE user_order VALUES
(1, 112)
,(1, 123)
,(2, 234)
,(2, 21)
,(3, 821)
;
列转行
方式一 :采用 UNION ALL 的方式
SELECT campaign_id, campaign_name, 'day1_num', day1_num
FROM explode_laterview_org
UNION ALL
SELECT campaign_id, campaign_name, 'day2_num', day2_num
FROM explode_laterview_org
UNION ALL
SELECT campaign_id, campaign_name, 'day3_num', day3_num
FROM explode_laterview_org
UNION ALL
SELECT campaign_id, campaign_name, 'day4_num', day4_num
FROM explode_laterview_org
UNION ALL
SELECT campaign_id, campaign_name, 'day5_num', day5_num
FROM explode_laterview_org
UNION ALL
SELECT campaign_id, campaign_name, 'day6_num', day6_num
FROM explode_laterview_org
UNION ALL
SELECT campaign_id, campaign_name, 'day7_num', day7_num
FROM explode_laterview_org
;
方式一在指标少时方便使用,当行转列的指标比较多时代码量会比较大,维护困难
方式二:使用lateral view和str_to_map
SELECT
campaign_id, campaign_name, type, num
FROM explode_laterview_org
LATERAL VIEW
EXPLODE(
STR_TO_MAP(
CONCAT(
'day1_num=',CAST (day1_num AS STRING),
'&day2_num=',CAST (day2_num AS STRING),
'&day3_num=',CAST (day3_num AS STRING),
'&day4_num=',CAST (day4_num AS STRING),
'&day5_num=',CAST (day5_num AS STRING),
'&day6_num=',CAST (day6_num AS STRING),
'&day7_num=',CAST (day7_num AS STRING)
)
,'&', '=')
) lateral_table AS type, num
;
方式二极大的减少重复的代码量,需要熟悉lateral view和str_to_map,concat等函数使用,concat主要拼接key-value形式存储的字符串,如在一个项目开发中使用一个sql处理出十多个指标,后期需要列存储各指标值时,使用方式二就可以使代码减少数10倍,极大提升代码可读性和可维护性,同时key值可以自定义。
窗口函数
Hive 中的OVER函数解析:https://blog.csdn.net/qq_43081842/article/details/102833424
Hive中的窗口分析函数:https://blog.csdn.net/qq_43081842/article/details/104459384
进阶函数
collect_set():把同一个分组不同行的数据聚合成为一个集合,用下标可以取某一个
对于非group by字段,可以用Hive的collect_set函数收集这些字段,返回一个数组;
select course,collect_set(area),avg(score) from student group by course;
result: chinese ["sh","bj"] 89
math ["bj"] 90
get_json_object (解析json)
get_json_object(param1,"$.param2")
param1:需要被解析的json字段
param2:数组就用 [0,1,2...] 0,1,2是数组对应的元素,遇若jsonObject直接用 ".key"取出想要获取的value。
[{"name":"Tom","sex":"男","age":"22"},{"name":"Jack","sex":"女"age":"32"}]
1、如果需要获取第一个json对象,hive语句如下
SELECT get_json_object(pjson,"$.[0]") FROM person;
得到第一个json对象
{"name":"Tom","sex":"男","age":"22"}
2、如果要获取第一个json对象中的name属性的值:
SELECT get_json_object(pjson,"$.[0].age") FROM person;
Hivesql中——array_contains()函数的使用
Array_contains()解析
Hive 中的array_contains函数于SQL中的in关键字操作类似,用于判定包含(array_contains)或不包含(!Array_contains)关系。与 in 不同的是array_contains可以用于判定一张表中同一个id的多条记录中的同一字段是否包含指定的一个或多个值。需要注意字段类型保持一致,若不一致则需要进行强制类型转换。
语法:
array_contains(数组,值),返回布尔类型值。
array_contains(数组,值)
1
array_contains()函数在where条件中的案例
解决使用某个字段条件过多,查询条件失效
select *
from login
where dt='20130101' and (ver !='3.1' or ver !='3.2' or ver != '4.0' or ver != '5.2');
select * from login where dt='20130101'
and (ver !='3.1' and ver !='3.2' and ver != '4.0' and ver != '5.2');
这个语句的目的是从一个用户登陆表中,查询20130101当天,版本不是3.1,3.2,4.0,5.2的登陆用户记录。但是,在hive0.90执行,最后返回的是20130101当天全部的登陆用户记录,后面的查询条件看起来是被忽略了。这里看起来同一个字段做这种多个!=或者=的条件查询,过多会失效。
于是修改了语句,如下:
select *
from login
where dt=20130101
and !array_contains(split('3.1,3.2,4.0,5.2',','),ver);
array_contians()直接应用案例:
1、使用array_contians()对监控指标打标签,判断报警级别。
select
task_id,
task_code,
task_version,
case
when array_contains(collect_set(tag_id), 599)
or array_contains(collect_set(tag_id), 1225)
then 'L1-级别'
when array_contains(collect_set(tag_id), 600)
or array_contains(collect_set(tag_id), 1226)
then 'L2-级别'
when array_contains(collect_set(tag_id), 601)
or array_contains(collect_set(tag_id), 1240)
then 'L3-级别'
when array_contains(collect_set(tag_id), 1166)
or array_contains(collect_set(tag_id), 1241)
then '鹰眼监控汇总'
else '非鹰眼监控任务'
end as core_tag_name
from
fdm.fdm_buffalo_uniontask_tag_theme_info_new
where
dt = sysdate( - 1)
group by
task_id,
task_code,
task_version
Hive窗口函数
LEAD LEAD(col,n, default_val):往后第n行数据 col 列名 n 往后第几行 默认为1 默认值 默认null
LAG LAG(col,n,default_val):往前第n行数据 col 列名 n 往前第几行 默认为1 默认值 默认null
FIRST_VALUE 在当前窗口下的第一个值 FIRST_VALUE (col,true/false) 如果设置为true,则跳过空值。
LAST_VALUE 在当前窗口下的最后一个值 LAST_VALUE (col,true/false)如果设置为true,则跳过空值。
rows between UNBOUNDED PRECEDING and UNBOUNDED FOLLOWING