hive常用函数小结

最新推荐文章于 2024-07-24 15:00:45 发布

泶月

最新推荐文章于 2024-07-24 15:00:45 发布

阅读量51

点赞数

文章标签： hive hadoop 数据仓库

本文链接：https://blog.csdn.net/shenmingyue_xy/article/details/132688095

版权

Where 语句中的不等于： != <> , 使用时会把为NULL的值去掉，所以有的时候需要补上。例如：

where rate!=‘D’ or rate is null;

或者

where rate<>‘D’ or rate is null;

Where和having的区别

Where执行是在聚合函数之前，having执行是在聚合函数之后；

Where是对聚合之前的数据做过滤的，having是对聚合之后的数据做过滤的

和group by 的关系

Where 用在 group by 之前，sql 会在分组之前计算where语句

Having 用在 group by 之后，sql会在分组之后计算having语句，查询结果中只返回满足having条件的组。

Group by order by 于having where 的位置先后

顺序是：

where…group by…having…order by （这里的group by 与 having 的位置可以调换，但是having放在后面效率高！），即where最前，order by最后。

数学函数

Round(double d,int n):返回保留n位小数的近似d值

hive (default)> select round(3.1415,2) round from dual;

3.14

Floor(double d):返回小于d的最大整值

hive (default)> select floor(3.1415) round from dual;

Ceil(double d):返回大于d的最小整值

hive (default)> select ceil(3.1415) round from dual;

Rand(int seed):返回随机数，seed是随机因子

hive (default)> select rand(10) from dual;

0.7304302967434272

hive (default)> select rand(1) from dual;

0.7308781907032909

hive (default)> select rand(2) from dual;

0.7311469360199058

Bin(int d):计算二进制d的string值

hive (default)> select bin(8) from dual;

1000

日期函数

To_date(string timestamp):返回时间字符串中的日期部分

hive (default)> select to_date('1970-01-01 00:00:00') from dual;

1970-01-01

date_format 日期格式化函数：格式化日期时候如果第一个参数为字符串，那么只能识别yyyy-MM-dd 类型的数据, : 第一个参数只能是String 类型或者日期类型

hive (default)> desc function date_format;

date_format(date/timestamp/string, fmt) - converts a date/timestamp/string to a value of string in the format specified by the date format fmt.

hive (default)> select date_format(to_date('2021-09-01'),'yyyy-MM') from dual;

2021-09

current_date:返回当前日期

hive (default)> select current_date from dual;

2021-09-05

year(date)：返回日期date的年,类型为int如year(‘2019-01-01’)=2019

hive (default)> select year('2021-09-05') from dual;

2021

month(date)：返回日期date的月,类型为int,如month(‘2019-01-01’)=1

hive (default)> select month('2021-09-05') from dual;

day(date): 返回日期date的天,类型为int,如day(‘2019-01-01’)=1

hive (default)> select day('2021-09-05') from dual;

weekofyear(date1)：返回日期date1位于该年第几周

hive (default)> select weekofyear('2021-01-04') from dual;

datediff(date1,date2):返回日期date1与date2相差的天数

hive (default)> select datediff('2021-09-05','2021-09-01') from dual;

date_add(date1,int1):返回日期date1加上int1的日期

hive (default)> select date_add('2021-09-01',3) from dual;

2021-09-04

date_sub(date1,int1):返回日期date1减去int1的日期

hive (default)> select date_sub('2021-09-05',3) from dual;

2021-09-02

months_between(date1,date2):返回date1与date2相差月份

hive (default)> select months_between('2021-05-04','2021-09-04') from dual;

-4.0

add_months(date1,int1):返回date1加上int1个月的日期，int1可为负数

hive (default)> select add_months('2021-05-04',2) from dual;

2021-07-04

last_day(date1):返回date1所在月份最后一天

hive (default)> select last_day('2021-05-04') from dual;

2021-05-31

next_day(date1,day1):返回日期date1的下个星期day1的日期

hive (default)> select next_day('2021-09-04','MO') from dual;

2021-09-06

trunc(date1,string1):返回日期最开始年份或月份。string1可为年(YYYY/YY/YEAR)或月(MONTH/MON/MM)

hive (default)> select trunc('2021-09-02','YYYY') from dual;

2021-01-01

hive (default)> select trunc('2021-09-02','MM') from dual;

2021-09-01

unix_timestamp():返回当前时间的unix时间戳，可指定日期格式

hive (default)> select unix_timestamp('2021-09-01','yyyy-mm-dd') from dual;

1609430940

from_unixtime():返回unix时间戳的日期，可指定格式。如select from_unixtime(unix_timestamp('2019-

03-06','yyyy-mm-dd'),'yyyymmdd')='20190306'

hive (default)> select from_unixtime(unix_timestamp('2021-09-01','yyyy-mm-dd'),'yyyymmdd') from dual;

20210901

条件函数

if(boolean,t1,t2):若布尔值成立，则返回t1,反正返回t2

hive (default)> select if(1>2,100,200) from dual;

200

coalesce(v0,v1,v2):返回参数中的第一个非空值,若所有值均为null,则返回null

hive (default)> select coalesce(null,1,2) from dual;

hive (default)> select coalesce(null,null,2) from dual;

hive (default)> select coalesce(null,null,null) from dual;

NULL

isnull(a):若a为null则返回true，否则返回false

hive (default)> select isnull(null) from dual;

true

hive (default)> select isnull('word') from dual;

false

字符串函数

length(string1):返回字符串长度

hive (default)> select length('world') from dual;

concat(string1,string2):返回拼接string1及string2后的字符串

hive (default)> select concat('hello','world') from dual;

helloworld

concat_ws(sep,string1,string2):返回按指定分隔符拼接的字符串

hive (default)> select concat_ws('-','hello','world') from dual;

hello-world

lower(string1):返回小写字符串，同lcase(string1)。upper()/ucase()：返回大写字符串

hive (default)> select lower('Hive') from dual;

hive

trim(string1):去字符串左右空格，ltrim(string1):去字符串左空格。rtrim(string1):去字符串右空格

hive (default)> select trim(' Hello ') from dual;

Hello

repeat(string1，int1)：返回重复string1字符串int1次后的字符串

hive (default)> select repeat('hello',3) from dual;

hellohellohello

reverse(string1):返回string1反转后的字符串。如reverse(‘abc’)返回’cba’

hive (default)> select reverse('hello') from dual;

olleh

rpad(string1,len1,pad1):以pad1字符右填充string1字符串，至len1长度。如rpad(‘abc’,5,‘1’)返回’abc11’。lpad()：左填充

hive (default)> select rpad('abc',5,'de') from dual;

abcde

select rpad('abc',5,'def') from dual;

hive (default)> select rpad('abc',5,'def') from dual;

abcde

split(string1,pat1):以pat1正则分隔字符串string1,返回数组。如split(‘a,b,c’,’,’)返回[“a”,“b”,“c”]

hive (default)> select split('hello,world,hadoop',',') from dual;

["hello","world","hadoop"]

substr(string1,index1,int1):以index位置起截取int1个字符。如substr(‘abcde’,1,2)返回’ab’

hive (default)> select substr('abcdefg',2,3) from dual;

bcd

egexp_replace 字符串替换函数，将源字符替换成为目标字符

聚合函数

count():统计行数

sum(col1):统计指定列和

avg(col1):统计指定列平均值

min(col1):返回指定列最小值

max(col1):返回指定列最大值

表生成函数

explode (array):返回多行array中对应的元素。explode常用来做行列转换。

hive (default)> select explode(array('A','B','C')) from dual;

col

案例一

hive (default)> select * from advertise;

advertise.pageid advertise.ad_list

front_page ["1","2","3"]

contract_page ["3","4","5"]

hive (default)> SELECT pageid, adid FROM advertise LATERAL VIEW explode(ad_list) adTable AS adid;

pageid adid

front_page 1

front_page 2

front_page 3

contract_page 3

contract_page 4

contract_page 5

SELECT adid, count(1) cnt

FROM advertise LATERAL VIEW explode(ad_list) adTable AS adid

GROUP BY adid;

adid cnt

1 1

2 1

3 2

4 1

5 1

hive (default)> select * from ad_v2;

ad_v2.col1 ad_v2.col2

["1","2"] ["a","b","c"]

["3","4"] ["d","e","f"]

Time taken: 0.057 seconds, Fetched: 2 row(s)

hive (default)> SELECT myCol1, myCol2 FROM ad_v2

> LATERAL VIEW explode(col1) myTable1 AS myCol1

> LATERAL VIEW explode(col2) myTable2 AS myCol2;

mycol1 mycol2

1 a

1 b

1 c

2 a

2 b

2 c

3 d

3 e

3 f

4 d

4 e

4 f

行列转换

https://blog.csdn.net/u010003835/article/details/106632597

Hive 行列转换：

https://blog.csdn.net/u010003835/article/details/106604698#comments_17639212

行转列

方式一：通过 GROUP BY + CASE WHEN + 聚合函数

1.有用户学习情况的 uid, 科目，分数多行数据。已知每个人所学分为英语，数学，语文3个科目。

uid, subject,score 转换为 uid，math_score, chinese_score, english_score

INSERT OVERWRITE TABLE user_subject_score2

SELECT

uid

,MAX(

CASE

WHEN subject = 'math' THEN score

ELSE -1

END

) AS chinese_score

,MAX(

CASE

WHEN subject = 'chinese' THEN score

ELSE -1

END

) AS chinese_score

,MAX(

CASE

WHEN subject = 'english' THEN score

ELSE -1

END

) AS chinese_score

FROM user_subject_score1

GROUP BY uid

;

+--------------------------+---------------------------------+------------------------------------+------------------------------------+

+--------------------------+---------------------------------+------------------------------------+------------------------------------+

| 1 | 80.0 | 76.0 | 80.0 |

| 2 | 88.0 | 88.0 | -1.0 |

| 3 | 66.0 | 30.0 | -1.0 |

+--------------------------+---------------------------------+------------------------------------+------------------------------------+

方式二：通过 GROUP BY + CONCAT_WS + COLLECT_LIST

有用户id ,订单id 求用户的所有订单列表

uid, order_id -> uid，order_ids

我们看下这两个函数的使用方法

CONCAT_WS

+----------------------------------------------------+

| tab_name |

+----------------------------------------------------+

| concat_ws(separator, [string | array(string)]+) - returns the concatenation of the strings separated by the separator. |

+----------------------------------------------------+

COLLECT_LIST

+----------------------------------------------------+

| tab_name |

+----------------------------------------------------+

| collect_list(x) - Returns a list of objects with duplicates |

+----------------------------------------------------+

use data_warehouse_test;

CREATE TABLE IF NOT EXISTS user_order (

uid BIGINT

,order_id BIGINT

);

CREATE TABLE IF NOT EXISTS user_orders (

uid BIGINT

,order_ids STRING

);

INSERT OVERWRITE TABLE user_order VALUES

(1, 112)

,(1, 123)

,(2, 234)

,(2, 21)

,(3, 821)

;

列转行

方式一：采用 UNION ALL 的方式

SELECT campaign_id, campaign_name, 'day1_num', day1_num

FROM explode_laterview_org

UNION ALL

SELECT campaign_id, campaign_name, 'day2_num', day2_num

FROM explode_laterview_org

UNION ALL

SELECT campaign_id, campaign_name, 'day3_num', day3_num

FROM explode_laterview_org

UNION ALL

SELECT campaign_id, campaign_name, 'day4_num', day4_num

FROM explode_laterview_org

UNION ALL

SELECT campaign_id, campaign_name, 'day5_num', day5_num

FROM explode_laterview_org

UNION ALL

SELECT campaign_id, campaign_name, 'day6_num', day6_num

FROM explode_laterview_org

UNION ALL

SELECT campaign_id, campaign_name, 'day7_num', day7_num

FROM explode_laterview_org

;

方式一在指标少时方便使用，当行转列的指标比较多时代码量会比较大，维护困难

方式二：使用lateral view和str_to_map

SELECT

campaign_id, campaign_name, type, num

FROM explode_laterview_org

LATERAL VIEW

EXPLODE(

STR_TO_MAP(

CONCAT(

'day1_num=',CAST (day1_num AS STRING),

'&day2_num=',CAST (day2_num AS STRING),

'&day3_num=',CAST (day3_num AS STRING),

'&day4_num=',CAST (day4_num AS STRING),

'&day5_num=',CAST (day5_num AS STRING),

'&day6_num=',CAST (day6_num AS STRING),

'&day7_num=',CAST (day7_num AS STRING)

)

,'&', '=')

) lateral_table AS type, num

;

方式二极大的减少重复的代码量，需要熟悉lateral view和str_to_map，concat等函数使用，concat主要拼接key-value形式存储的字符串，如在一个项目开发中使用一个sql处理出十多个指标，后期需要列存储各指标值时，使用方式二就可以使代码减少数10倍，极大提升代码可读性和可维护性，同时key值可以自定义。

窗口函数

Hive 中的OVER函数解析：https://blog.csdn.net/qq_43081842/article/details/102833424

Hive中的窗口分析函数：https://blog.csdn.net/qq_43081842/article/details/104459384

进阶函数

collect_set()：把同一个分组不同行的数据聚合成为一个集合，用下标可以取某一个

对于非group by字段，可以用Hive的collect_set函数收集这些字段，返回一个数组；

select course,collect_set(area),avg(score) from student group by course;

result: chinese ["sh","bj"] 89

math ["bj"] 90

get_json_object （解析json）

get_json_object(param1,"$.param2")

param1：需要被解析的json字段

param2：数组就用 [0,1,2...] 0,1,2是数组对应的元素,遇若jsonObject直接用 ".key"取出想要获取的value。

[{"name":"Tom","sex":"男","age":"22"},{"name":"Jack","sex":"女"age":"32"}]

1、如果需要获取第一个json对象，hive语句如下

SELECT get_json_object(pjson,"$.[0]") FROM person;

得到第一个json对象

{"name":"Tom","sex":"男","age":"22"}

2、如果要获取第一个json对象中的name属性的值：

SELECT get_json_object(pjson,"$.[0].age") FROM person;

Hivesql中——array_contains()函数的使用

Array_contains()解析

Hive 中的array_contains函数于SQL中的in关键字操作类似，用于判定包含（array_contains）或不包含（！Array_contains）关系。与 in 不同的是array_contains可以用于判定一张表中同一个id的多条记录中的同一字段是否包含指定的一个或多个值。需要注意字段类型保持一致，若不一致则需要进行强制类型转换。

语法：

array_contains(数组，值)，返回布尔类型值。

array_contains(数组，值)

array_contains()函数在where条件中的案例

解决使用某个字段条件过多，查询条件失效

select *

from login

where dt='20130101' and (ver !='3.1' or ver !='3.2' or ver != '4.0' or ver != '5.2');

select * from login where dt='20130101'

and (ver !='3.1' and ver !='3.2' and ver != '4.0' and ver != '5.2');

这个语句的目的是从一个用户登陆表中，查询20130101当天，版本不是3.1,3.2,4.0,5.2的登陆用户记录。但是，在hive0.90执行，最后返回的是20130101当天全部的登陆用户记录，后面的查询条件看起来是被忽略了。这里看起来同一个字段做这种多个!=或者=的条件查询，过多会失效。

于是修改了语句，如下:

select *

from login

where dt=20130101

and !array_contains(split('3.1,3.2,4.0,5.2',','),ver);

array_contians()直接应用案例：

1、使用array_contians()对监控指标打标签，判断报警级别。

select

task_id,

task_code,

task_version,

case

when array_contains(collect_set(tag_id), 599)

or array_contains(collect_set(tag_id), 1225)

then 'L1-级别'

when array_contains(collect_set(tag_id), 600)

or array_contains(collect_set(tag_id), 1226)

then 'L2-级别'

when array_contains(collect_set(tag_id), 601)

or array_contains(collect_set(tag_id), 1240)

then 'L3-级别'

when array_contains(collect_set(tag_id), 1166)

or array_contains(collect_set(tag_id), 1241)

then '鹰眼监控汇总'

else '非鹰眼监控任务'

end as core_tag_name

from

fdm.fdm_buffalo_uniontask_tag_theme_info_new

where

dt = sysdate( - 1)

group by

task_id,

task_code,

task_version

Hive窗口函数

LEAD LEAD(col,n, default_val)：往后第n行数据 col 列名 n 往后第几行默认为1 默认值默认null

LAG LAG(col,n,default_val)：往前第n行数据 col 列名 n 往前第几行默认为1 默认值默认null

FIRST_VALUE 在当前窗口下的第一个值 FIRST_VALUE （col,true/false）如果设置为true，则跳过空值。

LAST_VALUE 在当前窗口下的最后一个值 LAST_VALUE （col,true/false）如果设置为true，则跳过空值。

rows between UNBOUNDED PRECEDING and UNBOUNDED FOLLOWING

泶月

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
hive常用函数小结

方式二极大的减少重复的代码量，需要熟悉lateral view和str_to_map，concat等函数使用，concat主要拼接key-value形式存储的字符串，如在一个项目开发中使用一个sql处理出十多个指标，后期需要列存储各指标值时，使用方式二就可以使代码减少数10倍，极大提升代码可读性和可维护性，同时key值可以自定义。=或者=的条件查询，过多会失效。[{"name":"Tom","sex":"男","age":"22"},{"name":"Jack","sex":"女"age":"32"}]
复制链接

扫一扫