hive常用函数小结

Where 语句中的不等于: !=   <>   , 使用时会把为NULL的值去掉,所以有的时候需要补上。例如:

where rate!=‘D’ or rate is null;

或者

where rate<>‘D’ or rate is null;   

Where和having的区别

Where执行是在聚合函数之前,having执行是在聚合函数之后;

Where是对聚合之前的数据做过滤的,having是对聚合之后的数据做过滤的

和group by 的关系

Where 用在 group by 之前,sql 会在分组之前计算where语句

Having 用在 group by 之后,sql会在分组之后计算having语句,查询结果中只返回满足having条件的组。

Group by  order by  于having   where 的位置先后

顺序是:

where…group by…having…order by (这里的group by 与 having 的位置可以调换,但是having放在后面效率高!),即where最前,order by最后。

数学函数

Round(double d,int n):返回保留n位小数的近似d值

hive (default)> select round(3.1415,2) round from dual;

3.14

Floor(double d):返回小于d的最大整值

hive (default)> select floor(3.1415) round from dual;

3

Ceil(double d):返回大于d的最小整值

hive (default)> select ceil(3.1415) round from dual;

4

Rand(int seed):返回随机数,seed是随机因子

hive (default)> select rand(10) from dual;

0.7304302967434272

hive (default)> select rand(1) from dual;

0.7308781907032909

hive (default)> select rand(2) from dual;

0.7311469360199058

Bin(int d):计算二进制d的string值

hive (default)> select bin(8) from dual;

1000

日期函数

To_date(string timestamp):返回时间字符串中的日期部分

hive (default)>  select to_date('1970-01-01 00:00:00') from dual;

1970-01-01

date_format 日期格式化函数:格式化日期时候如果第一个参数为字符串,那么只能识别yyyy-MM-dd 类型的数据, : 第一个参数只能是String 类型或者日期类型

hive (default)> desc function date_format;

date_format(date/timestamp/string, fmt) - converts a date/timestamp/string to a value of string in the format specified by the date format fmt.

hive (default)> select date_format(to_date('2021-09-01'),'yyyy-MM') from dual;

2021-09

current_date:返回当前日期

hive (default)> select current_date from dual;

2021-09-05

year(date):返回日期date的年,类型为int如year(‘2019-01-01’)=2019

hive (default)> select year('2021-09-05') from dual;

2021

month(date):返回日期date的月,类型为int,如month(‘2019-01-01’)=1

hive (default)> select month('2021-09-05') from dual;

9

day(date): 返回日期date的天,类型为int,如day(‘2019-01-01’)=1

hive (default)> select day('2021-09-05') from dual;

5

weekofyear(date1):返回日期date1位于该年第几周

hive (default)> select weekofyear('2021-01-04') from dual;

1

datediff(date1,date2):返回日期date1与date2相差的天数

hive (default)> select datediff('2021-09-05','2021-09-01') from dual;

4

date_add(date1,int1):返回日期date1加上int1的日期

hive (default)> select date_add('2021-09-01',3) from dual;

2021-09-04

date_sub(date1,int1):返回日期date1减去int1的日期

hive (default)> select date_sub('2021-09-05',3) from dual;

2021-09-02

months_between(date1,date2):返回date1与date2相差月份

hive (default)> select months_between('2021-05-04','2021-09-04') from dual;

-4.0

add_months(date1,int1):返回date1加上int1个月的日期,int1可为负数

hive (default)> select add_months('2021-05-04',2) from dual;

2021-07-04

last_day(date1):返回date1所在月份最后一天

hive (default)> select last_day('2021-05-04') from dual;

2021-05-31

next_day(date1,day1):返回日期date1的下个星期day1的日期

hive (default)> select next_day('2021-09-04','MO') from dual;

2021-09-06

trunc(date1,string1):返回日期最开始年份或月份。string1可为年(YYYY/YY/YEAR)或月(MONTH/MON/MM)

hive (default)> select trunc('2021-09-02','YYYY') from dual;

2021-01-01

hive (default)> select trunc('2021-09-02','MM') from dual;

2021-09-01

unix_timestamp():返回当前时间的unix时间戳,可指定日期格式

hive (default)> select unix_timestamp('2021-09-01','yyyy-mm-dd') from dual;

1609430940

from_unixtime():返回unix时间戳的日期,可指定格式。如select from_unixtime(unix_timestamp('2019-

03-06','yyyy-mm-dd'),'yyyymmdd')='20190306'

hive (default)> select from_unixtime(unix_timestamp('2021-09-01','yyyy-mm-dd'),'yyyymmdd') from dual;

20210901

条件函数

if(boolean,t1,t2):若布尔值成立,则返回t1,反正返回t2

hive (default)> select if(1>2,100,200) from dual;

200

coalesce(v0,v1,v2):返回参数中的第一个非空值,若所有值均为null,则返回null

hive (default)> select coalesce(null,1,2) from dual;

1

hive (default)> select coalesce(null,null,2) from dual;

2

hive (default)> select coalesce(null,null,null) from dual;

NULL

isnull(a):若a为null则返回true,否则返回false

hive (default)> select isnull(null) from dual;

true

hive (default)> select isnull('word') from dual;

false

字符串函数

length(string1):返回字符串长度

hive (default)> select length('world') from dual;

5

concat(string1,string2):返回拼接string1及string2后的字符串

hive (default)> select concat('hello','world') from dual;

helloworld

concat_ws(sep,string1,string2):返回按指定分隔符拼接的字符串

hive (default)> select concat_ws('-','hello','world') from dual;

hello-world

lower(string1):返回小写字符串,同lcase(string1)。upper()/ucase():返回大写字符串

hive (default)> select lower('Hive') from dual;

hive

trim(string1):去字符串左右空格,ltrim(string1):去字符串左空格。rtrim(string1):去字符串右空格

hive (default)> select trim(' Hello ') from dual;

Hello

repeat(string1,int1):返回重复string1字符串int1次后的字符串

hive (default)> select repeat('hello',3) from dual;

hellohellohello

reverse(string1):返回string1反转后的字符串。如reverse(‘abc’)返回’cba’

hive (default)> select reverse('hello') from dual;

olleh

rpad(string1,len1,pad1):以pad1字符右填充string1字符串,至len1长度。如rpad(‘abc’,5,‘1’)返回’abc11’。lpad():左填充

hive (default)> select rpad('abc',5,'de') from dual;

abcde

select rpad('abc',5,'def') from dual;

hive (default)> select rpad('abc',5,'def') from dual;

abcde

split(string1,pat1):以pat1正则分隔字符串string1,返回数组。如split(‘a,b,c’,’,’)返回[“a”,“b”,“c”]

hive (default)> select split('hello,world,hadoop',',') from dual;

["hello","world","hadoop"]

substr(string1,index1,int1):以index位置起截取int1个字符。如substr(‘abcde’,1,2)返回’ab’

hive (default)> select substr('abcdefg',2,3) from dual;

bcd

egexp_replace 字符串替换函数,将源字符替换成为目标字符

聚合函数

count():统计行数

sum(col1):统计指定列和

avg(col1):统计指定列平均值

min(col1):返回指定列最小值

max(col1):返回指定列最大值

表生成函数

explode (array):返回多行array中对应的元素。explode常用来做行列转换。

hive (default)> select explode(array('A','B','C')) from dual;

OK

col

A

B

C

案例一

hive (default)> select * from advertise;

OK

advertise.pageid advertise.ad_list

front_page ["1","2","3"]

contract_page ["3","4","5"]

hive (default)> SELECT pageid, adid  FROM advertise LATERAL VIEW explode(ad_list) adTable AS adid;

OK

pageid adid

front_page 1

front_page 2

front_page 3

contract_page 3

contract_page 4

contract_page 5

SELECT adid, count(1) cnt

FROM advertise LATERAL VIEW explode(ad_list) adTable AS adid

GROUP BY adid;

adid cnt

1      1

2      1

3      2

4      1

5      1

hive (default)> select * from ad_v2;

OK

ad_v2.col1 ad_v2.col2

["1","2"] ["a","b","c"]

["3","4"] ["d","e","f"]

Time taken: 0.057 seconds, Fetched: 2 row(s)

hive (default)> SELECT myCol1, myCol2 FROM ad_v2

              > LATERAL VIEW explode(col1) myTable1 AS myCol1

              > LATERAL VIEW explode(col2) myTable2 AS myCol2;

OK

mycol1 mycol2

1 a

1 b

1 c

2 a

2 b

2 c

3 d

3 e

3 f

4 d

4 e

4 f

行列转换

https://blog.csdn.net/u010003835/article/details/106632597

Hive 行列转换:

 https://blog.csdn.net/u010003835/article/details/106604698#comments_17639212

行转列

方式一:通过 GROUP BY  + CASE WHEN + 聚合函数

1.有用户学习情况的 uid,  科目,分数  多行数据。已知每个人所学分为 英语,数学,语文3个科目。

uid, subject,score 转换为 uid,math_score, chinese_score, english_score

INSERT OVERWRITE TABLE user_subject_score2

SELECT

uid

,MAX(

CASE

WHEN subject = 'math' THEN score

ELSE -1

END

) AS chinese_score

,MAX(

CASE

WHEN subject = 'chinese' THEN score

ELSE -1

END

) AS chinese_score

,MAX(

CASE

WHEN subject = 'english' THEN score

ELSE -1

END

) AS chinese_score

FROM user_subject_score1

GROUP BY uid

;

+--------------------------+---------------------------------+------------------------------------+------------------------------------+

| user_subject_score2.uid  | user_subject_score2.math_score  | user_subject_score2.chinese_score  | user_subject_score2.english_score  |

+--------------------------+---------------------------------+------------------------------------+------------------------------------+

| 1                        | 80.0                            | 76.0                               | 80.0                               |

| 2                        | 88.0                            | 88.0                               | -1.0                               |

| 3                        | 66.0                            | 30.0                               | -1.0                               |

+--------------------------+---------------------------------+------------------------------------+------------------------------------+

方式二: 通过 GROUP BY  + CONCAT_WS + COLLECT_LIST

有 用户id ,订单id 求 用户的所有订单列表

uid, order_id  ->  uid,order_ids

我们看下这两个函数的使用方法

CONCAT_WS

+----------------------------------------------------+

|                      tab_name                      |

+----------------------------------------------------+

| concat_ws(separator, [string | array(string)]+) - returns the concatenation of the strings separated by the separator. |

+----------------------------------------------------+

COLLECT_LIST

+----------------------------------------------------+

|                      tab_name                      |

+----------------------------------------------------+

| collect_list(x) - Returns a list of objects with duplicates |

+----------------------------------------------------+

use data_warehouse_test;

CREATE TABLE IF NOT EXISTS user_order (

uid BIGINT

,order_id BIGINT

);

CREATE TABLE IF NOT EXISTS user_orders (

uid BIGINT

,order_ids STRING

);

INSERT OVERWRITE TABLE user_order VALUES

(1, 112)

,(1, 123)

,(2, 234)

,(2, 21)

,(3, 821)

;

列转行

方式一 :采用 UNION ALL 的方式

SELECT campaign_id, campaign_name, 'day1_num', day1_num

FROM explode_laterview_org

UNION ALL

SELECT campaign_id, campaign_name, 'day2_num', day2_num

FROM explode_laterview_org

UNION ALL

SELECT campaign_id, campaign_name, 'day3_num', day3_num

FROM explode_laterview_org

UNION ALL

SELECT campaign_id, campaign_name, 'day4_num', day4_num

FROM explode_laterview_org

UNION ALL

SELECT campaign_id, campaign_name, 'day5_num', day5_num

FROM explode_laterview_org

UNION ALL

SELECT campaign_id, campaign_name, 'day6_num', day6_num

FROM explode_laterview_org

UNION ALL

SELECT campaign_id, campaign_name, 'day7_num', day7_num

FROM explode_laterview_org

;

方式一在指标少时方便使用,当行转列的指标比较多时代码量会比较大,维护困难

方式二:使用lateral view和str_to_map

SELECT

    campaign_id, campaign_name, type, num

FROM explode_laterview_org

LATERAL VIEW

    EXPLODE(

            STR_TO_MAP(

                    CONCAT(

                        'day1_num=',CAST (day1_num AS STRING),

                        '&day2_num=',CAST (day2_num AS STRING),

                        '&day3_num=',CAST (day3_num AS STRING),

                        '&day4_num=',CAST (day4_num AS STRING),

                        '&day5_num=',CAST (day5_num AS STRING),

                        '&day6_num=',CAST (day6_num AS STRING),

                        '&day7_num=',CAST (day7_num AS STRING)

                    )

                ,'&', '=')

        ) lateral_table AS type, num

;

方式二极大的减少重复的代码量,需要熟悉lateral view和str_to_map,concat等函数使用,concat主要拼接key-value形式存储的字符串,如在一个项目开发中使用一个sql处理出十多个指标,后期需要列存储各指标值时,使用方式二就可以使代码减少数10倍,极大提升代码可读性和可维护性,同时key值可以自定义。

窗口函数

Hive 中的OVER函数解析:https://blog.csdn.net/qq_43081842/article/details/102833424

Hive中的窗口分析函数:https://blog.csdn.net/qq_43081842/article/details/104459384

进阶函数

collect_set():把同一个分组不同行的数据聚合成为一个集合,用下标可以取某一个

对于非group by字段,可以用Hive的collect_set函数收集这些字段,返回一个数组;

select course,collect_set(area),avg(score) from student group by course;

result: chinese  ["sh","bj"]  89

  math     ["bj"] 90

get_json_object (解析json)

get_json_object(param1,"$.param2")

param1:需要被解析的json字段

param2:数组就用 [0,1,2...] 0,1,2是数组对应的元素,遇若jsonObject直接用 ".key"取出想要获取的value。

[{"name":"Tom","sex":"男","age":"22"},{"name":"Jack","sex":"女"age":"32"}]

1、如果需要获取第一个json对象,hive语句如下

SELECT get_json_object(pjson,"$.[0]") FROM person;

得到第一个json对象

{"name":"Tom","sex":"男","age":"22"}

2、如果要获取第一个json对象中的name属性的值:

SELECT get_json_object(pjson,"$.[0].age") FROM person;

Hivesql中——array_contains()函数的使用

Array_contains()解析

Hive 中的array_contains函数于SQL中的in关键字操作类似,用于判定包含(array_contains)或不包含(!Array_contains)关系。与 in 不同的是array_contains可以用于判定一张表中同一个id的多条记录中的同一字段是否包含指定的一个或多个值。需要注意字段类型保持一致,若不一致则需要进行强制类型转换。

语法:

array_contains(数组,值),返回布尔类型值。

array_contains(数组,值)

1

array_contains()函数在where条件中的案例

解决使用某个字段条件过多,查询条件失效

select *

from login

where dt='20130101'  and (ver !='3.1' or ver !='3.2'  or ver != '4.0'  or ver != '5.2');

select * from login where dt='20130101'

and (ver !='3.1' and ver !='3.2' and ver != '4.0' and ver != '5.2');

这个语句的目的是从一个用户登陆表中,查询20130101当天,版本不是3.1,3.2,4.0,5.2的登陆用户记录。但是,在hive0.90执行,最后返回的是20130101当天全部的登陆用户记录,后面的查询条件看起来是被忽略了。这里看起来同一个字段做这种多个!=或者=的条件查询,过多会失效。

于是修改了语句,如下:

select *

from login

where dt=20130101

and !array_contains(split('3.1,3.2,4.0,5.2',','),ver);

array_contians()直接应用案例:

1、使用array_contians()对监控指标打标签,判断报警级别。

select

    task_id,

    task_code,

    task_version,

    case

        when array_contains(collect_set(tag_id), 599)

            or array_contains(collect_set(tag_id), 1225)

        then 'L1-级别'

        when array_contains(collect_set(tag_id), 600)

            or array_contains(collect_set(tag_id), 1226)

        then 'L2-级别'

        when array_contains(collect_set(tag_id), 601)

            or array_contains(collect_set(tag_id), 1240)

        then 'L3-级别'

        when array_contains(collect_set(tag_id), 1166)

            or array_contains(collect_set(tag_id), 1241)

        then '鹰眼监控汇总'

        else '非鹰眼监控任务'

    end as core_tag_name

from

    fdm.fdm_buffalo_uniontask_tag_theme_info_new

where

    dt = sysdate( - 1)

group by

    task_id,

    task_code,

task_version

Hive窗口函数

LEAD LEAD(col,n, default_val):往后第n行数据 col 列名 n 往后第几行 默认为1 默认值 默认null

LAG LAG(col,n,default_val):往前第n行数据 col 列名 n 往前第几行 默认为1 默认值 默认null

FIRST_VALUE 在当前窗口下的第一个值 FIRST_VALUE (col,true/false) 如果设置为true,则跳过空值。

LAST_VALUE 在当前窗口下的最后一个值 LAST_VALUE (col,true/false)如果设置为true,则跳过空值。

rows between UNBOUNDED  PRECEDING and UNBOUNDED FOLLOWING

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值