hive常用函数

最新推荐文章于 2024-05-29 17:57:09 发布

三只喵爪

最新推荐文章于 2024-05-29 17:57:09 发布

阅读量527

点赞数

文章标签： hive hadoop 大数据

本文链接：https://blog.csdn.net/m0_51392732/article/details/126796412

版权

函数

学习笔记记录，如有错误，感谢指正~

标明hive> 说明是hive独有的函数，未注明则表示与oracle通用

系统函数

查看系统函数

hive> show functions;

显示自带函数的用法

hive> desc function upper;

hive> desc function extended upper;

日期函数【重要】

计算时间

hive> datediff（‘endTime’，‘startTime’）

返回前后日期之间的天数差，
需要注意的是前后两个日期必须是’yyyy-MM-dd’格式，否则会报NULL

hive> select datediff('2019-06-25','2019-06-20'); 
5
hive> select datediff('2019-06-25','2019-06-27');
-2

hive> date_sub（‘yyyy-MM-dd’，n/-m）

返回初始日期n天前、m天后的日期

hive> select date_sub('2019-06-25',4);
2019-06-21
hive> select date_sub('2019-06-25',-2);
2019-06-27

hive> date_add（‘yyyy-MM-dd’，n/-m）

返回初始日期n天后、m天前的日期

hive> select date_add('2019-06-24',5);
2019-06-29
hive> select date_add('2019-06-24',-2);
2019-06-22

当前日期

hive> select current_date; 
hive> select unix_timestamp(); 
-- 建议使用current_timestamp，有没有括号都可以 
hive> select current_timestamp();

时间戳转日期

hive> select from_unixtime(1505456567); 
hive> select from_unixtime(1505456567, 'yyyyMMdd'); 
hive> select from_unixtime(1505456567, 'yyyy-MM-dd HH:mm:ss');

日期转时间戳

hive> select unix_timestamp('2019-09-15 14:23:00');

查询当月第几天

hive> select dayofmonth(current_date);

计算月末:

hive> select last_day(current_date);

当月第1天:

hive> select date_sub(current_date, dayofmonth(current_date)-1)

下个月第1天:

hive> select add_months(date_sub(current_date, dayofmonth(current_date)-1), 1)

字符串转时间（字符串必须为：yyyy-MM-dd格式）

select to_date('2020-01-01'); 
select to_date('2020-01-01 12:12:12');

日期、时间戳、字符串类型格式化输出标准时间格式

hive> select date_format(current_timestamp(), 'yyyy-MM-dd HH:mm:ss'); 
hive> select date_format(current_date(), 'yyyyMMdd'); 
hive> select date_format('2020-06-01', 'yyyy-MM-dd HH:mm:ss');

计算emp表中，每个人的工龄

hive> select *, round(datediff(current_date, hiredate)/365,1) workingyears from emp;

数组函数

1、hive> collect_list与collect_set

将分组中的某列转为一个数组返回，不同的是collect_list不去重，而collect_set去重,默认用’，"拼接
name sort
张三语文
张三语文
张三数学
李四语文
李四语文
李四数学

hive> select name,collect_list(sort) from stu group by stu
张三  ["语文","语文","数学"]
李四  ["语文","语文","数学"]

hive> select name,collect_set(sort) from stu group by stu
张三  ["语文","数学"]
李四  ["语文","数学"]

字符串函数

转小写 lower

select lower（"HELLO WORLD"）;

转大写 upper

select upper(ename) ename from emp;

求字符串长度 length

select length(ename) ename from emp;

字符串拼接 concat / ||

select empno||"+"||ename as idname from emp;
select concat("","+",ename) as idname from emp;
--empno+ename

hive> 指定分隔符 concat_ws(separator, [string | array(string)]+)

指定分隔符separator将数据连接起来

hive> select concat_ws('.', 'www', array('lagou', 'com'));
hive> select concat_ws(" ", ename, job) from emp;

求子串/截取 substr(str,x,y)
从第x位开始截取str，截取y个长度，x从1开始，可为正负，正数则顺着数，y可省略则为截完

SELECT substr('www.lagou.com', 5); 
--lagou.com
SELECT substr('www.lagou.com', -5); 
--u.com
SELECT substr('www.lagou.com', 5, 5);
--lagou

hive> 字符串切分 split，注意’.’ 要转义 ,通过’.’ 将字符串切分3个字段

hive> select split("www.lagou.com", "\\.");

hive> 行转列 concat_ws 、collect_set、cast

有表T结构a string , b string , c int 数据为
c d 1
c d 2
c d 3
e f 4
e f 5
e f 6
想要得到
c d 1,2,3
e f 4,5,6

select a,b,
       concat_ws(',', collect_set(cast(c as string)))
from T group by a,b;

上述用的到的 collect_set 函数，有两个作用，第一个是去重，去除group by后的重复元素，第二个是形成一个集合，将group by后属于同一组的第三列集合起来成为一个集合。与contact_ws结合使用就是将这些元素以逗号分隔形成字符串。

lpad(string str, int len, string pad) 将字符串str 用pad进行左补足到len位(如果位数不足的话)
rpad(string str, int len, string pad) 将字符串str 用pad进行右补足到len位(如果位数不足的话)
trim(string A) 删除字符串两边的空格，中间的会保留。
相应的 ltrim(string A) ,rtrim(string A)

数学函数

– 四舍五入。round

select round(314.15926);     --314
select round(314.15926, 2);  --313.16
select round(314.15926, -2); --300

select round(314.15926); select round(314.15926, 2); select round(314.15926, -2);

– 向上取整。ceil

select ceil(3.1415926); --4

– 向下取整。floor

select floor(3.1415926);  --3

– 其他数学函数包括：绝对值、平方、开方、对数运算、三角运算等

条件函数【重要】

hive> 列上if，if (boolean testCondition, T valueTrue, T valueFalseOrNull)

-- 将emp表的员工工资等级分类：0-1500、1500-3000、3000以上 
select sal, if (sal<1500, 1, if (sal < 3000, 2, 3)) from emp;

列上 case when
- 语法

case   
       when a then b 
      [when c then d]
      [else e]
end

case   a
       when b then c 
      [when d then e]
      [else f]
end  —默认a、b是等于条件

-- 以下语句等价 
select ename, deptno, 
  case deptno 
       when 10 then 'accounting' 
       when 20 then 'research' 
       when 30 then 'sales' 
       else 'unknown' end deptname 
from emp; 

select ename, deptno, 
  case 
       when deptno=10 then 'accounting' 
       when deptno=20 then 'research' 
       when deptno=30 then 'sales' 
       else 'unknown' end deptname 
from emp;

COALESCE(T v1, T v2, …T vn)，n≥2

返回参数中的第一个非空值；如果所有值都为 NULL，那么返回NULL

select sal, coalesce(comm1,comm2,comm3,0) from emp;
select sal, nvl(nvl(nvl(comm1,comm2),comm3),0) from emp;

nvl(T value, T default_value)

如果第一个参数为null，则返回第二个参数

select sal+nvl(comm,0) sumsal from emp;

hive> isnull / isnotnull

hive> select * from emp where isnull(comm); 
hive> select * from emp where isnotnull(comm);

nullif(x, y) x、y相等返回null，否则返回x

SELECT nullif(‘x’, ‘y’);

UDTF函数【重要】

explode，炸裂函数

– 就是将一行中复杂的 array 或者 map 结构拆分成多行

正则

regexp_replace

聚合函数

聚合函数每个组返回一行结果，并且会过滤掉空值(不计算空值)

MAX() --求最大值

MIN() --求最小值

AVG() --求平均值

SUM() --求和

COUNT() --计数

COUNT()：针对表本身的行数。另外的4个都是针对表里的数据

COUNT(列名)，会过滤掉该列空值

COUNT(*) ：返回表(from后面的表)的行数，都为null值的行数也会计入

等同于：COUNT(具体的值)。例如：COUNT(1) --类似于条件恒成立，都为null值的行数也会计入

窗口函数【重要】

over 关键字

聚合函数结合分析函数使用：聚合函数 OVER()

窗口函数是针对每一行数据的；如果over中没有参数，默认的是全部结果集；

select ename, sal, sum(sal)over() AS salsum, 
       CONCAT(ROUND((sal/sum(sal) over())*100,2),'%') AS 占比
from emp;

partition by子句

在over窗口中进行分区，对某一列进行分区统计，窗口的大小就是分区的大小

select deptno,ename, sal, 
sum(sal) over(partition by deptno) salsum from emp;

order by 子句

order by 子句对输入的数据进行排序，跟聚合函数一起使用有累计计算的作用

-- 增加了order by子句；sum：从分组的第一行到当前行求和 
select ename, sal, deptno, 
sum(sal) over(partition by deptno order by sal) salsum 
from emp;

Window子句

rows between … and …

排名函数

都是从1开始，生成数据项在分组中的排名。

row_number()。排名顺序增加不会重复；如1、2、3、4、… …

RANK()。排名相等会在名次中留下空位；如1、2、2、4、5、… …

DENSE_RANK()。排名相等会在名次中不会留下空位；如1、2、2、3、4、… …

select cname, sname, score, 
  row_number() over (partition by cname order by score desc) rank1, 
  rank() over (partition by cname order by score desc) rank2, 
  dense_rank() over (partition by cname order by score desc) rank3 
from t2;