七函数

最新推荐文章于 2022-03-13 21:55:28 发布

andyonlines

最新推荐文章于 2022-03-13 21:55:28 发布

阅读量259

点赞数

分类专栏： Hive 学习笔记文章标签： hive 大数据 hadoop 函数

本文链接：https://blog.csdn.net/andyonlines/article/details/104788434

版权

Hive 学习笔记专栏收录该内容

10 篇文章 1 订阅

订阅专栏

文章目录

1 函数的分类
2 函数的查看
3.NVL
- 3.1介绍
- 3.2使用
4.字符串拼接函数
- 4.1 concat
- 4.2 concat_ws
5.行转列函数
- 5.1含义
- 5.2 collect_set
- 5.3 collect_list
6.判断句式
- 6.1 if
- 6.2 case-when
7.列转行
- 7.1 含义
- 7.2 explode
8 练习
- 8.1 练习1
8.2 练习2
- 8.2.1建表
- 8.2.2 思路1
8.3 练习3
- 8.3.1 建表
- 8.3.2 需求
8.4 练习
- 8.4.1数据
- 8.4.2 建表
- 8.4.3 需求
9 窗口函数
- 9.1 定义
- 9.2 语法
- 9.3 window clause
- 9.4 案例
10 常用日期函数
11 自定义UDF函数
- 11.1 编写函数
- 11.2 引入函数

1 函数的分类

UDF(user define function)：用户定义的一进一出的函数。

UDTF(user define table function): 用户定义的表生成函数！一进多出！

UDAF(user define aggregation function)：用户定义的聚集函数！多进一出！

函数根据来源分为系统函数和用户自定义的函数！

2 函数的查看

注意：用户自定义的函数是以库为单位！在创建这个函数时，必须在要使用的库进行创建！否则需要用库名.函数名使用函数！

查看所有的函数：

show functions

查看某个函数的介绍：

desc function 函数名

查看某个函数的详细介绍：

desc function extended 函数名

3.NVL

3.1介绍

nvl(value,default_value) - Returns default value if value is null else returns value
当value是null值时，返回default_value,否则返回value

3.2使用

一般用在计算前对null的处理上！

求有奖金人的平均奖金： avg聚集函数默认忽略Null

select avg(comm) from emp;

求所有人的平均奖金：提前处理null值！

select avg(nvl(comm,0)) from emp;

4.字符串拼接函数

4.1 concat

描述：

concat(str1, str2, ... strN) - returns the concatenation of str1, str2, ... strN or concat(bin1, bin2, ... binN) - returns the concatenation of bytes in binary data  bin1, bin2, ... binN
Returns NULL if any argument is NULL

concat可以完成多个字符串的拼接！

一旦拼接的字符串中有一个NULL值，返回值就为NULL!

在concat拼接前，一定要先保证数据没有为NULL的！

4.2 concat_ws

描述：

concat_ws(separator, [string | array(string)]+) - returns the concatenation of the strings separated by the separator.

返回多个字符串或字符串数组的拼接结果，拼接时，每个字符串会使用separator作为分割

不受NULL值影响， NULL值会被忽略！

5.行转列函数

5.1含义

行转列： 1列N行转为 1列1行

通常是聚集函数！

5.2 collect_set

描述：

collect_set(x) - Returns a set of objects with duplicate elements eliminated

返回一组去重后的数据组成的set集合！

5.3 collect_list

描述：

collect_list(x) - Returns a list of objects with duplicates

返回一组数据组成的list集合！不去重！

6.判断句式

6.1 if

类似三元运算符！做单层判断！

语法：

if('条件判断','为true时','为false时')

示例：

select empno,ename,sal,if(sal<1500,'Poor Gay','Rich Gay') from emp;

6.2 case-when

类似swith-case！做多层判断！

语法：

case 列名
	when 值1 then  值2
	when 值3 then  值4
	when 值5 then  值6
	...
	else 值7
end

示例：

select empno,ename,job,case job when 'CLERK' then 'a' when 'SALESMAN' then 'b' else 'c' end from emp;

7.列转行

7.1 含义

列传行： 1列1行转为 N列N行

7.2 explode

描述：

explode(a) - separates the elements of array a into multiple rows, or the elements of a map into multiple rows and columns

explode使用的对象是array或map！

explode函数可以将一个array中的元素分割为N行1列！

select explode(friends) from default.t1 where name='songsong';

explode函数可以将一个 map中的元素(entry)分割为N行2列!

select explode(children) from default.t1 where name='songsong';

注意： explode函数在查询时不能写在select之外，也不能嵌套在表达式中！

在select中写了explode函数，select中只能有explode函数不能有别的表达式!

8 练习

8.1 练习1

8.1.1 建表

数据：

悟空    A       男
大海    A       男
宋宋    B       男
凤姐    A       女
婷姐    B       女
婷婷    B       女

需求：求出不同部门男女各多少人。结果如下

A     2       1
B     1       2

建表：

create table emp_sex(
name string, 
dept_id string, 
sex string) 
row format delimited fields terminated by "\t";

加载数据：

load data local inpath '/home/andy/hivedatas/test1' into table emp_sex;

8.2 思路1

关键词：不同部门，男女，各多少人

思路1：①按部门分组,group by

②按男女性别过滤,where

③求每个部门中记录的条数，count(*)

sql：

select t1.dept_id,male_count,female_count
from
(select  dept_id,count(*) male_count
from emp_sex
where sex='男'
group by dept_id)t1
join
(select  dept_id,count(*) female_count
from emp_sex
where sex='女'
group by dept_id)t2
on t1.dept_id=t2.dept_id

此条SQL会启动3个Job，因为嵌套了子查询！

在写hivesql时，尽量避免子查询！

8.3 思路2

关键词：不同部门，男女，各多少人

思路2：用sum来累加统计！如果统计的是男人，只要这条记录是男性，记1，否则记0！

对以上值进行累加求和可以求出总的男性数量！

根据部门进行分组即可

SQL：

select dept_id,sum(if(sex='男',1,0)) male_count,
sum(if(sex='女',1,0)) female_count
from emp_sex
group by dept_id

8.2 练习2

8.2.1建表

create table person_info(
name string, 
constellation string, 
blood_type string) 
row format delimited fields terminated by "\t";

加载数据：

load data local inpath '/home/andy/hivedatas/test2' into table person_info;

需求：把星座和血型一样的人归类到一起，结果如下：

射手座,A            大海|凤姐
白羊座,A            孙悟空|猪八戒
白羊座,B            宋宋

8.2.2 思路1

关键词：星座，血型一致的人归类到一起！

思路： ①按照星座，血型分组

②结果有两列

第一列是星座和血型，使用,拼接后的结果 concat()

第二列是name列多行数据转一行数据的拼接的结果，collect_list和concat_ws

sql：

select concat(constellation,',',blood_type),concat_ws('|',collect_list(name))
from  person_info
group by constellation,blood_type

在group by的语句中，select后面的表达式可以写什么？

select只能写group by 后面的字段和聚集函数中的字段

8.3 练习3

8.3.1 建表

create table movie_info(
    movie string, 
    category array<string>) 
row format delimited fields terminated by "\t"
collection items terminated by ",";

加载数据：

load data local inpath '/home/andy/hivedatas/test3' into table movie_info;

8.3.2 需求

将电影名称和电影的类型展开！

错误：

select t2.movie,t1.c1
from movie_info t2
left join 
(select explode(category)  c1 from movie_info) t1

在Join时，由于没有关联字段，造成笛卡尔集，并不能达到我们预期的效果！

需求：炸裂的临时结果集中的每一行，可以和炸裂之前的所在行的其他字段进行join!

hive 提供了支持此需求的实现，称为LATERAL VIEW(侧写)

语法：

select 临时列名，其他字段
from 表名
-- 将 UDTF函数执行的返回的结果集临时用 临时表名代替，结果集返回的每一列，按照顺序分别以临时--列名代替
lateral view UDTF() 临时表名 as 临时列名,...

示例：

select  movie,col1
from movie_info
lateral view explode(category) tmp1 as col1

8.4 练习

8.4.1数据

jack|tom|jerry	阳光男孩|肌肉男孩|直男	晒太阳|健身|说多喝热水
marry|nancy	阳光女孩|肌肉女孩|腐女	晒太阳|健身|看有内涵的段子

8.4.2 建表

create table person_info2(names array<string>,tags array<string>,hobbys array<string>)
row format delimited fields terminated by '\t'
collection items terminated by '|'

加载数据：

load data local inpath '/home/andy/hivedatas/test4' into table person_info2;

8.4.3 需求

将names,hobbys,tags拆分，组合！

结果如下：

jack	阳光男孩	晒太阳
jack	阳光男孩	健身
jack	阳光男孩	说多喝热水
jack	肌肉男孩	晒太阳
jack	肌肉男孩	健身
jack	肌肉男孩	说多喝热水
......

sql:

select name,tag,hobby
from person_info2
lateral view explode(names) tmp1 as name
lateral view explode(hobbys) tmp1 as hobby
lateral view explode(tags) tmp1 as tag

9 窗口函数

9.1 定义

在mysql5.5,5.6版本，不支持窗口函数！在oracle和sqlserver中支持窗口函数！

hive支持窗口函数！

https://cwiki.apache.org/confluence/display/Hive/LanguageManual+WindowingAndAnalytics

窗口函数=函数+窗口

函数：要运行的函数！只有以下函数称为窗口函数！

开窗函数：

LEAD: 用来返回当前行以下行的数据！

用法： LEAD (列名 [,offset] [,default])

offset是偏移量，默认为1，

default：取不到值就使用默认值代替

LAG: 用来返回当前行以上行的数据！

用法： LAG (列名 [,offset] [,default])

offset是偏移量，默认为1，

default：取不到值就使用默认值代替

FIRST_VALUE: 返回指定列的第一个值

用法： FIRST_VALUE(列名，[false是否忽略null值])

LAST_VALUE:返回指定列的最后一个值

用法： LAST_VALUE(列名，[false是否忽略null值])

标准的聚集函数：MAX,MIN,AVG,COUNT,SUM

分析排名函数：

RANK
ROW_NUMBER
DENSE_RANK
CUME_DIST
PERCENT_RANK
NTILE

窗口：函数在运行时，计算的结果集的范围！

窗口函数指以上特定函数在运算时，可以自定义一个窗口（计算的范围）

9.2 语法

函数 over( [partition by 字段1,字段2] [order by 字段 asc|desc] [window clause] )

partition by : 根据某些字段对整个数据集进行分区！

order by: 对分区或整个数据集中的数据按照某个字段进行排序！

注意：如果对数据集进行了分区，那么窗口的范围不能超过分区的范围！

既窗口必须在区内指定！

9.3 window clause

(ROWS | RANGE) BETWEEN (UNBOUNDED | [num]) PRECEDING AND ([num] PRECEDING | CURRENT ROW | (UNBOUNDED | [num]) FOLLOWING)
(ROWS | RANGE) BETWEEN CURRENT ROW AND (CURRENT ROW | (UNBOUNDED | [num]) FOLLOWING)
(ROWS | RANGE) BETWEEN [num] FOLLOWING AND (UNBOUNDED | [num]) FOLLOWING

定义起始行和终止行的范围即可！

两个特殊情况：

①When both ORDER BY and WINDOW clauses are missing, the WINDOW specification defaults to ROW BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING.

当over()既没有写order by，也没有写window 子句，此时窗口默认等同于上无边界到下无边界（整个数据集）

②When ORDER BY is specified with missing WINDOW clause, the WINDOW specification defaults to RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW.

当over()中，指定了order by 但是没有指定 window 子句，此时窗口默认等同于上无边界到当前行

支持Over()，但是不支持在over中定义windows子句的函数：

The OVER clause supports the following functions, but it does not support a window with them (see HIVE-4797):

Ranking functions: Rank, NTile, DenseRank, CumeDist, PercentRank.

Lead and Lag functions

9.4 案例

9.4.1 建表

create table business(
name string, 
orderdate string,
cost int
) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';

加载数据：

load data local inpath '/home/andy/hivedatas/test5' into table business;

需求：

(1）查询在2017年4月份购买过的顾客及总人数
(2）查询顾客的购买明细及月购买总额
(3）查询顾客的购买明细要将cost按照日期进行累加
(4）查询顾客的购买明细及顾客上次的购买时间
(5) 查询顾客的购买明细及顾客下次的购买时间
(6) 查询顾客的购买明细及顾客本月第一次购买的时间
(7) 查询顾客的购买明细及顾客本月最后一次购买的时间
(8) 查询顾客的购买明细及顾客最近三次cost花费
(9) 查询前20%时间的订单信息

9.4.2需求

9.4.2.1 查询在2017年4月份购买过的顾客及总人数

思路： ①where 过滤 2017年4月份的数据

②求顾客总人数

总人数不等于总人次

查询在2017年4月份购买过的总人数

select '2017-04',count(*)
from
(select name
from business
where year(orderdate)=2017 and month(orderdate)=4
group by name)tmp

select *
from business
where substring(orderdate,1,7)='2017-04'

错误：

select count(*)
from business
where year(orderdate)=2017 and month(orderdate)=4
group by name

聚集函数是分组后在组内统计，聚集函数默认工作的范围(窗口)是组内！

窗口函数对窗口中的每一条记录都进行计算！

使用窗口函数：需要指定count()运行的窗口大小为整个结果集而不是组内！

查询在2017年4月份购买过的顾客及总人数

select name,count(*) over(ROWS BETWEEN UNBOUNDED  PRECEDING AND UNBOUNDED FOLLOWING)
from business
where year(orderdate)=2017 and month(orderdate)=4
group by name

统计明细，为每条明细都附加一个总的结果，一般使用窗口函数！

如果只要结果不要明细，没有必要使用窗口函数！

以上窗口函数还可以简写：

select name,count(*) over()
from business
where year(orderdate)=2017 and month(orderdate)=4
group by name

9.4.2.2 查询顾客的购买明细及月购买总额

按照顾客和月份分区，在区内计算所有的购买金额的总和！

 select *,sum(cost) over(partition by name,substring(orderdate,1,7) ) total_month_cost
 from business

9.4.2.3 查询顾客的购买明细要将cost按照日期进行累加

 select *,sum(cost) over(partition by name order by orderdate ) total_cost
 from business

9.4.2.4 查询顾客的购买明细及顾客上次的购买时间

lag不支持在over()中定义window子句！

 select *,lag(orderdate,1,'无') over(partition by name order by orderdate ) total_cost
 from business

9.4.2.5 查询顾客的购买明细及顾客下次的购买时间

 select *,lead(orderdate,1,'无') over(partition by name order by orderdate ) total_cost
 from business

9.4.2.6 查询顾客的购买明细及顾客本月第一次购买的时间

让窗口可以取到第一个值

 select *,first_value(orderdate) over(partition by name,substring(orderdate,1,7) order by orderdate ) first_date
 from business

9.4.2.7查询顾客的购买明细及顾客本月最后一次购买的时间

让窗口可以取到最后一个值

 select *,last_value(orderdate) over(partition by name,substring(orderdate,1,7) order by orderdate 
ROWS BETWEEN UNBOUNDED  PRECEDING AND UNBOUNDED FOLLOWING ) last_date
 from business

9.4.2.8查询顾客的购买明细及顾客最近三次cost花费

10 常用日期函数

hive默认解析的日期必须是： 2019-11-24 08:09:10

unix_timestamp:返回当前或指定时间的时间戳	
from_unixtime：将时间戳转为日期格式
current_date：当前日期
current_timestamp：当前的日期加时间
*to_date：抽取日期部分
year：获取年
month：获取月
day：获取日
hour：获取时
minute：获取分
second：获取秒
weekofyear：当前时间是一年中的第几周
dayofmonth：当前时间是一个月中的第几天
* months_between： 两个日期间的月份，前-后
* add_months：日期加减月
* datediff：两个日期相差的天数，前-后
* date_add：日期加天数
* date_sub：日期减天数
* last_day：日期的当月的最后一天

*常用取整函数
round： 四舍五入
ceil：  向上取整
floor： 向下取整

常用字符串操作函数
upper： 转大写
lower： 转小写
length： 长度
* trim：  前后去空格
lpad： 向左补齐，到指定长度
rpad：  向右补齐，到指定长度
* regexp_replace： SELECT regexp_replace('100-200', '(\d+)', 'num')='num-num
	使用正则表达式匹配目标字符串，匹配成功后替换！

集合操作
size： 集合（map和list）中元素的个数
map_keys： 返回map中的key
map_values: 返回map中的value
* array_contains: 判断array中是否包含某个元素
sort_array： 将array中的元素排序

11 自定义UDF函数

11.1 编写函数

①引入依赖

 <dependency>
            <groupId>org.apache.hive</groupId>
            <artifactId>hive-exec</artifactId>
            <version>1.2.1</version>
  </dependency>

②自定义类，继承UDF类

提供多个evaluate()方法，返回不能是void类型，必须有返回值！可以返回null！

public class MyUDF  extends UDF {

    public String evaluate(String str){

        return "hello "+str;
    }
}

11.2 引入函数

③打包，上传到$HIVE_HOME/auxlib

④重启hive，之后创建函数

create function 函数名 as '函数的全类名'

用户自定义的函数有库的范围，在哪个库下创建，就默认在这个库下使用！否则需要使用库名.函数名调用！

andyonlines

关注

0
点赞
踩
2

收藏

觉得还不错? 一键收藏
0
评论
七函数

文章目录1 函数的分类2 函数的查看3.NVL3.1介绍3.2使用4.字符串拼接函数4.1 concat4.2 concat_ws5.行转列函数5.1含义5.2 collect_set5.3 collect_list6.判断句式6.1 if6.2 case-when7.列转行7.1 含义7.2 explode8 练习8.1 练习18.1.1 建表8.2 思路18.3 思路28.2 练习28.2.1...
复制链接

扫一扫