hive函数学习

最新推荐文章于 2023-07-10 16:58:37 发布

All996

最新推荐文章于 2023-07-10 16:58:37 发布

阅读量555

点赞数

文章标签： hive 学习数学建模

本文链接：https://blog.csdn.net/weixin_67024075/article/details/130826803

版权

3.1 行转列

这里的行转列,指的是多行转一列.输入多行输出一行

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-Vw4I5CSY-1684823841891)(img/18_hangzhuanlie.png)]

涉及的函数

concat函数  字符串拼接,不可指定分隔符 
语法 
    concat(element1,element2,element3...)
举例 
	select concat('I','Love','Hive');
特点 如果任意一个元素为null,结果为null
	select concat('I','Love','Hive',null);
	
concat_ws函数 字符串拼接,可以指定分隔符
语法 
	concat_ws(splitChar,e1,e2,e3....|array<String>);
举例
	select concat_ws('-','I','Love','Hive');
	select concat_ws('-',`array`('aa','bb','cc','dd'));
特点 任意一个元素不为null,结果就不为null
	select concat_ws('-','I','Love','Hive',null);
	
collect_list函数 用于将一列中的多行合并为一行 不进行去重
语法 
	collect_list(colName)
举例 
	select collect_list(price) from t_product;
	
collect_set函数 用于将一列中的多行合并为一行 去重
语法 
	 collect_set(colName)
举例
	select collect_list(price) from t_product;

还有聚合函数 count,sum,max,avg等等

准备数据

vi emp.tsv
name gender job
柳岩	女	班主任
唐嫣	女	讲师
金莲	女	班主任
大郎	男	讲师
小庆	男	讲师
武松	男	班主任
-- 获取如下结果
女,班主任    柳岩|金莲
女,讲师	 唐嫣
男,班主任	武松
男,讲师	 大郎|小庆

create table t_emp
(
    name   string,
    gender string,
    job    string
) row format delimited fields terminated by "\t";

load data local inpath '/root/emp.tsv' into table t_emp;

查询

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-9mRJqTqt-1684823841892)(img/19_hangzhuanlie.png)]

-- 将gender和job合并为一列
select concat_ws(',', gender, job) as gender_job, name
from t_emp;

-- 得出结果
with t1 as (select concat_ws(',', gender, job) as gender_job, name
            from t_emp)
select t1.gender_job, concat_ws('|', collect_list(name))
from t1
group by t1.gender_job;

求每种工作每种性别各有多少人,分别有谁,获取如下结果

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-7E3E4iS4-1684823841892)(img/20_hangzhuanlie.png)]

select job,
       count(*)                                    as total,
       sum(case gender when '男' then 1 else 0 end) as man,
       sum(case gender when '女' then 1 else 0 end) as woman,
       concat_ws('|', collect_list(name))          as names
from t_emp
group by job;

3.2 列传行

这里的列转行指的是将一行数据转换成多行,输入一行输出多行

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-Nh1uerz7-1684823841893)(img/21_liezhuanhang.png)]

涉及的函数

expload函数 用于将一个集合或者数组中的每个元素展开,将每个元素变为一行
语法 
    exploade(Map|Array)
举例 
	select explode(`array`(1,2,3,4,5));
	select explode(split('a,b,c,d',','));
	select explode(`map`('liuyan','38','tangyan','18'));

建表准备数据

 name		category
《八佰》	战争,动作,爱国,剧情
《当幸福来敲门》	剧情,励志,怀旧,心理,故事
《悬崖之上》	战争,爱国,抗日,谍战

create table t_movie
(
    name     string,
    category string
) row format delimited fields terminated by "\t";

load data local inpath '/root/movie.txt' overwrite into table t_movie;

select * from t_movie;

查询

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-2WQbBl12-1684823841893)(img/22_liezhuanhang.png)]

-- 先来炸裂所有分类
select explode(split(category, ',')) from t_movie;

-- 但是当我们加上name时 查询时 报错
-- UDTF's are not supported outside the SELECT clause, nor nested in expressions
select name, explode(split(category, ',')) from t_movie; -- 报错

我们发现在select条件中，如果只有explode函数表达式，程序执行是没有任何问题的.但是如果在select条件中，包含explode和其他字段，就会报错. 那么如何理解这个错误?为什么在select的时候，explode的旁边不支持其他字段的同时出现?

UDTF语法限制

explode函数属于UDTF表生成函数，explode执行返回的结果可以理解为一张虚拟的表，其数据来源于源表;
在select中只查询源表数据没有问题，只查询explode生成的虚拟表数据也没问题，但是不能在只查询源表的时候，既想返回源表字段又想返回explode生成的虚拟表字段;通俗点讲，有两张表，不能只查询一张表但是又想返回分别属于两张表的字段;

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-zRgasola-1684823841893)(img/23_explode.png)]

如何解决?

1.从SQL层面上来说上述问题的解决方案是:对两张表进行join关联查询

2.Hive专门提供了语法lateral View侧视图,专门用于搭配explode这样的UDTF函数,以满足上述需要.

-- laterval view相当于自带关联条件 t_moive和t1进行 join连接
select name, t1.categroy from t_movie lateral view explode(split(category,',')) t1 as categroy;

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-9zO948ll-1684823841893)(img/24_explode.png)]

-- 向表中添加一个没有分类的电影  category为null
vi moive2.txt
<战狼> 
load data local inpath '/root/movie2.txt'  into table t_movie;

select  * from t_movie;

-- 查询时 发现战狼电影不显示
select name, t1.categroy
from t_movie lateral view explode(split(category, ',')) t1 as categroy;

-- 如果想显示 需要写 laterval view outer  使用outer 显示null值
select name, t1.categroy
from t_movie lateral view outer explode(split(category, ',')) t1 as categroy;

3.3 聚合增强

增强聚合的grouping_sets、cube、rollup这几个函数主要适用于OLAP多维数据分析模式中，多维分析中的维指的分析问题时看待问题的维度、角度。

下面我们来准备一下数据，通过案例更好的理解函数的功能含义.

2018-03,2018-03-10,cookie1
2018-03,2018-03-10,cookie5
2018-03,2018-03-12,cookie7
2018-04,2018-04-12,cookie3
2018-04,2018-04-13,cookie2
2018-04,2018-04-13,cookie4
2018-04,2018-04-16,cookie4
2018-03,2018-03-10,cookie2
2018-03,2018-03-10,cookie3
2018-04,2018-04-12,cookie5
2018-04,2018-04-13,cookie6
2018-04,2018-04-15,cookie3
2018-04,2018-04-15,cookie2
2018-04,2018-04-16,cookie1

--表创建并且加载数据
CREATE TABLE cookie_info
(
    month    STRING,
    day      STRING,
    cookieid STRING
) ROW FORMAT DELIMITED
    FIELDS TERMINATED BY ',';

load data local inpath '/root/cookie.txt' into table cookie_info;

select *
from cookie_info;

3.3.1 grouping_sets

grouping sets是一种将多个group by逻辑写在一个sql语句中的便利写法。等价于将不同维度的GROUP BY结果集进行UNION ALL。GROUPING__ID表示结果属于哪一个分组集合。

SELECT
    month,
    day,
    COUNT(DISTINCT cookieid) AS nums,
    GROUPING__ID
FROM cookie_info
GROUP BY month,day
    GROUPING SETS (month,day) --这里是关键
ORDER BY GROUPING__ID;

--等价于
SELECT month, NULL, COUNT(DISTINCT cookieid) AS nums, 1 AS GROUPING__ID
FROM cookie_info
GROUP BY month
UNION ALL
SELECT NULL as month, day, COUNT(DISTINCT cookieid) AS nums, 2 AS GROUPING__ID
FROM cookie_info
GROUP BY day;

--再比如
SELECT month,
       day,
       COUNT(DISTINCT cookieid) AS nums,
       GROUPING__ID
FROM cookie_info
GROUP BY month, day
    GROUPING SETS ( month, day, ( month, day)) --1 month   2 day    3 (month,day)
ORDER BY GROUPING__ID;

--等价于
SELECT month, NULL, COUNT(DISTINCT cookieid) AS nums, 0 AS GROUPING__ID
FROM cookie_info
GROUP BY month
UNION ALL
SELECT NULL, day, COUNT(DISTINCT cookieid) AS nums, 1 AS GROUPING__ID
FROM cookie_info
GROUP BY day
UNION ALL
SELECT month, day, COUNT(DISTINCT cookieid) AS nums, 2 AS GROUPING__ID
FROM cookie_info
GROUP BY month, day;

3.3.2 cube

cube表示根据GROUP BY的维度的所有组合进行聚合。

对于cube来说,如果有n个维度,则所有组合的总个数是：2^n

比如cube有a,b,c 3个维度，则所有组合情况是： (a,b,c),(a,b),(b,c),(a,c),(a),(b),©,()

SELECT
    month,
    day,
    COUNT(DISTINCT cookieid) AS nums,
    GROUPING__ID
FROM cookie_info
GROUP BY month,day
WITH CUBE
ORDER BY GROUPING__ID;

--等价于
SELECT NULL, NULL, COUNT(DISTINCT cookieid) AS nums, 0 AS GROUPING__ID
FROM cookie_info
UNION ALL
SELECT month, NULL, COUNT(DISTINCT cookieid) AS nums, 1 AS GROUPING__ID
FROM cookie_info
GROUP BY month
UNION ALL
SELECT NULL, day, COUNT(DISTINCT cookieid) AS nums, 2 AS GROUPING__ID
FROM cookie_info
GROUP BY day
UNION ALL
SELECT month, day, COUNT(DISTINCT cookieid) AS nums, 3 AS GROUPING__ID
FROM cookie_info
GROUP BY month, day;

3.3.3 rollup

rollup是cube的子集，以最左侧的维度为主，从该维度进行层级聚合。

比如ROLLUP有a,b,c3个维度，则所有组合情况是：(a,b,c),(a,b),(a),()

-- month和day
SELECT month,
       day,
       COUNT(DISTINCT cookieid) AS nums,
       GROUPING__ID
FROM cookie_info
GROUP BY month, day
WITH ROLLUP
ORDER BY GROUPING__ID;
-- 等价于  所有,month,month day
SELECT NULL, NULL, COUNT(DISTINCT cookieid) AS nums, 0 AS GROUPING__ID
FROM cookie_info
UNION ALL
SELECT month, NULL, COUNT(DISTINCT cookieid) AS nums, 1 AS GROUPING__ID
FROM cookie_info
GROUP BY month
UNION ALL
SELECT month, day, COUNT(DISTINCT cookieid) AS nums, 3 AS GROUPING__ID
FROM cookie_info
GROUP BY month, day;

-- day 和month
SELECT month,
       day,
       COUNT(DISTINCT cookieid) AS nums,
       GROUPING__ID
FROM cookie_info
GROUP BY day, month
WITH ROLLUP
ORDER BY GROUPING__ID;

-- 等价于  所有 ,day,day month
SELECT NULL, NULL, COUNT(DISTINCT cookieid) AS nums, 0 AS GROUPING__ID
FROM cookie_info
UNION ALL
SELECT null, day, COUNT(DISTINCT cookieid) AS nums, 1 AS GROUPING__ID
FROM cookie_info
GROUP BY day
UNION ALL
SELECT month, day, COUNT(DISTINCT cookieid) AS nums, 3 AS GROUPING__ID
FROM cookie_info
GROUP BY day, month;

3.3 窗口函数*****

窗口函数（Window functions )也叫做开窗函数、OLAP函数，其最大特点是∶输入值是从SELECT语句的结果集中的一行或多行的“窗口”中获取的。
窗口函数可以简单地解释为类似于聚合函数的计算函数，但是通过GROUP BY子句组合的常规聚合会隐藏正在聚合的各个行，最终输出一行，窗口函数聚合后还可以访问当中的各个行，并且可以将这些行中的某些属性添加到结果集中。

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-KvknC77Q-1684823841894)(img/25_window.png)]

select * from stu_mark;
-- 常规分组查询 求分数和
select sname,sum(score) from stu_mark group by  sname;

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-tqVkQgvT-1684823841894)(img/26_win.png)]

-- 窗口函数分组
select sname,subject,score,sum(score) over (partition by sname) as total_score  from stu_mark;

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-MMU66yoG-1684823841894)(img/27_win.png)]

3.3.1 窗口函数语法规则

Function(arg1,.. ., argn) OVER( [PARTITION BY <...>][ORDER BY <....>][<window_expression>])


-- 其中Function (argl , ..., argn)可以是下面分类中的任意一个
    -- 聚合函数:比如sum max avg等
    -- 排序函数:比如rank row_number等
    -- 分析函数:比如lead lag first value等
-- OVER [PARTITION BY<...>] 类似于group by用于指定分组每个分组你可以把它叫做窗口
	-- 如果没有PARTITION BY 那么整张表的所有行就是一组
-- [ORDER BY <....>]用于指定每个分组内的数据排序规则支持ASC、DESC
-- [<window expression>]用于指定每个窗口中操作的数据范围默认是窗口中所有行

窗口函数使用

建表准备数据

vi orders.txt

name cadate   money
jack,2017-01-01,10
tony,2017-01-02,15
jack,2017-02-03,23
tony,2017-01-04,29
jack,2017-01-05,46
jack,2017-04-06,46
tony,2017-01-07,50
jack,2017-01-08,55
mart,2017-04-08,62
mart,2017-04-09,68
neil,2017-05-10,12
mart,2017-04-11,75
neil,2017-06-12,80
mart,2017-04-13,94

create table t_orders
(
    name  string,
    cdate string,
    money double
) row format delimited fields terminated by ',';

load data local inpath '/root/orders.txt' into table t_orders;

select * from t_orders;

-- 查询每个用户总订单金额
select name ,sum(money) from t_orders group by name;
-- 查询每个月的订单总数 count+group by 普通常规聚合操作
select substr(cdate,0,7)  month ,count(1)  from t_orders group by substr(cdate , 0 , 7);

-- sum+窗口函数 总共有四种用法
-- sum(...) over()对表所有行求和
-- sum(...) over(order by ...) 连续累积求和
-- sum(...) over(partition by ...) 同组内所有行求和
-- sum(...) over(partition by...order by...) 在每个分组内连续累积求和

-- 查询所有所有用户的总订单金额
-- sum(...) over()对表所有行求和
select * ,sum(money) over() as total_money  from t_orders;
-- 查询每个用户的订单总金额
-- sum(...) over(partition by ...) 同组内所有行求和
select * ,sum(money) over(partition by name) as user_money  from t_orders;

-- 查询每个用户的订单总金额 按天数排序 累加
-- sum(...) over(partition by...order by...) 在每个分组内连续累积求和
select name,money,sum(money) over(partition by name order by cdate) as user_money  from t_orders;
+-------+--------+-------------+
| name  | money  | user_money  |
+-------+--------+-------------+
| jack  | 10.0   | 10.0        |
| jack  | 46.0   | 56.0        |
| jack  | 55.0   | 111.0       |
| jack  | 23.0   | 134.0       |
| jack  | 42.0   | 176.0       |
| mart  | 62.0   | 62.0        |
| mart  | 68.0   | 130.0       |
| mart  | 75.0   | 205.0       |
| mart  | 94.0   | 299.0       |
| neil  | 12.0   | 12.0        |
| neil  | 80.0   | 92.0        |
| tony  | 15.0   | 15.0        |
| tony  | 29.0   | 44.0        |
| tony  | 50.0   | 94.0        |
+-------+--------+-------------+
-- 查询每个月的订单总金额 按照天数累加
select cdate,concat(substr(cdate,6,2),'月') as month ,sum(money) over (partition by substr(cdate,0,7) order by cdate) from t_orders;
+--------+-------------+---------------+
| month  |    cdate    | sum_window_0  |
+--------+-------------+---------------+
| 01月    | 2017-01-01  | 10.0          |
| 01月    | 2017-01-02  | 25.0          |
| 01月    | 2017-01-04  | 54.0          |
| 01月    | 2017-01-05  | 100.0         |
| 01月    | 2017-01-07  | 150.0         |
| 01月    | 2017-01-08  | 205.0         |
| 02月    | 2017-02-03  | 23.0          |
| 04月    | 2017-04-06  | 42.0          |
| 04月    | 2017-04-08  | 104.0         |
| 04月    | 2017-04-09  | 172.0         |
| 04月    | 2017-04-11  | 247.0         |
| 04月    | 2017-04-13  | 341.0         |
| 05月    | 2017-05-10  | 12.0          |
| 06月    | 2017-06-12  | 80.0          |
+--------+-------------+---------------+

3.3.2 窗口表达式

在sum(…） over( partition by… order by …)语法完整的情况下，进行累积聚合操作，默认累积聚合行为是∶从第一行聚合到当前行。
Window expression窗口表达式给我们提供了一种控制行范围的能力，比如向前2行，向后3行。

 关键字是rows between，包括下面这几个选项
- preceding :往前
- following :往后
- current row:当前行
- unbounded:边界
- unbounded preceding表示从前面的起点
- unbounded following :表示到后面的终点

代码演示

-- 前一行到当前行
select name,
       money,
       sum(money) over (partition by name order by cdate rows between 1 preceding and current row ) as user_money
from t_orders;

-- 当前行到后一行
select name,
       money,
       sum(money) over (partition by name order by cdate rows between current row and 1 following ) as user_money
from t_orders;

-- 前一行到后一行
select name,
       money,
       sum(money) over (partition by name order by cdate rows between 1 preceding and 1 following ) as user_money
from t_orders;
-- 当前行到最后一行
select name,
       money,
       sum(money)
           over (partition by name order by cdate rows between current row and unbounded following ) as user_money
from t_orders;

-- 第一行到最后一行 组内所有行
select name,
       money,
       sum(money)
           over (partition by name order by cdate rows between unbounded preceding and unbounded following ) as user_money
from t_orders;

3.3.3 编号函数

row_number:在每个分组中，为每行分配一个从1开始的唯一序列号，递增，不考虑重复;
rank:在每个分组中，为每行分配一个从1开始的序列号，考虑重复，挤占后续位置;
dense_rank:在每个分组中，为每行分配一个从1开始的序列号，考虑重复，不挤占后续位置;

select name,
       money,
       row_number() over (partition by name order by money) as r_num,
       rank() over (partition by name order by money)       as rank,
       dense_rank() over (partition by name order by money) as ds_rank
from t_orders;

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-2HnUchdc-1684823841894)(img/28_num.png)]

-- 查询每个人消费最高的订单的日期和金额
select name,
       cdate,
       money,
       row_number() over (partition by name order by money desc) as r_num
from t_orders;

with t1 as (select name,
                   cdate,
                   money,
                   row_number() over (partition by name order by money desc) as r_num
            from t_orders)
select name, cdate, money
from t1
where r_num = 1;

+-------+-------------+--------+
| name  |    cdate    | money  |
+-------+-------------+--------+
| jack  | 2017-01-08  | 55.0   |
| mart  | 2017-04-13  | 94.0   |
| neil  | 2017-06-12  | 80.0   |
| tony  | 2017-01-07  | 50.0   |
+-------+-------------+--------+

ntile函数 
将每个分组内的数据分为指定的若干个桶里（分为若干个部分），并且为每一个桶分配一个桶编号。
如果不能平均分配，则优先分配较小编号的桶，并且各个桶中能放的行数最多相差1。
有时会有这样的需求:如果数据排序后分为三部分，业务人员只关心其中的一部分，如何将这中间的三分之一数据拿出来呢?
这时可以使用ntile函数

select name,
       money,
       ntile(3) over (partition by name order by money) as r_num
from t_orders;

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-H2i6UPaP-1684823841894)(img/29_ntile.png)]

-- 查询前20%时间的订单信息 ntile(5)
select *, ntile(5) over (order by cdate) as n
from t_orders;

with t1 as (select *,
                   ntile(5) over (order by cdate) as n
            from t_orders)
select name, cdate, money
from t1
where n = 1;

+-------+-------------+--------+
| name  |    cdate    | money  |
+-------+-------------+--------+
| jack  | 2017-01-01  | 10.0   |
| tony  | 2017-01-02  | 15.0   |
| tony  | 2017-01-04  | 29.0   |
+-------+-------------+--------+

3.3.4 窗口分析函数

lag(col,n,default) 用于统计窗口内往上第n行值

第一个参数为列名，第二个参数为往上第n行（可选，默认为1），第三个参数为默认值（当往上第n行为NULL时候，取默认值，如不指定，则为NULL）；

lead(col,n,default) 用于统计窗口内往下第n行值

第一个参数为列名，第二个参数为往下第n行（可选，默认为1），第三个参数为默认值（当往下第n行为NULL时候，取默认值，如不指定，则为NULL）；
first_value取分组内排序后，截止到当前行，第一个值
last_value取分组内排序后，截止到当前行，最后一个值

-- lag 上n行
select name,
       money,
       lag(money, 1) over (partition by name order by money)    as num,
       lag(money, 1, 0) over (partition by name order by money) as num
from t_orders;

--lead 下n行
select name,
       money,
       lead(money, 1) over (partition by name order by money)    as num,
       lead(money, 1, 0) over (partition by name order by money) as num
from t_orders;

--first_value 第一行
select name,
       money,
       first_value(money) over (partition by name order by money) as num
from t_orders;
--last_value 最后一行 当前行就是最后一行
select name,
       money,
       last_value(money) over (partition by name order by money ) as num
from t_orders;

-- 查询顾客上次的购买时间
select name,
       cdate,
       lag(cdate, 1) over (partition by name order by cdate) as last_date
from t_orders;

3.4 查询练习

3.4.1 连续登陆

uid login_date
001,2017-02-05 12:00:00
001,2017-02-05 14:00:00
001,2017-02-06 13:00:00
001,2017-02-07 12:00:00
001,2017-02-08 12:00:00
001,2017-02-10 14:00:00
002,2017-02-05 13:00:00
002,2017-02-06 12:00:00
002,2017-02-06 14:00:00
002,2017-02-08 12:00:00
002,2017-02-09 16:00:00
002,2017-02-10 12:00:00
003,2017-01-31 13:00:00
003,2017-01-31 12:00:00
003,2017-02-01 12:00:00
004,2017-02-02 12:00:00
004,2017-02-03 12:00:00
004,2017-02-10 12:00:00
004,2017-03-01 12:00:00


create table  t_login_user(
                              uid string,
                              login_date string
)row format delimited fields terminated by ",";

load data local inpath "/root/login_user.txt" overwrite into table t_login_user;

select * from t_login_user;

计算连续登陆2天的用户

第一种方式

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-Ivte82UF-1684823841895)(img/30_login.png)]

– 查询连续登陆n天的用户 – 我们可以基于用户的登陆信息，找到如下规律：
– 连续两天登陆：用户下次登陆时间 = 本次登陆以后的第二天
– 连续三天登陆：用户下下次登陆时间 = 本次登陆以后的第三天……
– 我们可以对用户ID进行分区，按照登陆时间进行排序，通过lead函数计算出用户下次登陆时间通过日期函数计算出登陆以后第二天的日期，如果相等即为连续两天登录

-- 去掉用户重复登陆的记录
select distinct uid,date_format(login_date,'yyyy-MM-dd') from t_login_user;
-- 在去掉用户重复登录的基础上,对用户分组,对登陆日期排序 计算如果连续登陆 那么下一次登陆的日期
with t1 as ( select distinct uid,date_format(login_date,'yyyy-MM-dd') as login_date from t_login_user )
    select * ,
           date_add(login_date,1) as next_date,
           lead(login_date,1,0) over (partition by uid order by login_date) as next_login
from t1;
-- 查询下一次登陆日期 和 下一行记录相等的用户 就是连续登陆2天的用户
with t1 as ( select distinct uid,date_format(login_date,'yyyy-MM-dd') as login_date from t_login_user ),
     t2 as (select *,
    date_add(login_date,1) as next_date,
    lead(login_date,1,0) over (partition by uid order by login_date) as next_login
    from t1 )
select distinct uid from t2 where t2.next_date == t2.next_login;

-- 查询连续3天登陆的用户 
with t1 as ( select distinct uid,date_format(login_date,'yyyy-MM-dd') as login_date from t_login_user ),
     t2 as (select *,
    date_add(login_date,2) as next_date,
    lead(login_date,2,0) over (partition by uid order by login_date) as next_login
    from t1 )
select distinct uid from t2 where t2.next_date == t2.next_login;

-- 查询连续N天登陆的用户 
select *,
    --本次登陆日期的第N天
    date_add(登陆日期,N-1) as next_date,
    --按照用户id分区，按照登陆日期排序 取对应N-1行的数据
    lead(登陆日期,N-1,0) over (partition by 用户 order by 登陆日期) as next_login
    from t1)

-- 查询连续登陆大于4天的用户
with t1 as (select distinct uid, date_format(login_date, 'yyyy-MM-dd') as login_date from t_login_user),
     t2 as (
         select uid,
                date_format(login_date, 'yyyy-MM-dd')                     as interval_date,
                row_number() over (partition by uid order by login_date ) as rn
         from t1
     ),
     t3 as (
         select *, date_sub(interval_date, rn) as login_date
         from t2
     )
select uid, count(1)
from t3
group by uid, login_date
having count(1) >= 4;

第二种方式

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-XCpdv40O-1684823841895)(img/31_login.png)]

-- 去掉当天重复登陆信息
select distinct uid, date_format(login_date, 'yyyy-MM-dd') as login_date
from t_login_user;
-- 窗口函数用户分组,登陆日期排序,行号
with t1 as (select distinct uid, date_format(login_date, 'yyyy-MM-dd') as login_date from t_login_user)
select *,
       row_number() over (partition by uid order by login_date) as rn
from t1

-- 登陆日期-编号 = 间隙日期 
with t1 as (select distinct uid, date_format(login_date, 'yyyy-MM-dd') as login_date from t_login_user),
     t2 as (select *,
                   row_number() over (partition by uid order by login_date) as rn
            from t1)
select *,
       date_sub(login_date, rn) as interval_date
from t2;

-- 用户 间隙日期分组 计数 >=2 为连续两天登陆  >=n 为连续n天记录
with t1 as (select distinct uid, date_format(login_date, 'yyyy-MM-dd') as login_date from t_login_user),
     t2 as (select *, row_number() over (partition by uid order by login_date) as rn from t1),
     t3 as (select *, date_sub(login_date, rn) as interval_date from t2)
select uid, count(1) as login_count
from t3
group by uid, interval_date
having count(1) >= 2;

分组topN

查询每个用户最高连续登陆天数

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-9fByq6d5-1684823841895)(img/32_login.png)]

-- 查询每个用户连续登陆的天数
with t1 as (select distinct uid, date_format(login_date, 'yyyy-MM-dd') as login_date from t_login_user),
     t2 as (select *, row_number() over (partition by uid order by login_date) as rn from t1),
     t3 as (select *, date_sub(login_date, rn) as interval_date from t2)
select uid, count(1) as login_count
from t3
group by uid, interval_date

-- 分组 设置编号
with t1 as (select distinct uid, date_format(login_date, 'yyyy-MM-dd') as login_date from t_login_user),
     t2 as (select *, row_number() over (partition by uid order by login_date) as rn from t1),
     t3 as (select *, date_sub(login_date, rn) as interval_date from t2),
     t4 as (select uid, count(1) as login_count from t3 group by uid, interval_date)
select *, row_number() over (partition by uid order by login_count desc) as rn
from t4;

-- topn
with t1 as (select distinct uid, date_format(login_date, 'yyyy-MM-dd') as login_date from t_login_user),
     t2 as (select *, row_number() over (partition by uid order by login_date) as rn from t1),
     t3 as (select *, date_sub(login_date, rn) as interval_date from t2),
     t4 as (select uid, count(1) as login_count from t3 group by uid, interval_date),
     t5 as (select *, row_number() over (partition by uid order by login_count desc) as rn from t4)
select *
from t5
where rn <= 1;

3.4.2 打地鼠游戏

vi ds.txt
uid,hit,m
1,1,0
1,2,1
1,3,1
1,4,1
1,5,0
1,6,0
1,7,1
2,1,1
2,2,1
2,3,1
2,4,1
2,5,1
3,1,1
3,2,1
3,3,1
3,4,0
3,5,0
3,6,1
3,7,0
3,8,1

create table tb_ds(
      uid int ,  -- 用户名
      hit int ,  -- 第几次打地鼠
      m int      -- 是否命中 1命中 0 未命中
)
row format delimited fields terminated by ','  ;
load data local inpath '/root/ds.txt' into table tb_ds ;

select  * from tb_ds;

查询用户最大连续命中次数

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-laxq0XHl-1684823841895)(img/33_hit.png)]

--查询命中的的记录
select *
from tb_ds
where m = 1;

-- 用户分组 记录编号
select uid, hit, row_number() over (partition by uid order by hit) as rn
from tb_ds
where m = 1;

-- hit-rn 
with t1 as (select uid, hit, row_number() over (partition by uid order by hit) as rn from tb_ds where m = 1)
select *, (hit - rn) as sub
from t1

-- 分组得到每个用户的连续击中
with t1 as (select uid, hit, row_number() over (partition by uid order by hit) as rn from tb_ds where m = 1),
     t2 as (select *, (hit - rn) as sub from t1)
select uid, count(*) as hit_count
from t2
group by uid, sub

-- 用户分组得到每个用户最大次数
with t1 as (select uid, hit, row_number() over (partition by uid order by hit) as rn from tb_ds where m = 1),
    t2 as (select *, (hit-rn) as sub from t1 ),
    t3 as (select uid, count(*) as hit_count from t2 group by uid, sub )
select uid, max(hit_count) as hit_count
from t3
group by uid;

3.4.3 WordCount

Why Studying History Matters
Studying a subject that you feel pointless is never a fun or easy task.
If you're study history, asking yourself the question "why is history important" is a very good first step.
History is an essential part of human civilization.
You will find something here that will arouse your interest, or get you thinking about the significance of history.

求每个单词出现的次数

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-1k7zZBO1-1684823841895)(img/34_wc.png)]

create table t_wc(
    line string
)row format delimited lines terminated by "\n";

load data local inpath '/root/word.txt' overwrite into table t_wc;
select * from t_wc;

-- 使用正则表达式去掉特殊符号 将 除了单词字符 及 ' 和 空格的 所有字符替换成空串
select regexp_replace(line, '[^a-zA-Z_0-9\'\\s]', "") as word_line
from t_wc;
-- 使用split进行切割 将一行数据切割为一个数组
select split(regexp_replace(line, '[^a-zA-Z_0-9\'\\s]', ""), "\\s+")
from t_wc;
-- 使用expload函数进行炸裂
select explode(split(regexp_replace(line, '[^a-zA-Z_0-9\'\\s]', ""), "\\s+")) as word
from t_wc;
-- 分组得到每个单词的次数
with t1 as (select explode(split(regexp_replace(line, '[^a-zA-Z_0-9\'\\s]', ""), "\\s+")) as word from t_wc)
select word, count(1) as word_count
from t1
group by word;

3.5 json数据处理

JSON数据格式是数据存储及数据处理中最常见的结构化数据格式之一，很多场景下公司都会将数据以JSON格式存储在HDFS中，当构建数据仓库时，需要对JSON格式的数据进行处理和分析，那么就需要在Hive中对JSON格式的数据进行解析读取。

{"movie":"1240","rate":"5","timeStamp":"978294260","uid":"4"}
{"movie":"2987","rate":"4","timeStamp":"978243170","uid":"5"}
{"movie":"2333","rate":"4","timeStamp":"978242607","uid":"5"}
{"movie":"1175","rate":"5","timeStamp":"978244759","uid":"5"}
{"movie":"39","rate":"3","timeStamp":"978245037","uid":"5"}
{"movie":"288","rate":"2","timeStamp":"978246585","uid":"5"}
{"movie":"2337","rate":"5","timeStamp":"978243121","uid":"5"}
{"movie":"1535","rate":"4","timeStamp":"978245513","uid":"5"}
{"movie":"1392","rate":"4","timeStamp":"978245645","uid":"5"}

3.5.1 函数处理json数据

Hive中提供了两个专门用于解析JSON字符串的函数：get_json_object,json_tuple，这两个函数都可以实现将JSON数据中的每个字段独立解析出来，构建成表。

建表

将一条json语句作为一个字符串处理 
create table test_json(
	str string
) ;
load  data  local inpath '/root/movie.txt' into table test_json ;

select * from test_json;

使用get_json_object 解析

select get_json_object(str, "$.movie")     movie,
       get_json_object(str, "$.rate")      rate,
       get_json_object(str, "$.timeStamp") ts,
       get_json_object(str, "$.uid")       uid
from test_json;

使用json_tuple函数解析

select json_tuple(str, "movie", "rate", "timeStamp", "uid") as (movie, rate, ts, uid)
from test_json;

3.5.2 JSONSerde处理

使用函数解析JSON的过程中是将数据作为一个JSON字符串加载到表中，再通过JSON解析函数对JSON字符串进行解析，灵活性比较高，但是对于如果整个文件就是一个JSON文件，在使用起来就相对比较麻烦。Hive中为了简化对于JSON文件的处理，内置了一种专门用于解析JSON文件的Serde解析器，在创建表时，只要指定使用JSONSerde，就会自动将JSON文件中的每一列进行解析。

create table test_json2
(
    movie       string,
    rate        string,
    `timeStamp` string,
    uid         string
)
-- 指定json解析器解析
ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe'
STORED AS TEXTFILE;

load  data  local inpath '/root/movie.txt' into table test_json2 ;

select * from test_json2;
desc formatted  test_json2;

3.6 多字节分隔符处理

Hive默认序列化类是LazySimpleSerDe，其只支持使用单字节分隔符（char）来加载文本数据，例如逗号、制表符、空格等等，默认的分隔符为”\001”。根据不同文件的不同分隔符，我们可以通过在创建表时使用 row format delimited 来指定文件中的分割符，确保正确将表中的每一列与文件中的每一列实现一一对应的关系。

但是工作中有可能遇到特殊的数据

情况一：每一行数据的分隔符是多字节分隔符，例如：”||”、“–”等

01||周杰伦||中国||台湾||男||七里香
02||刘德华||中国||香港||男||笨小孩

情况二：数据的字段中包含了分隔符

192.168.88.134 [08/Nov/2020:10:44:32 +0800] "GET / HTTP/1.1" 404 951
192.168.88.100 [08/Nov/2020:10:44:33 +0800] "GET /hpsk_sdk/index.html HTTP/1.1" 200 328

如果遇到上面两种情况使用LazySimpleSerDe是没有办法处理的,处理后的数据有问题.这时其实有多种方式解决,这里我们选择使用RegexSerDe正则加载

除了使用最多的LazySimpleSerDe，Hive中内置了很多SerDe类；
官网地址：https://cwiki.apache.org/confluence/display/Hive/SerDe
多种SerDe用于解析和加载不同类型的数据文件，常用的有ORCSerDe 、RegexSerDe、JsonSerDe等

情况一

vi sing.txt

01||周杰伦||中国||台湾||男||七里香
02||刘德华||中国||香港||男||笨小孩
03||汪  峰||中国||北京||男||光明
04||朴  树||中国||北京||男||那些花儿
05||许  巍||中国||陕西||男||故乡

create table singer
(
    id       string,
    name     string,
    country  string,
    province string,
    gender   string,
    works    string
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES
        ("input.regex" = "([0-9]*)\\|\\|(.*)\\|\\|(.*)\\|\\|(.*)\\|\\|(.*)\\|\\|(.*)");

load data local inpath '/root/singer.txt' into table singer;
select * from singer;

情况二

192.168.88.134 [08/Nov/2020:10:44:32 +0800] "GET / HTTP/1.1" 404 951
192.168.88.100 [08/Nov/2020:10:44:33 +0800] "GET /hpsk_sdk/index.html HTTP/1.1" 200 328
192.168.88.134 [08/Nov/2020:20:19:06 +0800] "GET / HTTP/1.1" 404 951
192.168.88.100 [08/Nov/2020:20:19:13 +0800] "GET /hpsk_sdk/demo4.html HTTP/1.1" 200 982
192.168.88.100 [08/Nov/2020:20:19:13 +0800] "GET /hpsk_sdk/js/analytics.js HTTP/1.1" 200 11095
192.168.88.100 [08/Nov/2020:20:19:23 +0800] "GET /hpsk_sdk/demo3.html HTTP/1.1" 200 1024
192.168.88.100 [08/Nov/2020:20:19:26 +0800] "GET /hpsk_sdk/demo2.html HTTP/1.1" 200 854
192.168.88.100 [08/Nov/2020:20:19:27 +0800] "GET /hpsk_sdk/demo.html HTTP/1.1" 200 485
192.168.88.134 [08/Nov/2020:20:26:51 +0800] "GET / HTTP/1.1" 404 951
192.168.88.134 [08/Nov/2020:20:29:08 +0800] "GET / HTTP/1.1" 404 951

create table t_log
(
    ip     string, --IP地址
    stime  string, --时间
    mothed string, --请求方式
    url    string, --请求地址
    policy string, --请求协议
    stat   string, --请求状态
    body   string  --字节大小
)
    --指定使用RegexSerde加载数据
    ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
        --指定正则表达式
        WITH SERDEPROPERTIES (
        "input.regex" = "([^ ]*) ([^}]*) ([^ ]*) ([^ ]*) ([^ ]*) ([0-9]*) ([^ ]*)"
        );

load data local inpath '/root/a.log' into table t_log;

select * from t_log;
-- 时间转换
select
       from_unixtime(unix_timestamp(stime, '[dd/MMM/yyyy:HH:mm:ss +0800]'),'yyyy-MM-dd HH:mm:ss')
from t_log;

3.7 自定义函数

Hive 自带了一些函数，比如：max/min等，但是数量有限，自己可以通过自定义UDF来方便的扩展。

在企业中处理数据的时候，对于敏感数据往往需要进行脱敏处理。比如手机号。我们常见的处理方式是将手机号中间4位进行*号处理。 Hive中没有这样的函数可以直接实现功能，虽然可以通过各种函数的嵌套调用最终也能实现，但是效率不高，现要求自定义开发实现Hive函数，满足上述需求。

能够对输入数据进行非空判断、手机号位数判断
能够实现校验手机号格式，把满足规则的进行*号处理
对于不符合手机号规则的数据直接返回，不处理

实现步骤:

写一个java类，继承UDF，并重载evaluate方法，方法中实现函数的业务逻辑；
重载意味着可以在一个java类中实现多个函数功能；
程序打成jar包，上传HS2服务器本地或者HDFS;
客户端命令行中添加jar包到Hive的classpath： hive>add JAR /xxxx/udf.jar;
注册成为临时函数（给UDF命名）：create temporary function 函数名 as ‘UDF类全路径’;
HQL中使用函数。

开发环境准备:

idea创建Maven工程 
添加pom依赖
	<dependencies>
        <dependency>
            <groupId>org.apache.hive</groupId>
            <artifactId>hive-exec</artifactId>
            <version>3.1.2</version>
        </dependency>
        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-common</artifactId>
            <version>3.1.1</version>
        </dependency>
    </dependencies>

编写代码

public class MyUDF  extends UDF {
    public String evaluate(String phoneNum){
        //如果手机号为null 或者长度不对 直接返回
        if(phoneNum == null || phoneNum.trim().length()!=11){
            return  phoneNum;
        }

        //判断手机号是否符合规则
        String regex = "1[3-9][0-9]{9}";
        boolean matches = phoneNum.trim().matches(regex);
        //如果不匹配 原样返回
        if(!matches){
            return  phoneNum;
        }
        //如果匹配 将中间替换为*
        String newNum = phoneNum.trim().
                replaceAll("([0-9]{3})[0-9]{4}([0-9]{4})", "$1****$2");
        return newNum;
    }
}

打成jar包

Maven ----> package 
到target目录下去找对应的jar包

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-Iep2zPjH-1684823841896)(img/35_jar.png)]

上传添加jar包

1.可以上传到hiveserver2所在的服务器 
2.也可以上传到HDFS系统

使用命令
		add jar jar包位置

为函数命名

  create temporary function myphone_num as 'com.doit.demo.MyUDF';

使用函数

select myphone_num('13812341234');

All996

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
hive函数学习

Hive 自带了一些函数，比如：max/min等，但是数量有限，自己可以通过自定义UDF来方便的扩展。在企业中处理数据的时候，对于敏感数据往往需要进行脱敏处理。比如手机号。我们常见的处理方式是将手机号中间4位进行*号处理。Hive中没有这样的函数可以直接实现功能，虽然可以通过各种函数的嵌套调用最终也能实现，但是效率不高，现要求自定义开发实现Hive函数，满足上述需求。能够对输入数据进行非空判断、手机号位数判断能够实现校验手机号格式，把满足规则的进行*号处理。
复制链接

扫一扫