hive表生成函数explode、stack、lateral view和json_tuple、parse_url_tuple示例

最新推荐文章于 2024-01-18 08:45:00 发布

Data_IT_Farmer

最新推荐文章于 2024-01-18 08:45:00 发布

阅读量3.1k

点赞数 5

文章标签： hive

原文链接：https://www.modb.pro/db/114369

版权

在Hive中，所有的表生成函数，包括用户自定义的和内置的，都统称为用户自定义表生成函数（user defined table generating functions），简称udtf。本文只介绍Hive自带的内置表生成函数。

1、explode(array)

功能：返回n行，每行对应数组中的一个元素。

spark-sql> select explode(array(1,2,3,4));

col
1
2
3
4
Time taken: 0.048 seconds, Fetched 4 row(s)

2、explode(map)

功能：返回n行两列，每行对应每个map键-值，第一列是map的键，第二列是map的值。（不常用）

spark-sql> select explode(map(1,2,3,4));
key	value
1	2
3	4
Time taken: 0.039 seconds, Fetched 2 row(s)

3、posexplode(array)

功能：与explode类似，但除了数组元素本身（第二列），还返回各元素在数组中的位置（从0开始，第一列）。（不常用）

spark-sql> select posexplode(array(2,4,6,8));
pos	col
0	2
1	4
2	6
3	8
Time taken: 0.034 seconds, Fetched 4 row(s)

4、stack(int n, v_1, v_2, ..., v_k)

功能：把k列数据转换成n行，k/n列，其中n必须是正整数，后面的v_1到v_k必须是元素，不能是列名。（不常用）

4.1、n设为3，将后面6个元素按顺序分为3行2列

select stack(3,1,2,3,4,5,6);

spark-sql> select stack(3,1,2,3,4,5,6);
col0	col1
1	2
3	4
5	6
Time taken: 0.032 seconds, Fetched 3 row(s)

4.2、n设为2，将后面6个元素按顺序分为2行3列

select stack(2,1,2,3,4,5,6);

spark-sql> select stack(2,1,2,3,4,5,6);col0	col1	col2
1	2	3
4	5	6
Time taken: 0.039 seconds, Fetched 2 row(s)

4.3、n设为3，将后面7个元素按顺序分为3行3列

select stack(3,1,2,3,4,5,6,7);

spark-sql> select stack(3,1,2,3,4,5,6,7);
col0	col1	col2
1	2	3
4	5	6
7	NULL	NULL
Time taken: 0.035 seconds, Fetched 3 row(s)

4.4、n设为6，将后面6个元素转为为6行1列

select stack(6,1,2,3,4,5,6);

spark-sql> select stack(6,1,2,3,4,5,6);
col0
1
2
3
4
5
6
Time taken: 0.03 seconds, Fetched 6 row(s)

5、表生成函数必备：lateral view

udtf有一个很大的限制，在使用udtf时，select后面只能跟udtf，不能跟其他任何字段，否则会报错，如下：

hive 报错，spark-sql不报错，如下所示：

hive> select 1 as flag,explode(array(1,2,3,4));
FAILED: SemanticException [Error 10081]: UDTF's are not supported outside the SELECT clause, nor nested in expressions

spark-sql> select 1 as flag,explode(array(1,2,3,4));
flag	col
1	1
1	2
1	3
1	4
Time taken: 0.044 seconds, Fetched 4 row(s)
2022-11-07 21:15:01,476 INFO thriftserver.SparkSQLCLIDriver: Time taken: 0.044 seconds, Fetched 4 row(s)

1）语法:

select
列别名1[ ,列别名2,列别名3……]
from 表名 lateral view udtf(expression) 虚拟表别名 as 列别名1[ ,列别名2,列别名3……]

lateral view跟在from后面，然后跟要使用的udtf，为生成的虚拟表起一个表别名，不写会报错。然后跟as 列别名，有些udtf会产生多个列，所以有时要跟多个列别名。

2）应用1：行转列

原始数据：

select * 
from 
(
  select "a" as shop,"1,2,4" as uid_array
  union
  select "b" as shop,"4,5,6" as uid_array
)tb_final

a	1,2,4
b	4,5,6
Time taken: 5.345 seconds, Fetched 2 row(s)

首先使用split函数对uid_array进行切割，返回一个数组，然后使用lateral view explode进行行转列

select shop
      ,uid   --这里是下面生成的列别名
from 
( --准备原始数据
   select * 
   from 
   (
     select "a" as shop,"1,2,4" as uid_array
     union
     select "b" as shop,"4,5,6" as uid_array
   )tb_final
)temp_test6 
lateral view explode(split(uid_array, ',')) tb_a as uid;

a	1
a	2
a	4
b	4
b	5
b	6
Time taken: 1.479 seconds, Fetched 6 row(s)

3）应用2：求总聚合结果

举例：

此时想一句hql求出每个商店的来客数，以及两个商店去重后的来客数。由于uid为4的用户同时出现在两家商店，所以统计total维度时用户数为5。最终要得到如下结果：
a   3
b   3
总   5

explode实现，关键点在于构造一个array数组，将原本的聚合维度字段放入，然后任意自定义一个词例如‘total’作为总聚合的维度名称，再自定义一个别名（这里使用total_shop），代表包含total的字段名，聚合时使用新自定义的别名进行聚合，如下：

select total_shop
      ,count(distinct uid) as uid_num
from 
( --准备原始数据
   select shop
         ,uid   --这里是下面生成的列别名
   from 
   ( --准备原始数据
      select * 
      from 
      (
        select "a" as shop,"1,2,4" as uid_array
        union
        select "b" as shop,"4,5,6" as uid_array
      )tb_final
   )temp_test6 
   lateral view explode(split(uid_array, ',')) tb_a as uid
)temp_test7 
lateral view explode(array(shop, '总')) tb_a as total_shop
group by total_shop
order by total_shop;



a	3
b	3
总	5
Time taken: 2.14 seconds, Fetched 3 row(s)

这样的写法等价于：

select shop
      ,count(distinct uid) as uid_num
from 
( --准备原始数据
   select shop
         ,uid   --这里是下面生成的列别名
   from 
   ( --准备原始数据
      select * 
      from 
      (
        select "a" as shop,"1,2,4" as uid_array
        union
        select "b" as shop,"4,5,6" as uid_array
      )tb_final
   )temp_test6 
   lateral view explode(split(uid_array, ',')) tb_a as uid
)temp_test7 

group by shop

union all

select '总' as shop
      ,count(distinct uid) as uid_num
from 
( --准备原始数据
   select shop
         ,uid   --这里是下面生成的列别名
   from 
   ( --准备原始数据
      select * 
      from 
      (
        select "a" as shop,"1,2,4" as uid_array
        union
        select "b" as shop,"4,5,6" as uid_array
      )tb_final
   )temp_test6 
   lateral view explode(split(uid_array, ',')) tb_a as uid
)temp_test7 ;

a	3
b	3
总	5
Time taken: 2.715 seconds, Fetched 3 row(s)