hive中行转列，宽表转长表，explode和lateral view

最新推荐文章于 2024-06-21 16:47:31 发布

析杴

最新推荐文章于 2024-06-21 16:47:31 发布

阅读量219

点赞数

文章标签： hive hadoop 数据仓库

本文链接：https://blog.csdn.net/weixin_67485772/article/details/135095304

版权

hive中行转列，宽表转长表，explode和lateral view

首先，explode中文意思为爆炸，所以，这篇文我们的主题为，数据爆炸。

explode是什么？

当我们搜索explode的用法时，会发现，explode使用时基本都会出现一个external view的语句。

我们从explode讲起。

explode是hive中的一个UDTF的函数，而UDTF函数作用都是输入一行数据，将该行数据拆分、并返回多行数据。

explode是将一行数据转换为列数据，常与Array类型和map类型搭配使用。

explode的用法

当explode和Array一起使用时，语法为

select explode(col_name) as new_col_name from table_name;

#col_name 是Array类型的数据列
#table_name是表名
#而new_col_name是采集数据时的新列名

当explode和map一起使用时，语法为

select explode(map_col) as(key_name , value_name) from tablename;

#map_col为map形式的列名，
#table_name是字段
#key_name是map的key值
#value_name是map的value值

由于map数据一般是k-v结构，是两列，所以一般是转换为两列，一列是key，一列是value。

explode和lateral view的关系

UDTF函数，如explode单独使用时，只允许对拆分的字段进行采集，而无法采集其他字段，而在工作过程中，我们经常需要拆分字段和表中的原字段同时采集，这个时候要使用到lateral view。

later view为侧视图，就是用来和UDTF函数如explode（）这种同时使用的，lateral view会将拆分结果放到一个虚拟表中，然后虚拟表和原表将会进行join连接以达到能够同时select原表数据和拆分数据的结果。

使用later view时需要指定视图别名和新的字段名。

可以通过一下例子进行理解

# 查看表数据
hive> select * from udtf_test;
OK
jim5    ["james5","datacloase"]
jim4    ["james4","datacloase"]
jim3    ["james3","datacloase"]
jim2    ["james2","datacloase"]
 
# 1）hive只允许对其拆分字段进行访问
hive> select explode(subordinates) from udtf_test;
OK
james5
datacloase
james4
datacloase
james3
datacloase
james2
datacloase
Time taken: 0.075 seconds, Fetched: 8 row(s)
 
#2）同时select 查询 explode作用字段及其它字段时，报错
hive> select explode(subordinates),name from udtf_test;
FAILED: SemanticException 1:29 Only a single expression in the SELECT clause is supported with UDTF's. Error encountered near token 'name'
#3）借助lateral view，同时查询explode作用字段及其它字段
hive> select name,subordinate from udtf_test
    > lateral view explode(subordinates)sub as subordinate;
OK
jim5    james5
jim5    datacloase
jim4    james4
jim4    datacloase
jim3    james3
jim3    datacloase
jim2    james2
jim2    datacloase

案例练习和讲解

1.找出具有五位相同数字的电话号码
数据如下，txt文本数据

jimmhe  18191512076
xiaosong    18392988059
jingxianghua    18118818818
donghualing 17191919999

创造表

CREATE TABLE udtf_test1(
  name string, 
  phonenumber string)
ROW FORMAT DELIMITED 
  FIELDS TERMINATED BY '\t'

加载数据，直接加载本地的数据

load data local inpath '/user/local/data/udtf_test1.txt' into table udtf_test1;

查看表的加载内容

select * from udtf_test1;

在这里插入图片描述
解题思路
用split将电话号码的每一个号码都分出来作为一列，此时的虚拟表字段为名字，电话号码，电码号码的某个数字
此时，如果电话号码的某个数字相同，三列数据就会一样。
然后通过group by分组，统计出现的数量留下出现次数大于5的情况。

select name,phonenumber
from(
	select name,phonenumber,phone_num
	from udtf_test1
	lateral view explode(split(phonenumber,"") view_number as phone_num)aa
)
group by name,phonenumber,phone_num
having count(*)>=5;

2.求一下每个学生最好的成绩和最差的成绩
首先，表数据，依旧是txt文件

zhangsan|Chinese:80,Math:60,English:90
lisi|Chinese:90,Math:80,English:70
wangwu|Chinese:88,Math:90,English:96
maliu|Chinese:99,Math:65,English:60

创建表

create table stu_score_test(name string,score map<String,string>)
row format delimited
fields terminated by '|'
collection items terminated by ','
map keys terminated by ':';

导入数据

load data local inpath '/usr/local/data/stu_score_test.txt' into table stu_score_test;

查看数据

select * from stu_score_test;

在这里插入图片描述
解题思路
将成绩字段拆分为学科和成绩
用rank（）排名，按学生姓名分组排名
并提取出第一和最后的排名即可。

select 
    name,course,csorce
from(
    select 
        name
        ,course
        ,csorce 
        ,rank()over(partition by name order by csorce) last_rn
        ,rank()over(partition by name order by csorce desc) best_rn
    from stu_score_test
    lateral view  explode(score)  score_view as course,csorce
    )aa
where last_rn=1 or best_rn=1