hive 复杂数据类型在数仓中应用（array、map、struct、和其组合应用）

最新推荐文章于 2024-02-14 12:56:48 发布

lianchaozhao

最新推荐文章于 2024-02-14 12:56:48 发布

阅读量2.7k

点赞数

分类专栏： hive 大数据文章标签： hive 复杂结构 struct map

本文链接：https://blog.csdn.net/weixin_40809627/article/details/90647049

版权

大数据同时被 2 个专栏收录

74 篇文章 0 订阅

订阅专栏

hive

44 篇文章 6 订阅

订阅专栏

环境：一般宽表建表可能考虑存储更多信息选择复杂模型建设

复杂数据类型：array、map、struct
1.数组array，里边不能装不同类型的数据

more hive_array.txt
zhangsan beijing,shanghai,tianjin,hangzhou
lisi changchun,chengdu,wuhan,beijing

创建表
create table hive_array(name string, work_locations array)
row format delimited fields terminated by ‘\t’
collection items terminated by ‘,’;

hive> desc formatted hive_array;

#col_name data_type comment
name string
work_locations array

#加载本地文件
load data local inpath ‘/home/hadoop/data/hive_array.txt’ overwrite into table hive_array;

#查询数据
hive> select * from hive_array;
OK
ruoze [“beijing”,“shanghai”,“tianjin”,“hangzhou”]
jepson [“changchun”,“chengdu”,“wuhan”,“beijing”]

hive> select name, size(work_locations) from hive_array;
ruoze 4
jepson 4

hive> select name, work_locations[0] from hive_array;
ruoze beijing
jepson changchun

hive> select * from hive_array where array_contains(work_locations, “tianjin”);
ruoze [“beijing”,“shanghai”,“tianjin”,“hangzhou”]

2.map Map(‘a’#1,‘b’#2)
more hive_map.txt
1,zhangsan,father:xiaoming#mother:xiaohuang#brother:xiaoxu,28
2,lisi,father:mayun#mother:huangyi#brother:guanyu,22
3,wangwu,father:wangjianlin#mother:ruhua#sister:jingtian,29
4,mayun,father:mayongzhen#mother:angelababy,26

#创建表
create table hive_map(id int,name string, family map<string,string>,age int)
row format delimited fields terminated by ‘,’
collection items terminated by ‘#’
map keys terminated by ‘:’;

#查看表
hive> desc formatted hive_map;
OK
#col_name data_type comment

id int
name string
family map<string,string>
age int

#加载本地数据
load data local inpath ‘/home/hadoop/data/hive_map.txt’
overwrite into table hive_map;

查询方式
map的使用： map是键值对，即(key,value)形式。在一个键值对中，要求其key唯一，否则将覆盖掉其value。
map结构：fieldName(k1:v1,k2:v2,…)
取值语法： fieldName[‘key’]，通过方括号（[]）来取

hive> select * from hive_map;

1 zhangsan {“father”:“xiaoming”,“mother”:“xiaohuang”,“brother”:“xiaoxu”} 28
2 lisi {“father”:“mayun”,“mother”:“huangyi”,“brother”:“guanyu”} 22
3 wangwu {“father”:“wangjianlin”,“mother”:“ruhua”,“sister”:“jingtian”} 29
4 mayun {“father”:“mayongzhen”,“mother”:“angelababy”} 26

hive> select id,name,family[‘father’] as father, family[‘sister’] from hive_map;
OK
1 zhangsan xiaoming NULL
2 lisi mayun NULL
3 wangwu wangjianlin jingtian
4 mayun mayongzhen NULL

hive> select id,name,map_keys(family) from hive_map;
OK
1 zhangsan [“father”,“mother”,“brother”]
2 lisi [“father”,“mother”,“brother”]
3 wangwu [“father”,“mother”,“sister”]
4 mayun [“father”,“mother”]

hive> select id,name,map_values(family) from hive_map;
OK
1 zhangsan [“xiaoming”,“xiaohuang”,“xiaoxu”]
2 lisi [“mayun”,“huangyi”,“guanyu”]
3 wangwu [“wangjianlin”,“ruhua”,“jingtian”]
4 mayun [“mayongzhen”,“angelababy”]

hive> select id,name,size(family) from hive_map;
OK
1 zhangsan 3
2 lisi 3
3 wangwu 3
4 mayun 2

hive> select id,name,family[‘brother’] from hive_map where array_contains(map_keys(family),‘brother’);
OK
1 zhangsan xiaoxu
2 lisi guanyu

3.struct结构体
//原始数据
cat hive_struct.txt
192.168.1.1#zhangsan:40
192.168.1.2#lisi:50
192.168.1.3#wangwu:60
192.168.1.4#zhaoliu:70

//建表并导入数据
create table hive_struct(ip string,userinfo structname:string,age:int)
row format delimited fields terminated by ‘#’
collection items terminated by ‘:’;

#加载数据
load data local inpath ‘/home/hadoop/data/hive_struct.txt’
overwrite into table hive_struct;

#查询数据
hive> select * from hive_struct;

192.168.1.1 {“name”:“zhangsan”,“age”:40}
192.168.1.2 {“name”:“lisi”,“age”:50}
192.168.1.3 {“name”:“wangwu”,“age”:60}
192.168.1.4 {“name”:“zhaoliu”,“age”:70}

//取值
struct的使用： struct是结构体，其定义为： filedName struct filed1:type1,field2:type2,… 表示该字段由多个字段组合而成。
取某个字段的语法为： fieldname.field1, 通过点（.）来取

hive> select ip,userinfo.name,userinfo.age from hive_struct;
OK
192.168.1.1 zhangsan 40
192.168.1.2 lisi 50
192.168.1.3 wangwu 60
192.168.1.4 zhaoliu 70

4、map和struct 结合
建表语句
在这里插入图片描述
以borrow_repay_record为例：其key为phaseNumber,value为一个struct。
map(string,struct<…>) 显然，value的类型可以是复杂数据类型，这就形成了复杂数据类型的嵌套。其语法仍然符合各个基本类型的语法规则如，取出其对应的map 的key 为load 的value中对应 duedate:的值
语法为

select borrow_repay_recore[‘load’].duedate from dw_kuanbiao where dt=‘2019-02-12’

当不知道key（或者不关心key），如何来取出满足需求的value？这就用到了map的展开（将一行变为多行）
我们取出user_id为100000的记录对应的 borrow_repay_record (注意user_id取出的值存在多行情况 )
结果结构类似
{“19”:{“duedate”:“2015-04-23 14:51:42”,“repayoverduemgmtfee”:null，},
“18”:{“duedate”:“2015-03-23 14:51:42”,“repayoverduemgmtfee”:null，},
“15”:{“duedate”:“2014-12-23 14:51:42”,“repayoverduemgmtfee”:null，******},
“14”:{“duedate”:“2014-11-23 14:51:42”,“repayoverduemgmtfee”:null，*****}}

以看到，这一行当中，其实包含了相当多的信息。
为了能够获取任意一行中的任意一个字段，而不是通过key索引来寻找该字段，我们需要将上述一行，按照key ，value的形式打散，化为多行，并能够与表中的其他字段进行融合。而hive则提供了相关函数。
explode() 函数，能够将一行打散为多行，但该函数无法将打散出来的行与表的其他字段进行融合。
LATERAL VIEW 则能够弥补这一缺点，二者一般配合使用。
举例如下：

SELECT user_id,phaseNumber,value from dw_loan LATERAL VIEW explode(borrow_repay_record) adTable AS phaseNumber,value where dt = ‘2019-01-29’ and user_id = ‘100000’

在这里插入图片描述

通过LATERAL VIEW explode(borrow_repay_record) adTable AS phaseNumber,value 可以将map中的数据按行切分，并与原来的行中连接，形成多行。
可以认为， from后面就是一个表，和平常用的表并无区别。
那么，如果要算某个字段的和的时候，则直接使用就ok：

如，要计算本金的和，map 中value的某个字段值的情况：

select sum(value.principal) from dw_kubiao LATERAL VIEW explode(borrow_repay_recore) adTable As phaseNumeber,value where user_id =‘100000’

注：在宽表建设过程中，使用了hive的复杂数据类型，如map, struct, 以及复杂数据类型的嵌套，如map<string, struct> 等，

虽然hive复杂数据类型能够让单行记录容纳更多的信息，但也导致了加载过程的复杂。为了简化这些包含复杂数据类型的表的加载过程，采用了中间表。即先把数据按照最终表的数据结构导入到中间表，再利用MR清洗一遍中间表，使其满足复杂数据类型的要求。
（即先将数据导入到 tmep_kubiao中----》 dw_kubiao中）

其中map<string,struct> 由原来string 类型替换而来

temp_kubiao 表定义
…
COMMENT ‘标的信息表’
ROW FORMAT SERDE
‘org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe’
WITH SERDEPROPERTIES (
‘field.delim’=’,’,
‘serialization.format’=’,’)
…
dw_kubiao 表定义
…
ROW FORMAT SERDE
‘org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe’
WITH SERDEPROPERTIES (
‘colelction.delim’=’|’,
‘field.delim’=’,’,
‘mapkey.delim’=’:’,
‘serialization.format’=’,’
)

string 对应字段鸿以| 分隔成不同 key : struct （让后将 struct 中的分隔符由原来的@ 换成\004 --》由单独mr 实现）

注：map中多个元素的分隔符以及 struct多个元素的分隔符，目前hive提供的语法是无法都更改的，只能够更改一个。剩下的分隔符则按照 ascii 码 1- 8的顺序进行使用。当指定 colection的分隔符为 ’ | ', 实际上是指定了 map 结构的元素分隔符，那么 struct元素的分隔符则默认为 ‘\004’, 因此，只需要把 struct的分隔符改为 ‘\004’ 即可。

注：：：：map 中分隔方式暂时没找到修改语句。采用修改元数据然后重新加载hive 分区的方式实现

lianchaozhao

关注

0
点赞
踩
8

收藏

觉得还不错? 一键收藏
0
评论
hive 复杂数据类型在数仓中应用（array、map、struct、和其组合应用）

环境：一般宽表建表可能考虑存储更多信息选择复杂模型建设复杂数据类型：array、map、struct1.数组array，里边不能装不同类型的数据more hive_array.txtzhangsan beijing,shanghai,tianjin,hangzhoulisi changchun,chengdu,wuhan,beijing创建表create tabl...
复制链接

扫一扫