1. 键-值对的形式存储数据
有时候,我们在hive中编写hql的时候,可以用键-值对的形式存储结果。比如有一些同学在一些课程中的学习行为,我们可以记录成如下形式:
student | class_name | key | value |
---|---|---|---|
A | english | answer_num | 9 |
A | chinese | answer_num | 15 |
B | english | answer_num | 12 |
A | english | homework_num | 5 |
B | english | homework_num | 7 |
B | chinese | homework_num | 10 |
这样做的好处有以下几点。一是可以避免hql中出现大量的join连接,使得hql的冗长;二是便于修改表,比如说,在有个同学的学习特征提取错误的时候,可以直接删除这个key,而不用再重写修改hql,重新跑hql。在将数据存储成这种形式后,我们需要将其转换成普通的形式。本文主要记录如何通过hive中的内置函数,来进行数据的转换。
假设上述的表格存储在tmp的表中。
create table tmp as
select 'A' as student, 'english' as class_name, 'answer_num' as key, 9 as value from dual
union all
select 'A' as student, 'chinese' as class_name, 'answer_num' as key, 15 as value from dual
union all
select 'B' as student, 'english' as class_name, 'answer_num' as key, 12 as value from dual
union all
select 'A' as student, 'english' as class_name, 'homework_num' as key, 5 as value from dual
union all
select 'B' as student, 'english' as class_name, 'homework_num' as key, 7 as value from dual
union all
select 'B' as student, 'chinese' as class_name, 'homework_num' as key, 10 as value from dual;
则转化步骤如下:
2. step1 –> map()函数
用map()函数来将key,value映射成<键,值>对的形式。
select student, class_name,
map(key,value) as group_map
from tmp
;
运行结果为:
A english {"answer_num":9}
A chinese {"answer_num":15}
B english {"answer_num":12}
A english {"homework_num":5}
B english {"homework_num":7}
B chinese {"homework_num":10}
2. step2 –> collect_list()函数
用collect_list() 函数将key值拆解出来作为字段,而相应的value值转换为list。
select t1.student, t1.class_name,
collect_list(t1.group_map['answer_num']) as answer_num,
collect_list(t1.group_map['homework_num']) as homework_num
from ( -- step1
select student, class_name,
map(key,value) as group_map
from tmp
) t1
group by t1.student, t1.class_name
;
运行结果为
A chinese [15] []
A english [9] [5]
B chinese [] [10]
B english [12] [7]
另外,也可以用 collect_set() 函数。
3. step3 –> 处理step2的结果
set hive.cli.print.header = true
;
select t1.student,
t1.class_name,
answer_num[0] as answer_num,
homework_num[0] as homework_num
from ( -- step2
select t1.student, t1.class_name,
collect_list(t1.group_map['answer_num']) as answer_num,
collect_list(t1.group_map['homework_num']) as homework_num
from ( -- step1
select student, class_name,
map(key,value) as group_map
from tmp
) t1
group by t1.student, t1.class_name
) t1
;
运行结果为
t1.student t1.class_name answer_num homework_num
A chinese 15 NULL
A english 9 5
B chinese NULL 10
B english 12 7