hive outline
hive json数据处理简介
Hive中为了实现JSON格式的数据解析,提供了两种解析JSON数据的方式
- 方式一:使用JSON函数进行处理
Hive中提供了两个专门用于解析JSON字符串的函数:get_json_object、json_tuple,这两个函数都可以实现将JSON数据中的每个字段独立解析出来,构建成表
- 方式二:使用Hive内置的JSON Serde加载数据
Hive中除了提供JSON的解析函数以外,还提供了一种专门用于加载JSON文件的Serde来实现对JSON文件中数据的解析,在创建表时指定Serde,加载文件到表中,会自动解析为对应的表格式
对以下json数据,建表
{"device":"device_30","deviceType":"kafka","signal":98.0,"time":1616817201390}
{"device":"device_32","deviceType":"kafka","signal":65.0,"time":1616817207131}
{"device":"device_32","deviceType":"kafka","signal":95.0,"time":1616817207714}
{"device":"device_71","deviceType":"bigdata","signal":45.0,"time":1616817207907}
hive get_json_object
语法:
get_json_object(string json_string, string path)
第一个参数:指定要解析的JSON字符串
第二个参数:指定要返回的字段,通过$.columnName的方式来指定path
特点:每次只能返回JSON对象中一列的值
data =
{
"store":
{
"fruit":[{"weight":8,"type":"apple"}, {"weight":9,"type":"pear"}],
"bicycle":{"price":19.95,"color":"red"}
},
"email":"131778899@qq.com",
"owner":"amy"
}
get单层值
select get_json_object(data, '$.owner') from test;
结果:amy
get多层值
select get_json_object(data, '$.store.bicycle.price') from test;
结果:19.95
get数组值
select get_json_object(data, '$.store.fruit[0]') from test;
结果:{“weight”:8,“type”:“apple”}
hive json_tuple
语法:json_tuple(jsonStr, p1, p2, …, pn)
第一个参数:指定要解析的JSON字符串
第二个参数:指定要返回的第1个字段
……
第N+1个参数:指定要返回的第N个字段
特点:
- 功能类似于get_json_object,但是可以调用一次返回多列的值。属于UDTF类型函数
- 返回的每一列都是字符串类型
- 一般搭配lateral view使用
select a.json,
b.device,
b.deviceType,
b.signal,
b.stime
from tb_json_test1 a
lateral view json_tuple(json, "device", "deviceType", "signal", "time") b
as device, deviceType, signal, stime;
json | device | devicetype | signal | stime |
---|---|---|---|---|
{“device”:“device_30”,“deviceType”:“kafka”,“signal”:98.0,“time”:1616817201390} | device_30 | kafka | 98.0 | 1616817201390 |
{“device”:“device_32”,“deviceType”:“kafka”,“signal”:65.0,“time”:1616817207131} | device_32 | kafka | 65.0 | 1616817207131 |
{“device”:“device_32”,“deviceType”:“kafka”,“signal”:95.0,“time”:1616817207714} | device_32 | kafka | 95.0 | 1616817207714 |
{“device”:“device_71”,“deviceType”:“bigdata”,“signal”:45.0,“time”:1616817207907} | device_71 | bigdata | 45.0 | 1616817207907 |