hive如何使用json格式的数据

  1. 查看当前hive支持的json函数
show functions like "*json*";
OK
get_json_object
json_tuple
Time taken: 0.006 seconds, Fetched: 2 row(s)
  • get_json_object
 describe function get_json_object;
 OK
 get_json_object(json_txt, path) - Extract a json object from path 

解析json的字符串json_txt,返回path指定的内容。如果输入的json字符串无效,那么返回NULL。
例如:

select  get_json_object('{"store":{"fruit":\[{"weight":8,"type":"apple"},{"weight":9,"type":"pear"}],"bicycle":{"price":19.95,"color":"red"}}, "email":"amy@only_for_json_udf_test.net","owner":"amy" } ','$.ownes');
OK
NULL
Time taken: 0.219 seconds, Fetched: 1 row(s)

没有ownes,返回值是null

select  get_json_object('{"store":{"fruit":\[{"weight":8,"type":"apple"},{"weight":9,"type":"pear"}],"bicycle":{"price":19.95,"color":"red"}}, "email":"amy@only_for_json_udf_test.net","owner":"amy" } ','$.owner');
OK
amy

通过$.获取对应key的value 值

select  get_json_object('{"store":{"fruit":\[{"weight":8,"type":"apple"},{"weight":9,"type":"pear"}],"bicycle":{"price":19.95,"color":"red"}}, "email":"amy@only_for_json_udf_test.net","owner":"amy" } ','$.store.fruit');
OK
[{"weight":8,"type":"apple"},{"weight":9,"type":"pear"}]
select  get_json_object('{"store":{"fruit":\[{"weight":8,"type":"apple"},{"weight":9,"type":"pear"}],"bicycle":{"price":19.95,"color":"red"}}, "email":"amy@only_for_json_udf_test.net","owner":"amy" } ','$.store.fruit.weight');
OK
[8,9]

效果如上,如果是嵌套json,也通过.获取,如果并列结构获取的值将是list的形式

  • json_tuple
desc function json_tuple;
OK
json_tuple(jsonStr, p1, p2, ..., pn) - like get_json_object, but it takes multiple names and return a tuple. All the input parameters and output column types are string.

参数为一组键p1,p2……和JSON字符串,返回值的元组。该方法可以在一次调用中输入多个键

select json_tuple('{"store":{"fruit":\[{"weight":8,"type":"apple"},{"weight":9,"type":"pear"}],"bicycle":{"price":19.95,"color":"red"}}, "email":"amy@only_for_json_udf_test.net","owner":"amy" }',"store","email","owner")
OK
{"fruit":[{"weight":8,"type":"apple"},{"weight":9,"type":"pear"}],"bicycle":{"price":19.95,"color":"red"}}      amy@only_for_json_udf_test.net  amy

可以通过 “store”,“email”,“owner” 获取三个key 对应的value值,相比get_json_object显得更为高效率

select json_tuple('{"store":{"fruit":\[{"weight":8,"type":"apple"},{"weight":9,"type":"pear"}],"bicycle":{"price":19.95,"color":"red"}}, "email":"amy@only_for_json_udf_test.net","owner":"amy" }',"store.fruit","email","owner")
               > ;
OK
NULL    amy@only_for_json_udf_test.net  amy

可以看出。不能通过.获取件套数据对应的值,如果key不存在,将返回null。

要获取json各列的数据的方法

raw001	{"age":20,"gender":"M","weight":"56"}  
raw002	{"age":23,"gender":"M","weight":"70"}  
raw003	{"age":34,"gender":"F","weight":"50"} 

---------------------

select t.cl1 , get_json_object(t.cl2json,'$.age'), get_json_object(t.cl2json,'$.gender') from tmp_json_test t ; 
select t1.cl1, t2.* from tmp_json_test t1 lateral view json_tuple(t1.cl2json, 'age', 'gender') t2 as c1, c2; 

上面这种方式不能处理复杂类型(如果hive表中字段为array,map等),并且处理上非常麻烦,要写好长的sql。
还有一种对于数据是整个json格式的情况,相比第一种来说更灵活,更通用。重要的是每行必须是一个完整的JSON,一个JSON不能跨越多行,也就是说,serde不会对多行的Json有效。 因为这是由Hadoop处理文件的工作方式决定,文件必须是可拆分的,基于次开源的序列化工具hive-json-serde使用起来就显得特别方便了
https://github.com/rcongiu/Hive-JSON-Serde
下载对应版本的jar包后,将其添加在hive的anxlib中就可以使用了

<property>
     <name>hive.aux.jars.path</name>
      <value>auxlib/json-serde-1.3.8-jar-with-dependencies.jar</value>
 </property>

注意之前版本有bug,在使用hiveserver2的时候解析会实效,有类型转化的错误。
注意建表的时候 需要指定序列化对应的方式org.openx.data.jsonserde.JsonSerDe

ROW FORMAT SERDE 
  'org.openx.data.jsonserde.JsonSerDe' 
WITH SERDEPROPERTIES ( 
  'ignore.malformed.json'='true', 
  'mapping.position'='where', 
  'mapping.ts'='timestamp') 

‘ignore.malformed.json’=‘true’ 忽略坏数据为null,避免查询报错。
‘mapping.position’=‘where’, 将where字段列名映射为position字段,避免关键词作为列名使用不方便的问题, ‘mapping.ts’=‘timestamp’ 将timestamp映射为ts,作用和where相同,注意hive是大小写不敏感的,但是列名称必须和json数据结构中的列名相同,只是忽略大小写,如果不存在,则返回null值。
具体请参考

对应map这种 在union的时候,如果字段为null 如何拼接呢
map<string,string>

str_to_map(text[, delimiter1, delimiter2])

Splits text into key-value pairs using two delimiters. Delimiter1 separates text into K-V pairs, and Delimiter2 splits each K-V pair. Default delimiters are ‘,’ for delimiter1 and ‘=’ for delimiter2.

select str_to_map(“a:a&b:b”, ‘&’, ‘:’);
OK
{“a”:“a”,“b”:“b”}
https://github.com/rcongiu/Hive-JSON-Serde

  • 0
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值