hive如何解析json数组

欢迎各位关注我的个人微信公众号:赶令,将会定期更新大数据相关知识及个人心得

在hive的内置函数中,提供了两个函数用来解析json,分别是get_json_object和json_tuple

以下出现的hive函数不懂的请到 hive2.0中常用的内置函数大全 这篇博客去查看了解

首先来查看下get_json_object函数的详细信息

> desc function extended get_json_object;
OK
tab_name
get_json_object(json_txt, path) - Extract a json object from path 
Extract json object from a json string based on json path specified, and return json string of the extracted json object. It will return null if the input js
on string is invalid.A limited version of JSONPath supported:
  $   : Root object
  .   : Child operator
  []  : Subscript operator for array
  *   : Wildcard for []
Syntax not supported that is worth noticing:
  ''  : Zero length string as key
  ..  : Recursive descent
  @   : Current object/element
  ()  : Script expression
  ?() : Filter (script) expression.
  [,] : Union operator
  [start:end:step] : array slice operator

Function class:org.apache.hadoop.hive.ql.udf.UDFJson
Function type:BUILTIN
Time taken: 1.33 seconds, Fetched: 18 row(s)

语法: get_json_object(string json_string, string path)
返回值: string
说明:解析json的字符串json_string,返回path指定的内容。如果输入的json字符串无效,那么返回NULL。

select get_json_object('[{"website":"https://blog.csdn.net/lz6363","name":"浮云"},{"website":"https://blog.csdn.net/lz6363/article/details/86606379","name":"hive中的开窗函数"},{"website":"https://blog.csdn.net/lz6363/article/details/82470278","name":"利用hive统计连续登陆的用户数"}]'
, "$.[0].website"); -- 返回第一个

OK
_c0
https://blog.csdn.net/lz6363
Time taken: 0.448 seconds, Fetched: 1 row(s)

select get_json_object('[{"website":"https://blog.csdn.net/lz6363","name":"浮云"},{"website":"https://blog.csdn.net/lz6363/article/details/86606379","name":"hive中的开窗函数"},{"website":"https://blog.csdn.net/lz6363/article/details/82470278","name":"利用hive统计连续登陆的用户数"}]'
, "$.[0,1,2].website"); -- 返回第一、二、三个

OK
_c0
["https://blog.csdn.net/lz6363","https://blog.csdn.net/lz6363/article/details/86606379","https://blog.csdn.net/lz6363/article/details/82470278"]
Time taken: 0.585 seconds, Fetched: 1 row(s)

select get_json_object('[{"website":"https://blog.csdn.net/lz6363","name":"浮云"},{"website":"https://blog.csdn.net/lz6363/article/details/86606379","name":"hive中的开窗函数"},{"website":"https://blog.csdn.net/lz6363/article/details/82470278","name":"利用hive统计连续登陆的用户数"}]'
, "$.[0:2:1].website"); -- 返回从下标为0-2的,步长为1的

OK
_c0
["https://blog.csdn.net/lz6363","https://blog.csdn.net/lz6363/article/details/86606379","https://blog.csdn.net/lz6363/article/details/82470278"]
Time taken: 0.585 seconds, Fetched: 1 row(s)

json解析函数:json_tuple

语法:json_tuple(json_string, path1, path2, …, pathn)
返回值: tuple
说明:解析json的字符串json_string,返回path1-pathn指定的内容。如果输入的json字符串无效,那么返回NULL。

select json_tuple('{"website":"https://blog.csdn.net/lz6363/article/list/1","name":"浮云"}', 'website', 'name');
OK
c0	c1
https://blog.csdn.net/lz6363/article/list/1	浮云
Time taken: 0.788 seconds, Fetched: 1 row(s)

虽然get_json_object函数能通过下标解析指定位置的json,但当不知道json数组有多少个时,或者数组长度不断变化时,无法进行解析

我们可以通过以下办法解析json数组

先通过正则表达式将数组中的,用;代替

select regexp_replace(
'[{"website":"https://blog.csdn.net/lz6363","name":"浮云"},{"website":"https://blog.csdn.net/lz6363/article/details/86606379","name":"hive中的开窗函数"},{"website":"https://blog.csdn.net/lz6363/article/details/82470278","name":"利用hive统计连续登陆的用户数"}]', '\\}\\,\\{','\\}\\;\\{'
);

OK
_c0
[{"website":"https://blog.csdn.net/lz6363","name":"浮云"};{"website":"https://blog.csdn.net/lz6363/article/details/86606379","name":"hive中的开窗函数"};{"web
site":"https://blog.csdn.net/lz6363/article/details/82470278","name":"利用hive统计连续登陆的用户数"}]
Time taken: 0.458 seconds, Fetched: 1 row(s)

再通过正则表达式把数组中的[]去掉

select regexp_replace(
regexp_replace(
'[{"website":"https://blog.csdn.net/lz6363","name":"浮云"},{"website":"https://blog.csdn.net/lz6363/article/details/86606379","name":"hive中的开窗函数"},{"website":"https://blog.csdn.net/lz6363/article/details/82470278","name":"利用hive统计连续登陆的用户数"}]', '\\}\\,\\{','\\}\\;\\{'
),'\\[|\\]',''
);

OK
_c0
{"website":"https://blog.csdn.net/lz6363","name":"浮云"};{"website":"https://blog.csdn.net/lz6363/article/details/86606379","name":"hive中的开窗函数"};{"webs
ite":"https://blog.csdn.net/lz6363/article/details/82470278","name":"利用hive统计连续登陆的用户数"}
Time taken: 0.521 seconds, Fetched: 1 row(s)

接下来用split函数按;进行分割

select 
split(
regexp_replace(
regexp_replace(
'[{"website":"https://blog.csdn.net/lz6363","name":"浮云"},{"website":"https://blog.csdn.net/lz6363/article/details/86606379","name":"hive中的开窗函数"},{"website":"https://blog.csdn.net/lz6363/article/details/82470278","name":"利用hive统计连续登陆的用户数"}]', '\\}\\,\\{','\\}\\;\\{'
),'\\[|\\]',''
),'\\;'
);

OK
_c0
["{\"website\":\"https://blog.csdn.net/lz6363\",\"name\":\"浮云\"}","{\"website\":\"https://blog.csdn.net/lz6363/article/details/86606379\",\"name\":\"hive中
的开窗函数\"}","{\"website\":\"https://blog.csdn.net/lz6363/article/details/82470278\",\"name\":\"利用hive统计连续登陆的用户数\"}"]
Time taken: 0.434 seconds, Fetched: 1 row(s)

再利用explode函数将数组转换按列输出

select explode(
split(
regexp_replace(
regexp_replace(
'[{"website":"https://blog.csdn.net/lz6363","name":"浮云"},{"website":"https://blog.csdn.net/lz6363/article/details/86606379","name":"hive中的开窗函数"},{"website":"https://blog.csdn.net/lz6363/article/details/82470278","name":"利用hive统计连续登陆的用户数"}]', '\\}\\,\\{','\\}\\;\\{'
),'\\[|\\]',''
),'\\;'
)
);

OK
col
{"website":"https://blog.csdn.net/lz6363","name":"浮云"}
{"website":"https://blog.csdn.net/lz6363/article/details/86606379","name":"hive中的开窗函数"}
{"website":"https://blog.csdn.net/lz6363/article/details/82470278","name":"利用hive统计连续登陆的用户数"}
Time taken: 0.416 seconds, Fetched: 3 row(s)

最后用json_tuple函数将explode转换的json格式的列进行解析

select json_tuple(json, 'website', 'name') 
from 
(
select explode(
split(
regexp_replace(
regexp_replace(
'[{"website":"https://blog.csdn.net/lz6363","name":"浮云"},{"website":"https://blog.csdn.net/lz6363/article/details/86606379","name":"hive中的开窗函数"},{"website":"https://blog.csdn.net/lz6363/article/details/82470278","name":"利用hive统计连续登陆的用户数"}]', '\\}\\,\\{','\\}\\;\\{'
),'\\[|\\]',''
),'\\;'
)
) as json
) fuyun;

OK
c0	c1
https://blog.csdn.net/lz6363	浮云
https://blog.csdn.net/lz6363/article/details/86606379	hive中的开窗函数
https://blog.csdn.net/lz6363/article/details/82470278	利用hive统计连续登陆的用户数
Time taken: 0.497 seconds, Fetched: 3 row(s)

还可以通过自定义UDF函数实现,继承UDF类,实现evaluate方法,在evaluate方法中写实现功能的代码,再打成jar包,将jar包添加到hive中,在创建函数进行使用
具体代码实现:

package com.jsonarray.udf.json;
 
import org.apache.hadoop.hive.ql.exec.Description;
import org.apache.hadoop.hive.ql.exec.UDF;
import org.json.JSONArray;
import org.json.JSONException;
 
import java.util.ArrayList;
 
 
@Description(name = "json_array",
        value = "_FUNC_(array_string) - Convert a string of a JSON-encoded array to a Hive array of strings.")
public class UDFJsonAsArray extends UDF {
    public ArrayList<String> evaluate(String jsonString) {
        if (jsonString == null) {
            return null;
        }
        try {
            JSONArray extractObject = new JSONArray(jsonString);
            ArrayList<String> result = new ArrayList<String>();
            for (int ii = 0; ii < extractObject.length(); ++ii) {
                result.add(extractObject.get(ii).toString());
            }
            return result;
        } catch (JSONException e) {
            return null;
        } catch (NumberFormatException e) {
            return null;
        }
    }
}

将上面的代码进行编译打包,假设打包完的 jar 包名称为 jsonarray.jar,然后我们就可以如下使用这个函数了。
添加jar包

> add jar /opt/datas/ jsonarray.jar;

创建函数

> create temporary function json_array as 'com.jsonarray.udf.json.UDFJsonAsArray';

使用自定义UDF函数实现

select json_tuple(json, 'website', 'name') 
from
(
select explode(
json_array(
'[{"website":"https://blog.csdn.net/lz6363","name":"浮云"},{"website":"https://blog.csdn.net/lz6363/article/details/86606379","name":"hive中的开窗函数"},{"website":"https://blog.csdn.net/lz6363/article/details/82470278","name":"利用hive统计连续登陆的用户数"}]'
) as json
)
) fuyun;

OK
c0	c1
https://blog.csdn.net/lz6363	浮云
https://blog.csdn.net/lz6363/article/details/86606379	hive中的开窗函数
https://blog.csdn.net/lz6363/article/details/82470278	利用hive统计连续登陆的用户数
Time taken: 0.098 seconds, Fetched: 3 row(s)
  • 8
    点赞
  • 50
    收藏
    觉得还不错? 一键收藏
  • 7
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 7
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值