hive 处理 json数据

最新推荐文章于 2022-12-19 16:26:38 发布

Jackie_ZHF

最新推荐文章于 2022-12-19 16:26:38 发布

阅读量430

点赞数 2

分类专栏： Hive 大数据 HDFS

大数据同时被 3 个专栏收录

18 篇文章 0 订阅

订阅专栏

HDFS

6 篇文章 0 订阅

订阅专栏

Hive

5 篇文章 0 订阅

订阅专栏

两种方式
1、将json以字符串的方式整个入Hive表，然后通过使用UDF函数解析已经导入到hive中的数据，比如使用LATERAL VIEW json_tuple的方法，获取所需要的列名。

2、在导入之前将json拆成各个字段，导入Hive表的数据是已经解析过得。这将需要使用第三方的SerDe。

测试数据
测试数据为新浪微博的评论数据，格式如下：

{
"appCode": "weibo",
"dataType": "comment",
"pageToken": null,
"data": [
{
"rating": null,
"commenterId": "2235850235",
"tags": null,
"commenterScreenName": "-快乐的猪头-",
"publishDateStr": "2017-05-22T02:27:52",
"publishDate": 1495420072,
"likeCount": null,
"commentCount": null,
"source": "iPhone 6",
"url": null,
"referId": "4110146290137390",
"content": "盲道上都是共享单车了，管一管吧",
"imageUrls": null,
"id": "4110152040668671"
},
{
"rating": null,
"commenterId": "1457994444",
"tags": null,
"commenterScreenName": "彳拓",
"publishDateStr": "2017-05-22T02:06:26",
"publishDate": 1495418786,
"likeCount": null,
"commentCount": null,
"source": "iPhone 6 Plus",
"url": null,
"referId": "4110146290137390",
"content": "如何界定那车是残疾人的？",
"imageUrls": null,
"id": "4110146646971555"
}
]
}

该数据采用json格式存储。

第一种：
导入数据

CREATE TABLE IF NOT EXISTS tmp_json_test (
json string
)
STORED AS textfile ;

load data local inpath ‘/opt/datas/weibotest.json’ overwrite into table tmp_json_test;

解析数据：

select get_json_object(t.json,’.id′),getjsonobject(t.json,′ .id'), get_json_object(t.json,'.id′),getjsonobject(t.json,′.total_number’) from tmp_json_test t ;

select t2.* from tmp_json_test t1 lateral view json_tuple(t1.json, ‘id’, ‘total_number’) t2 as c1, c2;

第二种：
第二种方式相比第一种更灵活，更通用。重要的是每行必须是一个完整的JSON，一个JSON不能跨越多行。

下载Jar，使用之前先下载jar：

http://www.congiu.net/hive-json-serde/
如果要想在Hive中使用JsonSerde，需要把jar添加到hive类路径中：

add jar json-serde-1.3.7-jar-with-dependencies.jar;

建表

create table if not exists temp_db.test_json_weibo(
appCode string
,dateType string
,pageToken string
,data string
) ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
WITH SERDEPROPERTIES (
"ignore.malformed.json"="true"
)
STORED AS TEXTFILE;

load data local inpath ‘/home/hadoop/test_json_weibo.txt’ into table temp_db.test_json_weibo;

查数据

select * from temp_db.test_json_weibo limit 1;
1

倒入之后就可以随便使用了

select * from tmp_json_array where array_contains(ids,‘2813165271’) or array_contains(ids,‘1419789200’);

需要注意的是当你的数据中包含有不符合json规范的行时，运行查询会报异常

测试可以增加配置用以跳过错误数据

ALTER TABLE weibo_json SET SERDEPROPERTIES ( “ignore.malformed.json” = “true”);
在运行查询不会报错，但是坏数据记录将变为NULL。

最后需要提醒的是当你的json数据中包含hive关键字时，导入的数据会有问题，此时 SerDe可以使用SerDe属性将hive列映射到名称不同的属性

如果ids是hive关键字的话，更改建表语句如下：

复制代码
CREATE TABLE tmp_json_array (
id string,
ids_alias array,
total_number int)
ROW FORMAT SERDE ‘org.openx.data.jsonserde.JsonSerDe’
WITH SERDEPROPERTIES (“mapping.ids_alias”=“ids”)
STORED AS TEXTFILE;
---------------------
作者：cuteximi_1995
来源：CSDN
原文：https://blog.csdn.net/qq_31975963/article/details/88657709
版权声明：本文为博主原创文章，转载请附上博文链接！