json格式的数据导入到hive表中

最新推荐文章于 2024-08-26 15:33:15 发布

purisuit_knowledge

最新推荐文章于 2024-08-26 15:33:15 发布

阅读量1.6w

点赞数

分类专栏： hive 文章标签： hive json数据

本文链接：https://blog.csdn.net/purisuit_knowledge/article/details/77852149

版权

hive 专栏收录该内容

16 篇文章 0 订阅

订阅专栏

总体来说，有两大类方法：

1、将json以字符串的方式整个入Hive表，然后使用LATERAL VIEW json_tuple的方法，获取所需要的列名。

2、将json拆成各个字段，入Hive表。这将需要使用第三方的SerDe。

第一种方法的的缺点是不能处理复杂类型（如果hive表中字段为array,map等）。

实战：

1. 创建表

[sql]view plaincopy 
   
 CREATE TABLE tmp_json_test (  
            appkey string,   
            json string  
 )   
 ROW FORMAT DELIMITED   
   FIELDS TERMINATED BY '|'   
 STORED AS textfile ;  

2. 导入数据，样例如下：

[html]view plaincopy 
   
 load data local inpath '/home/jb-gongmingfeng/test_data.log' overwrite into table tmp_json_test;  

[html]view plaincopy 
   
 appkey001|{"count":2,"usage":91273,"pkg":"com.example.gotest"}  
 appkey001|{"count":234,"usage":9876,"pkg":"com.example.gotest"}  
 appkey001|{"count":34,"usage":5432,"pkg":"com.example.msg"}  

3. 读取json的数据有两种方法。

方法一：

[html]view plaincopy 
   
 select t.appkey , get_json_object(t.json,'$.count'), get_json_object(t.json,'$.usage') from tmp_json_test t ;  

方法二：

[html]view plaincopy 
   
 select t1.appkey, t2.* from tmp_json_test t1 lateral view json_tuple(t1.json, 'count', 'usage') t2 as c1, c2;  

查询结果相同，如下：

[html]view plaincopy 
   
 appkey001   2   91273  
 appkey001   234 9876  
 appkey001   34  5432  
 appkey001   56  3454  
 appkey001   354 3557  
 appkey001   12  79090  
 appkey001   5   2145  
 appkey001   3   5673  
 appkey001   75  3457  
 appkey001   2   6879  

4. 总结一下，方法一使用函数get_json_object ，方法二使用函数 json_tuple 。

第二种方式相比第一种更灵活，更通用。重要的是每行必须是一个完整的JSON，一个JSON不能跨越多行，也就是说，serde不会对多行的Json有效。因为这是由Hadoop处理文件的工作方式决定，文件必须是可拆分的，例如，hadoop将在行尾分割文本文件。

实战：

1. 下载Jar

使用之前先下载jar：

http://www.congiu.net/hive-json-serde/

如果要想在Hive中使用JsonSerde，需要把jar添加到hive类路径中：

add jar json-serde-1.3.7-jar-with-dependencies.jar;

2. 与数组使用

源数据：

{"country":"Switzerland","languages":["German","French","Italian"]}
{"country":"China","languages":["chinese"]}

Hive表：

CREATE TABLE tmp_json_array (
    country string,
    languages array<string> 
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
STORED AS TEXTFILE;

LOAD DATA LOCAL INPATH '/home/xiaosi/a.txt' OVERWRITE INTO TABLE  tmp_json_array;

使用：

hive> select languages[0] from tmp_json_array;
OK
German
chinese
Time taken: 0.096 seconds, Fetched: 2 row(s)

3. 嵌套结构

源数据：

{"country":"Switzerland","languages":["German","French","Italian"],"religions":{"catholic":[6,7]}}
{"country":"China","languages":["chinese"],"religions":{"catholic":[10,20],"protestant":[40,50]}}

Hive表：

CREATE TABLE tmp_json_nested (
    country string,
    languages array<string>,
    religions map<string,array<int>>)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
STORED AS TEXTFILE;

LOAD DATA LOCAL INPATH '/home/xiaosi/a.txt' OVERWRITE INTO TABLE  tmp_json_nested ;

使用：

hive> select * from tmp_json_nested;
OK
Switzerland	["German","French","Italian"]	{"catholic":[6,7]}
China	["chinese"]	{"catholic":[10,20],"protestant":[40,50]}
Time taken: 0.113 seconds, Fetched: 2 row(s)
hive> select languages[0] from tmp_json_nested;
OK
German
chinese
Time taken: 0.122 seconds, Fetched: 2 row(s)
hive> select religions['catholic'][0] from tmp_json_nested;
OK
6
10
Time taken: 0.111 seconds, Fetched: 2 row(s)

4. 坏数据

格式错误的数据的默认行为是抛出异常。例如，对于格式不正确的json（languages后缺少':'）：

{"country":"Italy","languages"["Italian"],"religions":{"protestant":[40,50]}}

使用：

hive> LOAD DATA LOCAL INPATH '/home/xiaosi/a.txt' OVERWRITE INTO TABLE  tmp_json_nested ;
Loading data to table default.tmp_json_nested
OK
Time taken: 0.23 seconds
hive> select * from tmp_json_nested;
OK
Failed with exception java.io.IOException:org.apache.hadoop.hive.serde2.SerDeException: 
Row is not a valid JSON Object - JSONException: Expected a ':' after a key at 31 [character 32 line 1]
Time taken: 0.096 seconds

这种方式不是一种好的策略，我们数据中难免会遇到坏数据。如下操作可以忽略坏数据：

ALTER TABLE json_table SET SERDEPROPERTIES ( "ignore.malformed.json" = "true");

更改设置后：

hive> ALTER TABLE tmp_json_nested SET SERDEPROPERTIES ( "ignore.malformed.json" = "true");
OK
Time taken: 0.122 seconds
hive> select * from tmp_json_nested;
OK
Switzerland	["German","French","Italian"]	{"catholic":[6,7]}
China	["chinese"]	{"catholic":[10,20],"protestant":[40,50]}
NULL	NULL	NULL
Time taken: 0.103 seconds, Fetched: 3 row(s)

现在不会导致查询失败，但是坏数据记录将变为NULL NULL NULL。

注意：

如果JSON格式正确，但是不符合Hive范式，则不会跳过，依然会报错：

{"country":"Italy","languages":"Italian","religions":{"catholic":"90"}}

使用：

hive> ALTER TABLE tmp_json_nested SET SERDEPROPERTIES ( "ignore.malformed.json" = "true");
OK
Time taken: 0.081 seconds
hive> select * from tmp_json_nested;
OK
Failed with exception java.io.IOException:org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.ClassCastException:
java.lang.String cannot be cast to org.openx.data.jsonserde.json.JSONArray
Time taken: 0.097 seconds

5. 将标量转为数组

这是一个常见的问题，某一个字段有时是一个标量，有时是一个数组，例如：

{ field: "hello", .. }
{ field: [ "hello", "world" ], ...

在这种情况下，如果将表声明为array<string>，如果SerDe找到一个标量，它将返回一个单元素的数组，从而有效地将标量提升为数组。但是标量必须是正确的类型。

6. 映射Hive关键词

有时可能发生的是，JSON数据具有名为hive中的保留字的属性。例如，您可能有一个名为“timestamp”的JSON属性，它是hive中的保留字，当发出CREATE TABLE时，hive将失败。此SerDe可以使用SerDe属性将hive列映射到名称不同的属性。

{"country":"Switzerland","exec_date":"2017-03-14 23:12:21"}
{"country":"China","exec_date":"2017-03-16 03:22:18"}

CREATE TABLE tmp_json_mapping (
    country string,
    dt string
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
WITH SERDEPROPERTIES ("mapping.dt"="exec_date")
STORED AS TEXTFILE;

hive> select * from tmp_json_mapping;
OK
Switzerland	2017-03-14 23:12:21
China	2017-03-16 03:22:18
Time taken: 0.081 seconds, Fetched: 2 row(s)

“mapping.dt”，表示dt列读取JSON属性为exec_date的值。

原文：https://github.com/rcongiu/Hive-JSON-Serde

注：在网上曾经找到wget https://hive-json-serde.googlecode.com/files/hive-json-serde-0.2.jar这个jar包，经试验，该jar包不能很好的支持复杂类型，并且把数据导入hive表后，查询起来会非常慢。

官网：https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-RowFormats&SerDe 的推荐方法没试验，因为我们的hive版本比较低

purisuit_knowledge

关注

0
点赞
踩
35

收藏

觉得还不错? 一键收藏
1
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

专栏目录