JsonSerde 的 github 地址:https://github.com/rcongiu/Hive-JSON-Serde
JsonSerde 的 jar下载地址:http://www.congiu.net/hive-json-serde/
-
基本使用
-
下载 json-serde-1.3.7-jar-with-dependencies.jar 并上传到 HiveServer2 所在服务器的 /etc/hive/auxlib 目录下
[root@cdh01 auxlib]# pwd /etc/hive/auxlib [root@cdh01 auxlib]# ll 总用量 84 -rw-r--r-- 1 root root 82101 6月 26 22:18 json-serde-1.3.7-jar-with-dependencies.jar
-
创建外部表
创建表sql:
create external table test_json_data ( id string, list array<struct<col:string>> ) common "测试Json表" row format serde 'org.openx.data.jsonserde.JsonSerDe' stored as textfile location '/user/root/json_data';
测试数据:
注意: 一行一条 json 数据
{"id": "1000","list": [{"col": "value1"}, {"col": "value2"}, {"col": "value3"}]}
-
查询表数据
select * from test_json_data;
select b.basiclist.col from test_json_data t LATERAL VIEW explode(t.list) b as basiclist;
-
查询复杂字段
测试数据 data.txt
{"one":true,"three":["red","yellow","orange"],"two":19.5,"four":"poop"} {"one":false,"three":["red","yellow","black"],"two":129.5,"four":"stars"} {"one":false,"three":["pink","gold"],"two":222.56,"four":"fiat"}
建表语句:
DROP TABLE test; CREATE TABLE test ( one boolean, three array<string>, two double, four string ) ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe' STORED AS TEXTFILE;
加载数据
LOAD DATA LOCAL INPATH '/root/tmp/data.txt' OVERWRITE INTO TABLE test;
hive> LOAD DATA LOCAL INPATH '/root/tmp/data.txt' OVERWRITE INTO TABLE test; Loading data to table default.test OK Time taken: 3.22 seconds
查询
select three[1] from test;
-
定义嵌套结构
测试数据 nesteddata.txt
{"country":"Switzerland","languages":["German","French","Italian"],"religions":{"catholic":[10,20],"protestant":[40,50]}}
建表语句:
DROP TABLE json_nested_test; CREATE TABLE json_nested_test ( country string, languages array<string>, religions map<string,array<int>> ) ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe' STORED AS TEXTFILE;
加载数据
LOAD DATA LOCAL INPATH '/root/tmp/nesteddata.txt' OVERWRITE INTO TABLE json_nested_test ;
查询
select country,languages,languages[0],religions,religions['catholic'][0] from json_nested_test;
-
Arrays结构
测试数据 people.txt
["John", 26 ] ["Mary", 23 ]
建表语句:
DROP TABLE people; CREATE TABLE people ( name string, age int ) ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe' STORED AS TEXTFILE;
加载数据
LOAD DATA LOCAL INPATH '/root/tmp/people.txt' OVERWRITE INTO TABLE people;
查询
select * from people;
复杂 Arrays 结构
测试数据 complex_array.txt["John", { street:"10 green street", city:"Paris" }] ["Mary", { street:"20 red street", city:"Shanghai" }]
建表语句:
DROP TABLE complex_array ; CREATE TABLE complex_array ( name string, address struct<street:string,city:string> ) ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe' STORED AS TEXTFILE;
加载数据
LOAD DATA LOCAL INPATH '/root/tmp/complex_array .txt' OVERWRITE INTO TABLE complex_array ;
查询
select name,address,address.city from complex_array;
-
导入格式错误的数据
测试数据 complex_array.txt
{"country":"Italy","languages" "Italian","religions":{"catholic":"90"}}
上面 json 少了一个
:
建表语句:
DROP TABLE json_table; CREATE TABLE json_table ( country string, languages string, religions map<string,string> ) ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe' STORED AS TEXTFILE;
加载数据
LOAD DATA LOCAL INPATH '/root/tmp/json_table.txt' OVERWRITE INTO TABLE json_table ;
查询
select * from json_table;
报错信息
# Hive CLI hive> select * from json_table; OK Failed with exception java.io.IOException:org.apache.hadoop.hive.serde2.SerDeException: Row is not a valid JSON Object - JSONException: Expected a ':' after a key at 32 [character 33 line 1] Time taken: 0.367 seconds # Beeline 0: jdbc:hive2://192.168.1.101:10000> select * from json_table; INFO : Compiling command(queryId=hive_20190627212905_2ddbcd4f-65aa-4cf3-95a3-76e3caea230e): select * from json_table INFO : Semantic Analysis Completed INFO : Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:json_table.country, type:string, comment:null), FieldSchema(name:json_table.languages, type:string, comment:null), FieldSchema(name:json_table.religions, type:map<string,string>, comment:null)], properties:null) INFO : Completed compiling command(queryId=hive_20190627212905_2ddbcd4f-65aa-4cf3-95a3-76e3caea230e); Time taken: 0.083 seconds INFO : Executing command(queryId=hive_20190627212905_2ddbcd4f-65aa-4cf3-95a3-76e3caea230e): select * from json_table INFO : Completed executing command(queryId=hive_20190627212905_2ddbcd4f-65aa-4cf3-95a3-76e3caea230e); Time taken: 0.001 seconds INFO : OK Error: java.io.IOException: org.apache.hadoop.hive.serde2.SerDeException: Row is not a valid JSON Object - JSONException: Expected a ':' after a key at 32 [character 33 line 1] (state=,code=0)
设置忽略掉格式错误的数据
ALTER TABLE json_table SET SERDEPROPERTIES ( "ignore.malformed.json" = "true");
再次查询,错误的数据被空行替代
# Hive CLI hive> select * from json_table; OK 19/06/27 21:31:38 WARN jsonserde.JsonSerDe: Ignoring malformed JSON: Row is not a valid JSON Object - JSONException: Expected a ':' after a key at 32 [character 33 line 1] NULL NULL NULL Time taken: 0.088 seconds, Fetched: 1 row(s) # Beeline 0: jdbc:hive2://192.168.1.101:10000> select * from json_table; INFO : Compiling command(queryId=hive_20190627212905_2ddbcd4f-65aa-4cf3-95a3-76e3caea230e): select * from json_table INFO : Semantic Analysis Completed INFO : Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:json_table.country, type:string, comment:null), FieldSchema(name:json_table.languages, type:string, comment:null), FieldSchema(name:json_table.religions, type:map<string,string>, comment:null)], properties:null) INFO : Completed compiling command(queryId=hive_20190627212905_2ddbcd4f-65aa-4cf3-95a3-76e3caea230e); Time taken: 0.083 seconds INFO : Executing command(queryId=hive_20190627212905_2ddbcd4f-65aa-4cf3-95a3-76e3caea230e): select * from json_table INFO : Completed executing command(queryId=hive_20190627212905_2ddbcd4f-65aa-4cf3-95a3-76e3caea230e); Time taken: 0.001 seconds INFO : OK Error: java.io.IOException: org.apache.hadoop.hive.serde2.SerDeException: Row is not a valid JSON Object - JSONException: Expected a ':' after a key at 32 [character 33 line 1] (state=,code=0) 0: jdbc:hive2://192.168.1.101:10000> select * from json_table; INFO : Compiling command(queryId=hive_20190627213144_97ef602d-92f8-463c-93b4-113f3686922f): select * from json_table INFO : Semantic Analysis Completed INFO : Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:json_table.country, type:string, comment:null), FieldSchema(name:json_table.languages, type:string, comment:null), FieldSchema(name:json_table.religions, type:map<string,string>, comment:null)], properties:null) INFO : Completed compiling command(queryId=hive_20190627213144_97ef602d-92f8-463c-93b4-113f3686922f); Time taken: 0.086 seconds INFO : Executing command(queryId=hive_20190627213144_97ef602d-92f8-463c-93b4-113f3686922f): select * from json_table INFO : Completed executing command(queryId=hive_20190627213144_97ef602d-92f8-463c-93b4-113f3686922f); Time taken: 0.0 seconds INFO : OK +---------------------+-----------------------+-----------------------+ | json_table.country | json_table.languages | json_table.religions | +---------------------+-----------------------+-----------------------+ | NULL | NULL | NULL | +---------------------+-----------------------+-----------------------+ 1 row selected (0.131 seconds)
-
映射Hive关键字
测试数据 mytable.txt
{ "myfield" : "name","timestamp" : "2019-06-27 21:56:30"}
建表语句
CREATE TABLE mytable ( myfield string, ts string ) ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe' WITH SERDEPROPERTIES ( "mapping.ts" = "timestamp" ) STORED AS TEXTFILE;
加载数据
LOAD DATA LOCAL INPATH '/root/tmp/mytable.txt' OVERWRITE INTO TABLE mytable;
查询
select * from mytable;
Apache自带的 org.apache.hive.hcatalog.data.JsonSerDe 不支持忽略格式错误数据和字段映射等。