Hive 利用 JsonSerde 处理 JSON 格式数据

JsonSerde 的 github 地址:https://github.com/rcongiu/Hive-JSON-Serde
JsonSerde 的 jar下载地址:http://www.congiu.net/hive-json-serde/

  1. 基本使用

    1. 配置 Hive

    2. 下载 json-serde-1.3.7-jar-with-dependencies.jar 并上传到 HiveServer2 所在服务器的 /etc/hive/auxlib 目录下

      [root@cdh01 auxlib]# pwd
      /etc/hive/auxlib
      [root@cdh01 auxlib]# ll
      总用量 84
      -rw-r--r-- 1 root root 82101 6月  26 22:18 json-serde-1.3.7-jar-with-dependencies.jar
      
    3. 创建外部表

      创建表sql:

      create external table test_json_data (
        id string,
        list array<struct<col:string>> 
      ) common "测试Json表"
      row format serde 'org.openx.data.jsonserde.JsonSerDe'
      stored as textfile
      location '/user/root/json_data';
      

      测试数据:

      注意: 一行一条 json 数据

      {"id": "1000","list": [{"col": "value1"}, {"col": "value2"}, {"col": "value3"}]}
      

      在这里插入图片描述

    4. 查询表数据

      select * from test_json_data;
      

      在这里插入图片描述

    select b.basiclist.col from test_json_data t 
    LATERAL VIEW explode(t.list) b as basiclist;
    

    在这里插入图片描述

  2. 查询复杂字段

    测试数据 data.txt

    {"one":true,"three":["red","yellow","orange"],"two":19.5,"four":"poop"}
    {"one":false,"three":["red","yellow","black"],"two":129.5,"four":"stars"}
    {"one":false,"three":["pink","gold"],"two":222.56,"four":"fiat"}
    

    建表语句:

    DROP TABLE test;
    CREATE TABLE test (
      one boolean,
      three array<string>,
      two double,
      four string 
    )
    ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
    STORED AS TEXTFILE;
    

    加载数据

    LOAD DATA LOCAL INPATH '/root/tmp/data.txt' OVERWRITE INTO TABLE test;
    
    hive> LOAD DATA LOCAL INPATH '/root/tmp/data.txt' OVERWRITE INTO TABLE test;
    Loading data to table default.test
    OK
    Time taken: 3.22 seconds
    

    查询

    select three[1] from test;
    

    在这里插入图片描述

  3. 定义嵌套结构

    测试数据 nesteddata.txt

    {"country":"Switzerland","languages":["German","French","Italian"],"religions":{"catholic":[10,20],"protestant":[40,50]}}
    

    建表语句:

    DROP TABLE json_nested_test;
    CREATE TABLE json_nested_test (
      country string,
      languages array<string>,
      religions map<string,array<int>>
    )
    ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
    STORED AS TEXTFILE;
    

    加载数据

    LOAD DATA LOCAL INPATH '/root/tmp/nesteddata.txt' OVERWRITE INTO TABLE  json_nested_test ;
    

    查询

    select country,languages,languages[0],religions,religions['catholic'][0] from json_nested_test;
    

    在这里插入图片描述

  4. Arrays结构

    测试数据 people.txt

    ["John", 26 ]
    ["Mary", 23 ]
    

    建表语句:

    DROP TABLE people; 
    CREATE TABLE people (
    	name string, 
    	age int
    )
    ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
    STORED AS TEXTFILE;
    

    加载数据

    LOAD DATA LOCAL INPATH '/root/tmp/people.txt' OVERWRITE INTO TABLE  people;
    

    查询

    select * from people;
    

    在这里插入图片描述

    复杂 Arrays 结构
    测试数据 complex_array.txt

    ["John", { street:"10 green street", city:"Paris" }]
    ["Mary", { street:"20 red street", city:"Shanghai" }]
    

    建表语句:

    DROP TABLE complex_array ; 
    CREATE TABLE complex_array (
    	name string, 
    	address struct<street:string,city:string>
    )
    ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
    STORED AS TEXTFILE;
    

    加载数据

    LOAD DATA LOCAL INPATH '/root/tmp/complex_array .txt' OVERWRITE INTO TABLE  complex_array ;
    

    查询

    select name,address,address.city from complex_array;
    

    在这里插入图片描述

  5. 导入格式错误的数据

    测试数据 complex_array.txt

    {"country":"Italy","languages" "Italian","religions":{"catholic":"90"}}
    

    上面 json 少了一个 :

    建表语句:

    DROP TABLE json_table;
    CREATE TABLE json_table (
      country string,
      languages string,
      religions map<string,string>
    )
    ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
    STORED AS TEXTFILE;
    

    加载数据

    LOAD DATA LOCAL INPATH '/root/tmp/json_table.txt' OVERWRITE INTO TABLE  json_table ;
    

    查询

    select * from json_table;
    

    报错信息

    # Hive CLI
    hive> select * from json_table;
    OK
    Failed with exception java.io.IOException:org.apache.hadoop.hive.serde2.SerDeException: Row is not a valid JSON Object - JSONException: Expected a ':' after a key at 32 [character 33 line 1]
    Time taken: 0.367 seconds
    
    # Beeline
    0: jdbc:hive2://192.168.1.101:10000> select * from json_table;
    INFO  : Compiling command(queryId=hive_20190627212905_2ddbcd4f-65aa-4cf3-95a3-76e3caea230e): select * from json_table
    INFO  : Semantic Analysis Completed
    INFO  : Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:json_table.country, type:string, comment:null), FieldSchema(name:json_table.languages, type:string, comment:null), FieldSchema(name:json_table.religions, type:map<string,string>, comment:null)], properties:null)
    INFO  : Completed compiling command(queryId=hive_20190627212905_2ddbcd4f-65aa-4cf3-95a3-76e3caea230e); Time taken: 0.083 seconds
    INFO  : Executing command(queryId=hive_20190627212905_2ddbcd4f-65aa-4cf3-95a3-76e3caea230e): select * from json_table
    INFO  : Completed executing command(queryId=hive_20190627212905_2ddbcd4f-65aa-4cf3-95a3-76e3caea230e); Time taken: 0.001 seconds
    INFO  : OK
    Error: java.io.IOException: org.apache.hadoop.hive.serde2.SerDeException: Row is not a valid JSON Object - JSONException: Expected a ':' after a key at 32 [character 33 line 1] (state=,code=0)
    
    

    设置忽略掉格式错误的数据

    ALTER TABLE json_table SET SERDEPROPERTIES ( "ignore.malformed.json" = "true");
    

    再次查询,错误的数据被空行替代

    # Hive CLI
    hive> select * from json_table;
    OK
    19/06/27 21:31:38 WARN jsonserde.JsonSerDe: Ignoring malformed JSON: Row is not a valid JSON Object - JSONException: Expected a ':' after a key at 32 [character 33 line 1]
    NULL	NULL	NULL
    Time taken: 0.088 seconds, Fetched: 1 row(s)
    
    # Beeline
    0: jdbc:hive2://192.168.1.101:10000> select * from json_table;
    INFO  : Compiling command(queryId=hive_20190627212905_2ddbcd4f-65aa-4cf3-95a3-76e3caea230e): select * from json_table
    INFO  : Semantic Analysis Completed
    INFO  : Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:json_table.country, type:string, comment:null), FieldSchema(name:json_table.languages, type:string, comment:null), FieldSchema(name:json_table.religions, type:map<string,string>, comment:null)], properties:null)
    INFO  : Completed compiling command(queryId=hive_20190627212905_2ddbcd4f-65aa-4cf3-95a3-76e3caea230e); Time taken: 0.083 seconds
    INFO  : Executing command(queryId=hive_20190627212905_2ddbcd4f-65aa-4cf3-95a3-76e3caea230e): select * from json_table
    INFO  : Completed executing command(queryId=hive_20190627212905_2ddbcd4f-65aa-4cf3-95a3-76e3caea230e); Time taken: 0.001 seconds
    INFO  : OK
    Error: java.io.IOException: org.apache.hadoop.hive.serde2.SerDeException: Row is not a valid JSON Object - JSONException: Expected a ':' after a key at 32 [character 33 line 1] (state=,code=0)
    0: jdbc:hive2://192.168.1.101:10000> select * from json_table;
    INFO  : Compiling command(queryId=hive_20190627213144_97ef602d-92f8-463c-93b4-113f3686922f): select * from json_table
    INFO  : Semantic Analysis Completed
    INFO  : Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:json_table.country, type:string, comment:null), FieldSchema(name:json_table.languages, type:string, comment:null), FieldSchema(name:json_table.religions, type:map<string,string>, comment:null)], properties:null)
    INFO  : Completed compiling command(queryId=hive_20190627213144_97ef602d-92f8-463c-93b4-113f3686922f); Time taken: 0.086 seconds
    INFO  : Executing command(queryId=hive_20190627213144_97ef602d-92f8-463c-93b4-113f3686922f): select * from json_table
    INFO  : Completed executing command(queryId=hive_20190627213144_97ef602d-92f8-463c-93b4-113f3686922f); Time taken: 0.0 seconds
    INFO  : OK
    +---------------------+-----------------------+-----------------------+
    | json_table.country  | json_table.languages  | json_table.religions  |
    +---------------------+-----------------------+-----------------------+
    | NULL                | NULL                  | NULL                  |
    +---------------------+-----------------------+-----------------------+
    1 row selected (0.131 seconds)
    
  6. 映射Hive关键字

    测试数据 mytable.txt

    { "myfield" : "name","timestamp" : "2019-06-27 21:56:30"}
    

    建表语句

    CREATE TABLE mytable (
      myfield string, 
      ts string
    ) 
    ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
    WITH SERDEPROPERTIES ( "mapping.ts" = "timestamp" )
    STORED AS TEXTFILE;
    

    加载数据

    LOAD DATA LOCAL INPATH '/root/tmp/mytable.txt' OVERWRITE INTO TABLE mytable;
    

    查询

    select * from mytable;
    

    在这里插入图片描述

Apache自带的 org.apache.hive.hcatalog.data.JsonSerDe 不支持忽略格式错误数据和字段映射等。

  • 5
    点赞
  • 20
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值