HiveSQL\SparkSQL的json高性能解析方案

cxf_coding

于 2024-08-16 09:58:13 发布

阅读量171

点赞数 6

文章标签： json

本文链接：https://blog.csdn.net/weixin_43451620/article/details/141251016

版权

旧方法：get_json_object() 性能极差

```sql
select get_json_object(properties, '$.client_version_name'),
       get_json_object(properties, '$.is_big'),
       get_json_object(properties, '$.network_type'),
       get_json_object(properties, '$.is_monkey'),
       get_json_object(properties, '$.launch_package'),
       get_json_object(properties, '$.launch_type'),
       get_json_object(properties, '$.country_code'),
       get_json_object(properties, '$.android_version'),
       get_json_object(...)
  from (select properties
          from schema_name.table_name
         where pt_d = '20240727'
           and pt_h = '18'
           and type = 'oper') t
           ;```

新方法：json_tuple()性能极好

```sql
 select t.*,a.*
  from (select properties
          from  schema_name.table_name
         where pt_d = '20240727'
           and pt_h = '18'
           and type = 'oper') t
lateral view json_tuple(t.properties ,'client_version_name','is_big','network_type','is_monkey','launch_package','launch_type','country_code','android_version','...') a as  client_version_name,is_big,network_type,is_monkey,launch_package,ppt_launch_type,country_code,android_version,...
;```

新旧方案测试结果
在这里插入图片描述

结论
1、get_json_object解开越多越慢，线性增加，json_tuple无论解多少，时间相差无几，且后者有巨大性能优势
2、json_tuple具有巨大的性能优势，基于此，数仓建模更多应该倾向解越多越好，表将更好用，成本也更低
建议
目前数据仓库都是使用get_json_object的方法，此方法性能较差，严重影响产出时间，且对集群CPU资源的使用压力较大，浪费云成本与费用，建议：
对数据开发：建议此方法的全面推广，新开发必须用这个方式，根据业务情况适时改掉现有数仓的旧代码，将能为公司节约非常大的大数据计算成本
对数据查询：建议数据分析师推广此方法查询数据，提高查询数据时间，节约大数据计算成本和费用

TIPS：
1、json_tuple的原理是无论解多少个字段，都是只解一次拿多个字段，而get_json_object则一个字段启动一次解析，所以后者的性能大约为前者的n倍。
2、json_tuple支持特殊字符，如果key里含有特殊字符也可以解析，如@字符，get_json_object不支持