Hive-之UDTF解析解析JsonArray

最新推荐文章于 2023-09-22 20:25:05 发布

稳哥的哥

最新推荐文章于 2023-09-22 20:25:05 发布

阅读量674

点赞数

分类专栏： Hive

本文链接：https://blog.csdn.net/shufangreal/article/details/113098845

版权

Hive 专栏收录该内容

24 篇文章 1 订阅

订阅专栏

本文介绍了如何在Hive中使用自定义UDTF（用户定义表函数）ExplodeJsonArray，该函数用于将存储为JSON数组的列拆分成多行。博客详细阐述了初始化过程、参数校验、输出类型定义以及处理和关闭方法的实现，展示了如何处理JSON字符串并将其转换为单独的行。示例中，数据从一列包含JSON数组的字符串转换为多行展示，适合大数据场景下对JSON数据的处理。

摘要由CSDN通过智能技术生成

package com.shufang.hive;

import org.apache.hadoop.hive.ql.exec.UDFArgumentException;
import org.apache.hadoop.hive.ql.exec.UDFArgumentTypeException;
import org.apache.hadoop.hive.ql.metadata.HiveException;
import org.apache.hadoop.hive.ql.udf.generic.GenericUDTF;
import org.apache.hadoop.hive.serde2.objectinspector.*;
import org.apache.hadoop.hive.serde2.objectinspector.primitive.PrimitiveObjectInspectorFactory;
import org.json.JSONArray;
import org.json.JSONException;

import java.util.ArrayList;
import java.util.List;

/**
 *  需求：目前有一列json数组字符串，在数据库中的存储样式如下
 *  c1              c2               json_line
 *  red             man                 [{"id":"1001","name":"superman"},{"id":"1002","name":"spiderMan"}]
 *  green           woman               [{"id":"1001","name":"superman"},{"id":"1002","name":"spiderMan"}]
 *
 *  我们需要利用lateral view 结合一个udtf函数，来将其json_line这个字段爆炸成多行显示,用传统的explode函数是无法办到的
 *  c1              c2                  json_object
 *  red             man                 {"id":"1001","name":"superman"}
 *  red             man                 {"id":"1002","name":"spiderMan"}
 *
 *  green           woman               {"id":"1001","name":"superman"}
 *  green           woman               {"id":"1002","name":"spiderMan"}
 */
public class ExplodeJsonArray extends GenericUDTF {


    /**
     * 这个initialize()方法只会被执行一次，相当于是校验出入类型，声明输出类型，然后将元数据加载到内存中，给process等方法进行调用
     * @param argOIs
     * @return
     * @throws UDFArgumentException
     */
    @Override
    public StructObjectInspector initialize(StructObjectInspector argOIs) throws UDFArgumentException {
        /**
         * 1、参数校验，个数，类型
         */
        List<? extends StructField> refs = argOIs.getAllStructFieldRefs();
        //1.1 参数格式校验
        if (refs.size() != 1 ){
            throw new UDFArgumentException("ExplodeJsonArray's arguments size must be one ,but you lost or beyond 1 ");
        }
        //1.2 参数类型校验，如果函数有多个参数，那么我们需要遍历分别校验类型
        StructField field = refs.get(0);
        String typeName = field.getFieldObjectInspector().getTypeName();
        if (!"string".equals(typeName.toLowerCase())){
            throw new UDFArgumentException("ExplodeJsonArray 的 initialize只能接受一个string类型的参数");
        }


        /**
         * 2、将函数需要输出结果字段名称、字段类型封装到一个对象检查器中，然后返回该对象检查器
         */
        //PrimitiveObjectInspectorFactory.javaStringObjectInspector.create(); //这个只能获取基本数据类型的检查器
        //我们需要返回一个复杂数据类型的数据类型，得换一个工厂类
        ArrayList<String> structFieldNames = new ArrayList<String>();  //这里放的是所有的column name，通过lateral view fc() tbl as c1,...cn
        //我们只需要返回一个String类型的json字符串，所以只需要添加并返回一个类型名称，这个名称随便取
        structFieldNames.add("json_col");

        //然后获取一个String类型的objectInspector，用来表示返回值的类型
        ArrayList<ObjectInspector> objectInspector = new ArrayList<ObjectInspector>();
        objectInspector.add(PrimitiveObjectInspectorFactory.javaStringObjectInspector);


        //最终返回该类型的对象校验器
        StructObjectInspector outSOI = ObjectInspectorFactory.getStandardStructObjectInspector(structFieldNames,objectInspector);
        return outSOI;
    }

    /**
     * 这个方法会对循环为每个对象调用
     * @param objects
     * @throws HiveException
     */
    public void process(Object[] objects) throws HiveException {
        //这是传入的一个jsonArray的字符串对象,我们要将这个jsonArray转换成真正的JsonArray对象
        String jsonArrayStr = objects[0].toString();
        JSONArray jsonArray = null;

        try {
            //转换成真正的JsonArray对象
            jsonArray = new JSONArray(jsonArrayStr);

            //根据JsonArray的长度遍历,将其中的每个JsonObject转换成简单的字符串
            for (int i = 0; i < jsonArray.length(); i++) {
                //[{},{},{}],获取到需要被输出的数组中的一个json字符串{}
                String singlejson = jsonArray.getString(i);
                String[] jsons = new String[1];
                jsons[0]=singlejson;

                //最终通过forward方法将数据传给下一个操作
                forward(jsons);
            }


        } catch (JSONException e) {
            e.printStackTrace();
        }
    }

    public void close() throws HiveException {

    }
}

稳哥的哥

关注

0
点赞
踩
3

收藏

觉得还不错? 一键收藏
0
评论
Hive-之UDTF解析解析JsonArray

package com.shufang.hive;import org.apache.hadoop.hive.ql.exec.UDFArgumentException;import org.apache.hadoop.hive.ql.exec.UDFArgumentTypeException;import org.apache.hadoop.hive.ql.metadata.HiveException;import org.apache.hadoop.hive.ql.udf.generic.Gen
复制链接

扫一扫

专栏目录