DATAX hdfsreader orc格式读取数据丢失问题

最近做一个数据同步任务,从hive仓库同步数据到pg,Hive有4000w多条数据,但datax只同步了280w就结束了,也没有任何报错。

看了下datax源码,找到HdfsReader模块DFSUtil核心实现源码读取orc格式的文件方法:

public void orcFileStartRead(String sourceOrcFilePath, Configuration readerSliceConfig,
                             RecordSender recordSender, TaskPluginCollector taskPluginCollector) {
    LOG.info(String.format("Start Read orcfile [%s].", sourceOrcFilePath));
    List<ColumnEntry> column = UnstructuredStorageReaderUtil
            .getListColumnEntry(readerSliceConfig, com.alibaba.datax.plugin.unstructuredstorage.reader.Key.COLUMN);
    String nullFormat = readerSliceConfig.getString(com.alibaba.datax.plugin.unstructuredstorage.reader.Key.NULL_FORMAT);
    StringBuilder allColumns = new StringBuilder();
    StringBuilder allColumnTypes = new StringBuilder();
    boolean isReadAllColumns = false;
    int columnIndexMax = -1;
    // 判断是否读取所有列
    if (null == column || column.size() == 0) {
        int allColumnsCount = getAllColumnsCount(sourceOrcFilePath);
        columnIndexMax = allColumnsCount - 1;
        isReadAllColumns = true;
    } else {
        columnIndexMax = getMaxIndex(column);
    }
    for (int i = 0; i <= columnIndexMax; i++) {
        allColumns.append("col");
        allColumnTypes.append("string");
        if (i != columnIndexMax) {
            allColumns.append(",");
            allColumnTypes.append(":");
        }
    }
    if (columnIndexMax >= 0) {
        JobConf conf = new JobConf(hadoopConf);
        Path orcFilePath = new Path(sourceOrcFilePath);
        Properties p = new Properties();
        p.setProperty("columns", allColumns.toString());
        p.setProperty("columns.types", allColumnTypes.toString());
        try {
            OrcSerde serde = new OrcSerde();
            serde.initialize(conf, p);
            StructObjectInspector inspector = (StructObjectInspector)                 serde.getObjectInspector();
            InputFormat<?, ?> in = new OrcInputFormat();
            FileInputFormat.setInputPaths(conf, orcFilePath.toString());

            //If the network disconnected, will retry 45 times, each time the retry interval for 20 seconds
            //Each file as a split
            //TODO multy threads
            InputSplit[] splits = in.getSplits(conf, -1);

            RecordReader reader = in.getRecordReader(splits[0], conf, Reporter.NULL);
            Object key = reader.createKey();
            Object value = reader.createValue();
            // 获取列信息
            List<? extends StructField> fields = inspector.getAllStructFieldRefs();

            List<Object> recordFields;
            while (reader.next(key, value)) {
                recordFields = new ArrayList<Object>();

                for (int i = 0; i <= columnIndexMax; i++) {
                    Object field = inspector.getStructFieldData(value, fields.get(i));
                    recordFields.add(field);
                }
                transportOneRecord(column, recordFields, recordSender,
                        taskPluginCollector, isReadAllColumns, nullFormat);
            }
            reader.close();
        } catch (Exception e) {
            String message = String.format("从orcfile文件路径[%s]中读取数据发生异常,请联系系统管理员。"
                    , sourceOrcFilePath);
            LOG.error(message);
            throw DataXException.asDataXException(HdfsReaderErrorCode.READ_FILE_ERROR, message);
        }
    } else {
        String message = String.format("请确认您所读取的列配置正确!columnIndexMax 小于0,column:%s", JSON.toJSONString(column));
        throw DataXException.asDataXException(HdfsReaderErrorCode.BAD_CONFIG_VALUE, message);
    }

}

 读文件时 由于hdfs文件存储 是block 形式的。当单个文件 大于 单个block 的size时,出现一个文件 多个block 存储,仅读取了第一个block,造成了数据的部分丢失。

改成如下,用for循环分割块读取数据

 public void orcFileStartRead(String sourceOrcFilePath, Configuration readerSliceConfig,
                                 RecordSender recordSender, TaskPluginCollector taskPluginCollector) {
        LOG.info(String.format("Start Read orcfile [%s].", sourceOrcFilePath));
        List<ColumnEntry> column = UnstructuredStorageReaderUtil
                .getListColumnEntry(readerSliceConfig, com.alibaba.datax.plugin.unstructuredstorage.reader.Key.COLUMN);
        String nullFormat = readerSliceConfig.getString(com.alibaba.datax.plugin.unstructuredstorage.reader.Key.NULL_FORMAT);
        StringBuilder allColumns = new StringBuilder();
        StringBuilder allColumnTypes = new StringBuilder();
        boolean isReadAllColumns = false;
        int columnIndexMax = -1;
        // 判断是否读取所有列
        if (null == column || column.size() == 0) {
            int allColumnsCount = getAllColumnsCount(sourceOrcFilePath);
            columnIndexMax = allColumnsCount - 1;
            isReadAllColumns = true;
        } else {
            columnIndexMax = getMaxIndex(column);
        }
        for (int i = 0; i <= columnIndexMax; i++) {
            allColumns.append("col");
            allColumnTypes.append("string");
            if (i != columnIndexMax) {
                allColumns.append(",");
                allColumnTypes.append(":");
            }
        }
        if (columnIndexMax >= 0) {
            JobConf conf = new JobConf(hadoopConf);
            Path orcFilePath = new Path(sourceOrcFilePath);
            Properties p = new Properties();
            p.setProperty("columns", allColumns.toString());
            p.setProperty("columns.types", allColumnTypes.toString());
            try {
                OrcSerde serde = new OrcSerde();
                serde.initialize(conf, p);
                StructObjectInspector inspector = (StructObjectInspector)                 serde.getObjectInspector();
                InputFormat<?, ?> in = new OrcInputFormat();
                FileInputFormat.setInputPaths(conf, orcFilePath.toString());

                //If the network disconnected, will retry 45 times, each time the retry interval for 20 seconds
                //Each file as a split
                //TODO multy threads
                InputSplit[] splits = in.getSplits(conf, -1);
                for (InputSplit split : splits) {
                    RecordReader reader = in.getRecordReader(split, conf, Reporter.NULL);
                    Object key = reader.createKey();
                    Object value = reader.createValue();
                    // 获取列信息
                    List<? extends StructField> fields = inspector.getAllStructFieldRefs();

                    List<Object> recordFields;
                    while (reader.next(key, value)) {
                        recordFields = new ArrayList<Object>();

                        for (int i = 0; i <= columnIndexMax; i++) {
                            Object field = inspector.getStructFieldData(value, fields.get(i));
                            recordFields.add(field);
                        }
                        transportOneRecord(column, recordFields, recordSender,
                                taskPluginCollector, isReadAllColumns, nullFormat);
                    }
                    reader.close();
                }
            } catch (Exception e) {
                String message = String.format("从orcfile文件路径[%s]中读取数据发生异常,请联系系统管理员。"
                        , sourceOrcFilePath);
                LOG.error(message);
                throw DataXException.asDataXException(HdfsReaderErrorCode.READ_FILE_ERROR, message);
            }
        } else {
            String message = String.format("请确认您所读取的列配置正确!columnIndexMax 小于0,column:%s", JSON.toJSONString(column));
            throw DataXException.asDataXException(HdfsReaderErrorCode.BAD_CONFIG_VALUE, message);
        }

    }

然后重新打包成jar,替换datax引擎的hdfsreader-0.0.1-SNAPSHOT.jar

注意:打包的时候可能会提示maven仓库找不到plugin-unstructured-storage-util-0.0.1-SNAPSHOT.jar

解决办法:从datax引擎hdfsreader/lib中拷贝出plugin-unstructured-storage-util-0.0.1-SNAPSHOT.jar,在hdfsreader模块下面建一个lib目录放进去,然后在pom中本地引用。

<dependency>
    <groupId>com.unstrucatured</groupId>
    <artifactId>unstrucatured</artifactId>
    <version>200</version>
    <scope>system</scope>
    <systemPath>${basedir}/src/main/lib/plugin-unstructured-storage-util-0.0.1-SNAPSHOT.jar</systemPath>
</dependency>

改完后实测完美解决!

  • 2
    点赞
  • 3
    收藏
    觉得还不错? 一键收藏
  • 打赏
    打赏
  • 1
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

今朝花落悲颜色

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值