flink写入hdfs orc格式文件

1、自定义MyOrcVectorizer

public class MyOrcVectorizer extends Vectorizer<String> implements Serializable {

    private static final Logger logger = LoggerFactory.getLogger(MyOrcVectorizer.class);

    public MyOrcVectorizer(String schema) {
        super(schema);
    }

    @Override
    public void vectorize(String element, VectorizedRowBatch batch) throws IOException {
        String[] split = element.split(Constant.DELIMITER,-1);
        int rowId = batch.size++;
        for (int i = 0; i < split.length; i++) {
            BytesColumnVector colVector = (BytesColumnVector) batch.cols[i];
            colVector.setVal(rowId, split[i].getBytes(StandardCharsets.UTF_8));
        }
    }
}

2、自定义OrcSink

public class OrcSink implements Serializable {

    public static StreamingFileSink<String> orcSink(String basePath, String schema) {
        Configuration conf = new Configuration();
        Properties writerProperties = new Properties();
        writerProperties.setProperty("orc.compress", "snappy");//指定压缩格式
        writerProperties.setProperty("orc.compress.size", "5242880");
        writerProperties.setProperty("orc.stripe.size", "5242880");
        writerProperties.setProperty("orc.block.size", "52428800");

        final OrcBulkWriterFactory<String> writerFactory = new OrcBulkWriterFactory<>(
                new MyOrcVectorizer(schema), writerProperties, conf);
        final StreamingFileSink<String> orcSink = StreamingFileSink
                .forBulkFormat(new Path(basePath), writerFactory)
                .withBucketAssigner(new BucketAssigner<String, String>() {
                    @Override
                    public String getBucketId(String element, Context context) {
                        String dt = new SimpleDateFormat(DT_SDF).format(context.currentProcessingTime());
                        return "dt=" + dt;
                    }

                    @Override
                    public SimpleVersionedSerializer<String> getSerializer() {
                        return SimpleVersionedStringSerializer.INSTANCE;
                    }
                })
                .withRollingPolicy(OnCheckpointRollingPolicy.build())
                .withOutputFileConfig(new OutputFileConfig(
                        String.valueOf(LocalDateTime.now().toEpochSecond(ZoneOffset.of("+8"))), ""))
                .build();

        return orcSink;
    }

    private static String getSchema(int size) {
        String schema = "struct<_col0:string";
        for (int i = 1; i < size; i++) {
            schema = schema + ",_col" + i + ":string";
        }
        schema += ">";
        System.out.println(schema);
        return schema;
    }
}

3、使用

StreamingFileSink<String> orcSink = OrcSink.orcSink(basePath,schmaSize);

4、orc版本和hdfs集群版本不兼容问题处理

Caused by: java.lang.ArrayIndexOutOfBoundsException: 7 at org.apache.orc.OrcFile$WriterVersion.from(OrcFile.java:145) at org.apache.orc.impl.OrcTail.getWriterVersion(OrcTail.java:74) at org.apache.orc.impl.ReaderImpl.<init>(ReaderImpl.java:385) at org.apache.hadoop.hive.ql.io.orc.ReaderImpl.<init>(ReaderImpl.java:62) at org.apache.hadoop.hive.ql.io.orc.OrcFile.createReader(OrcFile.java:89) at org.apache.hadoop.hive.ql.io.orc.VectorizedOrcInputFormat.getRecordReader(VectorizedOrcInputFormat.java:186) at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.createVectorizedReader(OrcInputFormat.java:1672) at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getRecordReader(OrcInputFormat.java:1683) at org.apache.hadoop.hive.ql.io.CombineHiveRecordReader.<init>(CombineHiveRecordReader.java:68) ... 26 more

flink写orc文件和hive读取orc文件会使用到orc-core jar。如果用老版本

去读取新版本写的文件则会报数组下标越界异常。根源在于这个包随着版本更新,OrcFile类

 

 在OrcFile里的CURRENT_WRITER总是会被写死为最新的版本。这个信息是被记录到每一个orc文件里的。那么当使用老版本orc-core去读取时,根本没有这个新的版本。故而报下标越界异常。

解决方法为:下载orc-core源码,修改OrcFile内的CURRENT_WRITER为你hive支持的版本,如本人将其改为了HIVE_13083。然后将修改后的orc-core install到本地替换掉中央仓库来的orc-core jar。然后重新打包你自己的flink项目。处理完后flink将写入你指定的orc版本文件。

  • 0
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 1
    评论
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值