1、自定义MyOrcVectorizer
public class MyOrcVectorizer extends Vectorizer<String> implements Serializable {
private static final Logger logger = LoggerFactory.getLogger(MyOrcVectorizer.class);
public MyOrcVectorizer(String schema) {
super(schema);
}
@Override
public void vectorize(String element, VectorizedRowBatch batch) throws IOException {
String[] split = element.split(Constant.DELIMITER,-1);
int rowId = batch.size++;
for (int i = 0; i < split.length; i++) {
BytesColumnVector colVector = (BytesColumnVector) batch.cols[i];
colVector.setVal(rowId, split[i].getBytes(StandardCharsets.UTF_8));
}
}
}
2、自定义OrcSink
public class OrcSink implements Serializable {
public static StreamingFileSink<String> orcSink(String basePath, String schema) {
Configuration conf = new Configuration();
Properties writerProperties = new Properties();
writerProperties.setProperty("orc.compress", "snappy");//指定压缩格式
writerProperties.setProperty("orc.compress.size", "5242880");
writerProperties.setProperty("orc.stripe.size", "5242880");
writerProperties.setProperty("orc.block.size", "52428800");
final OrcBulkWriterFactory<String> writerFactory = new OrcBulkWriterFactory<>(
new MyOrcVectorizer(schema), writerProperties, conf);
final StreamingFileSink<String> orcSink = StreamingFileSink
.forBulkFormat(new Path(basePath), writerFactory)
.withBucketAssigner(new BucketAssigner<String, String>() {
@Override
public String getBucketId(String element, Context context) {
String dt = new SimpleDateFormat(DT_SDF).format(context.currentProcessingTime());
return "dt=" + dt;
}
@Override
public SimpleVersionedSerializer<String> getSerializer() {
return SimpleVersionedStringSerializer.INSTANCE;
}
})
.withRollingPolicy(OnCheckpointRollingPolicy.build())
.withOutputFileConfig(new OutputFileConfig(
String.valueOf(LocalDateTime.now().toEpochSecond(ZoneOffset.of("+8"))), ""))
.build();
return orcSink;
}
private static String getSchema(int size) {
String schema = "struct<_col0:string";
for (int i = 1; i < size; i++) {
schema = schema + ",_col" + i + ":string";
}
schema += ">";
System.out.println(schema);
return schema;
}
}
3、使用
StreamingFileSink<String> orcSink = OrcSink.orcSink(basePath,schmaSize);
4、orc版本和hdfs集群版本不兼容问题处理
Caused by: java.lang.ArrayIndexOutOfBoundsException: 7 at org.apache.orc.OrcFile$WriterVersion.from(OrcFile.java:145) at org.apache.orc.impl.OrcTail.getWriterVersion(OrcTail.java:74) at org.apache.orc.impl.ReaderImpl.<init>(ReaderImpl.java:385) at org.apache.hadoop.hive.ql.io.orc.ReaderImpl.<init>(ReaderImpl.java:62) at org.apache.hadoop.hive.ql.io.orc.OrcFile.createReader(OrcFile.java:89) at org.apache.hadoop.hive.ql.io.orc.VectorizedOrcInputFormat.getRecordReader(VectorizedOrcInputFormat.java:186) at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.createVectorizedReader(OrcInputFormat.java:1672) at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getRecordReader(OrcInputFormat.java:1683) at org.apache.hadoop.hive.ql.io.CombineHiveRecordReader.<init>(CombineHiveRecordReader.java:68) ... 26 more
flink写orc文件和hive读取orc文件会使用到orc-core jar。如果用老版本
去读取新版本写的文件则会报数组下标越界异常。根源在于这个包随着版本更新,OrcFile类
在OrcFile里的CURRENT_WRITER总是会被写死为最新的版本。这个信息是被记录到每一个orc文件里的。那么当使用老版本orc-core去读取时,根本没有这个新的版本。故而报下标越界异常。
解决方法为:下载orc-core源码,修改OrcFile内的CURRENT_WRITER为你hive支持的版本,如本人将其改为了HIVE_13083。然后将修改后的orc-core install到本地替换掉中央仓库来的orc-core jar。然后重新打包你自己的flink项目。处理完后flink将写入你指定的orc版本文件。