Flink教程-flink 1.11 流式数据ORC格式写入file

8 篇文章 0 订阅


在flink中,StreamingFileSink是一个很重要的把流式数据写入文件系统的sink,可以支持写入行格式(json, csv等)的数据,以及列格式(orc、parquet)的数据。
hive作为一个广泛的数据存储,而ORC作为hive经过特殊优化的列式存储格式,在hive的存储格式中占有很重要的地位。今天我们主要讲一下使用StreamingFileSink将流式数据以ORC的格式写入文件系统,这个功能是flink 1.11版本开始支持的。

StreamingFileSink简介

StreamingFileSink提供了两个静态方法来构造相应的sink,forRowFormat用来构造写入行格式数据的sink,forBulkFormat方法用来构造写入列格式数据的sink,

我们看一下方法forBulkFormat。

/**
     * Creates the builder for a {@link StreamingFileSink} with bulk-encoding format.
     *
     * @param basePath the base path where all the buckets are going to be created as
     *     sub-directories.
     * @param writerFactory the {@link BulkWriter.Factory} to be used when writing elements in the
     *     buckets.
     * @param <IN> the type of incoming elements
     * @return The builder where the remaining of the configuration parameters for the sink can be
     *     configured. In order to instantiate the sink, call {@link BulkFormatBuilder#build()}
     *     after specifying the desired parameters.
     */
    public static <IN> StreamingFileSink.DefaultBulkFormatBuilder<IN> forBulkFormat(
            final Path basePath, final BulkWriter.Factory<IN> writerFactory) {
        return new StreamingFileSink.DefaultBulkFormatBuilder<>(
                basePath, writerFactory, new DateTimeBucketAssigner<>());
    }

forBulkFormat两个参数,第一个是一个写入的路径,第二个是一个用于创建writer的实现BulkWriter.Factory接口的工厂类。

写入orc工厂类

<dependency>
  <groupId>org.apache.flink</groupId>
  <artifactId>flink-orc_${scala.binary.version}</artifactId>
  <version>${flink.version}</version>
</dependency>

flink为我们提供了写入orc格式的工厂类OrcBulkWriterFactory,我们简单看下这个工厂类的一些变量。


@PublicEvolving
public class OrcBulkWriterFactory<T> implements BulkWriter.Factory<T> {

	private static final Path FIXED_PATH = new Path(".");

	private final Vectorizer<T> vectorizer;
	private final Properties writerProperties;
	private final Map<String, String> confMap;
	private OrcFile.WriterOptions writerOptions;
	
	public OrcBulkWriterFactory(Vectorizer<T> vectorizer) {
		this(vectorizer, new Configuration());
	}
	public OrcBulkWriterFactory(Vectorizer<T> vectorizer, Configuration configuration) {
		this(vectorizer, null, configuration);
	}
	public OrcBulkWriterFactory(Vectorizer<T> vectorizer, Properties writerProperties, Configuration configuration) {
        ...................
	}
    .............
}

向量化操作

flink使用了hive的VectorizedRowBatch来写入ORC格式的数据,所以需要把输入数据组织成VectorizedRowBatch对象,而这个转换的功能就是由OrcBulkWriterFactory中的变量—也就是抽象类Vectorizer类完成的,主要实现的方法就是org.apache.flink.orc.vector.Vectorizer#vectorize方法。

在flink中,提供了一个支持RowData输入格式的RowDataVectorizer,在方法vectorize中,根据不同的类型,将输入的RowData格式的数据转成VectorizedRowBatch类型。


	@Override
	public void vectorize(RowData row, VectorizedRowBatch batch) {
		int rowId = batch.size++;
		for (int i = 0; i < row.getArity(); ++i) {
			setColumn(rowId, batch.cols[i], fieldTypes[i], row, i);
		}
	}

构造OrcBulkWriterFactory

工厂类一共提供了三个构造方法,我们看到最全的一个构造方法一共接受三个参数,第一个就是我们上面讲到的Vectorizer对象,第二个是一个写入orc格式的配置属性,第三个是hadoop的配置文件.

写入的配置来自https://orc.apache.org/docs/hive-config.html,具体可以是以下的值.
在这里插入图片描述

实例讲解

构造source

首先我们自定义一个source,模拟生成RowData数据,我们这个也比较简单,主要是生成了一个int和double类型的随机数.


	public static class MySource implements SourceFunction<RowData>{
		@Override
		public void run(SourceContext<RowData> sourceContext) throws Exception{
			while (true){
				GenericRowData rowData = new GenericRowData(2);
				rowData.setField(0, (int) (Math.random() * 100));
				rowData.setField(1, Math.random() * 100);
				sourceContext.collect(rowData);
				Thread.sleep(10);
			}
		}
		@Override
		public void cancel(){

		}
	}

构造OrcBulkWriterFactory

接下来定义构造OrcBulkWriterFactory需要的参数。


		//写入orc格式的属性
		final Properties writerProps = new Properties();
		writerProps.setProperty("orc.compress", "LZ4");

		//定义类型和字段名
		LogicalType[] orcTypes = new LogicalType[]{new IntType(), new DoubleType()};
		String[] fields = new String[]{"a1", "b2"};
		TypeDescription typeDescription = OrcSplitReaderUtil.logicalTypeToOrcType(RowType.of(
				orcTypes,
				fields));

		//构造工厂类OrcBulkWriterFactory
		final OrcBulkWriterFactory<RowData> factory = new OrcBulkWriterFactory<>(
				new RowDataVectorizer(typeDescription.toString(), orcTypes),
				writerProps,
				new Configuration());

构造StreamingFileSink


		StreamingFileSink orcSink = StreamingFileSink
				.forBulkFormat(new Path("file:///tmp/aaaa"), factory)
				.build();


  • 1
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
以下是一个基于Flink消费Kafka数据并将其写入HDFS的示例: ```java import org.apache.flink.api.common.serialization.SimpleStringSchema; import org.apache.flink.core.fs.FileSystem; import org.apache.flink.formats.orc.OrcSplitReaderUtil; import org.apache.flink.formats.orc.OrcWriterFactory; import org.apache.flink.formats.orc.vector.StringColumnVector; import org.apache.flink.streaming.api.datastream.DataStream; import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment; import org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumer; import org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer; import org.apache.flink.streaming.util.serialization.JSONKeyValueDeserializationSchema; import org.apache.flink.table.api.EnvironmentSettings; import org.apache.flink.table.api.Table; import org.apache.flink.table.api.bridge.java.StreamTableEnvironment; import org.apache.flink.table.descriptors.*; import org.apache.flink.types.Row; import java.util.Properties; public class FlinkKafkaHdfsOrcDemo { public static void main(String[] args) throws Exception { // set up the streaming execution environment final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment(); env.setParallelism(1); // set parallelism to 1 for demo purposes // set up the Kafka consumer properties Properties kafkaProps = new Properties(); kafkaProps.setProperty("bootstrap.servers", "localhost:9092"); kafkaProps.setProperty("group.id", "flink-kafka-consumer-group"); // create a FlinkKafkaConsumer instance to consume Kafka data FlinkKafkaConsumer<String> kafkaConsumer = new FlinkKafkaConsumer<>("my-topic", new SimpleStringSchema(), kafkaProps); // create a data stream from the Kafka source DataStream<String> kafkaStream = env.addSource(kafkaConsumer); // parse the JSON data and create a table from it EnvironmentSettings settings = EnvironmentSettings.newInstance().inStreamingMode().useBlinkPlanner().build(); StreamTableEnvironment tableEnv = StreamTableEnvironment.create(env, settings); tableEnv.connect(new Kafka().version("universal").topic("my-topic").startFromEarliest().property("bootstrap.servers", "localhost:9092").property("group.id", "flink-kafka-consumer-group")) .withFormat(new Json().deriveSchema()) .withSchema(new Schema().field("name", DataTypes.STRING()).field("age", DataTypes.INT())) .createTemporaryTable("myTable"); Table myTable = tableEnv.from("myTable"); // create an OrcWriterFactory to write ORC data OrcWriterFactory<Row> orcWriterFactory = (OrcWriterFactory<Row>) OrcSplitReaderUtil.createRowOrcWriterFactory( new String[]{"name", "age"}, new OrcSplitReaderUtil.TypeDescription[]{ OrcSplitReaderUtil.TypeDescription.createString(), OrcSplitReaderUtil.TypeDescription.createInt() }); // create a FlinkKafkaProducer instance to write Kafka data FlinkKafkaProducer<Row> kafkaProducer = new FlinkKafkaProducer<>( "my-topic", new OrcRowSerializationSchema("/path/to/hdfs/file.orc", orcWriterFactory), kafkaProps, FlinkKafkaProducer.Semantic.EXACTLY_ONCE); // write the table data to HDFS in ORC format myTable.execute().output(kafkaProducer); // execute the job env.execute("Flink Kafka HDFS ORC Demo"); } // implementation of OrcRowSerializationSchema private static class OrcRowSerializationSchema implements FlinkKafkaProducer.SerializationSchema<Row> { private final String filePath; private final OrcWriterFactory<Row> orcWriterFactory; private transient OrcWriterFactory.Writer<Row> orcWriter; public OrcRowSerializationSchema(String filePath, OrcWriterFactory<Row> orcWriterFactory) { this.filePath = filePath; this.orcWriterFactory = orcWriterFactory; } @Override public byte[] serialize(Row row) { try { if (orcWriter == null) { orcWriter = orcWriterFactory.createWriter(filePath, FileSystem.getHadoopFileSystem(new org.apache.flink.core.fs.Path(filePath).toUri()), true); } StringColumnVector nameVector = new StringColumnVector(1); nameVector.vector[0] = row.getField(0).toString(); StringColumnVector ageVector = new StringColumnVector(1); ageVector.vector[0] = row.getField(1).toString(); orcWriter.addRow(nameVector, ageVector); return null; } catch (Exception e) { throw new RuntimeException(e); } } } } ``` 该示例使用Flink的Table API从Kafka消费数据,并将其写入HDFS中的ORC文件。示例代码使用`JsonKeyValueDeserializationSchema`解析JSON格式数据,并使用`OrcWriterFactory`将数据写入ORC文件。在示例中,`OrcWriterFactory`被配置为使用String和Int类型的列。还创建了一个`OrcRowSerializationSchema`类,它将Flink的`Row`类型转换为ORC文件中的列向量,并使用`OrcWriterFactory.Writer`将数据写入ORC文件。 注意:在实际使用中,应该根据实际需求修改示例代码,并根据需要添加适当的错误处理和容错机制。

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值