flink入门之StreaimingFileSink的使用

最新推荐文章于 2023-06-28 17:48:06 发布

今天上上签

最新推荐文章于 2023-06-28 17:48:06 发布

阅读量1k

点赞数 1

分类专栏： flink 文章标签：大数据 flink

本文链接：https://blog.csdn.net/bradym/article/details/109578392

版权

flink 专栏收录该内容

12 篇文章 1 订阅

订阅专栏

需求：用flink实时消费kafka信息，将信息存储到hdfs上。
方案：用flink提供的StreaimingFileSink方法。

StreaimingFileSink

forRowFormat方法
forBulkFormat方法
- Parquet格式
- Parquet格式+snappy压缩
自定义分桶策略
滚动策略
优化
参考文档

forRowFormat方法

这个方法比较简单，就是把读到的信息按照行存储的格式写入hdfs上，我们这里直接看下官方提供的代码：（分桶策略后面再说）

import org.apache.flink.api.common.serialization.SimpleStringEncoder
import org.apache.flink.core.fs.Path
import org.apache.flink.streaming.api.functions.sink.filesystem.StreamingFileSink
import org.apache.flink.streaming.api.functions.sink.filesystem.rollingpolicies.DefaultRollingPolicy

val input: DataStream[String] = ...

val sink: StreamingFileSink[String] = StreamingFileSink
    .forRowFormat(new Path(outputPath), new SimpleStringEncoder[String]("UTF-8"))
    .withRollingPolicy(
    	//设置滚动策略
        DefaultRollingPolicy.builder()
       		 //15min 滚动，生成一个新文件
            .withRolloverInterval(TimeUnit.MINUTES.toMillis(15))
            //5min未接收到数据，滚动，生成一个新文件
            .withInactivityInterval(TimeUnit.MINUTES.toMillis(5))
            //文件大小达到1G，滚动，生成一个新文件
            .withMaxPartSize(1024 * 1024 * 1024)
            .build())
     //设置分桶策略
    .withBucketAssigner(dayAssigner)
    .build()

input.addSink(sink)

我们基本看代码就能知道这个sink的使用方法了。

forBulkFormat方法

但是除了上述的按照行存储的方式，我们往往需要指定其他的存储格式，例如：parquet，Avro，ORC等等…而且，我们还需要压缩写入的文件，针对这种情况，我们就需要换另一种方法，采用另一种编码器了。

Parquet格式

这里用parquet举例，其他情况类似。
首先我们要额外引入依赖：

> <dependency>   
> <groupId>org.apache.flink</groupId>  
> <artifactId>flink-parquet_2.11</artifactId>  
> <version>1.11.2</version> 
> </dependency>

然后代码：

import org.apache.flink.streaming.api.functions.sink.filesystem.StreamingFileSink
import org.apache.flink.formats.parquet.avro.ParquetAvroWriters

val input: DataStream[DemoBean] = ...

val sink: StreamingFileSink[DemoBean] = StreamingFileSink
    .forBulkFormat(outputBasePath, ParquetAvroWriters.forReflectRecord(classOf[DemoBean]))
    .withBucketAssigner(assigner)
    .build()

input.addSink(sink)

我们可以很明显的看出，编码方式由行存储的：

new SimpleStringEncoder[String]("UTF-8")

变成了：

ParquetAvroWriters.forReflectRecord(classOf[DemoBean])

然后我们发现这样的写入方式并没有设置文件的滚动策略，这里我们等下再说，然后现在，数据就可以以parquet的格式写入hdfs了。

Parquet格式+snappy压缩

上面我们说的两种方式，细心的朋友发现，在官网其实都可以找到，但是，实际生产中，我们经常还有一种需求，就是要求写入的文件要用上压缩，例如：Snappy，Lzo，Gzip等…官网并没有提供这种写法的示例，但我们可以自己实现。
首先需额外导入依赖：

		<dependency>
            <groupId>org.apache.flink</groupId>
            <artifactId>flink-parquet_2.11</artifactId>
            <version>1.9.1</version>
        </dependency>

        <dependency>
            <groupId>org.apache.flink</groupId>
            <artifactId>flink-avro</artifactId>
            <version>1.9.1</version>
        </dependency>

        <dependency>
            <groupId>org.apache.parquet</groupId>
            <artifactId>parquet-avro</artifactId>
            <version>1.10.0</version>
        </dependency>
        <dependency>
            <groupId>org.apache.parquet</groupId>
            <artifactId>parquet-hadoop</artifactId>
            <version>1.10.0</version>
        </dependency>

然后我们可以看一下AvroParquetWriter这个类的源码：

/**
   * Create a new {@link AvroParquetWriter}.
   *
   * @param file The file name to write to.
   * @param writeSupport The schema to write with.
   * @param compressionCodecName Compression code to use, or CompressionCodecName.UNCOMPRESSED
   * @param blockSize the block size threshold.
   * @param pageSize See parquet write up. Blocks are subdivided into pages for alignment and other purposes.
   * @param enableDictionary Whether to use a dictionary to compress columns.
   * @param conf The Configuration to use.
   * @throws IOException
   */
  AvroParquetWriter(Path file, WriteSupport<T> writeSupport,
                           CompressionCodecName compressionCodecName,
                           int blockSize, int pageSize, boolean enableDictionary,
                           boolean enableValidation, WriterVersion writerVersion,
                           Configuration conf)
      throws IOException {
    super(file, writeSupport, compressionCodecName, blockSize, pageSize,
        pageSize, enableDictionary, enableValidation, writerVersion, conf);
  }

可以看到，是可以传入一个CompressionCodecName的参数的，但是我们上面调用的ParquetAvroWriters.forReflectRecord(classOf[DemoBean])方法并没有要求去传一个压缩格式，所以我们需要重新写这个方法，然后调用：

StreamingFileSink.forBulkFormat(
      outputBasePath,
      MyParquetAvroWriterSnappy.forReflectRecord(CompressionCodecName.SNAPPY, classOf[DemoBean])
      )
     .withBucketAssigner(dayAssigner)
     .build()

MyParquetAvroWriterSnappy类Scala实现：

import org.apache.avro.Schema
import org.apache.avro.generic.GenericData
import org.apache.avro.reflect.ReflectData
import org.apache.flink.formats.parquet.{ParquetBuilder, ParquetWriterFactory}
import org.apache.parquet.avro.AvroParquetWriter
import org.apache.parquet.hadoop.ParquetWriter
import org.apache.parquet.hadoop.metadata.CompressionCodecName
import org.apache.parquet.io.OutputFile

object ParquetAvroWriterSnappyScala {
  def forReflectRecord(compressionCodecName:CompressionCodecName,t1:Class[DemoBean]): ParquetWriterFactory[DemoBean] ={
    val schemaString = ReflectData.get().getSchema(t1).toString()
    val builder = new MyParquet(schemaString,compressionCodecName)
    new ParquetWriterFactory(builder)

  }

  def createAvroParquetWriter(schemaString:String,dataModel:GenericData,out:OutputFile,compressionCodecName:CompressionCodecName): ParquetWriter[DemoBean] ={
    val schema = new Schema.Parser().parse(schemaString)
    val builder = AvroParquetWriter.builder[DemoBean](out)
      .withSchema(schema)
      .withDataModel(dataModel)
      .withCompressionCodec(compressionCodecName)
      .build()
    builder
  }
}

private class MyParquet(schemaString:String,compressionCodecName: CompressionCodecName) extends ParquetBuilder[DemoBean]{
  override def createWriter(out: OutputFile): ParquetWriter[DemoBean] = {
    ParquetAvroWriterSnappyScala.createAvroParquetWriter(schemaString,ReflectData.get(),out,compressionCodecName)
  }
}

MyParquetAvroWriterSnappy类java实现：

import org.apache.avro.Schema;
import org.apache.avro.generic.GenericData;
import org.apache.avro.reflect.ReflectData;
import org.apache.flink.formats.parquet.ParquetBuilder;
import org.apache.flink.formats.parquet.ParquetWriterFactory;
import org.apache.parquet.avro.AvroParquetWriter;
import org.apache.parquet.hadoop.ParquetWriter;
import org.apache.parquet.hadoop.metadata.CompressionCodecName;
import org.apache.parquet.io.OutputFile;

import java.io.IOException;

public class ParquetAvroWriterSnappyDemo {

    public static <T> ParquetWriterFactory<T> forReflectRecord(Class<T> type, CompressionCodecName codecName) {
        final String schemaString = ReflectData.get().getSchema(type).toString();
        final ParquetBuilder<T> builder = (out) -> createAvroParquetWriter(schemaString, ReflectData.get(), out, codecName);
        return new ParquetWriterFactory<>(builder);
    }

    private static <T> ParquetWriter<T> createAvroParquetWriter(
            String schemaString,
            GenericData dataModel,
            OutputFile out,
            CompressionCodecName codecName) throws IOException {

        final Schema schema = new Schema.Parser().parse(schemaString);

        return AvroParquetWriter.<T>builder(out)
                .withSchema(schema)
                .withDataModel(dataModel)
                .withCompressionCodec(codecName)
                .build();
    }

    private ParquetAvroWriterSnappyDemo() {}
}

我们重新自定义了带压缩的写入方法后，就可以指定压缩名字传入了，我们这里分别用parquet写入和parquet+snappy压缩写入10万条相同数据验证代码的正确性：

886.0 K /tmp/test/2020-11-11-uncompress
449.8 K /tmp/test/2020-11-11-compress

虽然数据较少，但是也能看出来，压缩设置确实生效了。

自定义分桶策略

Flink 提供了两个分桶策略：

BasePathBucketAssigner，不分桶，所有文件写到根目录；
DateTimeBucketAssigner，基于系统时间(yyyy-MM-dd–HH)分桶。

我们在实际生产中往往根据不同要求调整分桶策略（说白了就是自定义文件写入的文件夹的名字）例如按天的自定义分桶：

private class DayBucketAssigner(formatString: String,zoneId: ZoneId) extends BucketAssigner[DemoBean,String] {
    private val serialVersionUID = 10000L
    // DateTimeFormatter被用来通过当前系统时间和DateTimeFormat来生成时间字符串
    private var dateTimeFormatter:DateTimeFormatter = null
    override def getBucketId(element: DemoBean, context: BucketAssigner.Context): String = {
      if(dateTimeFormatter == null){
        dateTimeFormatter = DateTimeFormatter.ofPattern(formatString).withZone(zoneId)
      }
      val dateStr = dateTimeFormatter.format(Instant.ofEpochMilli(context.currentProcessingTime()))
      dateStr
    }

    override def getSerializer: SimpleVersionedSerializer[String] = {
      SimpleVersionedStringSerializer.INSTANCE
    }
  }