Flink1.9系列-StreamingFileSink vs BucketingSink篇

最新推荐文章于 2024-05-03 04:21:51 发布

置顶枫叶的落寞

最新推荐文章于 2024-05-03 04:21:51 发布

阅读量6.6k

点赞数

分类专栏： Flink 文章标签： Flink StreamingFileSink hdfs

本文链接：https://blog.csdn.net/u013220482/article/details/100901471

版权

Flink 专栏收录该内容

16 篇文章 0 订阅

订阅专栏

在完成以下两篇文章的操作后，我们基本就可以创建属于我们自己的Flink工程代码了。
1.Flink1.9系列-CDH版本源码编译篇
2.Flink1.9系列-Flink on Yarn配置篇

1.Flink Project代码结构

在开始之前，我们先大概浏览一下官方文档，Flink1.9 doc ，在programming-model模块我们可以看到一个简单的Flink demo，类似于flink源码中的WordCount代码一样。从demo中我们可以看到一个Flink Project简单可以分成以下两个部分：

source
sink

这次我们讲的StreamingFileSink和BucketingSink就是属于sink板块的一大支柱，为什么说明明是两个我们要说成是一大支柱呢？因为Bucketing从历史上看是StreamingFileSink的祖宗，而StreamingFileSink更像是一个正在茁壮成长的孩子，虽然问题很多，但是前景很好！

或者你遇到了如下的错误不知道怎么去解决

java.lang.UnsupportedOperationException: Recoverable writers on Hadoop are only supported for HDFS and for Hadoop version 2.7 or newer
	at org.apache.flink.runtime.fs.hdfs.HadoopRecoverableWriter.<init>(HadoopRecoverableWriter.java:57)
	at org.apache.flink.runtime.fs.hdfs.HadoopFileSystem.createRecoverableWriter(HadoopFileSystem.java:202)
	at org.apache.flink.core.fs.SafetyNetWrapperFileSystem.createRecoverableWriter(SafetyNetWrapperFileSystem.java:69)
	at org.apache.flink.streaming.api.functions.sink.filesystem.Buckets.<init>(Buckets.java:112)
	at org.apache.flink.streaming.api.functions.sink.filesystem.StreamingFileSink$RowFormatBuilder.createBuckets(StreamingFileSink.java:242)
	at org.apache.flink.streaming.api.functions.sink.filesystem.StreamingFileSink.initializeState(StreamingFileSink.java:327)
	at org.apache.flink.streaming.util.functions.StreamingFunctionUtils.tryRestoreFunction(StreamingFunctionUtils.java:178)
	at org.apache.flink.streaming.util.functions.StreamingFunctionUtils.restoreFunctionState(StreamingFunctionUtils.java:160)
	at org.apache.flink.streaming.api.operators.AbstractUdfStreamOperator.initializeState(AbstractUdfStreamOperator.java:96)
	at org.apache.flink.streaming.api.operators.AbstractStreamOperator.initializeState(AbstractStreamOperator.java:281)
	at org.apache.flink.streaming.runtime.tasks.StreamTask.initializeState(StreamTask.java:878)
	at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:392)
	at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:705)
	at org.apache.flink.runtime.taskmanager.Task.run(Task.java:530)
	at java.lang.Thread.run(Thread.java:748)

ok！开始正题。。。。。。

2.BucketingSink

我们先看一下使用的demo：

val bucketingsink = new BucketingSink[(String, String)](basePath)
bucketingsink.setBucketer(new KeyBucket())
bucketingsink.setWriter(new Tuple_2Writer())
bucketingsink.setBatchSize(1024 * 1024 * 20)
bucketingsink.setBatchRolloverInterval(20 * 60 * 1000)

使用方法很简单，我们再简单浏览一下BucketingSink的源码。。。
这个类属于package org.apache.flink.streaming.connectors.fs.bucketing，可以看到是属于connectors的一部分，接下来就是一些类的说明，类参数和子类的一些说明，如下：

/**
 * Sink that emits its input elements to {@link FileSystem} files within
 * buckets. This is integrated with the checkpointing mechanism to provide exactly once semantics.
 *
 *
 * <p>When creating the sink a {@code basePath} must be specified. The base directory contains
 * one directory for every bucket. The bucket directories themselves contain several part files,
 * one for each parallel subtask of the sink. These part files contain the actual output data.
 *
 *
 * <p>The sink uses a {@link Bucketer} to determine in which bucket directory each element should
 * be written to inside the base directory. The {@code Bucketer} can, for example, use time or
 * a property of the element to determine the bucket directory. The default {@code Bucketer} is a
 * {@link DateTimeBucketer} which will create one new bucket every hour. You can specify
 * a custom {@code Bucketer} using {@link #setBucketer(Bucketer)}. For example, use the
 * {@link BasePathBucketer} if you don't want to have buckets but still want to write part-files
 * in a fault-tolerant way.
 *
 *
 * <p>The filenames of the part files contain the part prefix, the parallel subtask index of the sink
 * and a rolling counter. For example the file {@code "part-1-17"} contains the data from
 * {@code subtask 1} of the sink and is the {@code 17th} bucket created by that subtask. Per default
 * the part prefix is {@code "part"} but this can be configured using {@link #setPartPrefix(String)}.
 * When a part file becomes bigger than the user-specified batch size or when the part file becomes older
 * than the user-specified roll over interval the current part file is closed, the part counter is increased
 * and a new part file is created. The batch size defaults to {@code 384MB}, this can be configured
 * using {@link #setBatchSize(long)}. The roll over interval defaults to {@code Long.MAX_VALUE} and
 * this can be configured using {@link #setBatchRolloverInterval(long)}.
 *
 *
 * <p>In some scenarios, the open buckets are required to change based on time. In these cases, the sink
 * needs to determine when a bucket has become inactive, in order to flush and close the part file.
 * To support this there are two configurable settings:
 * <ol>
 *     <li>the frequency to check for inactive buckets, configured by {@link #setInactiveBucketCheckInterval(long)},
 *     and</li>
 *     <li>the minimum amount of time a bucket has to not receive any data before it is considered inactive,
 *     configured by {@link #setInactiveBucketThreshold(long)}</li>
 * </ol>
 * Both of these parameters default to {@code 60, 000 ms}, or {@code 1 min}.
 *
 *
 * <p>Part files can be in one of three states: {@code in-progress}, {@code pending} or {@code finished}.
 * The reason for this is how the sink works together with the checkpointing mechanism to provide exactly-once
 * semantics and fault-tolerance. The part file that is currently being written to is {@code in-progress}. Once
 * a part file is closed for writing it becomes {@code pending}. When a checkpoint is successful the currently
 * pending files will be moved to {@code finished}.
 *
 *
 * <p>If case of a failure, and in order to guarantee exactly-once semantics, the sink should roll back to the state it
 * had when that last successful checkpoint occurred. To this end, when restoring, the restored files in {@code pending}
 * state are transferred into the {@code finished} state while any {@code in-progress} files are rolled back, so that
 * they do not contain data that arrived after the checkpoint from which we restore. If the {@code FileSystem} supports
 * the {@code truncate()} method this will be used to reset the file back to its previous state. If not, a special
 * file with the same name as the part file and the suffix {@code ".valid-length"} will be created that contains the
 * length up to which the file contains valid data. When reading the file, it must be ensured that it is only read up
 * to that point. The prefixes and suffixes for the different file states and valid-length files can be configured
 * using the adequate setter method, e.g. {@link #setPendingSuffix(String)}.
 *
 *
 * <p><b>NOTE:</b>
 * <ol>
 *     <li>
 *         If checkpointing is not enabled the pending files will never be moved to the finished state. In that case,
 *         the pending suffix/prefix can be set to {@code ""} to make the sink work in a non-fault-tolerant way but
 *         still provide output without prefixes and suffixes.
 *     </li>
 *     <li>
 *         The part files are written using an instance of {@link Writer}. By default, a
 *         {@link StringWriter} is used, which writes the result of {@code toString()} for
 *         every element, separated by newlines. You can configure the writer using the
 *         {@link #setWriter(Writer)}. For example, {@link SequenceFileWriter}
 *         can be used to write Hadoop {@code SequenceFiles}.
 *     </li>
 *     <li>
 *       	{@link #closePartFilesByTime(long)} closes buckets that have not been written to for
 *       	{@code inactiveBucketThreshold} or if they are older than {@code batchRolloverInterval}.
 *     </li>
 * </ol>
 *
 *
 * <p>Example:
 * <pre>{@code
 *     new BucketingSink<Tuple2<IntWritable, Text>>(outPath)
 *         .setWriter(new SequenceFileWriter<IntWritable, Text>())
 *         .setBucketer(new DateTimeBucketer("yyyy-MM-dd--HHmm")
 * }</pre>
 *
 * <p>This will create a sink that writes to {@code SequenceFiles} and rolls every minute.
 *
 * @see DateTimeBucketer
 * @see StringWriter
 * @see SequenceFileWriter
 *
 * @param <T> Type of the elements emitted by this sink
 *
 * @deprecated Please use the
 * {@link org.apache.flink.streaming.api.functions.sink.filesystem.StreamingFileSink StreamingFileSink}
 * instead.
 *
 */

请着重看这两段注释

If case of a failure, and in order to guarantee exactly-once semantics, the sink should roll back to the state it
 * had when that last successful checkpoint occurred. To this end, when restoring, the restored files in {@code pending}
 * state are transferred into the {@code finished} state while any {@code in-progress} files are rolled back, so that
 * they do not contain data that arrived after the checkpoint from which we restore. If the {@code FileSystem} supports
 * the {@code truncate()} method this will be used to reset the file back to its previous state. If not, a special
 * file with the same name as the part file and the suffix {@code ".valid-length"} will be created that contains the
 * length up to which the file contains valid data. When reading the file, it must be ensured that it is only read up
 * to that point. The prefixes and suffixes for the different file states and valid-length files can be configured
 * using the adequate setter method, e.g. {@link #setPendingSuffix(String)}

* @deprecated Please use the
 * {@link org.apache.flink.streaming.api.functions.sink.filesystem.StreamingFileSink StreamingFileSink}
 * instead.

大概解释一下：Flink为了保证消息只消费一次，sink必须具有能回滚到上一次成功checkpoint的状态点，如果你指定的文件系统支持truncate操作，flink就会将之前保存的文件重新设置到上次成功的状态。相反，如果不支持的话，flink会创建一个同名的文件，并且增加一个后缀作为标识。这是非常重要的，而且也是BucketingSink和StreamingFileSink的主要不同点所在

下面那段注释则标识了新版本的Flink已经废弃该类，并用StreamingFileSink替代了，这时候BucketingSink的孩子就出现在人们的眼睛中了。当然，我没有深入考究StreamingFileSink是从Flink1.6还是Flink1.7或者其他版本开始的。既然官网推荐使用，我们接下来就讲一下StreamingFileSink这个类吧。

3.StreamingFileSink

同样的，我们看一下StreamingFileSink的源码和使用方法
先看一下简单的demo

val bucketingsink = StreamingFileSink
    .forRowFormat(new Path(basePath), new Tuple2Encoder())
    .withBucketAssigner(new KeyBucketAssigner())
    //      .withRollingPolicy(DefaultRollingPolicy[(String,String),String])
    .build()

使用方法也很简单，这里主要看一下几个方法

		@Override
		Buckets<IN, BucketID> createBuckets(int subtaskIndex) throws IOException {
			return new Buckets<>(
					basePath,
					bucketAssigner,
					bucketFactory,
					new RowWisePartWriter.Factory<>(encoder),
					rollingPolicy,
					subtaskIndex);
		}

使用自定义或者默认的bucket创建目录及文件层级，接下来看一下这个方法里调用的Buckets类：

Buckets(
			final Path basePath,
			final BucketAssigner<IN, BucketID> bucketAssigner,
			final BucketFactory<IN, BucketID> bucketFactory,
			final PartFileWriter.PartFileFactory<IN, BucketID> partFileWriterFactory,
			final RollingPolicy<IN, BucketID> rollingPolicy,
			final int subtaskIndex) throws IOException {

		this.basePath = Preconditions.checkNotNull(basePath);
		this.bucketAssigner = Preconditions.checkNotNull(bucketAssigner);
		this.bucketFactory = Preconditions.checkNotNull(bucketFactory);
		this.partFileWriterFactory = Preconditions.checkNotNull(partFileWriterFactory);
		this.rollingPolicy = Preconditions.checkNotNull(rollingPolicy);
		this.subtaskIndex = subtaskIndex;

		this.activeBuckets = new HashMap<>();
		this.bucketerContext = new Buckets.BucketerContext();

		try {
			this.fsWriter = FileSystem.get(basePath.toUri()).createRecoverableWriter();
		} catch (IOException e) {
			LOG.error("Unable to create filesystem for path: {}", basePath);
			throw e;
		}

		this.bucketStateSerializer = new BucketStateSerializer<>(
				fsWriter.getResumeRecoverableSerializer(),
				fsWriter.getCommitRecoverableSerializer(),
				bucketAssigner.getSerializer()
		);

		this.maxPartCounter = 0L;
	}

请注意这一行代码：

this.fsWriter = FileSystem.get(basePath.toUri()).createRecoverableWriter();

创建文件，尤其是hdfs文件，这里我们在深入一层看一下

    @Override
	public RecoverableWriter createRecoverableWriter() throws IOException {
		// This writer is only supported on a subset of file systems, and on
		// specific versions. We check these schemes and versions eagerly for better error
		// messages in the constructor of the writer.
		return new HadoopRecoverableWriter(fs);
	}

public HadoopRecoverableWriter(org.apache.hadoop.fs.FileSystem fs) {
		this.fs = checkNotNull(fs);

		// This writer is only supported on a subset of file systems, and on
		// specific versions. We check these schemes and versions eagerly for
		// better error messages.
		if (!"hdfs".equalsIgnoreCase(fs.getScheme()) || !HadoopUtils.isMinHadoopVersion(2, 7)) {
			throw new UnsupportedOperationException(
					"Recoverable writers on Hadoop are only supported for HDFS and for Hadoop version 2.7 or newer");
		}
	}

看到问题了吗？StreamingFileSink在写hdfs时候，要求hadoop版本必须大于2.7，但是目前市面开源的稳定版本包含cloudera cdh在内，都是支持hadoop2.6，所以如果你使用hadoop版本<2.7,那建议你还是使用BucketingSink，不出什么错，毕竟是祖宗！！！

枫叶的落寞

关注

0
点赞
踩
8

收藏

觉得还不错? 一键收藏
0
评论
Flink1.9系列-StreamingFileSink vs BucketingSink篇

Flink1.9系列-StreamingFileSink vs BucketingSink篇详细全面的从demo代码和源码层面剖析和解释了为什么Flink新版本的StreamingFileSink写hdfs会时常出问题
复制链接

扫一扫