Flink1.8.1写入数据到HDFS
版本
- Flink:1.8.1
- Hadoop:2.6.0(CDH5.14)
Maven依赖
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-connector-filesystem_2.11</artifactId>
<version>${flink.version}</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-hdfs</artifactId>
<version>${hadoop.version}</version>
<scope>provided</scope>
<exclusions>
<exclusion>
<groupId>xml-apis</groupId>
<artifactId>xml-apis</artifactId>
</exclusion>
</exclusions>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-common</artifactId>
<version>${hadoop.version}</version>
<scope>provided</scope>
</dependency>
<!-- Maven整包插件 -->
<plugin>
<artifactId>maven-assembly-plugin</artifactId>
<configuration>
<archive>
<manifest>
<mainClass>com.allen.capturewebdata.Main</mainClass>
</manifest>
</archive>
<descriptorRefs>
<descriptorRef>jar-with-dependencies</descriptorRef>
</descriptorRefs>
</configuration>
</plugin>
Flink的lib目录下添加
具体版本结合实际情况,添加后需要重启Flink集群
flink-shaded-hadoop-2-uber-2.6.5-8.0.jar
jar下载方式:若在Flink官网找不到对应jar,可通过下载到maven仓库中获取,下载后再将pom中依赖删除
https://mvnrepository.com/artifact/org.apache.flink/flink-shaded-hadoop-2-uber
实现代码
//设置checkpoint
// start a checkpoint every 1000 ms
env.enableCheckpointing(1000L)
// advanced options:
// set mode to exactly-once (this is the default)
env.getCheckpointConfig.setCheckpointingMode(CheckpointingMode.EXACTLY_ONCE)
// checkpoints have to complete within one minute, or are discarded
env.getCheckpointConfig.setCheckpointTimeout(10000)
// make sure 500 ms of progress happen between checkpoints
env.getCheckpointConfig.setMinPauseBetweenCheckpoints(500)
// allow only one checkpoint to be in progress at the same time
env.getCheckpointConfig.setMaxConcurrentCheckpoints(1)
// enable externalized checkpoints which are retained after job cancellation
env.getCheckpointConfig.enableExternalizedCheckpoints(ExternalizedCheckpointCleanup.RETAIN_ON_CANCELLATION)
// This determines if a task will be failed if an error occurs in the execution of the task’s checkpoint procedure.
env.getCheckpointConfig.setFailOnCheckpointingErrors(true)
val sinkStream: DataStream[String] = ...
val sink = new BucketingSink[Tuple2[IntWritable, Text]]("hdfs://cdh1:8020/data")
sink.setBucketer(new DateTimeBucketer("yyyy-MM-dd", ZoneId.of("Asia/Shanghai")))
sink.setWriter(new SequenceFileWriter[IntWritable, Text])
//设置最大存储桶大小(以字节为单位)
sink.setBatchSize(1024 * 1024 * 400)
//设置翻转间隔(以毫秒为单位)
sink.setBatchRolloverInterval(20 * 60 * 1000)
//设置的是关闭不活跃桶的阈值,多久时间没有数据写入就关闭桶
sink.setInactiveBucketThreshold(1000L)
//设置的是检查两次检查桶不活跃的情况的周期
sink.setInactiveBucketCheckInterval(1000L)
sinkStream.addSink(sink)
设置checkpoint原因:
文件有三种状态:in-progress,pending 和 finished
in-progress:正在写入
pending:写入完成,由in-progress转为pending状态
finished:当checkpoint后,由pengding转为finished状态
官方文档
https://ci.apache.org/projects/flink/flink-docs-release-1.8/dev/connectors/filesystem_sink.html
打包方式
target目录下可找到:
XXX-1.0-SNAPSHOT-jar-with-dependencies.jar