Flink1.8.1写入数据到HDFS

版本

  1. Flink:1.8.1
  2. Hadoop:2.6.0(CDH5.14)

Maven依赖

    <dependency>
        <groupId>org.apache.flink</groupId>
        <artifactId>flink-connector-filesystem_2.11</artifactId>
        <version>${flink.version}</version>
    </dependency>
    <dependency>
        <groupId>org.apache.hadoop</groupId>
        <artifactId>hadoop-hdfs</artifactId>
        <version>${hadoop.version}</version>
        <scope>provided</scope>
        <exclusions>
            <exclusion>
                <groupId>xml-apis</groupId>
                <artifactId>xml-apis</artifactId>
            </exclusion>
        </exclusions>
    </dependency>
    <dependency>
        <groupId>org.apache.hadoop</groupId>
        <artifactId>hadoop-common</artifactId>
        <version>${hadoop.version}</version>
        <scope>provided</scope>
    </dependency>

	<!-- Maven整包插件 --> 
    <plugin>
        <artifactId>maven-assembly-plugin</artifactId>
        <configuration>
            <archive>
                <manifest>
                    <mainClass>com.allen.capturewebdata.Main</mainClass>
                </manifest>
            </archive>
            <descriptorRefs>
                <descriptorRef>jar-with-dependencies</descriptorRef>
            </descriptorRefs>
        </configuration>
    </plugin>

Flink的lib目录下添加

具体版本结合实际情况,添加后需要重启Flink集群

flink-shaded-hadoop-2-uber-2.6.5-8.0.jar

jar下载方式:若在Flink官网找不到对应jar,可通过下载到maven仓库中获取,下载后再将pom中依赖删除
https://mvnrepository.com/artifact/org.apache.flink/flink-shaded-hadoop-2-uber

实现代码

//设置checkpoint
// start a checkpoint every 1000 ms
env.enableCheckpointing(1000L)
// advanced options:
// set mode to exactly-once (this is the default)
env.getCheckpointConfig.setCheckpointingMode(CheckpointingMode.EXACTLY_ONCE)
// checkpoints have to complete within one minute, or are discarded
env.getCheckpointConfig.setCheckpointTimeout(10000)
// make sure 500 ms of progress happen between checkpoints
env.getCheckpointConfig.setMinPauseBetweenCheckpoints(500)
// allow only one checkpoint to be in progress at the same time
env.getCheckpointConfig.setMaxConcurrentCheckpoints(1)
// enable externalized checkpoints which are retained after job cancellation
env.getCheckpointConfig.enableExternalizedCheckpoints(ExternalizedCheckpointCleanup.RETAIN_ON_CANCELLATION)
// This determines if a task will be failed if an error occurs in the execution of the task’s checkpoint procedure.
env.getCheckpointConfig.setFailOnCheckpointingErrors(true)

val sinkStream: DataStream[String] = ...

val sink = new BucketingSink[Tuple2[IntWritable, Text]]("hdfs://cdh1:8020/data")
sink.setBucketer(new DateTimeBucketer("yyyy-MM-dd", ZoneId.of("Asia/Shanghai")))
sink.setWriter(new SequenceFileWriter[IntWritable, Text])
//设置最大存储桶大小(以字节为单位)
sink.setBatchSize(1024 * 1024 * 400) 
//设置翻转间隔(以毫秒为单位)
sink.setBatchRolloverInterval(20 * 60 * 1000)
//设置的是关闭不活跃桶的阈值,多久时间没有数据写入就关闭桶
sink.setInactiveBucketThreshold(1000L)
//设置的是检查两次检查桶不活跃的情况的周期
sink.setInactiveBucketCheckInterval(1000L)

sinkStream.addSink(sink)

设置checkpoint原因:
文件有三种状态:in-progress,pending 和 finished
in-progress:正在写入
pending:写入完成,由in-progress转为pending状态
finished:当checkpoint后,由pengding转为finished状态

官方文档
https://ci.apache.org/projects/flink/flink-docs-release-1.8/dev/connectors/filesystem_sink.html

https://ci.apache.org/projects/flink/flink-docs-master/api/java/org/apache/flink/streaming/connectors/fs/bucketing/BucketingSink.html

打包方式

在这里插入图片描述
target目录下可找到:

XXX-1.0-SNAPSHOT-jar-with-dependencies.jar

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值