flink消费kafka落地到hdfs

一杯拿铁go

已于 2023-04-13 17:44:47 修改

阅读量826

点赞数 2

分类专栏： hadoop_spark_flink以及相关文章标签： kafka hdfs flink

于 2023-02-28 16:01:14 首次发布

本文链接：https://blog.csdn.net/w417950004/article/details/129263245

版权

hadoop_spark_flink以及相关专栏收录该内容

4 篇文章 0 订阅

订阅专栏

一，代码部分

1，配置kafka

        final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
        env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime);
        env.getConfig().setAutoWatermarkInterval(intervalInteval);
        //设置重启策略 重启次数以及重启间隔
       env.setRestartStrategy(RestartStrategies.fixedDelayRestart(restartAttempts, delayBetweenAttempts));
        //配置消费下发点击的kafka信息
        Properties propertiesForSend = new Properties();
        //构建Kafka参数，可以消费带安全认证的集群，也可以消费不带安全认证的
        propertiesForSend.put("bootstrap.servers", sendBlogBrokers);
        propertiesForSend.put("group.id", sendBlogGroupId);
        propertiesForSend.put("enable.auto.commit", "true");
        propertiesForSend.put("auto.commit.interval.ms", "3000");
        propertiesForSend.put("auto.offset.reset", "latest");
        propertiesForSend.put("session.timeout.ms", "30000");
        propertiesForSend.put("key.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");//classOf[StringDeserializer]
        propertiesForSend.put("value.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");

配置checkpoint
checkpoint是需要去配置的，在生成hdfs文件的时候若不设置checkpoint，文件会一直是.inprogress状态，无法生成最终的文件。在checkpoint执行之后文件才会从.inprogress状态到完成状态，文件才会被落下。

   env.enableCheckpointing(checkpointInteval);
   env.getCheckpointConfig().setCheckpointTimeout(checkpointTimeOut);
   env.getCheckpointConfig().setCheckpointingMode(CheckpointingMode.EXACTLY_ONCE);
//

2，消费kafka和处理数据

//        // 消费kafka数据
        FlinkKafkaConsumer flinkKafka = new FlinkKafkaConsumer<>(sendBlogTopic, new SimpleStringSchema(), propertiesForSend);
        flinkKafka.setStartFromLatest();
        DataStream<String> sendStream = env.addSource(flinkKafka)
                .setParallelism(sendParallel)
                ;
        //数据通过自定义的Log2RedisProcessFunction类过滤并写入到redis
        SingleOutputStreamOperator<String> Filterstream = sendStream
                .process(new Log2RedisProcessFunction()) 
                .setParallelism(12); //设置并行度

3，配置hdfs信息以及落盘

3.1 配置滚动落盘信息。

DefaultRollingPolicy rollingPolicy = DefaultRollingPolicy
        .builder()
        .withMaxPartSize(1024*1024*120) // 设置每个文件的最大大小 ,默认是128M。这里设置为120M
        .withRolloverInterval(Integer.parseInt(120000)) // 滚动写入新文件的时间，默认60s。
        .build();

上面信息表示文件大小为120M或者写入的信息有120秒时就会生成新的文件。
还有一个参数是：

.withInactivityInterval(TimeUnit.MINUTES.toMinutes(5))

表示如果5分钟还没有新数据，就生成一个新的文件，无这个需求就没有。

3.2 配置分桶策略

StreamingFileSink sinkConf = StreamingFileSink
        .forRowFormat(new Path(localHdfsPath), new SimpleStringEncoder<String>("UTF-8"))
        .withBucketAssigner(new DateTimeBucketAssigner("yyyy-MM-dd-HH"))
        .withRollingPolicy(rollingPolicy)
        .withBucketCheckInterval(10000L) // 桶检查间隔，这里设置为10s
        .withOutputFileConfig(config)
        .build()
        ;
//落hdfs        
Filterstream.addSink(sinkConf);

使用StreamingFileSink来承接流数据并分到到各个桶中。分桶时可以选择默认行DateTimeBucketAssigner或者单个全局Bucket。本次选默认的。
设置分桶的格式，按照年月日，更精细的话可以到分钟级别：

.withBucketAssigner(new DateTimeBucketAssigner("yyyy-MM-dd"))
.withBucketAssigner(new DateTimeBucketAssigner("yyyy-MM-dd-HH-mm"))

检查是否到了分桶的时间了：

.withBucketCheckInterval(10000L) // 桶检查间隔，这里设置为10s

可以设置输出文件是的前缀和后缀

		//输出文件是的前缀和后缀
        OutputFileConfig config = OutputFileConfig
                .builder()
                .withPartPrefix("wudl")
                .withPartSuffix(".txt")
                .build();
		//配置落入的hdfs

二，需要增加的依赖

    <!--读写hdfs-->
    <dependency>
      <groupId>org.apache.flink</groupId>
      <artifactId>flink-connector-files</artifactId>
      <version>1.13.1</version>
    </dependency>
    
    <dependency>
      <groupId>org.apache.flink</groupId>
      <artifactId>flink-connector-filesystem_2.12</artifactId>
      <version>1.11.3</version>
    </dependency>

    <dependency>
      <groupId>org.apache.flink</groupId>
      <artifactId>flink-connector-kafka_2.11</artifactId>
      <version>1.13.1</version>
    </dependency>

三，遇到的问题：

报错信息：

Caused by: java.io.IOException: Failed to create the parent directory: /user/wesh/hei/search_output/log/2023-02-28

原因：发现是hdfs的地址没写全，在跳板机上即使不写全也能读写，但flink落hdfs的时候必须写全了。

解决：补全hdfs的前缀：

viewfs://c9/

参考链接：

https://www.cnblogs.com/chang09/p/15966282.html

https://blog.csdn.net/wudonglianga/article/details/122163911

https://blog.csdn.net/qq_20042935/article/details/123368281