封装一个kafka直联flink进行实时数据处理的工具类

本文链接：https://blog.csdn.net/weixin_45896475/article/details/104894816

封装一个kafka直联flink进行实时数据处理的工具类

就这么几行代码就可以将kafka的数据搞过来,当然是需要搞一个FlinkUtils.createKafkaStream工具类的

public class RealtimeETL {

    public static void main(String[] args) throws Exception {
        
        //用来解析配置文的工具类
        ParameterTool parameters = ParameterTool.fromPropertiesFile(args[0]);
        
        //使用Flink拉取Kafka中的数据，对数据进行清洗、过滤整理
        
        DataStream<String> lines = FlinkUtils.createKafkaStream(parameters, SimpleStringSchema.class);

        lines.print();

        FlinkUtils.env.execute();

    }
}

这已经是一个相对完美的工具类了

public class FlinkUtils {

    public static final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();

    public static <T> DataStream<T> createKafkaStream(ParameterTool parameters, Class<? extends DeserializationSchema<T>> clazz) throws Exception{

        //如果开启Checkpoint，偏移量会存储到哪呢？
        env.enableCheckpointing(parameters.getLong("checkpoint.interval", 300000));
        env.getCheckpointConfig().setCheckpointingMode(CheckpointingMode.AT_LEAST_ONCE);
        //就是将job cancel后，依然保存对应的checkpoint数据
        env.getCheckpointConfig().enableExternalizedCheckpoints(CheckpointConfig.ExternalizedCheckpointCleanup.RETAIN_ON_CANCELLATION);
        String checkPointPath = parameters.get("checkpoint.path");
        if(checkPointPath != null) {
            env.setStateBackend(new FsStateBackend(checkPointPath));
        }
        int restartAttempts = parameters.getInt("restart.attempts", 30);
        int delayBetweenAttempts = parameters.getInt("delay.between.attempts", 30000);
        env.setRestartStrategy(RestartStrategies.fixedDelayRestart(restartAttempts, delayBetweenAttempts));
        Properties properties = parameters.getProperties();
        String topics = parameters.getRequired("kafka.topics");
        List<String> topicList = Arrays.asList(topics.split(","));
        FlinkKafkaConsumer<T> flinkKafkaConsumer = new FlinkKafkaConsumer<T>(topicList, clazz.newInstance(), properties);
        //在Checkpoint的时候将Kafka的偏移量保存到Kafka特殊的Topic中，默认是true
        flinkKafkaConsumer.setCommitOffsetsOnCheckpoints(false);
        return env.addSource(flinkKafkaConsumer);
    }

}

然后再本来传入参数(路径)😄:\test/conf.properties
然后在这个路径下新建conf.properties文件并传入参数
checkpoint.interval=30000
bootstrap.servers=doit01:9092,doit02:9092,doit03:9092
group.id=g10
auto.offset.reset=earliest
kafka.topics=wordcount2
然后再虚拟机开启zookeeper和kafk的服务
直接运行最上面的程序即可将kafka中的属于该话题的数据全部搞过来还可以进行实时处理数据