封装一个kafka直联flink进行实时数据处理的工具类
就这么几行代码就可以将kafka的数据搞过来,当然是需要搞一个FlinkUtils.createKafkaStream工具类的
public class RealtimeETL {
public static void main(String[] args) throws Exception {
//用来解析配置文的工具类
ParameterTool parameters = ParameterTool.fromPropertiesFile(args[0]);
//使用Flink拉取Kafka中的数据,对数据进行清洗、过滤整理
DataStream<String> lines = FlinkUtils.createKafkaStream(parameters, SimpleStringSchema.class);
lines.print();
FlinkUtils.env.execute();
}
}
这已经是一个相对完美的工具类了
public class FlinkUtils {
public static final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
public static <T> DataStream<T> createKafkaStream(ParameterTool parameters, Class<? extends DeserializationSchema<T>> clazz) throws Exception{
//如果开启Checkpoint,偏移量会存储到哪呢?
env.enableCheckpointing(parameters.getLong("checkpoint.interval", 300000));
env.getCheckpointConfig().setCheckpointingMode(CheckpointingMode.AT_LEAST_ONCE);
//就是将job cancel后,依然保存对应的checkpoint数据
env.getCheckpointConfig().enableExternalizedCheckpoints(CheckpointConfig.ExternalizedCheckpointCleanup.RETAIN_ON_CANCELLATION);
String checkPointPath = parameters.get("checkpoint.path");
if(checkPointPath != null) {
env.setStateBackend(new FsStateBackend(checkPointPath));
}
int restartAttempts = parameters.getInt("restart.attempts", 30);
int delayBetweenAttempts = parameters.getInt("delay.between.attempts", 30000);
env.setRestartStrategy(RestartStrategies.fixedDelayRestart(restartAttempts, delayBetweenAttempts));
Properties properties = parameters.getProperties();
String topics = parameters.getRequired("kafka.topics");
List<String> topicList = Arrays.asList(topics.split(","));
FlinkKafkaConsumer<T> flinkKafkaConsumer = new FlinkKafkaConsumer<T>(topicList, clazz.newInstance(), properties);
//在Checkpoint的时候将Kafka的偏移量保存到Kafka特殊的Topic中,默认是true
flinkKafkaConsumer.setCommitOffsetsOnCheckpoints(false);
return env.addSource(flinkKafkaConsumer);
}
}
然后再本来传入参数(路径)😄:\test/conf.properties
然后在这个路径下新建conf.properties文件并传入参数
checkpoint.interval=30000
bootstrap.servers=doit01:9092,doit02:9092,doit03:9092
group.id=g10
auto.offset.reset=earliest
kafka.topics=wordcount2
然后再虚拟机开启zookeeper和kafk的服务
直接运行最上面的程序即可将kafka中的属于该话题的数据全部搞过来还可以进行实时处理数据