本文主要讲:(自己管理offset)通过自定义kafkaconsumer的反序列方式是可以拿到kafka message的所有信息的,包括topic partition offset等。
另外:
我们需要编写一个Kafka Consumer,通过Flink计算引擎从Kafka相应的Topic中读取数据。在Flink中,我们可以通过FlinkKafkaConsumer08
来实现,这个类提供了读取一个或者多个Kafka Topic的机制。它的构造函数接收以下几个参数:
1、topic的名字,可以是String(用于读取一个Topic)List(用于读取多个Topic);
2、可以提供一个DeserializationSchema
/ KeyedDeserializationSchema
用于反系列化Kafka中的字节数组;
3、Kafka consumer的一些配置信息,而且我们必须指定bootstrap.servers
、zookeeper.connect
(这个属性仅仅在Kafka 0.8中需要)和group.id
属性。
好了,我们来使用FlinkKafkaConsumer08
类吧,初始化如下:
val properties = new Properties();
properties.setProperty("bootstrap.servers", "www.iteblog.com:9092");
// only required for Kafka 0.8
properties.setProperty("zookeeper.connect", "www.iteblog.com:2181");
properties.setProperty("group.id", "iteblog");
val stream = env.addSource(new FlinkKafkaConsumer08[String]("iteblog",
new SimpleStringSchema(), properties))
stream.print()
上面的例子中使用到SimpleStringSchema
来反系列化message,这个类是实现了DeserializationSchema
接口,并重写了T deserialize(byte[] message)
函数,DeserializationSchema
接口仅提供了反系列化data的接口,所以如果我们需要反系列化key,我们需要使用KeyedDeserializationSchema
的子类。KeyedDeserializationSchema
接口提供了T deserialize(byte[] messageKey, byte[] message, String topic, int partition, long offset)
方法,可以饭系列化kafka消息的data和key。
为了方便使用,Flink内部提供了一序列的schemas:TypeInformationSerializationSchema
和TypeInformationKeyValueSerializationSchema
,它可以根据Flink的TypeInformation
信息来推断出需要选择的schemas。
如果我们启用了Flink的Checkpint机制,那么Flink Kafka Consumer将会从指定的Topic中消费消息,然后定期地将Kafka offsets信息、状态信息以及其他的操作信息进行Checkpint。所以,如果Flink作业出故障了,Flink将会从最新的Checkpint中恢复,并且从上一次偏移量开始读取Kafka中消费消息。
我们需要在执行环境下启用Flink Checkpint机制,如下:
|
需要主要的是,Flink仅仅支持在拥有足够的处理slots情况下才能够从Checkpint恢复。Flink on YARN模式下支持动态地重启丢失的YARN containers。
如果我们没有启用Checkpint,那么Flink Kafka consumer将会定期地向Zookeeper commit偏移量。
Kafka Consumer Offset 自动提交设置
1 开启Check Point时(checkpoint管理)
Checkpoint后,同步offset给kafka。
2 未开启Check Point时(kafka管理)
enable.auto.commit: true
auto.commit.interval.ms: 1500
3 (自己管理offset)
通过自定义kafkaconsumer的反序列方式是可以拿到kafka message的所有信息的,包括topic partition offset等。
自行设置偏移量保存位置
这里采用了zookeeper作为保存的地址,就是实时更新偏移量属性。再job挂掉后重新拉取偏移量保存下来
就能一次消费啦,但真正做到一次消费必须和业务场景结合来做,比如事务。
废话不多说啦,我本地实现了一个例子
<dependency>
<groupId>com.wehotel</groupId>
<artifactId>commons</artifactId>
<version>1.0-SNAPSHOT</version>
</dependency>
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-java</artifactId>
</dependency>
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-streaming-java_2.11</artifactId>
</dependency>
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-scala_2.11</artifactId>
</dependency>
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-table_2.11</artifactId>
</dependency>
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-streaming-scala_2.11</artifactId>
</dependency>
<dependency>
<groupId>org.apache.bahir</groupId>
<artifactId>flink-connector-redis_2.11</artifactId>
<version>1.0</version>
</dependency>
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-statebackend-rocksdb_2.11</artifactId>
</dependency>
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-connector-kafka-0.11_2.11</artifactId>
</dependency>
<dependency>
<groupId>org.apache.kafka</groupId>
<artifactId>kafka-clients</artifactId>
<version>0.11.0.3</version>
</dependency>
<dependency>
<groupId>org.apache.commons</groupId>
<artifactId>commons-pool2</artifactId>
<version>2.4.2</version>
</dependency>
<dependency>
<groupId>redis.clients</groupId>
<artifactId>jedis</artifactId>
<version>2.9.0</version>
</dependency>
<!--基于scala-logging和logback的日志打印模板,其中logback是一个更高效/更优于log4j的日志打印框架,目前正逐渐替代log4j的位置,以下为实现日志打印的几个步骤:-->
<dependency>
<groupId>com.typesafe.scala-logging</groupId>
<artifactId>scala-logging_2.11</artifactId>
<version>3.7.2</version>
</dependency>
<dependency>
<groupId>ch.qos.logback</groupId>
<artifactId>logback-classic</artifactId>
<version>1.2.3</version>
</dependency>
<dependency>
<groupId>com.alibaba</groupId>
<artifactId>fastjson</artifactId>
<version>1.2.58</version>
</dependency>
<dependency>
<groupId>org.apache.kafka</groupId>
<artifactId>kafka-clients</artifactId>
<version>0.11.0.3</version>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.curator/curator-framework -->
<dependency>
<groupId>org.apache.curator</groupId>
<artifactId>curator-framework</artifactId>
<version>2.12.0</version>
</dependency>
<dependency>
<groupId>org.apache.curator</groupId>
<artifactId>curator-recipes</artifactId>
<version>2.12.0</version>
</dependency>
<dependency>
<groupId>org.apache.curator</groupId>
<artifactId>curator-client</artifactId>
<version>2.12.0</version>
</dependency>
maven项目导入成功就可以实现下面的代码功能
import org.apache.curator.RetryPolicy;
import org.apache.curator.framework.CuratorFramework;
import org.apache.curator.framework.CuratorFrameworkFactory;
import org.apache.curator.retry.ExponentialBackoffRetry;
import org.apache.flink.api.common.typeinfo.TypeHint;
import org.apache.flink.api.common.typeinfo.TypeInformation;
import org.apache.flink.streaming.api.CheckpointingMode;
import org.apache.flink.streaming.api.TimeCharacteristic;
import org.apache.flink.streaming.api.datastream.DataStreamSource;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumer011;
import org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumerBase;
import org.apache.flink.streaming.connectors.kafka.internals.KafkaTopicPartition;
import org.apache.flink.streaming.util.serialization.KeyedDeserializationSchema;
import org.apache.zookeeper.CreateMode;
import scala.Tuple2;
import java.io.IOException;
import java.util.*;
我们使用Apache Curator来操作zookeeper,先创建kafka的基类吧
public class KafkaSource {
private String topic;
private String message;
private Integer partition;
private Long offset;
public String getTopic() {
return topic;
}
public void setTopic(String topic) {
this.topic = topic;
}
public String getMessage() {
return message;
}
public void setMessage(String message) {
this.message = message;
}
public Integer getPartition() {
return partition;
}
public void setPartition(Integer partition) {
this.partition = partition;
}
public Long getOffset() {
return offset;
}
public void setOffset(Long offset) {
this.offset = offset;
}
}
主类逻辑,这里面我把kafka地址和zookeeper地址隐藏了,自行改动
**
* <Description>
*
* @author enjian
* @version 1.0
* @taskId:
* @createDate 2020/03/31 11:35
* @see ""
*/
public class ZKUtils {
//会话超时时间
private static final int SESSION_TIMEOUT = 30 * 1000;
//连接超时时间
private static final int CONNECTION_TIMEOUT = 3 * 1000;
//ZooKeeper服务地址
private static final String CONNECT_ADDR = "xxxxx";
//创建连接实例
private static CuratorFramework client ;
public static void main(String[] args) throws Exception {
//1 重试策略:初试时间为1s 重试10次
RetryPolicy retryPolicy = new ExponentialBackoffRetry(1000, 10);
//2 通过工厂创建连接
client = CuratorFrameworkFactory.builder()
.connectString(CONNECT_ADDR).connectionTimeoutMs(CONNECTION_TIMEOUT)
.sessionTimeoutMs(SESSION_TIMEOUT)
.retryPolicy(retryPolicy)
// .namespace("super") //命名空间
.build();
//3 开启连接
client.start();
StreamExecutionEnvironment flinkEnv = changeEnv();
Tuple2<HashMap<KafkaTopicPartition, Long>, Boolean> kafkaOffset = getFromOffsets("tripGoodsCA00001", "test");
FlinkKafkaConsumer011<KafkaSource> ds = createKafkaSource("tripGoodsCA00001", "test");
FlinkKafkaConsumerBase flinkKafkaConsumerBase = ds.setStartFromLatest();
// 如果kafka不为空的话,从这里开始执行
if (kafkaOffset._2){
System.out.println("----------------------zookeeper manager offsets-----------------------------------");
Map<KafkaTopicPartition, Long> specificStartOffsets = kafkaOffset._1;
flinkKafkaConsumerBase = ds.setStartFromSpecificOffsets(specificStartOffsets);
}
DataStreamSource<KafkaSource> tetsds = flinkEnv.addSource(flinkKafkaConsumerBase);
tetsds.print();
// tetsds.print();
flinkEnv.execute("test");
}
public static void ensureZKExists(String zkTopicPath) {
try {
if (client.checkExists().forPath(zkTopicPath) == null) {//zk中没有没写过数据,创建父节点,也就是会递归创建
client.create().creatingParentsIfNeeded()
.withMode(CreateMode.PERSISTENT) // 节点类型
.forPath(zkTopicPath);
}
} catch (Exception e) {
e.printStackTrace();
}
}
public static void storeOffsets(HashMap<String, Long> offsetRange, String topic, String group) {
String zkTopicPath = String.format("/offsets/%s/%s", topic,group);
Iterator<Map.Entry<String, Long>> setoffsetrange = offsetRange.entrySet().iterator();
while (setoffsetrange.hasNext()) {
Map.Entry<String, Long> offsethas = setoffsetrange.next();
//partition
String path = String.format("%s/%s", zkTopicPath, offsethas.getKey());
ensureZKExists(path);
try {
client.setData().forPath(path, (offsethas.getValue() + "").getBytes());
} catch (Exception e) {
e.printStackTrace();
}
}
}
/**
* 从zookeeper中读取kafka对应的offset
* @param topic
* @param group
* @return Tuple2<HashMap<TopicPartition, Long>, Boolean>
*/
public static Tuple2<HashMap<KafkaTopicPartition, Long>, Boolean> getFromOffsets(String topic, String group) {
Tuple2<HashMap<KafkaTopicPartition, Long>, Boolean> returnTuple2 = null;
///xxxxx/offsets/topic/group/partition/
String zkTopicPath = String.format("/offsets/%s/%s", topic,group);
ensureZKExists(zkTopicPath);
HashMap<KafkaTopicPartition, Long> offsets = new HashMap<KafkaTopicPartition, Long>();
try {
List<String> partitions = client.getChildren().forPath(zkTopicPath);
for (String partition : partitions) {
// System.out.println(new String(client.getData().forPath(String.format("%s/%s", zkTopicPath,partition))));
Long offset = Long.valueOf(new String(client.getData().forPath(String.format("%s/%s", zkTopicPath,partition))));
KafkaTopicPartition topicPartition = new KafkaTopicPartition(topic, Integer.valueOf(partition));
offsets.put(topicPartition, offset);
}
if (offsets.isEmpty()) {
return new Tuple2<>(offsets, false);
} else {
return new Tuple2<>(offsets, true);
}
} catch (Exception e) {
e.printStackTrace();
}
//如果有直接读取对应的数据
return returnTuple2;
}
public static Properties getKafkaProperties(String groupId) {
Properties properties = new Properties();
properties.setProperty("bootstrap.servers", "");
// properties.setProperty("zookeeper.connect", getStrValue(new Constants().KAFKA_ZOOKEEPER_LIST));
properties.setProperty("group.id", groupId);
return properties;
}
public static Properties getProduceKafkaProperties() {
Properties properties = new Properties();
properties.setProperty("bootstrap.servers", "");
properties.setProperty("key.serializer", "org.apache.kafka.common.serialization.StringSerializer");
properties.setProperty("value.serializer", "org.apache.kafka.common.serialization.StringSerializer");
return properties;
}
public static StreamExecutionEnvironment changeEnv(){
final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime);
env.getConfig().enableForceKryo();
//启用检查点,设置检查点的最小间隔为5000ms
// env.setStateBackend(new RocksDBStateBackend(chkPointPath));
env.enableCheckpointing(600000);
//设置一致性级别为exactly-once
env.getCheckpointConfig().setCheckpointingMode(CheckpointingMode.EXACTLY_ONCE);
//设置检查点超时间,如果在超时后,丢弃这个检查点,默认是10分钟
env.getCheckpointConfig().setCheckpointTimeout(60000);
//设置快照失败后任务继续正常执行
env.getCheckpointConfig().setFailOnCheckpointingErrors(false);
//设置并发检查点数量为1
env.getCheckpointConfig().setMaxConcurrentCheckpoints(1);
return env;
}
/**
* 创建kafka的source
* @param topic
* @param groupid
* @return
*/
public static FlinkKafkaConsumer011<KafkaSource> createKafkaSource(String topic, String groupid){
// kafka消费者配置
FlinkKafkaConsumer011<KafkaSource> dataStream = new FlinkKafkaConsumer011<KafkaSource>(topic, new KeyedDeserializationSchema<KafkaSource>() {
@Override
public TypeInformation<KafkaSource> getProducedType() {
return TypeInformation.of(new TypeHint<KafkaSource>() {
});
}
@Override
public KafkaSource deserialize(byte[] messageKey, byte[] message, String topic, int partition, long offset) throws IOException {
KafkaSource kafkasource = new KafkaSource();
kafkasource.setTopic(topic);
kafkasource.setMessage(message.toString());
kafkasource.setPartition(partition);
kafkasource.setOffset(offset);
HashMap<String,Long> partitionAndOffset = new HashMap<>();
partitionAndOffset.put(String.valueOf(partition),offset);
storeOffsets(partitionAndOffset,topic,groupid);
return kafkasource;
}
@Override
public boolean isEndOfStream(KafkaSource s) {
return false;
}
}, getKafkaProperties(groupid));
//设置消息的起始位置的偏移量,最晚的记录开始启动
dataStream.setStartFromLatest();
//自动提交offset
// dataStream.setCommitOffsetsOnCheckpoints(true);
return dataStream;
}