Apache Kafka基本操作
一、概述
Apache Kafka是一个分布式的流数据
平台; 三层含义:
- 消息系统(MQ): 发布和订阅流数据
- 流数据处理(Streaming): 可以基于Kakfa开发流数据处理的应用,用以实时处理流数据
- 流数据存储(Store): 以一种安全分布式、冗余、容错的方式,存放流数据;
Apache Kafka典型的应用场景
- 构建实时的流数据管道,用以在应用和系统之间进行可靠的数据传输
- 构建实时的流数据处理应用,用以传输或者处理流数据
- 作为流数据处理框架(Storm、Spark、Flink等)的数据源
架构
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-VI2KyPOo-1578303573389)(assets/1577085480600.png)]
剖析Kafka Topic
读写消费位置
- 生产者采用分区追加记录的方式新增数据,每一条写入数据都有一个位置标识
offset
(从0~n,每次+1)- 对于每一个消费者,都会维护一个消费位置offset;只要数据在保留周期内,读的offset可以进行灵活调整(回到过去,跳到将来)
消费组
用来组织管理消费者,结论:
不同组广播,同组负载均衡
二、环境搭建
完全分布式Kafka集群
准备工作
- 三个节点
- JDK8.0+ 环境
- 同步集群时钟
- ZooKeeper集群服务健康
安装
上传安装包
[root@HadoopNode01 ~]# scp kafka_2.11-2.2.0.tgz root@HadoopNode02:~
kafka_2.11-2.2.0.tgz 100% 61MB 30.5MB/s 00:02
[root@HadoopNode01 ~]# scp kafka_2.11-2.2.0.tgz root@HadoopNode03:~
kafka_2.11-2.2.0.tgz
解压缩安装
[root@HadoopNode0X ~]# tar -zxf kafka_2.11-2.2.0.tgz -C /usr
配置
[root@HadoopNode0X kafka_2.11-2.2.0]# vi config/server.properties
# The id of the broker. This must be set to a unique integer for each broker.
broker.id=0 | 1 | 2
listeners=PLAINTEXT://HadoopNode01:9092 | PLAINTEXT://HadoopNode02:9092 | PLAINTEXT://HadoopNode03:9092
log.dirs=/data/kafka
zookeeper.connect=HadoopNode01:2181,HadoopNode02:2181,HadoopNode03:2181
启动
[root@HadoopNode0X kafka_2.11-2.2.0]# bin/kafka-server-start.sh -daemon config/server.properties
[root@HadoopNode0X kafka_2.11-2.2.0]# jps
7585 Jps
9346 DFSZKFailoverController
8885 NameNode
9004 DataNode
1500 QuorumPeerMain
7357 Kafka
9197 JournalNode
三、 操作使用
指令操作
topic相关
创建
[root@HadoopNode01 kafka_2.11-2.2.0]# bin/kafka-topics.sh --create --bootstrap-server HadoopNode01:9092,HadoopNode02:9092,HadoopNode03:9092 --topic t1 --partitions 3 --replication-factor 3
表示创建t1 topic,由3个leader 主分区构成,每一个分区除过本身由两个冗余备份;
展示所有
[root@HadoopNode01 kafka_2.11-2.2.0]# bin/kafka-topics.sh --list --bootstrap-server HadoopNode01:9092,HadoopNode02:9092,HadoopNode03:9092
t1
删除
[root@HadoopNode01 kafka_2.11-2.2.0]# bin/kafka-topics.sh --delete --bootstrap-server HadoopNode01:9092,HadoopNode02:9092,HadoopNode03:9092 --topic t1
[root@HadoopNode01 kafka_2.11-2.2.0]# bin/kafka-topics.sh --list --bootstrap-server HadoopNode01:9092,HadoopNode02:9092,HadoopNode03:9092
修改
修改topic 的主分区数量和复制因子
[root@HadoopNode01 kafka_2.11-2.2.0]# bin/kafka-topics.sh --alter --bootstrap-server HadoopNode01:9092,HadoopNode02:9092,HadoopNode03:9092 --topic t1 --partitions 5
修改复制因子
[root@HadoopNode01 kafka_2.11-2.2.0]# bin/kafka-reassign-partitions.sh --zookeeper HadoopNode01:2181,HadoopNode02:2181,HadoopNode03:2181 --reassignment-json-file /usr/kafka_2.11-2.2.0/config/change-replication-factor.json --execute
// json文件内容如下
{
"partitions":
[
{
"topic": "t2",
"partition": 0,
"replicas": [1,2,0]
},
{
"topic": "t2",
"partition": 1,
"replicas": [0,2,1]
},
{
"topic": "t2",
"partition": 2,
"replicas": [0,1,2]
}
],
"version":1
}
描述
描述某一个topic 的详细信息
- 需要注意的是Leader、Replicas、Isr的值指的都是Broker ID
- Partition是Topic分区序号
[root@HadoopNode01 kafka_2.11-2.2.0]# bin/kafka-topics.sh --describe --bootstrap-server HadoopNode01:9092,HadoopNode02:9092,HadoopNode03:9092 --topic t1
Topic:t1 PartitionCount:5 ReplicationFactor:3 Configs:segment.bytes=1073741824
Topic: t1 Partition: 0 Leader: 0 Replicas: 0,2,1 Isr: 0,2,1
Topic: t1 Partition: 1 Leader: 2 Replicas: 2,1,0 Isr: 2,1,0
Topic: t1 Partition: 2 Leader: 1 Replicas: 1,0,2 Isr: 1,0,2
Topic: t1 Partition: 3 Leader: 0 Replicas: 0,2,1 Isr: 0,2,1
Topic: t1 Partition: 4 Leader: 1 Replicas: 1,0,2 Isr: 1,0,2
[root@HadoopNode01 kafka_2.11-2.2.0]# bin/kafka-topics.sh --describe --bootstrap-server HadoopNode01:9092,HadoopNode02:9092,HadoopNode03:9092 --topic t1
Topic:t1 PartitionCount:5 ReplicationFactor:3 Configs:segment.bytes=1073741824
Topic: t1 Partition: 0 Leader: 0 Replicas: 0,2,1 Isr: 0,2,1
Topic: t1 Partition: 1 Leader: 1 Replicas: 2,1,0 Isr: 1,0
Topic: t1 Partition: 2 Leader: 1 Replicas: 1,0,2 Isr: 1,0,2
Topic: t1 Partition: 3 Leader: 0 Replicas: 0,2,1 Isr: 0,2,1
Topic: t1 Partition: 4 Leader: 1 Replicas: 1,0,2 Isr: 1,0,2
结论:
- t1 topic 第一个分区 broker id=0(HadoopNode01)组织管理
- t1 topic 第二个分区 broker id=2(HadoopNode03)组织管理
如:
- 当broker id = 2 节点服务 杀死, 第二个分区(p1)故障转移【在Broker id = 1 or id = 0 的这两个服务实例中的某一个复制分区升级为主分区】
发布&订阅
订阅(subscribe)
[root@HadoopNode01 kafka_2.11-2.2.0]# bin/kafka-console-consumer.sh --topic t1 --bootstrap-server HadoopNode01:9092,HadoopNode02:9092,HadoopNode03:9092 --property print.key=true --property print.timestamp=true
CreateTime:1577092478538 null Hello Kafka
发布(publish)
[root@HadoopNode02 kafka_2.11-2.2.0]# bin/kafka-console-producer.sh --broker-list HadoopNode01:9092,HadoopNode02:9092,HadoopNode03:9092 --topic t1
>Hello Kafka
>
基于控制台的生产者:
recorde key = null
JAVA API操作
导入依赖
<!-- https://mvnrepository.com/artifact/org.apache.kafka/kafka-clients -->
<dependency>
<groupId>org.apache.kafka</groupId>
<artifactId>kafka-clients</artifactId>
<version>2.2.0</version>
</dependency>
生产者API
package com.baizhi.basic;
import org.apache.kafka.clients.producer.KafkaProducer;
import org.apache.kafka.clients.producer.ProducerConfig;
import org.apache.kafka.clients.producer.ProducerRecord;
import org.apache.kafka.common.serialization.StringSerializer;
import java.util.Properties;
/**
* 生产者
*/
public class ProducerDemo {
public static void main(String[] args) {
// 生产者的配置信息
Properties prop = new Properties();
prop.put(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, "HadoopNode01:9092,HadoopNode02:9092,HadoopNode03:9092");
prop.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG, StringSerializer.class);
prop.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG, StringSerializer.class);
// record k v 泛型
KafkaProducer<String, String> producer = new KafkaProducer<String, String>(prop);
// 通过生产者发布消息
ProducerRecord<String, String> record = new ProducerRecord<String, String>("t2", "user002", "xz");
producer.send(record);
// 释放资源
producer.flush();
producer.close();
}
}
消费者API
package com.baizhi.basic;
import org.apache.kafka.clients.consumer.ConsumerConfig;
import org.apache.kafka.clients.consumer.ConsumerRecords;
import org.apache.kafka.clients.consumer.KafkaConsumer;
import org.apache.kafka.common.serialization.StringDeserializer;
import java.time.Duration;
import java.util.Arrays;
import java.util.Properties;
/**
* 消费者
*/
public class ConsumerDemo {
public static void main(String[] args) {
// 配置对象
Properties prop = new Properties();
prop.put(ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG, "HadoopNode01:9092,HadoopNode02:9092,HadoopNode03:9092");
prop.put(ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG, StringDeserializer.class);
prop.put(ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG, StringDeserializer.class);
// prop.put(ConsumerConfig.GROUP_ID_CONFIG, "g1"); // 消费组 不同组广播 同组负载均衡
prop.put(ConsumerConfig.GROUP_ID_CONFIG, "g2"); // 消费组 不同组广播 同组负载均衡
// 消费者对象
KafkaConsumer<String, String> consumer = new KafkaConsumer<String, String>(prop);
// 订阅主题
consumer.subscribe(Arrays.asList("t2"));
// 循环拉取t2 topic中新增数据
while (true) {
ConsumerRecords<String, String> records = consumer.poll(Duration.ofSeconds(5));
records.forEach(record -> {
System.out.println(
record.key()
+ "\t"
+ record.value()
+ "\t"
+ record.timestamp()
+ "\t"
+ record.offset()
+ "\t"
+ record.partition()
+ "\t"
+ record.topic()
);
});
}
}
}
Topic API
package com.baizhi.basic;
import org.apache.kafka.clients.admin.*;
import org.apache.kafka.common.KafkaFuture;
import java.util.Arrays;
import java.util.Map;
import java.util.Properties;
import java.util.Set;
import java.util.concurrent.ExecutionException;
/**
* 管理 API
*/
public class AdminDemo {
public static void main(String[] args) throws ExecutionException, InterruptedException {
Properties prop = new Properties();
prop.put(AdminClientConfig.BOOTSTRAP_SERVERS_CONFIG, "HadoopNode01:9092,HadoopNode02:9092,HadoopNode03:9092");
AdminClient adminClient = KafkaAdminClient.create(prop);
// 创建topic
/*
adminClient.createTopics(Arrays.asList(new NewTopic("t4", 3, (short) 3)));
*/
// 删除topic
/*
adminClient.deleteTopics(Arrays.asList("t4"));
*/
/*
// 展示所有(只展示用户创建的Topic列表)
ListTopicsResult topics = adminClient.listTopics();
KafkaFuture<Set<String>> names = topics.names();
Set<String> tNames = names.get();
tNames.forEach(name -> System.out.println(name));
*/
// 描述一个topic
/*
t2 (name=t2, internal=false, partitions=
(partition=0, leader=HadoopNode02:9092 (id: 1 rack: null), replicas=HadoopNode02:9092 (id: 1 rack: null), HadoopNode03:9092 (id: 2 rack: null), HadoopNode01:9092 (id: 0 rack: null), isr=HadoopNode02:9092 (id: 1 rack: null), HadoopNode03:9092 (id: 2 rack: null), HadoopNode01:9092 (id: 0 rack: null)),
(partition=1, leader=HadoopNode01:9092 (id: 0 rack: null), replicas=HadoopNode01:9092 (id: 0 rack: null), HadoopNode03:9092 (id: 2 rack: null), HadoopNode02:9092 (id: 1 rack: null), isr=HadoopNode01:9092 (id: 0 rack: null), HadoopNode03:9092 (id: 2 rack: null), HadoopNode02:9092 (id: 1 rack: null)),
(partition=2, leader=HadoopNode01:9092 (id: 0 rack: null), replicas=HadoopNode01:9092 (id: 0 rack: null), HadoopNode02:9092 (id: 1 rack: null), HadoopNode03:9092 (id: 2 rack: null), isr=HadoopNode01:9092 (id: 0 rack: null), HadoopNode02:9092 (id: 1 rack: null), HadoopNode03:9092 (id: 2 rack: null)))
*/
DescribeTopicsResult result = adminClient.describeTopics(Arrays.asList("t2"));
Map<String, KafkaFuture<TopicDescription>> map = result.values();
map.forEach((k, v) -> {
try {
System.out.println(k + "\t" + v.get());
} catch (InterruptedException e) {
e.printStackTrace();
} catch (ExecutionException e) {
e.printStackTrace();
}
});
adminClient.close();
}
}
四、高级特性
消费组
用来组织管理消费者一个特性,同组负载均衡,不同组广播
同组负载均衡
测试方法:相同的消费组的消费者实例运行多个
// 运行多个消费者服务实例
prop.put(ConsumerConfig.GROUP_ID_CONFIG, "g1"); // 消费组 不同组广播 同组负载均衡
结论:
- 同组 负载均衡,同组一个消费者负责一个或多个分区的数据处理,分区到消费者是一个平行关系(正常情况下)
- 当同组的某些消费者故障时,故障消费者所负责处理的分区数据,会自动进行容错处理(将故障消费者所负责的分区,交给其余可用的消费者消费处理)
不同组广播
多个消费者属于不同的消费组
prop.put(ConsumerConfig.GROUP_ID_CONFIG, "g2"); // 消费组 不同组广播 同组负载均衡
结论:
- 不同的消费组,Topic数据会复制给所有的消费组,但是只能由消费组中的一个消费者进行数据的处理
生产者记录发布策略
Kafka生产者用以产生数据,并且将数据发送到kafka集群进行持久化存储; 发布策略有如下几种:
Record的Key不为空,则使用哈希取模的发布策略(key.hashCode % numPartitions = 分区序号)
Record的Key为空,则使用轮询分区的发布策略
手动指定Record存储的分区序号
// 测试生产者发布策略
// key != null
/*
ProducerRecord<String, String> record1 = new ProducerRecord<String, String>("t2", "user006", "xh1");
ProducerRecord<String, String> record2 = new ProducerRecord<String, String>("t2", "user006", "xh2");
ProducerRecord<String, String> record3 = new ProducerRecord<String, String>("t2", "user006", "xh3");
producer.send(record1);
producer.send(record2);
producer.send(record3);
*/
// key == null
/*
ProducerRecord<String, String> record1 = new ProducerRecord<String, String>("t2", "xh1");
ProducerRecord<String, String> record2 = new ProducerRecord<String, String>("t2", "xh2");
ProducerRecord<String, String> record3 = new ProducerRecord<String, String>("t2", "xh3");
producer.send(record1);
producer.send(record2);
producer.send(record3);
*/
// 手动指定分区序号 p0分区
ProducerRecord<String, String> record1 = new ProducerRecord<String, String>("t2", 0, "user007", "xh4");
producer.send(record1);
ProducerRecord<String, String> record2 = new ProducerRecord<String, String>("t2", 2, "user007", "xh5");
producer.send(record2);
消费者消费方式
Kafka消费者订阅1个或者多个感兴趣Kafka Topic,当这些Topic有新的数据产生,消费者拉取最新的数据,然后进行相应的业务处理;
有三种消费方式:
只订阅(subscribe)
: 订阅1到N个Topic的所有分区指定消费分区
: 订阅某一个Topic的特定分区手动指定分区消费位置
: 每一个消费者维护一个消费信息(元数据,读位置offset),可以手动重置offset;这样做的目的可以重新消费已经处理过的数据或者跳过不感兴趣的数据
// 订阅主题
// consumer.subscribe(Arrays.asList("t2"));
// 指定消费分区 只处理t2 topic p0分区数据
// consumer.assign(Arrays.asList(new TopicPartition("t2",0)));
// 手动指定消费位置
consumer.assign(Arrays.asList(new TopicPartition("t2",0)));
consumer.seek(new TopicPartition("t2",0),31);
首次订阅 offset重置方式
Kafka消费者在第一次(首次)订阅某个Topic时,offset默认采用的重置方式为latest
(默认), 还有另外的一个方式earliest
;
自动重置消费位置 auto.offset.reset = latest
结论:
- latest: 如果当前分区有已提交的offset,从已提交的offset之后消费数据;如果没有提交的offset,则从最后(最新产生的数据)消费数据
- earliest:如果当前分区有已提交的offset,从已提交的offset之后消费数据;如果没有提交的offset,则从分区的最前(开头)消费数据
注意:
- kafka消费位置基于消费组管理,并且kafka使用一个特殊Topic(
__consumer_offsets
),用以记录消费组对不同topic的消费位置。__consumer_offsets
由50个主分区构成,复制因子1,是一个系统topic
// latest
// prop.put(ConsumerConfig.GROUP_ID_CONFIG, "g1"); // 消费组 不同组广播 同组负载均衡
// prop.put(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG,"latest");
// earliest
prop.put(ConsumerConfig.GROUP_ID_CONFIG, "g2"); // 消费组 不同组广播 同组负载均衡
prop.put(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG,"earliest");
消费者偏移量控制【重点】
Kafka消费者的消费位置有两种提交方法,一种自动提交(默认)和另一种手动提交(非默认)
将当前消费组中消费者一个消费位置offset,提交保存到__consumer_offsets
自动提交
# 每隔5秒将当前消费组中消费者的消费位置写入保存到__consumer_offsets
enable.auto.commit = true
auto.commit.interval.ms = 5000
手动提交
保证业务处理正常情况下提交读offset,非正常情况下不提交读offset,再下次拉取数据时依然会获取未正确处理的数据;
enable.auto.commit = false
prop.put(ConsumerConfig.ENABLE_AUTO_COMMIT_CONFIG, false);
// 业务处理完成后再手动提交消费位置
// 异步提交
consumer.commitAsync();
// 可以通过异步或者同步方式手动提交当前消费者的消费位置 建议使用异步方式【效率高】
// consumer.commitAsync();
HashMap<TopicPartition, OffsetAndMetadata> map = new HashMap<>();
// 从已提交的offset +1 提交开始拉取新数据
map.put(new TopicPartition(record.topic(),record.partition()),new OffsetAndMetadata(record.offset() + 1));
consumer.commitSync(map);
自定义对象类型传输
生产者发布数据时: 自定义对象类型 implements Serializer<T>
消费者消费数据时: 自定义对象类型 implements Deserializer<T>
序列化策略:JSON、JDK、框架等
导入依赖
<dependency>
<groupId>org.apache.commons</groupId>
<artifactId>commons-lang3</artifactId>
<version>3.6</version>
</dependency>
<dependency>
<groupId>com.alibaba</groupId>
<artifactId>fastjson</artifactId>
<version>1.2.56</version>
</dependency>
package com.baizhi.transform;
import java.io.Serializable;
import java.util.Date;
/**
* 自定义对象 必须实现序列化接口
*/
public class User implements Serializable {
private Integer id;
private String name;
private Boolean sex;
// 不参与序列化
transient private Date birthday;
// 省略getter/setter方法
}
//---------------------------------------------------------------------------------
package com.baizhi.transform;
import org.apache.commons.lang3.SerializationUtils;
import org.apache.kafka.common.serialization.Serializer;
import java.util.Map;
/**
* 序列化器
*/
public class UserToByteArray implements Serializer<User> {
public void configure(Map<String, ?> map, boolean b) {
}
/**
* 序列化方法 User ---> Bytes
*
* @param topic
* @param user
* @return
*/
public byte[] serialize(String topic, User user) {
return SerializationUtils.serialize(user);
}
public void close() {
}
}
//---------------------------------------------------------------------------------
package com.baizhi.transform;
import org.apache.commons.lang3.SerializationUtils;
import org.apache.kafka.common.serialization.Deserializer;
import java.util.Map;
/**
* 反序列化器
*/
public class ByteArrayToUser implements Deserializer<User> {
public void configure(Map<String, ?> map, boolean b) {
}
/**
* 反序列化方法 bytes ---> User
* @param topic
* @param bytes
* @return
*/
public User deserialize(String topic, byte[] bytes) {
return SerializationUtils.deserialize(bytes);
}
public void close() {
}
}
//---------------------------------------------------------------------------------
package com.baizhi.transform;
import org.apache.kafka.clients.producer.KafkaProducer;
import org.apache.kafka.clients.producer.ProducerConfig;
import org.apache.kafka.clients.producer.ProducerRecord;
import org.apache.kafka.common.serialization.StringSerializer;
import java.util.Date;
import java.util.Properties;
/**
* 生产者
*/
public class ProducerDemo {
public static void main(String[] args) {
// 生产者的配置信息
Properties prop = new Properties();
prop.put(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, "HadoopNode01:9092,HadoopNode02:9092,HadoopNode03:9092");
prop.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG, StringSerializer.class);
prop.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG, UserToByteArray.class);
// record k v 泛型
KafkaProducer<String, User> producer = new KafkaProducer<String, User>(prop);
// 通过生产者发布消息
ProducerRecord<String, User> record = new ProducerRecord<String, User>("t4", "user001", new User(1, "zs", true, new Date()));
producer.send(record);
// 释放资源
producer.flush();
producer.close();
}
}
//---------------------------------------------------------------------------------
package com.baizhi.transform;
import org.apache.kafka.clients.consumer.ConsumerConfig;
import org.apache.kafka.clients.consumer.ConsumerRecords;
import org.apache.kafka.clients.consumer.KafkaConsumer;
import org.apache.kafka.clients.consumer.OffsetAndMetadata;
import org.apache.kafka.common.TopicPartition;
import org.apache.kafka.common.serialization.StringDeserializer;
import java.time.Duration;
import java.util.Arrays;
import java.util.HashMap;
import java.util.Properties;
/**
* 消费者
*/
public class ConsumerDemo {
public static void main(String[] args) {
// 配置对象
Properties prop = new Properties();
prop.put(ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG, "HadoopNode01:9092,HadoopNode02:9092,HadoopNode03:9092");
prop.put(ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG, StringDeserializer.class);
prop.put(ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG, ByteArrayToUser.class);
prop.put(ConsumerConfig.ENABLE_AUTO_COMMIT_CONFIG, false);
prop.put(ConsumerConfig.GROUP_ID_CONFIG, "g1");
prop.put(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG,"earliest");
// 消费者对象
KafkaConsumer<String, User> consumer = new KafkaConsumer<String, User>(prop);
// 订阅主题
consumer.subscribe(Arrays.asList("t4"));
// 循环拉取t2 topic中新增数据
while (true) {
ConsumerRecords<String, User> records = consumer.poll(Duration.ofSeconds(5));
records.forEach(record -> {
System.out.println(
record.key()
+ "\t"
+ record.value()
+ "\t"
+ record.timestamp()
+ "\t"
+ record.offset()
+ "\t"
+ record.partition()
+ "\t"
+ record.topic()
);
// 可以通过异步或者同步方式手动提交当前消费者的消费位置 建议使用异步方式【效率高】
// consumer.commitAsync();
HashMap<TopicPartition, OffsetAndMetadata> map = new HashMap<>();
// 从已提交的offset +1 提交开始拉取新数据
map.put(new TopicPartition(record.topic(),record.partition()),new OffsetAndMetadata(record.offset() + 1));
consumer.commitSync(map);
});
}
}
}
生产者的批处理
Kafka生产者生成的记录Record,首先进行缓存,然后定期或者在缓存空间即满时,一次性将多条数据写入到kafka集群;
批处理操作是一种惯用的kafka优化写方法,对于资源的利用率更高,但是有一定数据延迟;
batch.size = 4096
linger.ms = 5000
prop.put(ProducerConfig.BATCH_SIZE_CONFIG,4096); // 4096 = 4kb 设定批处理操作缓存区大小
prop.put(ProducerConfig.LINGER_MS_CONFIG,2000); // 设定批处理操作 每一个批次逗留时间
// 两个条件满足其一即可
建议添加log4j日志,查看运行详情
Ack & Retries机制
Kafka为了确保数据能够正确的写入到Kafka集群,提供了应答机制(ack);
Ack应答策略:
-
ack = 0
无需应答 -
ack = 1
表示数据写入到主分区立即应答 -
ack = all
或者-1
表示数据写入到主分区并且同步到复制分区后再进行应答
因为Kakfa Ack机制存在,当生产者发布的一个数据在写入Kafka集群时,如果长时间未获得ack应答,进行retry重试操作,直到数据被正确写入;
// 生产者默认配置
acks = 1
request.timeout.ms = 30000
retries = 2147483647
prop.put(ProducerConfig.ACKS_CONFIG, "all"); // 主分区和复制分区都写入后ack
prop.put(ProducerConfig.RETRIES_CONFIG, 10); // 重试次数
prop.put(ProducerConfig.REQUEST_TIMEOUT_MS_CONFIG, 1); // 请求超时时间 1ms 为了模拟
注意:
因为Kafka Retry机制存在有可能会导致Kafka集群存放多个相同数据; 如果要确保相同数据只保留一个,则需要开启Kafka幂等写操作;
幂等写操作
幂等: 一次操作和多次操作影响的结果是一致的;
prop.put(ProducerConfig.ENABLE_IDEMPOTENCE_CONFIG, true);
SpringBoot集成Kafka
创建SpringBoot项目
添加kafka配置
#====================== kafka =========================
spring.kafka.bootstrap-servers=HadoopNode01:9092,HadoopNode02:9092,HadoopNode03:9092
spring.kafka.producer.key-serializer=org.apache.kafka.common.serialization.StringSerializer
spring.kafka.producer.value-serializer=org.apache.kafka.common.serialization.StringSerializer
spring.kafka.producer.acks=all
spring.kafka.producer.retries=10
spring.kafka.producer.batch-size=4096
spring.kafka.consumer.group-id=g1
spring.kafka.consumer.key-deserializer=org.apache.kafka.common.serialization.StringDeserializer
spring.kafka.consumer.value-deserializer=org.apache.kafka.common.serialization.StringDeserializer
生产者DEMO
package com.baizhi.kafkasb;
import org.junit.jupiter.api.Test;
import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.boot.test.context.SpringBootTest;
import org.springframework.kafka.core.KafkaTemplate;
import org.springframework.kafka.support.SendResult;
import org.springframework.util.concurrent.FailureCallback;
import org.springframework.util.concurrent.ListenableFuture;
import org.springframework.util.concurrent.SuccessCallback;
@SpringBootTest
class KafkaSbApplicationTests {
/**
* 生产者demo
*/
@Autowired
private KafkaTemplate<String, String> kafkaTemplate;
@Test
void test1() {
ListenableFuture<SendResult<String, String>> future = kafkaTemplate.send("t3", "user00210", "xh210");
// 异步处理的结果对象
/*
future.addCallback(
new SuccessCallback<SendResult<String, String>>() {
@Override
public void onSuccess(SendResult<String, String> stringStringSendResult) {
System.out.println("发送成功!");
}
}, new FailureCallback() {
@Override
public void onFailure(Throwable throwable) {
System.out.println("发送失败!");
throwable.printStackTrace();
}
});
*/
// 函数式编程
future.addCallback(
(stringStringSendResult) -> {
System.out.println("发送成功!");
},
(t) -> {
System.out.println("发送失败!");
t.printStackTrace();
}
);
}
}
消费者DEMO
/**
* 消费者DEMO
*/
@KafkaListener(topics = "t3", groupId = "g1")
public void receive(ConsumerRecord<String, String> record) {
System.out.println(
record.key()
+ "\t"
+ record.value()
+ "\t"
+ record.timestamp()
+ "\t"
+ record.offset()
+ "\t"
+ record.partition()
+ "\t"
+ record.topic()
);
}
Kafka事务
什么是事务?事务指的一个连贯的操作(不可分割整体),要么同时成功,要么同时失败;
Kafka事务类似于DB事务,隔离级别只有两种:read_uncommitted
(默认) 和read_committed
// 初始化事务
producer.initTransactions()
// 开启事务
producer.beginTransaction()
// 提交事务
producer.commitTransaction()
// 取消事务
producer.abortTransaction()
// 发送事务偏移量信息
producer.sendOffsetsToTransaction
生产者事务
生产者在发布数据时,可以将多个数据放置在同一个事务环境中,是一个原子操作不可分割,要么同时发布,要不同时回滚;
package com.baizhi.transaction;
import org.apache.kafka.clients.producer.KafkaProducer;
import org.apache.kafka.clients.producer.ProducerConfig;
import org.apache.kafka.clients.producer.ProducerRecord;
import org.apache.kafka.common.serialization.StringSerializer;
import java.util.Properties;
import java.util.UUID;
/**
* 生产者 事务
*/
public class ProducerDemo {
public static void main(String[] args) {
// 生产者的配置信息
Properties prop = new Properties();
prop.put(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, "HadoopNode01:9092,HadoopNode02:9092,HadoopNode03:9092");
prop.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG, StringSerializer.class);
prop.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG, StringSerializer.class);
prop.put(ProducerConfig.ENABLE_IDEMPOTENCE_CONFIG, true); // 事务操作 需要开启幂等写操作支持
prop.put(ProducerConfig.TRANSACTIONAL_ID_CONFIG, UUID.randomUUID().toString()); // 事务操作 必须有一个唯一标识事务ID
// record k v 泛型
KafkaProducer<String, String> producer = new KafkaProducer<String, String>(prop);
// 初始化事务环境
producer.initTransactions();
// 开启事务
producer.beginTransaction();
try {
// 通过生产者发布消息
for (int i = 220; i < 240; i++) {
ProducerRecord<String, String> record = new ProducerRecord<String, String>("t3", "user00" + i, "xh" + i);
/*
if (i == 235) {
// 模拟业务错误
int m = 1 / 0;
}
*/
producer.send(record);
}
// 提交事务
producer.commitTransaction();
}catch (Exception e){
// 取消事务
producer.abortTransaction();
e.printStackTrace();
}
// 释放资源
producer.flush();
producer.close();
}
}
消费生产并存事务
也称为 consume transfer produce ;
package com.baizhi.transaction.transfer;
import org.apache.kafka.clients.consumer.ConsumerConfig;
import org.apache.kafka.clients.consumer.ConsumerRecords;
import org.apache.kafka.clients.consumer.KafkaConsumer;
import org.apache.kafka.clients.consumer.OffsetAndMetadata;
import org.apache.kafka.clients.producer.KafkaProducer;
import org.apache.kafka.clients.producer.ProducerConfig;
import org.apache.kafka.clients.producer.ProducerRecord;
import org.apache.kafka.common.TopicPartition;
import org.apache.kafka.common.serialization.StringDeserializer;
import org.apache.kafka.common.serialization.StringSerializer;
import java.time.Duration;
import java.util.*;
/**
* 消费生产并存事务
*/
public class ConsumeTransferProduceTransaction {
public static void main(String[] args) {
KafkaProducer<String, String> kafkaProducer = bulidKafkaProducer();
KafkaConsumer<String, String> kafkaConsumer = bulidKafkaConsumer();
kafkaConsumer.subscribe(Arrays.asList("t5"));
kafkaProducer.initTransactions();
while (true) {
kafkaProducer.beginTransaction();
ConsumerRecords<String, String> consumerRecords = kafkaConsumer.poll(Duration.ofSeconds(5));
Map<TopicPartition, OffsetAndMetadata> map = new HashMap<>();
try {
consumerRecords.forEach(record -> {
System.out.println(record.key() + "\t" + record.value() + "\t" + record.offset());
// 模拟业务错误
/*
if ("xz".equals(record.value())) {
int m = 1 / 0;
}
*/
kafkaProducer.send(new ProducerRecord<String, String>("t6", record.key(), record.value() + "?"));
// 注意: 消费生产并存事务中,消费者的消费位置(offset),需要通过生产者sendOffsetsToTransaction方法提交
// 在map中存放消费者消费位置信息
TopicPartition key = new TopicPartition(record.topic(), record.partition());
OffsetAndMetadata value = new OffsetAndMetadata(record.offset() + 1);
map.put(key, value);
});
kafkaProducer.sendOffsetsToTransaction(map, "g1");
kafkaProducer.commitTransaction();
} catch (Exception e) {
kafkaProducer.abortTransaction();
}
}
}
public static KafkaProducer<String, String> bulidKafkaProducer() {
// 生产者的配置信息
Properties prop = new Properties();
prop.put(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, "HadoopNode01:9092,HadoopNode02:9092,HadoopNode03:9092");
prop.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG, StringSerializer.class);
prop.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG, StringSerializer.class);
prop.put(ProducerConfig.ENABLE_IDEMPOTENCE_CONFIG, true); // 事务操作 需要开启幂等写操作支持
prop.put(ProducerConfig.TRANSACTIONAL_ID_CONFIG, UUID.randomUUID().toString());
KafkaProducer<String, String> kafkaProducer = new KafkaProducer<>(prop);
return kafkaProducer;
}
public static KafkaConsumer<String, String> bulidKafkaConsumer() {
// 配置对象
Properties prop = new Properties();
prop.put(ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG, "HadoopNode01:9092,HadoopNode02:9092,HadoopNode03:9092");
prop.put(ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG, StringDeserializer.class);
prop.put(ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG, StringDeserializer.class);
prop.put(ConsumerConfig.ENABLE_AUTO_COMMIT_CONFIG, false); // 必须使用手动提交消费位置
prop.put(ConsumerConfig.GROUP_ID_CONFIG, "g1");
prop.put(ConsumerConfig.ISOLATION_LEVEL_CONFIG, "read_committed"); // 修改事务隔离界别 读已提交 不会脏读问题
prop.put(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, "earliest");
// 消费者对象
KafkaConsumer<String, String> kafkaConsumer = new KafkaConsumer<String, String>(prop);
return kafkaConsumer;
}
}
五、Flume和Kafka整合
Kafka Source
从kafka中读取数据,作为flume数据采集的数据源
a1.sources.r1.type = org.apache.flume.source.kafka.KafkaSource
a1.sources.r1.batchSize = 5000
a1.sources.r1.batchDurationMillis = 2000
a1.sources.r1.kafka.bootstrap.servers = HadoopNode01:9092,HadoopNode02:9092,HadoopNode03:9092
a1.sources.r1.kafka.topics = t3
a1.sources.r1.kafka.consumer.group.id = g1
Kafka Channel
将flume采集的数据,临时存放在Kafka中
a1.channels.c1.type = org.apache.flume.channel.kafka.KafkaChannel
a1.channels.c1.kafka.bootstrap.servers = HadoopNode01:9092,HadoopNode02:9092,HadoopNode03:9092
a1.channels.c1.kafka.topic = flume_channel
a1.channels.c1.kafka.consumer.group.id = g1
[root@HadoopNode01 kafka_2.11-2.2.0]# bin/kafka-console-consumer.sh --bootstrap-server HadoopNode01:9092,HadoopNode02:9092,HadoopNode03:9092 --topic flume_channel --from-beginning
11
22
33
--from-beginning
等价于earliest
Kakfa Sink【重点】
a1.sinks.k1.channel = c1
a1.sinks.k1.type = org.apache.flume.sink.kafka.KafkaSink
a1.sinks.k1.kafka.topic = mytopic
a1.sinks.k1.kafka.bootstrap.servers = HadoopNode01:9092,HadoopNode02:9092,HadoopNode03:9092
a1.sinks.k1.kafka.flumeBatchSize = 20
a1.sinks.k1.kafka.producer.acks = 1
a1.sinks.k1.kafka.producer.linger.ms = 1
a1.sinks.k1.kafka.producer.compression.type = snappy