1.Kafka概述
1.kafka架构
2.安装kafka
-
安装zookeeper
参考文档:https://blog.csdn.net/qq_55906442/article/details/121764357?spm=1001.2014.3001.5501
-
安装kafka
-
参考文档: https://blog.csdn.net/qq_55906442/article/details/121769929?spm=1001.2014.3001.5502
-
修改 config下server.properties
#broker的全局唯一编号,不能重复 broker.id=0 #删除topic功能使能 生产环境当中都是false delete.topic.enable=true #处理网络请求的线程数量 num.network.threads=3 #用来处理磁盘IO的线程数量 num.io.threads=8 #发送套接字的缓冲区大小 socket.send.buffer.bytes=102400 #接收套接字的缓冲区大小 socket.receive.buffer.bytes=102400 #请求套接字的缓冲区大小 socket.request.max.bytes=104857600 #kafka运行日志和数据存放的路径 log.dirs=/training/kafka-2.3.1/logs #topic在当前broker上的分区个数 num.partitions=1 #用来恢复和清理data下数据的线程数量 num.recovery.threads.per.data.dir=1 #segment文件保留的最长时间,超时将被删除 log.retention.hours=24 #配置连接Zookeeper集群地址 zookeeper.connect=hadoop001:2181
-
bin/kafka-server-start.sh -daemon config/server.properties 启动
-
3.Kafka命令操作
-
创建topic
bin/kafka-topics.sh --bootstrap-server 10.211.55.3:9092 --topic first --create --partitions 1 --replication-factor 1
-
查看所有topic
bin/kafka-topics.sh --bootstrap-server 10.211.55.3:9092 --list # 查看分区情况 bin/kafka-topics.sh --bootstrap-server 10.211.55.3:9092 --describe --topic abc
-
发送消息
./kafka-console-producer.sh --bootstrap-server 10.211.55.3:9092 --topic first
./kafka-console-consumer.sh --bootstrap-server 10.211.55.3:9092 --topic first # 查看历史消息 ./kafka-console-consumer.sh --bootstrap-server 10.211.55.3:9092 --topic first --from-beginning
2.Kafka生产者
1.生产者原理
- 生产者通过send发送数据,首先会经过拦截器(可以设置不用拦截器)对数据进行一些操作,然后经过序列化器(kafka的序列化器 因为java的序列化后的数据台太重了,会包含一些辅助安全传输的内容)接着通过分区器将数据保存至一个缓冲队列中(缓冲队列大小默认32m,每批次数据大小为16k),然后sender线程会去缓冲队列中读取数据(当批次大小达到16k,或者延迟时间到了之后),将broker作为key,数据为value的请求保存至networkClient中,用selector将数据与broker的链路打通进行输入输出流,最后broker会有一个应答机制发送给selector 如果接收消息成功那么会将network中的请求和缓冲队列的数据进行删除.
2.整合springboot
<dependency>
<groupId>org.springframework.kafka</groupId>
<artifactId>spring-kafka</artifactId>
<version>2.8.7</version>
</dependency>
spring:
application:
name: kafka
kafka:
bootstrap-servers: 10.211.55.3:9092
producer: # producer 生产者
retries: 0 # 重试次数
acks: 1 # 应答级别:多少个分区副本备份完成时向生产者发送ack确认(可选0、1、all/-1)
batch-size: 16384 # 批量大小
buffer-memory: 33554432 # 生产端缓冲区大小
key-serializer: org.apache.kafka.common.serialization.StringSerializer
# value-serializer: com.itheima.demo.config.MySerializer
value-serializer: org.apache.kafka.common.serialization.StringSerializer
consumer: # consumer消费者
group-id: javagroup # 默认的消费组ID
enable-auto-commit: true # 是否自动提交offset
auto-commit-interval: 100 # 提交offset延时(接收到消息后多久提交offset)
# earliest:当各分区下有已提交的offset时,从提交的offset开始消费;无提交的offset时,从头开始消费
# latest:当各分区下有已提交的offset时,从提交的offset开始消费;无提交的offset时,消费新产生的该分区下的数据
# none:topic各分区都存在已提交的offset时,从offset后开始消费;只要有一个分区不存在已提交的offset,则抛出异常
auto-offset-reset: latest
key-deserializer: org.apache.kafka.common.serialization.StringDeserializer
# value-deserializer: com.itheima.demo.config.MyDeserializer
value-deserializer: org.apache.kafka.common.serialization.StringDeserializer
-
同步发送
@RestController @RequestMapping("/producer") @Slf4j public class KafKaProducer { @Autowired private KafkaTemplate<String, Object> kafkaTemplate; @RequestMapping("/syncSend") public void syncSend() throws ExecutionException, InterruptedException, TimeoutException { for (int i = 0; i < 3; i++) { ListenableFuture<SendResult<String, Object>> future = kafkaTemplate.send("first", "ade"); //注意,可以设置等待时间,超出后,不再等候结果 SendResult<String, Object> result = future.get(3, TimeUnit.SECONDS); log.info("send result:{}", result.getProducerRecord().value()); } } }
-
消费者接收
@Slf4j @Component public class KafkaConsumer { @KafkaListener(topics = "first") public void receiveString(String message) { log.info("Message : %s" + message); } }
-
异步发送
@RestController @RequestMapping("/producer") @Slf4j public class KafKaProducer { @Autowired private KafkaTemplate<String, Object> kafkaTemplate; @RequestMapping("/syncSend") public void syncSend() throws ExecutionException, InterruptedException, TimeoutException { for (int i = 0; i < 3; i++) { ListenableFuture<SendResult<String, Object>> future = kafkaTemplate.send("first", "ade"); future.addCallback(new ListenableFutureCallback() { @Override public void onSuccess(Object result) { System.out.println("发送成功 result = " + result); } @Override public void onFailure(Throwable ex) { System.out.println("发送异常"); ex.printStackTrace(); } }); } } }
-
提高生产者吞吐量的参数
3.数据可靠性
- kafka的生产者通过ack应答来保证数据的可靠性
- ack=0 :生产者发送过来的数据,不需要等数据库落盘应答
- ack=1 :生产者发送过来的数据,Leader收到数据后应答
- **ack=-1:**生产者发送过来的数据,Leader+和isr队列里面所有的节点收齐数据后应答
4.数据去重
-
幂等性
-
因为幂等性只能保证单分区会话不重复,所以需要使用生产者事物
-
开启事物必须开启幂等性
-
Properties properties = new Properties();
properties.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG, StringSerializer.class.getName());
properties.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG,StringSerializer.class.getName());
properties.put(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG,"localhost:9092");
properties.put(ProducerConfig.TRANSACTIONAL_ID_CONFIG,"transactionalId");
KafkaProducer<String,String> producer = new KafkaProducer<>(properties);
//初始化事物
producer.initTransactions();
//开启事物
producer.beginTransaction();
try {
//业务逻辑处理
ProducerRecord<String,String> record1 = new ProducerRecord<>("kafka-demo","msg1");
producer.send(record1);
ProducerRecord<String,String> record2 = new ProducerRecord<>("kafka-demo","msg2");
producer.send(record2);
ProducerRecord<String,String> record3 = new ProducerRecord<>("kafka-demo","msg3");
producer.send(record3);
//提交事物
producer.commitTransaction();
}catch (Exception e){
e.printStackTrace();
//终止事物
producer.abortTransaction();
}finally{
producer.close();
}
}
5.数据有序
- kafka单分区内有序(有条件),多分区,分区与分区之间无序
3.broker
1.broker工作流程
1.zookeeper存储的信息
- 客户端工具 https://github.com/vran-dev/PrettyZoo/releases
2.broker总体工作流程
3.节点服役和退役
-
服役新节点
- 创建一个要均衡的主题
vim topics-to-move.json { "topics": [ {"topic": "first"} ], "version": 1 }
-
生成一个负载均衡的计划
bin/kafka-reassign-partitions.sh --bootstrap-server hadoop102:9092 --topics-to-move-json-file topics-to-move.json --broker-list "0,1,2,3" --generate Current partition replica assignment {"version":1,"partitions":[{"topic":"first","partition":0,"replicas":[0,2,1],"log_dirs":["any","any","any"]},{"topic":"first","partition":1,"replicas":[2,1,0],"log_dirs":["any","any","any"]},{"topic":"first","partition":2,"replicas":[1,0,2],"log_dirs":["any","any","any"]}]} Proposed partition reassignment configuration {"version":1,"partitions":[{"topic":"first","partition":0,"replicas":[2,3,0],"log_dirs":["any","any","any"]},{"topic":"first","partition":1,"replicas":[3,0,1],"log_dirs":["any","any","any"]},{"topic":"first","partition":2,"replicas":[0,1,2],"log_dirs":["any","any","any"]}]}
-
创建副本存储计划(所有副本存储在 broker0、broker1、broker2、broker3 中
vim increase-replication-factor.json # 输入如下内容 {"version":1,"partitions":[{"topic":"first","partition":0,"replicas":[2,3,0],"log_dirs":["any","any","any"]},{"topic":"first","partition":1,"replicas":[3,0,1],"log_dirs":["any","any","any"]},{"topic":"first","partition":2,"replicas":[0,1,2],"log_dirs":["any","any","any"]}]}
-
执行副本存储计划
bin/kafka-reassign-partitions.sh --bootstrap-server hadoop102:9092 --reassignment-json-file increase-replication-factor.json --execute
-
验证副本存储计划
bin/kafka-reassign-partitions.sh --bootstrap-server hadoop102:9092 --reassignment-json-file increase-replication-factor.json --verify Status of partition reassignment: Reassignment of partition first-0 is complete. Reassignment of partition first-1 is complete. Reassignment of partition first-2 is complete. Clearing broker-level throttles on brokers 0,1,2,3 Clearing topic-level throttles on topic first
-
退役旧节点
-
创建一个要均衡的主题
vim topics-to-move.json { "topics": [ {"topic": "first"} ], "version": 1 }
-
创建执行计划
bin/kafka-reassign-partitions.sh --bootstrap-server hadoop102:9092 --topics-to-move-json-file topics-to-move.json --broker-list "0,1,2" --generate Current partition replica assignment {"version":1,"partitions":[{"topic":"first","partition":0,"replicas":[2,0,1],"log_dirs":["any","any","any"]},{"topic":"first","partition":1,"replicas":[3,1,2],"log_dirs":["any","any","any"]},{"topic":"first","partition":2,"replicas":[0,2,3],"log_dirs":["any","any","any"]}]} Proposed partition reassignment configuration {"version":1,"partitions":[{"topic":"first","partition":0,"replicas":[2,0,1],"log_dirs":["any","any","any"]},{"topic":"first","partition":1,"replicas":[0,1,2],"log_dirs":["any","any","any"]},{"topic":"first","partition":2,"replicas":[1,2,0],"log_dirs":["any","any","any"]}]}
-
创建副本存储计划(所有副本存储在 broker0、broker1、broker2 中)
vim increase-replication-factor.json {"version":1,"partitions":[{"topic":"first","partition":0,"replicas":[2,0,1],"log_dirs":["any","any","any"]},{"topic":"first","partition":1,"replicas":[0,1,2],"log_dirs":["any","any","any"]},{"topic":"first","partition":2,"replicas":[1,2,0],"log_dirs":["any","any","any"]}]}
-
执行副本存储计划
bin/kafka-reassign-partitions.sh --bootstrap-server hadoop102:9092 --reassignment-json-file increase-replication-factor.json --execute
-
验证副本存储计划
bin/kafka-reassign-partitions.sh --bootstrap-server hadoop102:9092 --reassignment-json-file increase-replication-factor.json --verify
-
2.kafka副本
1.副本的基本信息
2. leader_follower故障处理
- follower故障处理细节
- leader故障处理细节
3.Leader_partition自动平衡
- 实际开发 建议将enable关闭,因为再平衡很影响性能
4.新增副本因子
-
在生产环境当中,由于某个主题的重要等级需要提升,我们考虑增加副本。副本数的增加需要先制定计划,然后根据计划执行
-
创建topic
bin/kafka-topics.sh --bootstrap-server hadoop102:9092 --create --partitions 3 --replication-factor 1 --topic four
-
手动增加副本存储
# 创建副本存储计划(所有副本都指定存储在 broker0、broker1、broker2 中) vim increase-replication-factor.json # 输入如下内容 {"version":1,"partitions":[{"topic":"four","partition":0,"replicas":[0,1,2]},{"topic":"four","partition":1,"replicas":[0,1,2]},{"topic":"four","partition":2,"replicas":[0,1,2]}]}
-
执行副本存储计划
bin/kafka-reassign-partitions.sh --bootstrap-server hadoop102:9092 --reassignment-json-file increase-replication-factor.json --execute
-
3.文件存储
1.文件存储机制
-
Topic 数据到底存储在什么位置?
-
启动生产者,并发送消息
bin/kafka-console-producer.sh --bootstrap-server hadoop102:9092 --topic first >hello world
-
查看 hadoop102(或者 hadoop103、hadoop104)的/opt/module/kafka/datas/first-1 (first-0、first-2)路径上的文件
ls 00000000000000000092.index 00000000000000000092.log 00000000000000000092.snapshot 00000000000000000092.timeindex leader-epoch-checkpoint partition.metadata
-
直接查看 log 日志,发现是乱码
cat 00000000000000000092.log
-
通过工具查看 index 和 log 信息
kafka-run-class.sh kafka.tools.DumpLogSegments --files ./00000000000000000000.index Dumping ./00000000000000000000.index offset: 3 position: 152 kafka-run-class.sh kafka.tools.DumpLogSegments --files ./00000000000000000000.log
-
index文件和 log 文件详解
说明:日志存储参数配置
-
2.文件清理策略
-
Kafka 中默认的日志保存时间为 7 天,可以通过调整如下参数修改保存时间
-
log.retention.hours,最低优先级小时,默认 7 天
-
log.retention.minutes,分钟
-
log.retention.ms,最高优先级毫秒
-
log.retention.check.interval.ms,负责设置检查周期,默认 5 分钟
-
-
日志一旦超过了设置的时间,Kafka 中提供的日志清理策略有 delete 和 compact 两种
-
compact 日志压缩
compact日志压缩:对于相同key的不同value值,只保留最后一个版本
log.cleanup.policy = compact 所有数据启用压缩策略
压缩后的offset可能是不连续的,比如上图中没有6,当从这些offset消费消息时,将会拿到比这个offset大的offset对应的消息,实际上会拿到offset为7的消息,并从这个位置开始消费。
这种策略只适合特殊场景,比如消息的key是用户ID,value是用户的资料,通过这种压缩策略,整个消息集里就保存了所有用户最新的资料
-
delete 日志删除:将过期数据删除
log.cleanup.policy = delete 所有数据启用删除策略
-
(1)基于时间:默认打开。以 segment 中所有记录中的最大时间戳作为该文件时间戳
(2)基于大小:默认关闭,。超过设置的所有日志总大小,删除最早的 segment
log.retention.bytes,默认等于-1,表示无穷大
如果一个 segment 中有一部分数据过期,一部分没有过期,会以最后的时间为准
4.高效读写数据
-
kafka本身是分布式集群,可以采用分区技术,并行度高
-
读数据采用稀疏索引,可以快速定位要消费的数据
-
顺序写磁盘,生产者生产数据要写入log文件中,写的过程是一直追加文件末端
-
采用页缓存+零拷贝技术
4.Kafka消费者
1.消费模式
- kafka消费模式分为两种:拉模式(采用)和推模式
2.消费工作流程
- 一个消费者可以消费多个分区,一个分区只能被一个消费者消费,当消费者宕机后恢复后,会从消费者系统主题里面读上次还没读完的数据
3.消费者组的初始化流程
4.消费流程
5.消费分区策略
- Range:
- RoundRobin
- Sticky
- 假设有7个分区 3个消费者 ,那么sticky会将这7个分区以 3 2 2的形式给到消费者 与range不同的是 3 2 2每次给的消费者都不一样。 例如 第一次 消费1 3 消费2 2 消费3 2 第二次 消费1 2 消费2 3 消费3 2
6.offset位移
1.offset维护位置
2.自动提交
3.手动提交
4.指定offset消费
7.消费者事物
- 重复消费和漏消费
-
消费者事物