kafka学习总结

最新推荐文章于 2021-11-23 13:18:59 发布

jinhuazhe2013

最新推荐文章于 2021-11-23 13:18:59 发布

阅读量466

点赞数

分类专栏：大数据

本文链接：https://blog.csdn.net/u012386311/article/details/79033646

版权

大数据专栏收录该内容

2 篇文章 0 订阅

订阅专栏

kafka学习总结

前言

由于更高级的版本必须是java8以上才支持，而公司用的Java7，所以我就下载了0.10.0.0版本，并且参考的对应版本的官方帮助文档。
主要参考http://kafka.apache.org/0100/documentation.html

大概看看文档前面部分后，可以按照文档中1.3的quick start开始玩玩kafka。玩玩以后看其他文档会更好理解。

需要注意的几点地方

ReplicationFactor和PartitionCount

> bin/kafka-topics.sh --describe --zookeeper localhost:2181 --topic my-replicated-topic
Topic:my-replicated-topic   PartitionCount:1    ReplicationFactor:3 Configs:
    Topic: my-replicated-topic  Partition: 0    Leader: 1   Replicas: 1,2,0 Isr: 1,2,0

Here is an explanation of output.
The first line gives a summary of all the partitions, each additional line gives information about one partition. Since we have only one partition for this topic there is only one line.
* “leader” is the node responsible for all reads and writes for the given partition. Each node will be the leader for a randomly selected portion of the partitions.
* “replicas” is the list of nodes that replicate the log for this partition regardless of whether they are the leader or even if they are currently alive.
* “isr” is the set of “in-sync” replicas. This is the subset of the replicas list that is currently alive and caught-up to the leader.
这里是对输出的解释。
第一行给出了所有分区的摘要，每个附加行给出了关于一个分区的信息。由于我们只有一个分区，所以只有一行。
* “leader”是负责给定分区的所有读取和写入的节点。每个节点将成为分区随机选择部分的领导者。
* “副本”是复制此分区日志的节点列表，无论它们是否是领导者，或者即使他们当前处于活动状态。
* “isr”是一组“同步”副本。这是复制品列表的子集，当前活着并被引导到领导者。

replication-factor 应该小于等于broker的个数。即kafka集群中启动的kafka的个数。
而partitions的个数决定了同一个group下，能有多少个consumer同时消费同一个topic的消息。
如：

> bin/kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 4 --partitions 3 --topic my-replicated-topic4
Error while executing topic command : replication factor: 4 larger than available brokers: 3
[2018-01-11 11:47:43,914] ERROR kafka.admin.AdminOperationException: replication factor: 4 larger than available brokers: 3
    at kafka.admin.AdminUtils$.assignReplicasToBrokers(AdminUtils.scala:117)
    at kafka.admin.AdminUtils$.createTopic(AdminUtils.scala:403)
    at kafka.admin.TopicCommand$.createTopic(TopicCommand.scala:110)
    at kafka.admin.TopicCommand$.main(TopicCommand.scala:61)
    at kafka.admin.TopicCommand.main(TopicCommand.scala)
 (kafka.admin.TopicCommand$)

当consumer的个数(线程数)多于partitions的个数时候，会有警告：

> bin/kafka-console-consumer.sh --consumer.config config/consumer-1.properties --topic my-replicated-topic --zookeeper localhost:2181
[2018-01-11 11:56:50,152] WARN No broker partitions consumed by consumer thread test-consumer-group-1_zhangweiwendeMacBook-Pro.local-1515643009993-5d9f164e-0 for topic my-replicated-topic (kafka.consumer.RangeAssignor)

指定config文件是因为consumer-1.properties中定义了groupid

#consumer group id
group.id=test-consumer-group-1

更改topic

更改PartitionCount(只能比原来大，不能比原来设置的小)

> bin/kafka-topics.sh --zookeeper localhost:2181 --alter --topic my-replicated-topic --partitions 4

The replication factor controls how many servers will replicate each message that is written. If you have a replication factor of 3 then up to 2 servers can fail before you will lose access to your data. We recommend you use a replication factor of 2 or 3 so that you can transparently bounce machines without interrupting data consumption.
The partition count controls how many logs the topic will be sharded into. There are several impacts of the partition count. First each partition must fit entirely on a single server. So if you have 20 partitions the full data set (and read and write load) will be handled by no more than 20 servers (no counting replicas). Finally the partition count impacts the maximum parallelism of your consumers. This is discussed in greater detail in the concepts section.

replication控制有多少服务器将复制每个写入的消息。如果复制因子为3，则最多有2个服务器可能会失败，然后您将无法访问数据。我们建议您使用2或3的复制因子，以便在不中断数据消耗的情况下透明地反弹机器。
partition控制主题将被分成多少个日志。分区计数有几个影响。首先，每个分区必须完全适合一台服务器。所以，如果你有20个分区，完整的数据集（和读写负载）将由不超过20个服务器（不包括副本）处理。最后，分区数会影响消费者的最大并行度。这在概念部分更详细地讨论。

groupid

通过zookeeper工具查看kafka消费者信息：

> bin/zookeeper-shell.sh localhost:2181
Connecting to localhost:2181
Welcome to ZooKeeper!

WATCHER::

WatchedEvent state:SyncConnected type:None path:null
ls /consumers
[console-consumer-31081, console-consumer-11444, console-consumer-38177, test-consumer-group-1]

这里的记录都是我用shell工具去消费产生的。但是我用java客户端消费，发现并没有记录在这里。（我想，可能和客户端有关吧。这点并没懂）

使用shell工具消费时，当没有指定group.id时候，新增一个命令窗口去执行bin/kafka-console-consumer.sh命令时，会在zookeeper的/consumers下面新生成一个console-consumer-* 的节点。

关于topic名称长度限制

Each sharded partition log is placed into its own folder under the Kafka log directory. The name of such folders consists of the topic name, appended by a dash (-) and the partition id. Since a typical folder name can not be over 255 characters long, there will be a limitation on the length of topic names. We assume the number of partitions will not ever be above 100,000. Therefore, topic names cannot be longer than 249 characters. This leaves just enough room in the folder name for a dash and a potentially 5 digit long partition id.
每个分片分区日志都放在自己的Kafka日志目录下的文件夹中。这些文件夹的名称由主题名称组成，由破折号（ - ）和分区ID附加。由于典型的文件夹名称长度不能超过255个字符，所以主题名称的长度会受到限制。我们假设分区的数量不会超过10万个。因此，主题名称不能超过249个字符。这在文件夹名称中留下了足够的空间以显示短划线和可能的5位长的分区ID。

再看看概念

生产者

Producers publish data to the topics of their choice. The producer is responsible for choosing which record to assign to which partition within the topic. This can be done in a round-robin fashion simply to balance load or it can be done according to some semantic partition function (say based on some key in the record). More on the use of partitioning in a second!

生产者发布数据到他们选择的主题。生产者负责选择哪个记录分配给主题内的哪个分区。这可以以循环的方式完成，只是为了平衡负载，或者可以根据某些语义分区功能（例如基于记录中的某个键）来完成。更多关于使用分区在第二！

消费者

Consumers label themselves with a consumer group name, and each record published to a topic is delivered to one consumer instance within each subscribing consumer group. Consumer instances can be in separate processes or on separate machines.

If all the consumer instances have the same consumer group, then the records will effectively be load balanced over the consumer instances.

If all the consumer instances have different consumer groups, then each record will be broadcast to all the consumer processes.

消费者用消费者组名称标记自己，并且发布到主题的每个记录被传递到每个订阅消费者组中的一个消费者实例。消费者实例可以在不同的进程中或在不同的机器上。

如果所有消费者实例具有相同的消费者组，则记录将有效地在消费者实例上负载平衡。

如果所有消费者实例具有不同的消费者组，则每个记录将被广播给所有消费者进程。

消费者模型

A two server Kafka cluster hosting four partitions (P0-P3) with two consumer groups. Consumer group A has two consumer instances and group B has four.

More commonly, however, we have found that topics have a small number of consumer groups, one for each “logical subscriber”. Each group is composed of many consumer instances for scalability and fault tolerance. This is nothing more than publish-subscribe semantics where the subscriber is a cluster of consumers instead of a single process.

The way consumption is implemented in Kafka is by dividing up the partitions in the log over the consumer instances so that each instance is the exclusive consumer of a “fair share” of partitions at any point in time. This process of maintaining membership in the group is handled by the Kafka protocol dynamically. If new instances join the group they will take over some partitions from other members of the group; if an instance dies, its partitions will be distributed to the remaining instances.

Kafka only provides a total order over records within a partition, not between different partitions in a topic. Per-partition ordering combined with the ability to partition data by key is sufficient for most applications. However, if you require a total order over records this can be achieved with a topic that has only one partition, though this will mean only one consumer process per consumer group.

两个服务器Kafka集群托管四个分区（P0-P3）与两个消费者组。消费者组A有两个消费者实例，而组B有四个消费者实例。

然而，更普遍的是，我们发现话题中有少量消费群体，每个“逻辑用户”都有一个消费群体。每个组由许多消费者实例组成，具有可扩展性和容错性。这不过是发布 - 订阅语义，订阅者是一群消费者而不是一个进程。

在Kafka中实现消费的方式是将日志中的分区划分为消费者实例，以便每个实例在任何时间点都是“公平分享”分区的唯一消费者。这个维护组中成员资格的过程是由Kafka协议动态地处理的。如果新实例加入组，他们将接管来自组中其他成员的一些分区;如果一个实例死亡，其分区将分配给其余的实例。

卡夫卡只提供一个分区内的记录总数，而不是主题中的不同分区之间。每个分区排序与按键分区数据的能力相结合，足以满足大多数应用程序的需求。但是，如果您需要全部订单而不是记录，则可以通过仅具有一个分区的主题来实现，但这意味着每个消费者组只有一个消费者进程。

zookeeper中存储结构

zookeeper client 或者zkui工具查看path。

> bin/zookeeper-shell.sh localhost:2181
Connecting to localhost:2181
Welcome to ZooKeeper!

WATCHER::

WatchedEvent state:SyncConnected type:None path:null
ls /
[consumers, config, controller, isr_change_notification, admin, brokers, zookeeper, controller_epoch]

查看我本地的zookeeper结果：

/
 - consumers, 
    - console-consumer-31081, 
    - console-consumer-11444, 
    - console-consumer-38177, 
    - test-consumer-group-1
        - offsets, 
            - test, 
                - 1 
                    - 856
                - 0
                    - 895
            - my-replicated-topic
                - 0 
                    - 4028
                - 1 
                    - 3705
                - 2 
                    - 3623
        - owners, 
            - test, 
                - 1, 
                    - test-consumer-group-1_zhangweiwendeMacBook-Pro.local-1515643003538-f1fc6049-0
                - 0
                    - test-consumer-group-1_zhangweiwendeMacBook-Pro.local-1515642984413-377bdad2-0
            - my-replicated-topic
        - ids
            - test-consumer-group-1_zhangweiwendeMacBook-Pro.local-1515642984413-377bdad2
                - {"version":1,"subscription":{"test":1},"pattern":"white_list","timestamp":"1515713737228"} 
            - test-consumer-group-1_zhangweiwendeMacBook-Pro.local-1515643003538-f1fc6049

 - config,
    - topics,
        - my-replicated-topic3, 
            - {"version":1,"config":{}}
        - __consumer_offsets, 
            - {"version":1,"config":{"segment.bytes":"104857600","compression.type":"uncompressed","cleanup.policy":"compact"}}
        - test, 
            - {"version":1,"config":{}}
        - my-replicated-topic, 
            - {"version":1,"config":{}}
        - replication-test 
    - clients, 
    - changes
 - controller, 
 - isr_change_notification, 
 - admin, 
    - delete_topics 
        - my-replicated-topic
 - brokers, 
    - seqid, 
    - topics, 
        - my-replicated-topic3, 
            - partitions
                - 3, 
                    - state
                        - {"controller_epoch":285,"leader":2,"version":1,"leader_epoch":15,"isr":[2,0,1]}
                - 2, 
                    - state
                        - "controller_epoch":285,"leader":1,"version":1,"leader_epoch":22,"isr":[1,2,0]}
                - 1, 
                    - state
                        - {"controller_epoch":285,"leader":0,"version":1,"leader_epoch":14,"isr":[0,2,1]}
                - 0
                    - state
                        - {"controller_epoch":285,"leader":2,"version":1,"leader_epoch":15,"isr":[2,0,1]}
        - __consumer_offsets, 
        - my-replicated-topic, 
        - test, 
        - replication-test
    - ids
        - 0
            - {"jmx_port":-1,"timestamp":"1515739523425","endpoints":["PLAINTEXT://172.24.108.121:9092"],"host":"172.24.108.121","version":3,"port":9092}
        - 1
            - {"jmx_port":-1,"timestamp":"1515737302356","endpoints":["PLAINTEXT://172.24.108.121:9093"],"host":"172.24.108.121","version":3,"port":9093}
        - 2
            - {"jmx_port":-1,"timestamp":"1515737302012","endpoints":["PLAINTEXT://172.24.108.121:9094"],"host":"172.24.108.121","version":3,"port":9094}
 - zookeeper, 
 - controller_epoch