kafka简述与集群配置

最新推荐文章于 2024-06-16 21:18:19 发布

huangbiao56

最新推荐文章于 2024-06-16 21:18:19 发布

阅读量481

点赞数 2

分类专栏：大数据平台文章标签： kafka集群搭建

本文链接：https://blog.csdn.net/qq_39022311/article/details/98350488

版权

大数据平台专栏收录该内容

7 篇文章 0 订阅

订阅专栏

一：kafka简述

消息队列（Message Queue）

消息 Message
   网络中的两台计算机或者两个通讯设备之间传递的数据。例如说：文本、音乐、视频等内容。
队列 Queue
   一种特殊的线性表（数据元素首尾相接），特殊之处在于只允许在首部删除元素和在尾部追加元素。入队、出队。
消息队列 MQ
   消息+队列，保存消息的队列。消息的传输过程中的容器；主要提供生产、消费接口供外部调用做数据的存储和获取。

MQ分类

MQ主要分为两类：点对点(p2p)、发布订阅(Pub/Sub)
共同点：
   消息生产者生产消息发送到queue中，然后消息消费者从queue中读取并且消费消息。
不同点：
   p2p模型包括：消息队列(Queue)、发送者(Sender)、接收者(Receiver)
    一个生产者生产的消息只有一个消费者(Consumer)(即一旦被消费，消息就不在消息队列中)。比如说打电话。

   Pub/Sub包含：消息队列(Queue)、主题(Topic)、发布者(Publisher)、订阅者(Subscriber)
   每个消息可以有多个消费者，彼此互不影响。比如我发布一个微博：关注我的人都能够看到。
   那么在大数据领域呢，为了满足日益增长的数据量，也有一款可以满足百万级别消息的生成和消费，分布式、持久稳定的产品——Kafka。

Kafka简介

 Kafka是分布式的发布—订阅消息系统。它最初由LinkedIn(领英)公司发布，使用Scala语言编写，与2010年12月份开源，成为Apache的顶级项目。
   Kafka是一个高吞吐量的、持久性的、分布式发布订阅消息系统。
   它主要用于处理活跃的数据(登录、浏览、点击、分享、喜欢等用户行为产生的数据)。

三大特点：
高吞吐量
   可以满足每秒百万级别消息的生产和消费——生产消费。QPS
持久性
   有一套完善的消息存储机制，确保数据的高效安全的持久化——中间存储。
分布式
   基于分布式的扩展和容错机制；Kafka的数据都会复制到几台服务器上。当某一台故障失效时，生产者和消费者转而使用其它的机器——整体健壮性。

Kafka组件

一个MQ需要哪些部分？生产、消费、消息类别、存储等等。
  对于kafka而言，kafka服务就像是一个大的水池。不断的生产、存储、消费着各种类别的消息。那么kafka由何组成呢？
> Kafka服务：
  > Topic：主题，Kafka处理的消息的不同分类。
  > Broker：消息代理，Kafka集群中的一个kafka服务节点称为一个broker，主要存储消息数据。存在硬盘中。每个topic都是有分区的。
  > Partition：Topic物理上的分组，一个topic在broker中被分为1个或者多个partition，分区在创建topic的时候指定。
  > Message：消息，是通信的基本单位，每个消息都属于一个partition
> Kafka服务相关
  > Producer：消息和数据的生产者，向Kafka的一个topic发布消息。
  > Consumer：消息和数据的消费者，定于topic并处理其发布的消息。
  > Zookeeper：协调kafka的正常运行。

Broker

Broker：配置文件server.properties 
   1、为了减少磁盘写入的次数,broker会将消息暂时buffer起来,当消息的个数达到一定阀值或者过了一定的时间间隔时,再flush到磁盘,这样减少了磁盘IO调用的次数。
     配置：Log Flush Policy
     #log.flush.interval.messages=10000   一个分区的消息数阀值
     #log.flush.interval.ms=1000    
   2、kafka的消息保存一定时间（通常为7天）后会被删除。
     配置：Log Retention Policy 
     log.retention.hours=168 
     #log.retention.bytes=1073741824
     log.retention.check.interval.ms=300000

Producer

Producer：配置文件：producer.properties
    1、自定义partition
    Producer也根据用户设置的算法来根据消息的key来计算输入哪个partition：partitioner.class
    2、异步或者同步发送
    配置项：producer.type
    异步或者同步发送
    同步是指：发送方发出数据后，等接收方发回响应以后才发下一个数据的通讯方式。  
    异步是指：发送方发出数据后，不等接收方发回响应，接着发送下个数据的通讯方式。
    3、批量发送可以很有效的提高发送效率。
    Kafka producer的异步发送模式允许进行批量发送，先将消息缓存在内存中，然后一次请求批量发送出去。
   具体配置queue.buffering.max.ms、queue.buffering.max.messages。
默认值分别为5000和10000

Consumer

consumers：配置文件：consumer.properties
1、每个consumer属于一个consumer group，可以指定组id。group.id
2、消费形式：
   组内：组内的消费者消费同一份数据；同时只能有一个consumer消费一个Topic中的1个partition；一个consumer可以消费多个partitions中的消息。
     所以，对于一个topic,同一个group中推荐不能有多于partitions个数的consumer同时消费,否则将意味着某些consumer将无法得到消息。
   组间：每个消费组消费相同的数据，互不影响。
3、在一个consumer多个线程的情况下，一个线程相当于一个消费者。
   例如：partition为3，一个consumer起了3个线程消费，另一个后来的consumer就无法消费。

（这是Kafka用来实现一个Topic消息的广播（发给所有的Consumer）和单播（发给某一个Consumer）的手段。
一个Topic可以对应多个Consumer Group。如果需要实现广播，只要每个Consumer有一个独立的Group就可以了。
要实现单播只要所有的Consumer在同一个Group里。用Consumer Group还可以将Consumer进行自由的分组而不需要多次发送消息到不同的Topic。）

topic、partition、message

1、每个partition在存储层面是append log文件。新消息都会被直接追加到log文件的尾部，每条消息在log文件中的位置称为offset（偏移量）。
2、每条Message包含了以下三个属性：
  1°、offset 对应类型：long  此消息在一个partition中序号。可以认为offset是partition中Message的id
  2°、MessageSize  对应类型：int32 此消息的字节大小。
  3°、data  是message的具体内容。
3、越多的partitions意味着可以容纳更多的consumer,有效提升并发消费的能力。
4、总之：业务区分增加topic、数据量大增加partition。
Kafka安装配置

以上概述参考：https://blog.51cto.com/xpleaf/2090847

二：集群配置

第一次可从单节点开始，官网写的很详细：http://kafka.apache.org/quickstart
看张kafka集群图：
在这里插入图片描述
我这里准备了三台虚拟机，在虚拟机上完成kafka集群搭建。
我的三台虚拟机：

1.安装zookeeper并搭建zookeeper集群
这里我不多说了，可以参考我的博客，百度也很多。
装好之后像上图一样配置hosts信息好进行操作。
装好后三台虚拟机配置环境变量然后都执行下：zkServer.sh start 启动服务
然后zkServer.sh status 查看各节点状态：
能看到mode:leader/follower就说明可以了

2.安装kafka
下载地址：https://www.apache.org/dyn/closer.cgi?path=/kafka/2.3.0/kafka_2.11-2.3.0.tgz
我下载的是kafka_2.11-2.3.0.tgz版本
建议去kafka官网下载最新版本，然后解压，我这是解压到了/usr/local/目录下
1）配置环境变量：

export KAFKA_HOME=/usr/local/kafka211_230
export PATH=$PATH:$KAFKA_HOME/bin

2）修改/usr/local/kafka211_230/config目录下的server.properties文件
这里重点修改四个参数：
broker.id标识本机、
listeners本机的ip地址和端口
log.dirs是kafka接收消息存放路径、
zookeeper.connect指定连接的zookeeper集群地址

#    http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# limitations under the License.

# see kafka.server.KafkaConfig for additional details and defaults

############################# Server Basics #############################

# The id of the broker. This must be set to a unique integer for each broker.
broker.id=3

############################# Socket Server Settings #############################

# The address the socket server listens on. It will get the value returned from 
# java.net.InetAddress.getCanonicalHostName() if not configured.
#   FORMAT:
#     listeners = listener_name://host_name:port
#   EXAMPLE:
#     listeners = PLAINTEXT://your.host.name:9092
#listeners=PLAINTEXT://:9092
listeners=PLAINTEXT://master:9092

# Hostname and port the broker will advertise to producers and consumers. If not set, 
# it uses the value for "listeners" if configured.  Otherwise, it will use the value
# returned from java.net.InetAddress.getCanonicalHostName().
#advertised.listeners=PLAINTEXT://your.host.name:9092

#listener.security.protocol.map=PLAINTEXT:PLAINTEXT,SSL:SSL,SASL_PLAINTEXT:SASL_PLAINTEXT,SASL_SSL:SASL_SSL

# The number of threads that the server uses for receiving requests from the network and sending responses to the network
num.network.threads=3

# The number of threads that the server uses for processing requests, which may include disk I/O
num.io.threads=8

# The send buffer (SO_SNDBUF) used by the socket server
socket.send.buffer.bytes=102400

# The receive buffer (SO_RCVBUF) used by the socket server
socket.receive.buffer.bytes=102400

# The maximum size of a request that the socket server will accept (protection against OOM)
socket.request.max.bytes=104857600



# The default number of log partitions per topic. More partitions allow greater
# parallelism for consumption, but this will also result in more files across
# the brokers.
num.partitions=1

# The number of threads per data directory to be used for log recovery at startup and flushing at shutdown.
# This value is recommended to be increased for installations with data dirs located in RAID array.
num.recovery.threads.per.data.dir=1

############################# Internal Topic Settings  #############################
# The replication factor for the group metadata internal topics "__consumer_offsets" and "__transaction_state"
# For anything other than development testing, a value greater than 1 is recommended for to ensure availability such as 3.
offsets.topic.replication.factor=1
transaction.state.log.replication.factor=1
transaction.state.log.min.isr=1

############################# Log Flush Policy #############################

# Messages are immediately written to the filesystem but by default we only fsync() to sync
# the OS cache lazily. The following configurations control the flush of data to disk.
# There are a few important trade-offs here:
#    1. Durability: Unflushed data may be lost if you are not using replication.
# The settings below allow one to configure the flush policy to flush data after a period of time or
# every N messages (or both). This can be done globally and overridden on a per-topic basis.

# The number of messages to accept before forcing a flush of data to disk
#log.flush.interval.messages=10000

# The maximum amount of time a message can sit in a log before we force a flush
#log.flush.interval.ms=1000

############################# Log Retention Policy #############################

# The following configurations control the disposal of log segments. The policy can
# be set to delete segments after a period of time, or after a given size has accumulated.
# A segment will be deleted whenever *either* of these criteria are met. Deletion always happens
# from the end of the log.

# The minimum age of a log file to be eligible for deletion due to age
log.retention.hours=168

# The maximum size of a log segment file. When this size is reached a new log segment will be created.
# The interval at which log segments are checked to see if they can be deleted according
# to the retention policies
log.retention.check.interval.ms=300000

############################# Zookeeper #############################

# Zookeeper connection string (see zookeeper docs for details).
# This is a comma separated host:port pairs, each corresponding to a zk
# server. e.g. "127.0.0.1:3000,127.0.0.1:3001,127.0.0.1:3002".
# You can also append an optional chroot string to the urls to specify the
# root directory for all kafka znodes.
zookeeper.connect=master:2181,slave1:2181,slave2:2181

# Timeout in ms for connecting to zookeeper
zookeeper.connection.timeout.ms=6000


############################# Group Coordinator Settings #############################

# The following configuration specifies the time, in milliseconds, that the GroupCoordinator will delay the initial consumer rebalance.
# The rebalance will be further delayed by the value of group.initial.rebalance.delay.ms as new members join the group, up to a maximum of max.poll.interval.ms.
# The default value for this is 3 seconds.
# We override this to 0 here as it makes for a better out-of-the-box experience for development and testing.
# However, in production environments the default value of 3 seconds is more suitable as this will help to avoid unnecessary, and potentially expensive, rebalances during application startup.
group.initial.rebalance.delay.ms=0

配置好一台虚拟机后比如我这台虚拟机：192.168.1.3
然后将kafka安装包发给另外两台虚拟机：

scp -r kafka211_230 slave1:/usr/local    (我这里的slave1=192.168.1.4)
scp -r kafka211_230 slave2:/usr/local    (我这里的slave1=192.168.1.5)

发过去之后修改配置环境变量
然后修改server.properties的：
broker.id标识本机，我这三台虚拟机的broker.id分别为：3，4，5
listeners本机的ip地址和端口，我这三台虚拟机的这个配置分别为：listeners=PLAINTEXT://master:9092
listeners=PLAINTEXT://slave1:9092
listeners=PLAINTEXT://slave2:9092

3.开启kafka集群

三个节点分别执行如下命令，启动kafka集群

kafka-server-start.sh -daemon  /usr/local/kafka211_230/config/server.properties &
//&后台执行，否则前台页面会卡住无法继续输入命令

基本操作：

1）创建topic

kafka-topics.sh --create --zookeeper master:2181,slave1:2181,slave2:2181 --replication-factor 3 --partitions 6 --topic kfk_test
//这里是复制了三个replication-factory备份，开了6个分区，topic名：kfk_test

2)列出创建的topic

kafka-topics.sh --list --zookeeper master:2181,slave1:2181,slave2:2181

3)生成数据

kafka-console-producer.sh -broker-list master:9092,slave1:9092,slave2:9092 --topic kfk_test
//这里需要弄清楚的是：-broker-list 后面的节点可以生产kfk_test消息，实际应用中不必每个节点对同一个topic都去进行生产消息

4)消费生产数据

kafka 0.9版本之后不推荐zookeeper方式进行消费,推荐使用bootstrap-server方式
kafka-console-consumer.sh --bootstrap-server master:9092,slave1:9092,slave2:9092 --from-beginning --topic kfk_test

5)查看指定topic信息

kafka-topics.sh --describe --zookeeper master:2181,slave1:2181,slave2:2181 --topic kfk_test

信息如下：
在这里插入图片描述
可以看到6个分区，每个分区3个副本
partiton： partion id 分区id
leader：当前负责读写的lead broker id ，就是server.properties的broker.id
replicas：当前partition的所有replication broker list
isr：relicas的子集，只包含出于活动状态的broker，离线或挂掉的broker不在此列表

6)、删除指定的topic

kafka-topics.sh --delete --zookeeper master:2181,slave1:2181,slave2:2181 --topic kfk_test

huangbiao56

关注

2
点赞
踩
2

收藏

觉得还不错? 一键收藏
0
评论
kafka简述与集群配置

一：kafka简述消息队列（Message Queue）消息 Message 网络中的两台计算机或者两个通讯设备之间传递的数据。例如说：文本、音乐、视频等内容。队列 Queue 一种特殊的线性表（数据元素首尾相接），特殊之处在于只允许在首部删除元素和在尾部追加元素。入队、出队。消息队列 MQ 消息+队列，保存消息的队列。消息的传输过程中的容器；主要提供生产、消费接口供...
复制链接

扫一扫

专栏目录