kafka-0.8介绍、安装、常用指令

本文链接：https://blog.csdn.net/Alian_W/article/details/109752483

kafka-0.8介绍、安装、常用指令

大家好，我是W

今天给大家带来Kafka的介绍、安装及其常用指令，我也是最近这段时间才学习到kafka，所以理解不深也无法给大家带来更深刻的东西，希望这篇博客对大家有用。我学习Kafka的时候是在学习Spark-Streaming的时候顺便学习的，下面我们的顺序是：Kafka的介绍、Kafka的安装、Kafka相关命令。

1、 Kafka介绍

Kafka是由Linkedin公司开发的基于zookeeper协调的分布式、支持分区、支持多副本的分布式消息系统，是一种高吞吐量的分布式发布订阅消息系统。它是由Scala和Java编写的，并在2010年由Linkedin公司贡献给Apache基金会，最终成为Apache的顶级开源项目。

Kafka的最大特点就是可以实时的处理大规模数据流以满足各种需求场景，特点给大家列举一下：

高吞吐量：Kafka每秒可以产生约25万条（50M）消息，每秒处理55万条消息（110M）。
持久化数据存储：支持将数据持久化到磁盘，因此可以用于批量消费。
分布式系统易于扩展：所有的producer、broker、consumer都有多个，可以不停机扩展。

kafka官网

2、 Kafka的安装（CentOS 6.x）

2.1 环境介绍

zookeeper-3.4.6集群（node-02,node-03,node-04）
scala版本：2.11
kafka_2.11-0.8.2.2：scala版本2.11，kafka版本0.8.2.2

2.2 kafka安装

下面我将演示我在CentOS 6.10下安装Kafka-0.8的步骤，但是在安装Kafka前需要确保自己的机器上配置好了zookeeper集群：

2.2.1 上传文件

通过Xftp上传 kafka_2.11-0.8.2.2.tgz 安装包到linux系统中，这里我放在/root目录下：

在这里插入图片描述

2.2.2 解压文件夹

输入命令，解压至指定目录：

tar -zxvf kafka_2.11-0.8.2.2.tgz -C /root/apps/

2.2.3 修改配置文件

首先我们可以进入kafka文件夹里看看里面的结构，ls一下可以看到：

ls
bin  config  libs  LICENSE  NOTICE

显然，kafka的目录结构很简单，LICENSE NOTICE不用管，libs里面都是装的依赖，config显然装的是一些配置文件，bin下面装的是一些脚本。

进入config，修改server.properties

cd /config
vi server.properties

我们将会看到这样的一个配置文件**(不用仔细看这里，下面会对关键配置项讲解)**：

# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements.  See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License.  You may obtain a copy of the License at
#
#    http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# see kafka.server.KafkaConfig for additional details and defaults

############################# Server Basics #############################

# The id of the broker. This must be set to a unique integer for each broker.
broker.id=0

############################# Socket Server Settings #############################

# The port the socket server listens on
port=9092

# Hostname the broker will bind to. If not set, the server will bind to all interfaces
#host.name=localhost

# Hostname the broker will advertise to producers and consumers. If not set, it uses the
# value for "host.name" if configured.  Otherwise, it will use the value returned from
# java.net.InetAddress.getCanonicalHostName().
#advertised.host.name=<hostname routable by clients>

# The port to publish to ZooKeeper for clients to use. If this is not set,
# it will publish the same port that the broker binds to.
#advertised.port=<port accessible by clients>

# The number of threads handling network requests
num.network.threads=3

# The number of threads doing disk I/O
num.io.threads=8

# The send buffer (SO_SNDBUF) used by the socket server
socket.send.buffer.bytes=102400

# The receive buffer (SO_RCVBUF) used by the socket server
socket.receive.buffer.bytes=102400

# The maximum size of a request that the socket server will accept (protection against OOM)
socket.request.max.bytes=104857600


############################# Log Basics #############################

# A comma seperated list of directories under which to store log files
log.dirs=/tmp/kafka-logs

# The default number of log partitions per topic. More partitions allow greater
# parallelism for consumption, but this will also result in more files across
# the brokers.
num.partitions=1

# The number of threads per data directory to be used for log recovery at startup and flushing at shutdown.
# This value is recommended to be increased for installations with data dirs located in RAID array.
num.recovery.threads.per.data.dir=1

############################# Log Flush Policy #############################

# Messages are immediately written to the filesystem but by default we only fsync() to sync
# the OS cache lazily. The following configurations control the flush of data to disk.
# There are a few important trade-offs here:
#    1. Durability: Unflushed data may be lost if you are not using replication.
#    2. Latency: Very large flush intervals may lead to latency spikes when the flush does occur as there will be a lot of data to flush.
#    3. Throughput: The flush is generally the most expensive operation, and a small flush interval may lead to exceessive seeks.
# The settings below allow one to configure the flush policy to flush data after a period of time or
# every N messages (or both). This can be done globally and overridden on a per-topic basis.

# The number of messages to accept before forcing a flush of data to disk
#log.flush.interval.messages=10000

# The maximum amount of time a message can sit in a log before we force a flush
#log.flush.interval.ms=1000

############################# Log Retention Policy #############################

# The following configurations control the disposal of log segments. The policy can
# be set to delete segments after a period of time, or after a given size has accumulated.
# A segment will be deleted whenever *either* of these criteria are met. Deletion always happens
# from the end of the log.

# The minimum age of a log file to be eligible for deletion
log.retention.hours=168

# A size-based retention policy for logs. Segments are pruned from the log as long as the remaining
# segments don't drop below log.retention.bytes.
#log.retention.bytes=1073741824

# The maximum size of a log segment file. When this size is reached a new log segment will be created.
log.segment.bytes=1073741824

# The interval at which log segments are checked to see if they can be deleted according
# to the retention policies
log.retention.check.interval.ms=300000

# By default the log cleaner is disabled and the log retention policy will default to just delete segments after their retention expires.
# If log.cleaner.enable=true is set the cleaner will be enabled and individual logs can then be marked for log compaction.
log.cleaner.enable=false

############################# Zookeeper #############################

# Zookeeper connection string (see zookeeper docs for details).
# This is a comma separated host:port pairs, each corresponding to a zk
# server. e.g. "127.0.0.1:3000,127.0.0.1:3001,127.0.0.1:3002".
# You can also append an optional chroot string to the urls to specify the
# root directory for all kafka znodes.
zookeeper.connect=localhost:2181

上面这个配置文件大部分都是注释（打#部分），所以我会讲解部分参数：

broker.id=0 # 在配置集群过程中需要对broker.id做配置，要求每一台机器的id不同。
port=9092 # 这是kafka对外提供服务的时候访问的端口。
host.name=localhost # 该台机器绑定的ip。
log.dirs=/tmp/kafka-logs # kafka在执行任务过程中存在以日志形式持久化环节，这个是持久化数据的路径，而不是kafka的log的路径。
num.partitions=1 # 这个是kafka每个topic产生的分区数，日后数据持久化将以文件形式保存，分区数对应着文件数，分区数增大支持更大的并行消费能力，但过大的分区数也会导致文件数增多。
log.retention.hours=168 # 日志保存最大时间，大于这个时间会被清洗（单位：小时）。
zookeeper.connect=localhost:2181 # zookeeper通讯地址，多个地址间用逗号分隔。

经过上面的讲解，已经了解到一些参数，而我们需要配置的参数有以下几个：

broker.id=0
host.name=localhost
log.dirs=/tmp/kafka-logs
zookeeper.connect=localhost:2181

每个参数具体作用以及该怎么配置我已经讲清楚了，下面是我的配置，大家可以根据实际情况配置：

broker.id=0
host.name=192.168.120.21
log.dirs=/roo/app/kafka-logs
zookeeper.connect=1192.168.120.21:2181,92.168.120.22:2181,192.168.120.23:2181

2.2.4 集群配置

接下来还需要把安装包以及修改后的文件拷贝到其他机器上，需要免密操作教程的同学可以参考：Linux（CentOS 6.10）的联网配置和免密登录配置
。

scp -r /root/apps/kafka_2.11-0.8.2.2/ node-02:/root/apps/

重复执行上述命令，记得修改主机名。

然后逐台机器修改其中配置文件的下面几个参数即可：

broker.id=0
host.name=localhost

2.3 启动Kafka

在启动kafka前请先启动zookeeper集群，然后对每一台机器执行以下命令(当然要进入到对应的目录，或者使用全路径名)：

sh kafka-server-start.sh -daemon /root/apps/kafka_2.11-0.8.2.2/config/server.properties

那么我们的kafka进程就启动了，可以通过JPS查看。

3、 Kafka相关命令和操作

启动kafka

sh /bigdata/kafka_2.11-0.10.2.1/bin/kafka-server-start.sh -daemon /bigdata/kafka_2.11-0.10.2.1/config/server.properties

停止kafka

sh /bigdata/kafka_2.11-0.10.2.1/bin/kafka-server-stop.sh

创建topic

sh /bigdata/kafka_2.11-0.10.2.1/bin/kafka-topics.sh --create --zookeeper node-1.xiaoniu.com:2181,node-2.xiaoniu.com:2181,node-3.xiaoniu.com:2181 --replication-factor 3 --partitions 3 --topic my-topic

列出所有topic

sh /bigdata/kafka_2.11-0.10.2.1/bin/kafka-topics.sh --list --zookeeper node-1.xiaoniu.com:2181,node-2.xiaoniu.com:2181,node-3.xiaoniu.com:2181

查看某个topic信息

sh /bigdata/kafka_2.11-0.10.2.1/bin/kafka-topics.sh --describe --zookeeper node-1.xiaoniu.com:2181,node-2.xiaoniu.com:2181,node-3.xiaoniu.com:2181 --topic my-topic

启动一个命令行的生产者

sh /bigdata/kafka_2.11-0.10.2.1/bin/kafka-console-producer.sh --broker-list spark-02:9092,spark-03:9092,spark-04:9092 --topic djm2

启动一个命令行的消费者

sh /bigdata/kafka_2.11-0.10.2.1/bin/kafka-console-consumer.sh --zookeeper node-1.xiaoniu.com:2181,node-2.xiaoniu.com:2181,node-3.xiaoniu.com:2181 --topic my-topic --from-beginning

消费者连接到borker的地址

sh /bigdata/kafka_2.11-0.10.2.1/bin/kafka-console-consumer.sh --bootstrap-server node-1.xiaoniu.com:9092,node-2.xiaoniu.com:9092,node-3.xiaoniu.com:9092 --topic xiaoniu --from-beginning