kafka-0.8介绍、安装、常用指令
大家好,我是W
今天给大家带来Kafka的介绍、安装及其常用指令,我也是最近这段时间才学习到kafka,所以理解不深也无法给大家带来更深刻的东西,希望这篇博客对大家有用。我学习Kafka的时候是在学习Spark-Streaming的时候顺便学习的,下面我们的顺序是:Kafka的介绍、Kafka的安装、Kafka相关命令。
1、 Kafka介绍
Kafka是由Linkedin公司开发的基于zookeeper协调的分布式、支持分区、支持多副本的分布式消息系统,是一种高吞吐量的分布式发布订阅消息系统。它是由Scala和Java编写的,并在2010年由Linkedin公司贡献给Apache基金会,最终成为Apache的顶级开源项目。
Kafka的最大特点就是可以实时的处理大规模数据流以满足各种需求场景,特点给大家列举一下:
- 高吞吐量:Kafka每秒可以产生约25万条(50M)消息,每秒处理55万条消息(110M)。
- 持久化数据存储:支持将数据持久化到磁盘,因此可以用于批量消费。
- 分布式系统易于扩展:所有的producer、broker、consumer都有多个,可以不停机扩展。
2、 Kafka的安装(CentOS 6.x)
2.1 环境介绍
- zookeeper-3.4.6集群(node-02,node-03,node-04)
- scala版本:2.11
- kafka_2.11-0.8.2.2:scala版本2.11,kafka版本0.8.2.2
2.2 kafka安装
下面我将演示我在CentOS 6.10下安装Kafka-0.8的步骤,但是在安装Kafka前需要确保自己的机器上配置好了zookeeper集群:
2.2.1 上传文件
通过Xftp上传 kafka_2.11-0.8.2.2.tgz 安装包到linux系统中,这里我放在/root目录下:
2.2.2 解压文件夹
输入命令,解压至指定目录:
tar -zxvf kafka_2.11-0.8.2.2.tgz -C /root/apps/
2.2.3 修改配置文件
首先我们可以进入kafka文件夹里看看里面的结构,ls一下可以看到:
ls
bin config libs LICENSE NOTICE
显然,kafka的目录结构很简单,LICENSE NOTICE不用管,libs里面都是装的依赖,config显然装的是一些配置文件,bin下面装的是一些脚本。
进入config,修改server.properties
cd /config
vi server.properties
我们将会看到这样的一个配置文件**(不用仔细看这里,下面会对关键配置项讲解)**:
# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements. See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# see kafka.server.KafkaConfig for additional details and defaults
############################# Server Basics #############################
# The id of the broker. This must be set to a unique integer for each broker.
broker.id=0
############################# Socket Server Settings #############################
# The port the socket server listens on
port=9092
# Hostname the broker will bind to. If not set, the server will bind to all interfaces
#host.name=localhost
# Hostname the broker will advertise to producers and consumers. If not set, it uses the
# value for "host.name" if configured. Otherwise, it will use the value returned from
# java.net.InetAddress.getCanonicalHostName().
#advertised.host.name=<hostname routable by clients>
# The port to publish to ZooKeeper for clients to use. If this is not set,
# it will publish the same port that the broker binds to.
#advertised.port=<port accessible by clients>
# The number of threads handling network requests
num.network.threads=3
# The number of threads doing disk I/O
num.io.threads=8
# The send buffer (SO_SNDBUF) used by the socket server
socket.send.buffer.bytes=102400
# The receive buffer (SO_RCVBUF) used by the socket server
socket.receive.buffer.bytes=102400
# The maximum size of a request that the socket server will accept (protection against OOM)
socket.request.max.bytes=104857600
############################# Log Basics #############################
# A comma seperated list of directories under which to store log files
log.dirs=/tmp/kafka-logs
# The default number of log partitions per topic. More partitions allow greater
# parallelism for consumption, but this will also result in more files across
# the brokers.
num.partitions=1
# The number of threads per data directory to be used for log recovery at startup and flushing at shutdown.
# This value is recommended to be increased for installations with data dirs located in RAID array.
num.recovery.threads.per.data.dir=1
############################# Log Flush Policy #############################
# Messages are immediately written to the filesystem but by default we only fsync() to sync
# the OS cache lazily. The following configurations control the flush of data to disk.
# There are a few important trade-offs here:
# 1. Durability: Unflushed data may be lost if you are not using replication.
# 2. Latency: Very large flush intervals may lead to latency spikes when the flush does occur as there will be a lot of data to flush.
# 3. Throughput: The flush is generally the most expensive operation, and a small flush interval may lead to exceessive seeks.
# The settings below allow one to configure the flush policy to flush data after a period of time or
# every N messages (or both). This can be done globally and overridden on a per-topic basis.
# The number of messages to accept before forcing a flush of data to disk
#log.flush.interval.messages=10000
# The maximum amount of time a message can sit in a log before we force a flush
#log.flush.interval.ms=1000
############################# Log Retention Policy #############################
# The following configurations control the disposal of log segments. The policy can
# be set to delete segments after a period of time, or after a given size has accumulated.
# A segment will be deleted whenever *either* of these criteria are met. Deletion always happens
# from the end of the log.
# The minimum age of a log file to be eligible for deletion
log.retention.hours=168
# A size-based retention policy for logs. Segments are pruned from the log as long as the remaining
# segments don't drop below log.retention.bytes.
#log.retention.bytes=1073741824
# The maximum size of a log segment file. When this size is reached a new log segment will be created.
log.segment.bytes=1073741824
# The interval at which log segments are checked to see if they can be deleted according
# to the retention policies
log.retention.check.interval.ms=300000
# By default the log cleaner is disabled and the log retention policy will default to just delete segments after their retention expires.
# If log.cleaner.enable=true is set the cleaner will be enabled and individual logs can then be marked for log compaction.
log.cleaner.enable=false
############################# Zookeeper #############################
# Zookeeper connection string (see zookeeper docs for details).
# This is a comma separated host:port pairs, each corresponding to a zk
# server. e.g. "127.0.0.1:3000,127.0.0.1:3001,127.0.0.1:3002".
# You can also append an optional chroot string to the urls to specify the
# root directory for all kafka znodes.
zookeeper.connect=localhost:2181
上面这个配置文件大部分都是注释(打#部分),所以我会讲解部分参数:
- broker.id=0 # 在配置集群过程中需要对broker.id做配置,要求每一台机器的id不同。
- port=9092 # 这是kafka对外提供服务的时候访问的端口。
- host.name=localhost # 该台机器绑定的ip。
- log.dirs=/tmp/kafka-logs # kafka在执行任务过程中存在以日志形式持久化环节,这个是持久化数据的路径,而不是kafka的log的路径。
- num.partitions=1 # 这个是kafka每个topic产生的分区数,日后数据持久化将以文件形式保存,分区数对应着文件数,分区数增大支持更大的并行消费能力,但过大的分区数也会导致文件数增多。
- log.retention.hours=168 # 日志保存最大时间,大于这个时间会被清洗(单位:小时)。
- zookeeper.connect=localhost:2181 # zookeeper通讯地址,多个地址间用逗号分隔。
经过上面的讲解,已经了解到一些参数,而我们需要配置的参数有以下几个:
- broker.id=0
- host.name=localhost
- log.dirs=/tmp/kafka-logs
- zookeeper.connect=localhost:2181
每个参数具体作用以及该怎么配置我已经讲清楚了,下面是我的配置,大家可以根据实际情况配置:
- broker.id=0
- host.name=192.168.120.21
- log.dirs=/roo/app/kafka-logs
- zookeeper.connect=1192.168.120.21:2181,92.168.120.22:2181,192.168.120.23:2181
2.2.4 集群配置
接下来还需要把安装包以及修改后的文件拷贝到其他机器上,需要免密操作教程的同学可以参考:Linux(CentOS 6.10)的联网配置和免密登录配置
。
scp -r /root/apps/kafka_2.11-0.8.2.2/ node-02:/root/apps/
重复执行上述命令,记得修改主机名。
然后逐台机器修改其中配置文件的下面几个参数即可:
- broker.id=0
- host.name=localhost
2.3 启动Kafka
在启动kafka前请先启动zookeeper集群,然后对每一台机器执行以下命令(当然要进入到对应的目录,或者使用全路径名):
sh kafka-server-start.sh -daemon /root/apps/kafka_2.11-0.8.2.2/config/server.properties
那么我们的kafka进程就启动了,可以通过JPS查看。
3、 Kafka相关命令和操作
启动kafka
sh /bigdata/kafka_2.11-0.10.2.1/bin/kafka-server-start.sh -daemon /bigdata/kafka_2.11-0.10.2.1/config/server.properties
停止kafka
sh /bigdata/kafka_2.11-0.10.2.1/bin/kafka-server-stop.sh
创建topic
sh /bigdata/kafka_2.11-0.10.2.1/bin/kafka-topics.sh --create --zookeeper node-1.xiaoniu.com:2181,node-2.xiaoniu.com:2181,node-3.xiaoniu.com:2181 --replication-factor 3 --partitions 3 --topic my-topic
列出所有topic
sh /bigdata/kafka_2.11-0.10.2.1/bin/kafka-topics.sh --list --zookeeper node-1.xiaoniu.com:2181,node-2.xiaoniu.com:2181,node-3.xiaoniu.com:2181
查看某个topic信息
sh /bigdata/kafka_2.11-0.10.2.1/bin/kafka-topics.sh --describe --zookeeper node-1.xiaoniu.com:2181,node-2.xiaoniu.com:2181,node-3.xiaoniu.com:2181 --topic my-topic
启动一个命令行的生产者
sh /bigdata/kafka_2.11-0.10.2.1/bin/kafka-console-producer.sh --broker-list spark-02:9092,spark-03:9092,spark-04:9092 --topic djm2
启动一个命令行的消费者
sh /bigdata/kafka_2.11-0.10.2.1/bin/kafka-console-consumer.sh --zookeeper node-1.xiaoniu.com:2181,node-2.xiaoniu.com:2181,node-3.xiaoniu.com:2181 --topic my-topic --from-beginning
消费者连接到borker的地址
sh /bigdata/kafka_2.11-0.10.2.1/bin/kafka-console-consumer.sh --bootstrap-server node-1.xiaoniu.com:9092,node-2.xiaoniu.com:9092,node-3.xiaoniu.com:9092 --topic xiaoniu --from-beginning
参考
提供几篇博客给大家参考:
总结
kafka的使用还是比较简单的,大家跟着上诉步骤一步步来应该没什么问题,接下来我将继续深入学习kafka后给大家带来更深刻的内容。祝各位学业、事业有成!