版本选择
组件 | 版本号 |
---|---|
Scala | 2.11.x |
Hadoop | 2.6.0-cdh5.7.0 |
Kafka | (apache) 2.11-0.10.2.2 |
Flume | 1.6.0-cdh5.7.0 |
Zookeeper | 3.4.5-cdh5.7.0 |
技术选型
单机配置(Standalone)
zookeeper
- 更改zookeeper的配置文件zoo.cfg
"""zoo.cfg"""
# The number of milliseconds of each tick
tickTime=2000
# The number of ticks that the initial
# synchronization phase can take
initLimit=10
# The number of ticks that can pass between
# sending a request and getting an acknowledgement
syncLimit=5
# the directory where the snapshot is stored.
# do not use /tmp for storage, /tmp here is just
# example sakes.
"""改这里"""
dataDir=/opt/module/zookeeper-3.4.5-cdh5.7.0/zkData
# the port at which the clients will connect
clientPort=2181
#
# Be sure to read the maintenance section of the
# administrator guide before turning on autopurge.
#
# http://zookeeper.apache.org/doc/current/zookeeperAdmin.html#sc_maintenance
#
# The number of snapshots to retain in dataDir
#autopurge.snapRetainCount=3
# Purge task interval in hours
# Set to "0" to disable auto purge feature
#autopurge.purgeInterval=1
"""改这里"""
server.1=hadoop04:2888:3888
- 添加myid文件
mkdir -p /opt/module/zookeeper-3.4.5-cdh5.7.0/zkData
# 对应zoo.cfg中的server.1
echo 1 > /opt/module/zookeeper-3.4.5-cdh5.7.0/zkData/myid
kafka
- 更改kafka配置文件中的server.properties
"""server.properties"""
# The id of the broker. This must be set to a unique integer for each broker.
broker.id=1
"""中间的不用修改,省略掉了"""
...
# The address the socket server listens on. It will get the value returned from
# java.net.InetAddress.getCanonicalHostName() if not configured.
# FORMAT:
# listeners = listener_name://host_name:port
# EXAMPLE:
# listeners = PLAINTEXT://your.host.name:9092
listeners=PLAINTEXT://hadoop04:9092
host.name=hadoop04
port=9092
# Hostname and port the broker will advertise to producers and consumers. If not set,
# it uses the value for "listeners" if configured. Otherwise, it will use the value
# returned from java.net.InetAddress.getCanonicalHostName().
advertised.listeners=PLAINTEXT://hadoop04:9092
advertised.host.name=hadoop04
advertised.port=9092
"""中间的不用修改,省略掉了"""
...
# Zookeeper connection string (see zookeeper docs for details).
# This is a comma separated host:port pairs, each corresponding to a zk
# server. e.g. "127.0.0.1:3000,127.0.0.1:3001,127.0.0.1:3002".
# You can also append an optional chroot string to the urls to specify the
# root directory for all kafka znodes.
zookeeper.connect=hadoop04:2181
flume
- 在$FLUME_HOME/conf下新增
taildir-file-hdfs-kafka.conf
进行配置
touch $FLUME_HOME/conf/taildir-file-hdfs-kafka.conf
- 添加source,channel,sink的名称
"""
taildir-file-hdfs-kafka.conf
设计source channel sink的名称
"""
a1.sources = tailDirSource
a1.channels = hdfsChannel kafkaChannel
a1.sinks = hdfsSink kafkaSink
- 配置source
"""
taildir-file-hdfs-kafka.conf
配置source
taildir source
"""
# source使用的类型是TAILDIR
a1.sources.tailDirSource.type = TAILDIR
# 该文件将以json的形式存储tailfile的绝对路径已经最后的访问位置
a1.sources.tailDirSource.positionFile = ../log/inodes/taildir_position.json
# 定义file group
a1.sources.tailDirSource.filegroups = f1
# 定义file group f1的监控文件
# 初始路径是$FLUME_HOME/conf
a1.sources.tailDirSource.filegroups.f1 = ../log/tmp/taildir-file-hdfs-kafka.log
# 通过header key和header value,可以用来指示特定的file group
a1.sources.tailDirSource.headers.f1.headerKey1 = value1
- 配置channel
"""
taildir-file-hdfs-kafka.conf
配置channel
包括2个file channel
初始路径是$FLUME_HOME/conf
"""
# hdfs channel
# 以文件的形式存储缓存数据
a1.channels.hdfsChannel.type = file
a1.channels.hdfsChannel.checkpointDir = ../log/checkpoint/hdfsChannel
a1.channels.hdfsChannel.dataDirs = ../log/data/hdfsChannel
# kafka channel
a1.channels.kafkaChannel.type = file
a1.channels.kafkaChannel.checkpointDir = ../log/checkpoint/kafkaChannel
a1.channels.kafkaChannel.dataDirs = ../log/data/kafkaChannel
- 配置sink
"""
taildir-file-hdfs-kafka.conf
配置sink
包括hdfsSink和kafkaSink
初始路径是$FLUME_HOME/conf
"""
# hdfs sink
# sink的类型
a1.sinks.hdfsSink.type = hdfs
# 存储到hdfs中的路径,用年月日进行分区
a1.sinks.hdfsSink.hdfs.path = /flume/events/%y-%m-%d
# 文件的前缀
a1.sinks.hdfsSink.hdfs.filePrefix = events
# 存储形式为无序列化的文本
a1.sinks.hdfsSink.hdfs.writeFormat = Text
# 允许在时间上进行四舍五入
a1.sinks.hdfsSink.hdfs.round = true
a1.sinks.hdfsSink.hdfs.roundValue = 10
a1.sinks.hdfsSink.hdfs.roundUnit = minute
# 使用本地时间戳
a1.sinks.hdfsSink.hdfs.useLocalTimeStamp = true
# kafka sink
# sink类型
a1.sinks.kafkaSink.type = org.apache.flume.sink.kafka.KafkaSink
# 数据丢给kafka的kafkaTestSink1的topic
a1.sinks.kafkaSink.topic = kafkaTestSink1
# brokerlist为单机
a1.sinks.kafkaSink.brokerList = hadoop04:9092
a1.sinks.kafkaSink.requiredAcks = 1
a1.sinks.kafkaSink.batchSize = 20
- 关联source,channel,sink
"""
taildir-file-hdfs-kafka.conf
关联source,channel,sink
"""
# 连接source和2个channel
a1.sources.tailDirSource.channels = hdfsChannel kafkaChannel
# sink和channel之间一对一连接
a1.sinks.hdfsSink.channel = hdfsChannel
a1.sinks.kafkaSink.channel = kafkaChannel
- 创建对应目录
mkdir -p $FLUME_HOME/log/data
mkdir -p $FLUME_HOME/log/tmp
mkdir -p $FLUME_HOME/log/inodes
mkdir -p $FLUME_HOME/log/checkpoint
启动程序
hadoop
# 启动集群
sh $HADOOP_HOME/sbin/start-all.sh
# 查看各个进程是否启动成功
jps
18880 NodeManager
18453 DataNode
18342 NameNode
32107 Jps
18636 SecondaryNameNode
18780 ResourceManager
zookeeper
# 启动zk
sh $ZOOKEEPER_HOME/sbin/zkServer.sh start
# 查看zk状态
sh $ZOOKEEPER_HOME/sbin/zkServer.sh status
JMX enabled by default
Using config: /opt/module/zookeeper-3.4.5-cdh5.7.0/sbin/../conf/zoo.cfg
Mode: standalone
kafka
- 启动kafka server,丢在后台
nohup sh $KAFKA_HOME/bin/kafka-server-start.sh $KAFKA_HOME/config/server.properties &
- 创建新的topic
sh $KAFKA_HOME/bin/kafka-topics.sh --create --zookeeper hadoop04:2181 --replication-factor 1 --partitions 1 --topic kafkaTestSink1
# 检查是否创建成功
sh $KAFKA_HOME/bin/kafka-topics.sh --list --zookeeper hadoop04:2181
- 在新的窗口启动一个控制台消费者,用来消费kafkaTestSink1
# 因为还没有生产者生产数据进来,所以启动后先搁置一边
sh $KAFKA_HOME/bin/kafka-console-consumer.sh --bootstrap-server hadoop04:9092 --from-beginning --topic kafkaTestSink1
flume
"""
在$FLUME_HOME/conf/下执行,否则会有错
这是由于配置文件中配置的相对路径导致的
之后解决了再改一下文章
"""
$FLUME_HOME/bin/flume-ng agent \
--name a1 \
--conf $FLUME_HOME/conf \
--conf-file $FLUME_HOME/conf/taildir-file-hdfs-kafka.conf \
-Dflume.root.logger=INFO,console
测试
1. 编辑taildir-file-hdfs-kafka.log(名字需要与source中配置的一样)丢到$FLUME_HOME/log/tmp/
下
# vim
TERMS
AND
CONDITIONS
FOR USE,
REPRODUCTION
,
AND
DISTRIBUTION
2. 修改taildir-file-hdfs-kafka.log,追加内容
Release Notes - Flume - Version v1.6.0
** Sub-task
* [FLUME-2250] - Add support for Kafka Source
* [FLUME-2251] - Add support for Kafka Sink
* [FLUME-2677] - Update versions in 1.6.0 branch
* [FLUME-2686] - Update KEYS file for 1.6 release
3. kafka的consumer端结果(搁置在一旁的控制台)
这里的第一行的空行和第二行的a是我用控制台producer进行的测试。
4. HDFS结果
测试成功