Flume配置TailDirSource，FileChannel，HDFSSink和KafkaSink的单机测试

最新推荐文章于 2022-10-07 13:15:03 发布

小朋友2D

最新推荐文章于 2022-10-07 13:15:03 发布

阅读量710

点赞数 1

分类专栏： Flume

本文链接：https://blog.csdn.net/ct2020129/article/details/90728136

版权

Flume 专栏收录该内容

2 篇文章 0 订阅

订阅专栏

文章目录

版本选择
技术选型
单机配置（Standalone）
启动程序
测试

版本选择

组件	版本号
Scala	2.11.x
Hadoop	2.6.0-cdh5.7.0
Kafka	(apache) 2.11-0.10.2.2
Flume	1.6.0-cdh5.7.0
Zookeeper	3.4.5-cdh5.7.0

技术选型

单机配置（Standalone）

zookeeper

更改zookeeper的配置文件zoo.cfg

"""zoo.cfg"""
# The number of milliseconds of each tick
tickTime=2000
# The number of ticks that the initial
# synchronization phase can take
initLimit=10
# The number of ticks that can pass between
# sending a request and getting an acknowledgement
syncLimit=5
# the directory where the snapshot is stored.
# do not use /tmp for storage, /tmp here is just
# example sakes.
"""改这里"""
dataDir=/opt/module/zookeeper-3.4.5-cdh5.7.0/zkData
# the port at which the clients will connect
clientPort=2181
#
# Be sure to read the maintenance section of the
# administrator guide before turning on autopurge.
#
# http://zookeeper.apache.org/doc/current/zookeeperAdmin.html#sc_maintenance
#
# The number of snapshots to retain in dataDir
#autopurge.snapRetainCount=3
# Purge task interval in hours
# Set to "0" to disable auto purge feature
#autopurge.purgeInterval=1
"""改这里"""
server.1=hadoop04:2888:3888

添加myid文件

mkdir -p /opt/module/zookeeper-3.4.5-cdh5.7.0/zkData
# 对应zoo.cfg中的server.1
echo 1 > /opt/module/zookeeper-3.4.5-cdh5.7.0/zkData/myid

kafka

更改kafka配置文件中的server.properties

"""server.properties"""
# The id of the broker. This must be set to a unique integer for each broker.
broker.id=1
"""中间的不用修改，省略掉了"""
...
# The address the socket server listens on. It will get the value returned from
# java.net.InetAddress.getCanonicalHostName() if not configured.
#   FORMAT:
#     listeners = listener_name://host_name:port
#   EXAMPLE:
#     listeners = PLAINTEXT://your.host.name:9092
listeners=PLAINTEXT://hadoop04:9092
host.name=hadoop04
port=9092

# Hostname and port the broker will advertise to producers and consumers. If not set,
# it uses the value for "listeners" if configured.  Otherwise, it will use the value
# returned from java.net.InetAddress.getCanonicalHostName().
advertised.listeners=PLAINTEXT://hadoop04:9092
advertised.host.name=hadoop04
advertised.port=9092
"""中间的不用修改，省略掉了"""
...
# Zookeeper connection string (see zookeeper docs for details).
# This is a comma separated host:port pairs, each corresponding to a zk
# server. e.g. "127.0.0.1:3000,127.0.0.1:3001,127.0.0.1:3002".
# You can also append an optional chroot string to the urls to specify the
# root directory for all kafka znodes.
zookeeper.connect=hadoop04:2181

flume

在$FLUME_HOME/conf下新增taildir-file-hdfs-kafka.conf进行配置

touch $FLUME_HOME/conf/taildir-file-hdfs-kafka.conf

添加source，channel，sink的名称

"""
taildir-file-hdfs-kafka.conf
设计source channel sink的名称
"""
a1.sources = tailDirSource
a1.channels = hdfsChannel kafkaChannel
a1.sinks = hdfsSink kafkaSink

配置source

"""
taildir-file-hdfs-kafka.conf
配置source
taildir source
"""
# source使用的类型是TAILDIR
a1.sources.tailDirSource.type = TAILDIR
# 该文件将以json的形式存储tailfile的绝对路径已经最后的访问位置
a1.sources.tailDirSource.positionFile = ../log/inodes/taildir_position.json
# 定义file group
a1.sources.tailDirSource.filegroups = f1
# 定义file group f1的监控文件
# 初始路径是$FLUME_HOME/conf
a1.sources.tailDirSource.filegroups.f1 = ../log/tmp/taildir-file-hdfs-kafka.log
# 通过header key和header value，可以用来指示特定的file group
a1.sources.tailDirSource.headers.f1.headerKey1 = value1

配置channel

"""
taildir-file-hdfs-kafka.conf
配置channel
包括2个file channel
初始路径是$FLUME_HOME/conf
"""
# hdfs channel
# 以文件的形式存储缓存数据
a1.channels.hdfsChannel.type = file
a1.channels.hdfsChannel.checkpointDir = ../log/checkpoint/hdfsChannel
a1.channels.hdfsChannel.dataDirs = ../log/data/hdfsChannel

# kafka channel
a1.channels.kafkaChannel.type = file
a1.channels.kafkaChannel.checkpointDir = ../log/checkpoint/kafkaChannel
a1.channels.kafkaChannel.dataDirs = ../log/data/kafkaChannel

配置sink

"""
taildir-file-hdfs-kafka.conf
配置sink
包括hdfsSink和kafkaSink
初始路径是$FLUME_HOME/conf
"""
# hdfs sink
# sink的类型
a1.sinks.hdfsSink.type = hdfs
# 存储到hdfs中的路径，用年月日进行分区
a1.sinks.hdfsSink.hdfs.path = /flume/events/%y-%m-%d
# 文件的前缀
a1.sinks.hdfsSink.hdfs.filePrefix = events
# 存储形式为无序列化的文本
a1.sinks.hdfsSink.hdfs.writeFormat = Text
# 允许在时间上进行四舍五入
a1.sinks.hdfsSink.hdfs.round = true
a1.sinks.hdfsSink.hdfs.roundValue = 10
a1.sinks.hdfsSink.hdfs.roundUnit = minute
# 使用本地时间戳
a1.sinks.hdfsSink.hdfs.useLocalTimeStamp = true


# kafka sink
# sink类型
a1.sinks.kafkaSink.type = org.apache.flume.sink.kafka.KafkaSink
# 数据丢给kafka的kafkaTestSink1的topic
a1.sinks.kafkaSink.topic = kafkaTestSink1
# brokerlist为单机
a1.sinks.kafkaSink.brokerList = hadoop04:9092
a1.sinks.kafkaSink.requiredAcks = 1
a1.sinks.kafkaSink.batchSize = 20

关联source，channel，sink

"""
taildir-file-hdfs-kafka.conf
关联source，channel，sink
"""
# 连接source和2个channel
a1.sources.tailDirSource.channels = hdfsChannel kafkaChannel
# sink和channel之间一对一连接
a1.sinks.hdfsSink.channel = hdfsChannel
a1.sinks.kafkaSink.channel = kafkaChannel

mkdir -p $FLUME_HOME/log/data
mkdir -p $FLUME_HOME/log/tmp
mkdir -p $FLUME_HOME/log/inodes
mkdir -p $FLUME_HOME/log/checkpoint

启动程序

hadoop

# 启动集群
sh $HADOOP_HOME/sbin/start-all.sh
# 查看各个进程是否启动成功
jps

18880 NodeManager
18453 DataNode
18342 NameNode
32107 Jps
18636 SecondaryNameNode
18780 ResourceManager

zookeeper

# 启动zk
sh $ZOOKEEPER_HOME/sbin/zkServer.sh start
# 查看zk状态
sh $ZOOKEEPER_HOME/sbin/zkServer.sh status

JMX enabled by default
Using config: /opt/module/zookeeper-3.4.5-cdh5.7.0/sbin/../conf/zoo.cfg
Mode: standalone

kafka

启动kafka server，丢在后台

nohup sh $KAFKA_HOME/bin/kafka-server-start.sh $KAFKA_HOME/config/server.properties &

创建新的topic

sh $KAFKA_HOME/bin/kafka-topics.sh --create --zookeeper hadoop04:2181 --replication-factor 1 --partitions 1 --topic kafkaTestSink1

# 检查是否创建成功
sh $KAFKA_HOME/bin/kafka-topics.sh --list --zookeeper hadoop04:2181

在新的窗口启动一个控制台消费者，用来消费kafkaTestSink1

# 因为还没有生产者生产数据进来，所以启动后先搁置一边
sh $KAFKA_HOME/bin/kafka-console-consumer.sh --bootstrap-server hadoop04:9092 --from-beginning --topic kafkaTestSink1

flume

"""
在$FLUME_HOME/conf/下执行，否则会有错
这是由于配置文件中配置的相对路径导致的
之后解决了再改一下文章
"""
$FLUME_HOME/bin/flume-ng agent \
--name a1 \
--conf $FLUME_HOME/conf \
--conf-file $FLUME_HOME/conf/taildir-file-hdfs-kafka.conf \
-Dflume.root.logger=INFO,console

测试

1. 编辑taildir-file-hdfs-kafka.log(名字需要与source中配置的一样)丢到$FLUME_HOME/log/tmp/下

# vim
TERMS
AND
CONDITIONS
FOR USE,
REPRODUCTION
,
AND

DISTRIBUTION

2. 修改taildir-file-hdfs-kafka.log，追加内容

Release Notes - Flume - Version v1.6.0

** Sub-task
    * [FLUME-2250] - Add support for Kafka Source
    * [FLUME-2251] - Add support for Kafka Sink
    * [FLUME-2677] - Update versions in 1.6.0 branch
    * [FLUME-2686] - Update KEYS file for 1.6 release

3. kafka的consumer端结果（搁置在一旁的控制台）
这里的第一行的空行和第二行的a是我用控制台producer进行的测试。
在这里插入图片描述
4. HDFS结果

测试成功

小朋友2D

关注

1
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Flume配置TailDirSource，FileChannel，HDFSSink和KafkaSink的单机测试

文章目录版本选择技术选型单机配置（Standalone）zookeeperkafkaflume启动程序hadoopzookeeperkafkaflume测试版本选择组件版本号Scala2.11.xHadoop2.6.0-cdh5.7.0Kafka(apache) 2.11-0.10.2.2Flume1.6.0-cdh5.7.0Zookeepe...
复制链接

扫一扫