Flume配置TailDirSource,FileChannel,HDFSSink和KafkaSink的单机测试

版本选择

组件版本号
Scala2.11.x
Hadoop2.6.0-cdh5.7.0
Kafka(apache) 2.11-0.10.2.2
Flume1.6.0-cdh5.7.0
Zookeeper3.4.5-cdh5.7.0

技术选型

source
channelA
channelB
sinkA
sinkB
tailDirSource
fileChannelA
fileChannelB
HDFSSink
Kafkasink

单机配置(Standalone)

zookeeper

  1. 更改zookeeper的配置文件zoo.cfg
"""zoo.cfg"""
# The number of milliseconds of each tick
tickTime=2000
# The number of ticks that the initial
# synchronization phase can take
initLimit=10
# The number of ticks that can pass between
# sending a request and getting an acknowledgement
syncLimit=5
# the directory where the snapshot is stored.
# do not use /tmp for storage, /tmp here is just
# example sakes.
"""改这里"""
dataDir=/opt/module/zookeeper-3.4.5-cdh5.7.0/zkData
# the port at which the clients will connect
clientPort=2181
#
# Be sure to read the maintenance section of the
# administrator guide before turning on autopurge.
#
# http://zookeeper.apache.org/doc/current/zookeeperAdmin.html#sc_maintenance
#
# The number of snapshots to retain in dataDir
#autopurge.snapRetainCount=3
# Purge task interval in hours
# Set to "0" to disable auto purge feature
#autopurge.purgeInterval=1
"""改这里"""
server.1=hadoop04:2888:3888
  1. 添加myid文件
mkdir -p /opt/module/zookeeper-3.4.5-cdh5.7.0/zkData
# 对应zoo.cfg中的server.1
echo 1 > /opt/module/zookeeper-3.4.5-cdh5.7.0/zkData/myid

kafka

  • 更改kafka配置文件中的server.properties
"""server.properties"""
# The id of the broker. This must be set to a unique integer for each broker.
broker.id=1
"""中间的不用修改,省略掉了"""
...
# The address the socket server listens on. It will get the value returned from
# java.net.InetAddress.getCanonicalHostName() if not configured.
#   FORMAT:
#     listeners = listener_name://host_name:port
#   EXAMPLE:
#     listeners = PLAINTEXT://your.host.name:9092
listeners=PLAINTEXT://hadoop04:9092
host.name=hadoop04
port=9092

# Hostname and port the broker will advertise to producers and consumers. If not set,
# it uses the value for "listeners" if configured.  Otherwise, it will use the value
# returned from java.net.InetAddress.getCanonicalHostName().
advertised.listeners=PLAINTEXT://hadoop04:9092
advertised.host.name=hadoop04
advertised.port=9092
"""中间的不用修改,省略掉了"""
...
# Zookeeper connection string (see zookeeper docs for details).
# This is a comma separated host:port pairs, each corresponding to a zk
# server. e.g. "127.0.0.1:3000,127.0.0.1:3001,127.0.0.1:3002".
# You can also append an optional chroot string to the urls to specify the
# root directory for all kafka znodes.
zookeeper.connect=hadoop04:2181

flume

  1. 在$FLUME_HOME/conf下新增taildir-file-hdfs-kafka.conf进行配置
touch $FLUME_HOME/conf/taildir-file-hdfs-kafka.conf
  1. 添加source,channel,sink的名称
"""
taildir-file-hdfs-kafka.conf
设计source channel sink的名称
"""
a1.sources = tailDirSource
a1.channels = hdfsChannel kafkaChannel
a1.sinks = hdfsSink kafkaSink
  1. 配置source
"""
taildir-file-hdfs-kafka.conf
配置source
taildir source
"""
# source使用的类型是TAILDIR
a1.sources.tailDirSource.type = TAILDIR
# 该文件将以json的形式存储tailfile的绝对路径已经最后的访问位置
a1.sources.tailDirSource.positionFile = ../log/inodes/taildir_position.json
# 定义file group
a1.sources.tailDirSource.filegroups = f1
# 定义file group f1的监控文件
# 初始路径是$FLUME_HOME/conf
a1.sources.tailDirSource.filegroups.f1 = ../log/tmp/taildir-file-hdfs-kafka.log
# 通过header key和header value,可以用来指示特定的file group
a1.sources.tailDirSource.headers.f1.headerKey1 = value1
  1. 配置channel
"""
taildir-file-hdfs-kafka.conf
配置channel
包括2个file channel
初始路径是$FLUME_HOME/conf
"""
# hdfs channel
# 以文件的形式存储缓存数据
a1.channels.hdfsChannel.type = file
a1.channels.hdfsChannel.checkpointDir = ../log/checkpoint/hdfsChannel
a1.channels.hdfsChannel.dataDirs = ../log/data/hdfsChannel

# kafka channel
a1.channels.kafkaChannel.type = file
a1.channels.kafkaChannel.checkpointDir = ../log/checkpoint/kafkaChannel
a1.channels.kafkaChannel.dataDirs = ../log/data/kafkaChannel
  1. 配置sink
"""
taildir-file-hdfs-kafka.conf
配置sink
包括hdfsSink和kafkaSink
初始路径是$FLUME_HOME/conf
"""
# hdfs sink
# sink的类型
a1.sinks.hdfsSink.type = hdfs
# 存储到hdfs中的路径,用年月日进行分区
a1.sinks.hdfsSink.hdfs.path = /flume/events/%y-%m-%d
# 文件的前缀
a1.sinks.hdfsSink.hdfs.filePrefix = events
# 存储形式为无序列化的文本
a1.sinks.hdfsSink.hdfs.writeFormat = Text
# 允许在时间上进行四舍五入
a1.sinks.hdfsSink.hdfs.round = true
a1.sinks.hdfsSink.hdfs.roundValue = 10
a1.sinks.hdfsSink.hdfs.roundUnit = minute
# 使用本地时间戳
a1.sinks.hdfsSink.hdfs.useLocalTimeStamp = true


# kafka sink
# sink类型
a1.sinks.kafkaSink.type = org.apache.flume.sink.kafka.KafkaSink
# 数据丢给kafka的kafkaTestSink1的topic
a1.sinks.kafkaSink.topic = kafkaTestSink1
# brokerlist为单机
a1.sinks.kafkaSink.brokerList = hadoop04:9092
a1.sinks.kafkaSink.requiredAcks = 1
a1.sinks.kafkaSink.batchSize = 20
  1. 关联source,channel,sink
"""
taildir-file-hdfs-kafka.conf
关联source,channel,sink
"""
# 连接source和2个channel
a1.sources.tailDirSource.channels = hdfsChannel kafkaChannel
# sink和channel之间一对一连接
a1.sinks.hdfsSink.channel = hdfsChannel
a1.sinks.kafkaSink.channel = kafkaChannel
  1. 创建对应目录
mkdir -p $FLUME_HOME/log/data
mkdir -p $FLUME_HOME/log/tmp
mkdir -p $FLUME_HOME/log/inodes
mkdir -p $FLUME_HOME/log/checkpoint

启动程序

hadoop

# 启动集群
sh $HADOOP_HOME/sbin/start-all.sh
# 查看各个进程是否启动成功
jps

18880 NodeManager
18453 DataNode
18342 NameNode
32107 Jps
18636 SecondaryNameNode
18780 ResourceManager

zookeeper

# 启动zk
sh $ZOOKEEPER_HOME/sbin/zkServer.sh start
# 查看zk状态
sh $ZOOKEEPER_HOME/sbin/zkServer.sh status

JMX enabled by default
Using config: /opt/module/zookeeper-3.4.5-cdh5.7.0/sbin/../conf/zoo.cfg
Mode: standalone

kafka

  1. 启动kafka server,丢在后台
nohup sh $KAFKA_HOME/bin/kafka-server-start.sh $KAFKA_HOME/config/server.properties &
  1. 创建新的topic
sh $KAFKA_HOME/bin/kafka-topics.sh --create --zookeeper hadoop04:2181 --replication-factor 1 --partitions 1 --topic kafkaTestSink1

# 检查是否创建成功
sh $KAFKA_HOME/bin/kafka-topics.sh --list --zookeeper hadoop04:2181
  1. 在新的窗口启动一个控制台消费者,用来消费kafkaTestSink1
# 因为还没有生产者生产数据进来,所以启动后先搁置一边
sh $KAFKA_HOME/bin/kafka-console-consumer.sh --bootstrap-server hadoop04:9092 --from-beginning --topic kafkaTestSink1

flume

"""
在$FLUME_HOME/conf/下执行,否则会有错
这是由于配置文件中配置的相对路径导致的
之后解决了再改一下文章
"""
$FLUME_HOME/bin/flume-ng agent \
--name a1 \
--conf $FLUME_HOME/conf \
--conf-file $FLUME_HOME/conf/taildir-file-hdfs-kafka.conf \
-Dflume.root.logger=INFO,console

测试

1. 编辑taildir-file-hdfs-kafka.log(名字需要与source中配置的一样)丢到$FLUME_HOME/log/tmp/

# vim
TERMS
AND
CONDITIONS
FOR USE,
REPRODUCTION
,
AND

DISTRIBUTION

2. 修改taildir-file-hdfs-kafka.log,追加内容

Release Notes - Flume - Version v1.6.0

** Sub-task
    * [FLUME-2250] - Add support for Kafka Source
    * [FLUME-2251] - Add support for Kafka Sink
    * [FLUME-2677] - Update versions in 1.6.0 branch
    * [FLUME-2686] - Update KEYS file for 1.6 release

3. kafka的consumer端结果(搁置在一旁的控制台)
这里的第一行的空行和第二行的a是我用控制台producer进行的测试。
在这里插入图片描述
4. HDFS结果
在这里插入图片描述
测试成功

  • 1
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值