均衡负载方式搭建高可用的flume-ng环境写入信息到hadoop和kafka

最新推荐文章于 2024-07-04 22:52:19 发布

小麒麟666

最新推荐文章于 2024-07-04 22:52:19 发布

阅读量2.9k

点赞数

分类专栏： Flume/Scribe 文章标签：分布式 hadoop flume kafka

本文链接：https://blog.csdn.net/lijinqi1987/article/details/52540526

版权

Flume/Scribe 专栏收录该内容

10 篇文章 1 订阅

订阅专栏

应用场景为多台agent推送本地日志信息到hadoop，由于agent和hadoop集群处在不同的网段，数据量较大时可能出现网络压力较大的情况，所以我们在hadoop一侧的网段中部署了两台flume collector机器，将agent的数据发送到collector上进行分流，分成2个collector的数据导入hadoop，数据流图如下：

图中只画了3个agent，实际应用场景中有多台，但是collector只有两台

我们需要将agent的数据均衡地分发到两台collector机器上，agent的配置如下：

#name the components on this agent  这里声明各个source、channel、sink的名称
a1.sources = r1
a1.sinks = k1 k2
a1.channels = c1

# Describe/configure the source    声明source的类型，此处是通过tcp的方式监听本地端口5140
a1.sources.r1.type = syslogtcp
a1.sources.r1.port = 5140
a1.sources.r1.host = localhost
a1.sources.r1.channels = c1

#define sinkgroups   此处配置k1、k2的组策略，k1、k2合为一组，类型为均衡负载方式
a1.sinkgroups=g1
a1.sinkgroups.g1.sinks=k1 k2
a1.sinkgroups.g1.processor.type=load_balance
a1.sinkgroups.g1.processor.backoff=true
a1.sinkgroups.g1.processor.selector=round_robin

#define the sink 1<span>	</span>指定sink1、sink2的数据流向，都是通过avro方式发到两台collector机器
a1.sinks.k1.type=avro
a1.sinks.k1.hostname=10.0.3.82
a1.sinks.k1.port=5150

#define the sink 2
a1.sinks.k2.type=avro
a1.sinks.k2.hostname=10.0.3.83
a1.sinks.k2.port=5150


# Use a channel which buffers events in memory   指定channel的类型为内存channel
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
a1.sinks.k2.channel=c1

在collector1、collector2都正常的情况下，agent的数据随机向两台机器分发，当collector任意一台机器故障时，agent的数据会发送到另一台正常的机器上

collector1的配置

# The configuration file needs to define the sources,
# the channels and the sinks.
# Sources, channels and sinks are defined per agent,
# in this case called 'agent'

collector1.sources = r1
collector1.channels = c1 c2
collector1.sinks = k1 k2

# Describe the source
collector1.sources.r1.type = avro
collector1.sources.r1.port = 5150
collector1.sources.r1.bind = 0.0.0.0
collector1.sources.r1.channels = c1 c2


# Describe channels c1 c2 which buffers events in memory
collector1.channels.c1.type = file
collector1.channels.c1.checkpointDir = /usr/local/apache-flume-1.6.0-bin/fileChannel/checkpoint
collector1.channels.c1.dataDir = /usr/local/apache-flume-1.6.0-bin/fileChannel/data

collector1.channels.c2.type = memory
collector1.channels.c2.capacity = 1000
collector1.channels.c2.transactionCapacity = 100

# Describe the sink k1 to hadoop
collector1.sinks.k1.type = hdfs
collector1.sinks.k1.channel = c1
collector1.sinks.k1.hdfs.path = /quantone/flume/
collector1.sinks.k1.hdfs.fileType = DataStream
collector1.sinks.k1.hdfs.writeFormat = TEXT
collector1.sinks.k1.hdfs.rollInterval = 300
collector1.sinks.k1.hdfs.filePrefix = %Y-%m-%d
collector1.sinks.k1.hdfs.round = true
collector1.sinks.k1.hdfs.roundValue = 5
collector1.sinks.k1.hdfs.roundUnit = minute
collector1.sinks.k1.hdfs.useLocalTimeStamp = true

# Describe the sink k2 to kafka
collector1.sinks.k2.type = org.apache.flume.sink.kafka.KafkaSink
collector1.sinks.k2.topic = mytopic
collector1.sinks.k2.channel = c2
collector1.sinks.k2.brokerList = 10.0.3.178:9092,10.0.3.179:9092
collector1.sinks.k2.requiredAcks = 1
collector1.sinks.k2.batchSize = 20

collector2的配置

# The configuration file needs to define the sources,
# the channels and the sinks.
# Sources, channels and sinks are defined per agent,
# in this case called 'agent'

collector2.sources = r1
collector2.channels = c1 c2
collector2.sinks = k1 k2

# Describe the source
collector2.sources.r1.type = avro
collector2.sources.r1.port = 5150
collector2.sources.r1.bind = 0.0.0.0
collector2.sources.r1.channels = c1 c2

# Describe channels c1 c2 which buffers events in memory
collector2.channels.c1.type = file
collector2.channels.c1.checkpointDir = /usr/local/apache-flume-1.6.0-bin/fileChannel/checkpoint
collector2.channels.c1.dataDir = /usr/local/apache-flume-1.6.0-bin/fileChannel/data

collector2.channels.c2.type = memory
collector2.channels.c2.capacity = 1000
collector2.channels.c2.transactionCapacity = 100

# Describe the sink k1 to hadoop
collector2.sinks.k1.type = hdfs
collector2.sinks.k1.channel = c1
collector2.sinks.k1.hdfs.path = /quantone/flume
collector2.sinks.k1.hdfs.fileType = DataStream
collector2.sinks.k1.hdfs.writeFormat = TEXT
collector2.sinks.k1.hdfs.rollInterval = 300
collector2.sinks.k1.hdfs.filePrefix = %Y-%m-%d
collector2.sinks.k1.hdfs.round = true
collector2.sinks.k1.hdfs.roundValue = 5
collector2.sinks.k1.hdfs.roundUnit = minute
collector2.sinks.k1.hdfs.useLocalTimeStamp = true

# Describe the sink k2 to kafka
collector2.sinks.k2.type = org.apache.flume.sink.kafka.KafkaSink
collector2.sinks.k2.topic = mytopic
collector2.sinks.k2.channel = c2
collector2.sinks.k2.brokerList = 10.0.3.178:9092,10.0.3.179:9092
collector2.sinks.k2.requiredAcks = 1
collector2.sinks.k2.batchSize = 20

sink到hadoop的channel类型为file类型，该类型的channel会在对应的sink发送数据失败后将信息持久化到对应的文件目录中，待网络恢复正常后继续讲数据发送出去，相比memory channel，此种类型的channel适合数据量不大但是对可靠性要求较高的数据传输。

需要注意的是：此处我们使用collector2.sinks.k1.hdfs.filePrefix = %Y-%m-%d 的配置标明写入hadoop中文件名的前缀，如果在发送数据的header中没有对应的timestamp字段，这样配置会导致数据发送不了，此时需要加上配置collector2.sinks.k1.hdfs.useLocalTimeStamp = true 表明使用collector此时的时间来匹配%Y-%m-%d字段，但是这个时间其实不是日志在agent本地生成的真实时间。

如果想让不同的agent的数据写入到不同的kafka 的topic中，在collector的kafka sink中的字段collector1.sinks.k2.topic = mytopic 配置可以不配，在每个agent的source中配置static类型的interceptors，如：

a1.sources.r1.interceptors.i1.type = static
a1.sources.r1.interceptors.i1.key = topic
a1.sources.r1.interceptors.i1.value = mytopic

这样可以使不同的agent生成不同的topic名，将不同agent的数据写入到对应的topic中

小麒麟666

关注

0
点赞
踩
3

收藏

觉得还不错? 一键收藏
1
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

专栏目录