Flume 支持下面几种方式读取日志流数据:
- Avro
- Thrift
- Syslog
- Netcat
Avro source和Avro sink 对Flume灵活的拓扑结构至关重要。
-
Avro source : 可以接受Avro Client 发送的 事件,或者Avro sink发送的事件。
Avro Source被设计为高扩展的RPC服务器端,能从其他的Flume Agent的Avro Sink或者使用Flume的SDK发送数据的客户端应用,接受数据到一个Flume Agent中。
-
Avro sink :从channel获取的事件,在Avro sink中会封装成 Avro Event ,然后发送到绑定的ip和端口
a1.channels = c1 a1.sinks = k1 a1.sinks.k1.type = avro a1.sinks.k1.channel = c1 a1.sinks.k1.hostname = 10.10.10.10 a1.sinks.k1.port = 4545
多个Flume串联
如果需要数据流经多个Flume,需要采用Avro连接。
案例:
监控 机器master 上的一个日志变化,将事件通过Avro sink 写入到 slaver01机器的 Avro Source ,然后输出到logger sink 输出到控制台。
机器A (作为客户端):
配置文件
agent11.sources = r1
agent11.sinks = k1
agent11.channels = c1
# Describe/configure the source
agent11.sources.r1.type = TAILDIR
agent11.sources.r1.positionFile = /opt/flume/tail_dir_connection.json
agent11.sources.r1.filegroups = f1
agent11.sources.r1.filegroups.f1 = /opt/flume/files/zhangxu.*
# Describe the sink
agent11.sinks.k1.type = avro
agent11.sinks.k1.hostname = slaver01
agent11.sinks.k1.port = 40444
# Use a channel which buffers events in memory
agent11.channels.c1.type = memory
agent11.channels.c1.capacity = 1000
agent11.channels.c1.transactionCapacity = 100
# Bind the source and sink to the channel
agent11.sources.r1.channels = c1
agent11.sinks.k1.channel = c1
启动:
./bin/flume-ng agent -c conf/ -n agent11 -f ./job/conf/flume-avro-connection.conf -Dflume.root.logger=INFO,console
机器B (作为服务端)
agent1.sources = r1
agent1.sinks = k1
agent1.channels = c1
# Describe/configure the source
agent1.sources.r1.type = avro
agent1.sources.r1.bind = slaver01
agent1.sources.r1.port = 40444
# Describe the sink
agent1.sinks.k1.type = logger
# Use a channel which buffers events in memory
agent1.channels.c1.type = memory
agent1.channels.c1.capacity = 1000
agent1.channels.c1.transactionCapacity = 100
# Bind the source and sink to the channel
agent1.sources.r1.channels = c1
agent1.sinks.k1.channel = c1
启动:
/opt/flume/bin/flume-ng agent -c /opt/flume/conf/ -n agent1 -f flume-avro-connection.conf -Dflume.root.logger=INFO,console
多个Flume聚合
一个典型的应用场景:多个客户端产生日志,通过Flume聚合到一个存储系统上的Flume做统一的收集处理。第一层的Flume 需要配置成Avro sink ,然后都指向存储系统上的Flume 的Avro Source。
案例:
三台机器 master 、slaver01、slaver01 。slaver01和slaver02通过Taildir source和exec source 收集,然后通过avro sink 聚合到master。然后用file_roll sink输出到文件:
- master:
配置文件
agent11.sources = r1
agent11.sinks = k1
agent11.channels = c1
# Describe/configure the source
agent11.sources.r1.type = avro
agent11.sources.r1.bind = master
agent11.sources.r1.port = 40444
# Describe the sink
agent11.sinks.k1.type = logger
# Use a channel which buffers events in memory
agent11.channels.c1.type = memory
agent11.channels.c1.capacity = 1000
agent11.channels.c1.transactionCapacity = 100
# Bind the source and sink to the channel
agent11.sources.r1.channels = c1
agent11.sinks.k1.channel = c1
启动:
./bin/flume-ng agent -c conf/ -n agent11 -f job/conf/flume-avro-connection.conf -Dflume.root.logger=INFO,console
- slaver01
agent1.sources = r1
agent1.sinks = k1
agent1.channels = c1
# Describe/configure the source
agent1.sources.r1.type = TAILDIR
agent1.sources.r1.positionFile = /opt/flume/reduce_tail_reduce_dir.json
agent1.sources.r1.filegroups = f1
agent1.sources.r1.filegroups.f1 = /opt/flume/files/xiaomao.*
# Describe the sink
agent1.sinks.k1.type = avro
agent1.sinks.k1.hostname = master
agent1.sinks.k1.port = 40444
# Use a channel which buffers events in memory
agent1.channels.c1.type = memory
agent1.channels.c1.capacity = 1000
agent1.channels.c1.transactionCapacity = 100
# Bind the source and sink to the channel
agent1.sources.r1.channels = c1
agent1.sinks.k1.channel = c1
启动:
./bin/flume-ng agent -c conf/ -n agent1 -f job/conf/flume-reduce-tail.conf -Dflume.root.logger=INFO,console
- slaver02
agent1.sources = r1
agent1.sinks = k1
agent1.channels = c1
# Describe/configure the source
agent1.sources.r1.type = exec
agent1.sources.r1.command = tail -f /opt/flume/files/xiaomao.txt
agent1.sources.r1.shell = /bin/bash -c
# Describe the sink
agent1.sinks.k1.type = avro
agent1.sinks.k1.hostname = master
agent1.sinks.k1.port = 40444
# Use a channel which buffers events in memory
agent1.channels.c1.type = memory
agent1.channels.c1.capacity = 1000
agent1.channels.c1.transactionCapacity = 100
# Bind the source and sink to the channel
agent1.sources.r1.channels = c1
agent1.sinks.k1.channel = c1
启动:
./bin/flume-ng agent -c conf/ -n agent1 -f job/conf/flume-reduce-exec.conf -Dflume.root.logger=INFO,console
多路复用和负载均衡(故障转移)需要先了解Flume内部原理:
复制和多路复用
一个Source可以发给多个Channel,多个sink 绑定不同的Channel,可以做不同的处理。结构如下:
Flume Channel Selectors 属性selector.type
如果不配置,默认是 replicating
一个简单示例:
a1.sources = r1
a1.channels = c1 c2 c3
a1.sources.r1.selector.type = replicating
a1.sources.r1.channels = c1 c2 c3
a1.sources.r1.selector.optional = c3
事件会同时发送给c1,c2,c3 但是c3是可选的,意味着,如果给c3发送失败,不会回滚put事务,而c1,c2如果接受失败,会触发事务回滚。
示例:
监控日志,然后Flume1通过exec source接收后,写入到两个Channel ,两个Channel 分别通过Avro Sink 传入另外两个Flume,并分别写入HDFS和本地文件系统。
这里只给出Flume1的配置文件。
配置文件如下:
# Name the components on this agent
a1.sources = r1
a1.sinks = k1 k2
a1.channels = c1 c2
# 将数据流复制给所有 channel 默认即该选项。
a1.sources.r1.selector.type = replicating
# Describe/configure the source
a1.sources.r1.type = exec
a1.sources.r1.command = tail -F /opt/module/hive/logs/hive.log
a1.sources.r1.shell = /bin/bash -c
# Describe the sink
# sink 端的 avro 是一个数据发送者
a1.sinks.k1.type = avro
a1.sinks.k1.hostname = hadoop102
a1.sinks.k1.port = 4141
a1.sinks.k2.type = avro
a1.sinks.k2.hostname = hadoop102
a1.sinks.k2.port = 4142
# Describe the channel
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
a1.channels.c2.type = memory
a1.channels.c2.capacity = 1000
a1.channels.c2.transactionCapacity = 100
# Bind the source and sink to the channel
a1.sources.r1.channels = c1 c2
a1.sinks.k1.channel = c1
a1.sinks.k2.channel = c2
负载均衡和故障转移
Agent1的事件通过Avro 发送给Agent 2 、Agent3、Agent4 可以通过设置 Flume Sink Processors 属性,可以控制负载均衡和故障转移。原理就是通过设置多个sink,组成一个sink group 然后就可以实现负载均衡或者故障转移特性。
默认Sink Processors只会接受一个 sink 。needs to be default
, failover
or load_balance
a1.sinkgroups = g1
a1.sinkgroups.g1.sinks = k1 k2
a1.sinkgroups.g1.processor.type = load_balance
示例:
实现一个故障转移案例。只给出Flume1的配置文件。
# Name the components on this agent
a1.sources = r1
a1.channels = c1
a1.sinkgroups = g1
a1.sinks = k1 k2
# Describe/configure the source
a1.sources.r1.type = netcat
a1.sources.r1.bind = localhost
a1.sources.r1.port = 44444
a1.sinkgroups.g1.processor.type = failover
a1.sinkgroups.g1.processor.priority.k1 = 5
a1.sinkgroups.g1.processor.priority.k2 = 10
a1.sinkgroups.g1.processor.maxpenalty = 10000
# Describe the sink
a1.sinks.k1.type = avro
a1.sinks.k1.hostname = hadoop102
a1.sinks.k1.port = 4141
a1.sinks.k2.type = avro
a1.sinks.k2.hostname = hadoop102
a1.sinks.k2.port = 4142
# Describe the channel
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinkgroups.g1.sinks = k1 k2
a1.sinks.k1.channel = c1
a1.sinks.k2.channel = c1