Flume 拓扑结构

1. Agent 实现原理

1.1 Flume 流式处理

image-20211122093041914

推送事件(Put)

duPut:将此数据写入临时缓冲区putList

doCommit:检查channel 内存队列是否足够合并

doRollBack:当 channel 内存队列空间不足,回滚数据

拉取事件(Take)

doTake:将数据取到临时缓冲区 takeList

doCommit:如果数据全部发送成功,则清除临时缓冲区 takeList

doRollback:数据发送过程中如果出现异常,rollback 将临时缓冲区 takeList 中的数据归还给 channel 内存队列

1.2 Agent 重要组件

Agent 除了 Source、Channel、Sink 外,还有 Channel Selectors、Sink Processors、Interceptors。

1.2.1 Channel Selectors

Channel Selector :选出 Event 将要被发往哪个 Channel。

其共有三种类型, 分别是 Replicating(复制)、Multiplexing(多路复用)、Custom Channel(自定义)。

  • Replicating:会将同一个 Event 发往所有的 Channe,默认项
  • Multiplexing:会根据相应的原则,将不同的 Event 发往不同的 Channel
  • Custom Channel:需要自定义实现 ChannelSelector 接口

例子:

# Replicating Channel
a1.sources = r1
a1.channels = c1 c2 c3
a1.sources.r1.selector.type = replicating
a1.sources.r1.channels = c1 c2 c3
a1.sources.r1.selector.optional = c3
# Multiplexing Channel 
a1.sources = r1
a1.channels = c1 c2 c3 c4
a1.sources.r1.selector.type = multiplexing
a1.sources.r1.selector.header = state
a1.sources.r1.selector.mapping.CZ = c1
a1.sources.r1.selector.mapping.US = c2 c3
a1.sources.r1.selector.default = c4
# 自定义 Channel 
a1.sources = r1
a1.channels = c1
a1.sources.r1.selector.type = org.example.MyChannelSelector

1.2.2 Sink Processors

SinkProcessor 共 有 三四种 类 型 , 分 别 是 Default Sink Processor、FailoverSinkProcessor(故障转移)、 LoadBalancingSinkProcessor(负载均衡接)、Custom Sink Processor(自定义)。

  • Default Sink Processor:只接受一个 Sink
  • Failover Sink Processor:维护一个按优先级排序的接收器列表,确保只要有一个可用的事件就会被处理(交付)。
  • LoadBalancing Sink Processor:负载均衡接收器处理器提供了在多个接收器上负载平衡流的能力。

例子:

# Failover Sink Processor
a1.sinkgroups = g1
a1.sinkgroups.g1.sinks = k1 k2
a1.sinkgroups.g1.processor.type = failover
a1.sinkgroups.g1.processor.priority.k1 = 5
a1.sinkgroups.g1.processor.priority.k2 = 10
a1.sinkgroups.g1.processor.maxpenalty = 10000
# Load balancing Sink Processor
a1.sinkgroups = g1
a1.sinkgroups.g1.sinks = k1 k2
a1.sinkgroups.g1.processor.type = load_balance
a1.sinkgroups.g1.processor.backoff = true
a1.sinkgroups.g1.processor.selector = random

LoadBalancing Sink Processor 和 FailoverSinkProcessor 对应的是 Sink Group

1.2.3 Interceptors

Interceptors :Flume具有在运行中修改/删除事件的能力,这就需要在拦截器的帮助下完成。通常一个拦截器返回的事件列表被传递给下一个拦截器。

常用的拦截器:

  • Timestamp Interceptor:这个拦截器在事件头中插入它处理事件的时间(以毫秒为单位)。
  • Host Interceptor:这个拦截器插入运行此代理的主机的主机名或IP地址。

例子:

a1.sources = r1
a1.sinks = k1
a1.channels = c1
a1.sources.r1.interceptors = i1 i2
a1.sources.r1.interceptors.i1.type = host
a1.sources.r1.interceptors.i1.preserveExisting = false
a1.sources.r1.interceptors.i1.hostHeader = hostname
a1.sources.r1.interceptors.i2.type = timestamp
a1.sinks.k1.filePrefix = FlumeData.%{CollectorHost}.%Y-%m-%d
a1.sinks.k1.channel = c1

2. 拓扑结构

2.1 串行模式

2.1.1 模型结构

该模式是将多个 Flumev串联起来,为了跨越多个代理或跃点传输数据,前一个代理的 Sink 和当前跃点的 Source 需要为avro类型,且 Sink 指向 Source 的主机名(或IP地址)和端口。

Flume 的数量会影响传输速率,如果其中某个节点宕机,会影响整个传输系统。

2.1.2 串行实现

1️⃣ Step1:node1主机,添加配置文件 tailDir2avro.conf

# Name the components on this agent
taildir2avro.sources = r1
taildir2avro.sinks = k1
taildir2avro.channels = c1

# Describe/configure the source
taildir2avro.sources.r1.type = TAILDIR
taildir2avro.sources.r1.filegroups = g1 g2
taildir2avro.sources.r1.filegroups.g1 = /root/log.txt
taildir2avro.sources.r1.filegroups.g2 = /root/log/.*.txt

# Describe the sink
taildir2avro.sinks.k1.type = avro
taildir2avro.sinks.k1.hostname = node2
taildir2avro.sinks.k1.port = 12345

# Use a channel which buffers events in memory
taildir2avro.channels.c1.type = memory
taildir2avro.channels.c1.capacity = 30000
taildir2avro.channels.c1.transactionCapacity = 3000

# Bind the source and sink to the channel
taildir2avro.sources.r1.channels = c1
taildir2avro.sinks.k1.channel = c1

2️⃣ Step2:node2 主机,添加配置文件 avro2hdfs.conf

# Name the components on this agent
avro2hdfs.sources = r1
avro2hdfs.sinks = k1
avro2hdfs.channels = c1

# Describe/configure the source
avro2hdfs.sources.r1.type = avro
avro2hdfs.sources.r1.bind = node2
avro2hdfs.sources.r1.port = 12345


# Describe the sink
avro2hdfs.sinks.k1.type = hdfs
avro2hdfs.sinks.k1.hdfs.path = hdfs://node1:8020/flume/%y-%m-%d-%H-%M
avro2hdfs.sinks.k1.hdfs.fileType = DataStream
avro2hdfs.sinks.k1.hdfs.writeFormat = Text
avro2hdfs.sinks.k1.hdfs.useLocalTimeStamp = true
avro2hdfs.sinks.k1.hdfs.rollInterval = 300
avro2hdfs.sinks.k1.hdfs.rollSize = 130000000
avro2hdfs.sinks.k1.hdfs.rollCount = 0


# Use a channel which buffers events in memory
avro2hdfs.channels.c1.type = memory
avro2hdfs.channels.c1.capacity = 30000
avro2hdfs.channels.c1.transactionCapacity = 3000

# Bind the source and sink to the channel
avro2hdfs.sources.r1.channels = c1
avro2hdfs.sinks.k1.channel = c1

3️⃣ Step3:启动 node2 上的 Flume

flume-ng agent -c /opt/flume-1.9.0/conf/ -f /opt/flume-1.9.0/agents/avro2hdfs.conf -n avro2hdfs -Dflume.root.logger=INFO,console

4️⃣ Step4:查看node2 的 12345 端口是否被监听,确定已在监听状态

image-20211122011111411

5️⃣ Step5:运行一会 node1 的 printer.sh 脚本,启动 node1 的 Flume

flume-ng agent -c /opt/flume-1.9.0/conf/ -f /opt/flume-1.9.0/agents/taildir2avro.conf -n taildir2avro -Dflume.root.logger=INFO,console

6️⃣ Step6:查看 node2 控制台输出,node1 以成功向 node2 的12345端口发送数据了

image-20211122012528005

7️⃣ Step7:查看web端是否有数据提交

1121t2hdfs4

2.2 复制和多路复用

2.2.1 模型结构

A fan-out flow using a (multiplexing) channel selector

Flume支持将事件流向一个或多个目的地,这种模式可以将相同数据源复制到多个channel中,或将不同数据分发到不同的channel中,sink可以选择传送到不同的目的地。

2.2.2 模型实现

实现: node1 上实现两种 channel、sink,一种将数据上传至 hdfs,另一种发送给 node2,node2再上传一份。

1️⃣ Step1:在 node1 上添加配置文件 taildir2hdfs_avro.conf

# Name the components on this agent
taildir2hdfs_avro.sources = r1
taildir2hdfs_avro.sinks = k1 k2
taildir2hdfs_avro.channels = c1 c2

# Describe/configure the source
taildir2hdfs_avro.sources.r1.type = TAILDIR
taildir2hdfs_avro.sources.r1.filegroups = g1 g2
taildir2hdfs_avro.sources.r1.filegroups.g1 = /root/log.txt
taildir2hdfs_avro.sources.r1.filegroups.g2 = /root/log/.*.txt

# Describe the sink
taildir2hdfs_avro.sinks.k1.type = hdfs
taildir2hdfs_avro.sinks.k1.hdfs.path = hdfs://node1:8020/flume/node1/%y-%m-%d-%H-%M
taildir2hdfs_avro.sinks.k1.hdfs.fileType = DataStream
taildir2hdfs_avro.sinks.k1.hdfs.writeFormat = Text
taildir2hdfs_avro.sinks.k1.hdfs.useLocalTimeStamp = true
taildir2hdfs_avro.sinks.k1.hdfs.rollInterval = 300
taildir2hdfs_avro.sinks.k1.hdfs.rollSize = 30000000
taildir2hdfs_avro.sinks.k1.hdfs.rollCount = 0

taildir2hdfs_avro.sinks.k2.type = avro
taildir2hdfs_avro.sinks.k2.hostname = node2
taildir2hdfs_avro.sinks.k2.port = 12345

# Use a channel which buffers events in memory
taildir2hdfs_avro.channels.c1.type = memory
taildir2hdfs_avro.channels.c1.capacity = 10000
taildir2hdfs_avro.channels.c1.transactionCapacity = 1000

taildir2hdfs_avro.channels.c2.type = memory
taildir2hdfs_avro.channels.c2.capacity = 10000
taildir2hdfs_avro.channels.c2.transactionCapacity = 1000

# Bind the source and sink to the channel
taildir2hdfs_avro.sources.r1.channels = c1 c2
taildir2hdfs_avro.sinks.k1.channel = c1
taildir2hdfs_avro.sinks.k2.channel = c2

2️⃣ Step2:在 node2 上添加配置文件 avro2hdfs.conf

# Name the components on this agent
avro2hdfs.sources = r1
avro2hdfs.sinks = k1
avro2hdfs.channels = c1

# Describe/configure the source
avro2hdfs.sources.r1.type = avro
avro2hdfs.sources.r1.bind = node2
avro2hdfs.sources.r1.port = 12345


# Describe the sink
avro2hdfs.sinks.k1.type = hdfs
avro2hdfs.sinks.k1.hdfs.path = hdfs://node1:8020/flume/node2/%y-%m-%d-%H-%M
avro2hdfs.sinks.k1.hdfs.fileType = DataStream
avro2hdfs.sinks.k1.hdfs.writeFormat = Text
avro2hdfs.sinks.k1.hdfs.useLocalTimeStamp = true
avro2hdfs.sinks.k1.hdfs.rollInterval = 300
avro2hdfs.sinks.k1.hdfs.rollSize = 30000000
avro2hdfs.sinks.k1.hdfs.rollCount = 0


# Use a channel which buffers events in memory
avro2hdfs.channels.c1.type = memory
avro2hdfs.channels.c1.capacity = 10000
avro2hdfs.channels.c1.transactionCapacity = 1000

# Bind the source and sink to the channel
avro2hdfs.sources.r1.channels = c1
avro2hdfs.sinks.k1.channel = c1

3️⃣ Step3:启动 node2 上的 Flume

flume-ng agent -c /opt/flume-1.9.0/conf/ -f /opt/flume-1.9.0/agents/avro2hdfs.conf -n avro2hdfs -Dflume.root.logger=INFO,console

4️⃣ Step4:启动 node1 上的 Flume

flume-ng agent -c /opt/flume-1.9.0/conf/ -f /opt/flume-1.9.0/agents/taildir2hdfs_avro.conf -n taildir2hdfs_avro -Dflume.root.logger=INFO,console

5️⃣ Step5:在web端查看是否有文件上传

1122t2hdfs_avro1

2.3 负载均衡和故障转移

2.3.1 模型结构

img

2.3.2 模型实现

实现: node1 向 node2、node3 发送数据,node2、node3 负责上传数据

1️⃣ Step1:在 node1 上添加配置文件 taildir2avro.conf

# Name the components on this agent
taildir2avro.sources = r1
taildir2avro.sinks = k1 k2
taildir2avro.channels = c1 c2

# Describe/configure the source
taildir2avro.sources.r1.type = TAILDIR
taildir2avro.sources.r1.filegroups = g1 g2
taildir2avro.sources.r1.filegroups.g1 = /root/log.txt
taildir2avro.sources.r1.filegroups.g2 = /root/log/.*.txt

# Describe the sink
taildir2avro.sinks.k1.type = avro
taildir2avro.sinks.k1.hostname = node2
taildir2avro.sinks.k1.port = 12345

taildir2avro.sinks.k2.type = avro
taildir2avro.sinks.k2.hostname = node3
taildir2avro.sinks.k2.port = 12345

# Use a channel which buffers events in memory
taildir2avro.channels.c1.type = memory
taildir2avro.channels.c1.capacity = 10000
taildir2avro.channels.c1.transactionCapacity = 1000

taildir2avro.channels.c2.type = memory
taildir2avro.channels.c2.capacity = 10000
taildir2avro.channels.c2.transactionCapacity = 1000

# Bind the source and sink to the channel
taildir2avro.sources.r1.channels = c1 c2
taildir2avro.sinks.k1.channel = c1
taildir2avro.sinks.k2.channel = c2

2️⃣ Step2:在 node2 上添加配置文件 avro2hdfs.conf

# Name the components on this agent
avro2hdfs.sources = r1
avro2hdfs.sinks = k1
avro2hdfs.channels = c1

# Describe/configure the source
avro2hdfs.sources.r1.type = avro
avro2hdfs.sources.r1.bind = node2
avro2hdfs.sources.r1.port = 12345


# Describe the sink
avro2hdfs.sinks.k1.type = hdfs
avro2hdfs.sinks.k1.hdfs.path = hdfs://node1:8020/flume/node2/%y-%m-%d-%H-%M
avro2hdfs.sinks.k1.hdfs.fileType = DataStream
avro2hdfs.sinks.k1.hdfs.writeFormat = Text
avro2hdfs.sinks.k1.hdfs.useLocalTimeStamp = true
avro2hdfs.sinks.k1.hdfs.rollInterval = 300
avro2hdfs.sinks.k1.hdfs.rollSize = 30000000
avro2hdfs.sinks.k1.hdfs.rollCount = 0


# Use a channel which buffers events in memory
avro2hdfs.channels.c1.type = memory
avro2hdfs.channels.c1.capacity = 10000
avro2hdfs.channels.c1.transactionCapacity = 1000

# Bind the source and sink to the channel
avro2hdfs.sources.r1.channels = c1
avro2hdfs.sinks.k1.channel = c1

3️⃣ Step3:在 node3 上添加配置文件 avro2hdfs.conf

# Name the components on this agent
avro2hdfs.sources = r1
avro2hdfs.sinks = k1
avro2hdfs.channels = c1

# Describe/configure the source
avro2hdfs.sources.r1.type = avro
avro2hdfs.sources.r1.bind = node3
avro2hdfs.sources.r1.port = 12345


# Describe the sink
avro2hdfs.sinks.k1.type = hdfs
avro2hdfs.sinks.k1.hdfs.path = hdfs://node1:8020/flume/node3/%y-%m-%d-%H-%M
avro2hdfs.sinks.k1.hdfs.fileType = DataStream
avro2hdfs.sinks.k1.hdfs.writeFormat = Text
avro2hdfs.sinks.k1.hdfs.useLocalTimeStamp = true
avro2hdfs.sinks.k1.hdfs.rollInterval = 300
avro2hdfs.sinks.k1.hdfs.rollSize = 30000000
avro2hdfs.sinks.k1.hdfs.rollCount = 0


# Use a channel which buffers events in memory
avro2hdfs.channels.c1.type = memory
avro2hdfs.channels.c1.capacity = 10000
avro2hdfs.channels.c1.transactionCapacity = 1000

# Bind the source and sink to the channel
avro2hdfs.sources.r1.channels = c1
avro2hdfs.sinks.k1.channel = c1

4️⃣ Step4:启动 node2、node3 上的 Flume

flume-ng agent -c /opt/flume-1.9.0/conf/ -f /opt/flume-1.9.0/agents/avro2hdfs.conf -n avro2hdfs -Dflume.root.logger=INFO,console

5️⃣ Step5:启动 node1 上的 Flume

flume-ng agent -c /opt/flume-1.9.0/conf/ -f /opt/flume-1.9.0/agents/taildir2avro.conf -n taildir2avro -Dflume.root.logger=INFO,console

6️⃣ Step6:控制台查看日志输出

image-20211122124634755

7️⃣ Step7:在web端查看是否有文件上传

1122t2fz1

2.4 聚合模式

2.4.1 模型结构

A fan-in flow using Avro RPC to consolidate events in one place

这是比较实用的结构,在Flume中通过配置许多带有avro接收器的第一层代理来实现,这些代理都指向单个代理的avro源(同样,您可以在这种场景中使用节约源/接收器/客户端)。第二层代理上的这个源将接收到的事件合并为单个通道,该通道由接收器消费到其最终目的地。

web应用通常分布在上百个服务器。产生的日志,处理起来也非常麻烦。用flume的这种组合方式能很好的解决这一问题,每台服务器部署一个flume采集日志,传送到一个集中收集日志的flume,再由此flume上传到hdfs、hive、hbase等,进行日志分析。

2.4.2 模型实现

实现: node1、node2 都向 node3 发送数据,由 node3 负责向 hdfs 上传数据

1️⃣ Step1:在 node1 上添加配置文件 taildir2avro.conf

# Name the components on this agent
taildir2avro.sources = r1
taildir2avro.sinks = k1
taildir2avro.channels = c1

# Describe/configure the source
taildir2avro.sources.r1.type = TAILDIR
taildir2avro.sources.r1.filegroups = g1 g2
taildir2avro.sources.r1.filegroups.g1 = /root/log.txt
taildir2avro.sources.r1.filegroups.g2 = /root/log/.*.txt

# Describe the sink
taildir2avro.sinks.k1.type = avro
taildir2avro.sinks.k1.hostname = node3
taildir2avro.sinks.k1.port = 12345

# Use a channel which buffers events in memory
taildir2avro.channels.c1.type = memory
taildir2avro.channels.c1.capacity = 10000
taildir2avro.channels.c1.transactionCapacity = 1000

# Bind the source and sink to the channel
taildir2avro.sources.r1.channels = c1
taildir2avro.sinks.k1.channel = c1

2️⃣ Step2:在 node2 上添加配置文件 taildir2avro.conf

# Name the components on this agent
taildir2avro.sources = r1
taildir2avro.sinks = k1
taildir2avro.channels = c1

# Describe/configure the source
taildir2avro.sources.r1.type = TAILDIR
taildir2avro.sources.r1.filegroups = g1 g2
taildir2avro.sources.r1.filegroups.g1 = /root/log.txt
taildir2avro.sources.r1.filegroups.g2 = /root/log/.*.txt

# Describe the sink
taildir2avro.sinks.k1.type = avro
taildir2avro.sinks.k1.hostname = node3
taildir2avro.sinks.k1.port = 12345

# Use a channel which buffers events in memory
taildir2avro.channels.c1.type = memory
taildir2avro.channels.c1.capacity = 10000
taildir2avro.channels.c1.transactionCapacity = 1000

# Bind the source and sink to the channel
taildir2avro.sources.r1.channels = c1
taildir2avro.sinks.k1.channel = c1

3️⃣ Step3:在 node3 上添加配置文件 avro2hdfs.conf

# Name the components on this agent
avro2hdfs.sources = r1
avro2hdfs.sinks = k1
avro2hdfs.channels = c1

# Describe/configure the source
avro2hdfs.sources.r1.type = avro
avro2hdfs.sources.r1.bind = node3
avro2hdfs.sources.r1.port = 12345


# Describe the sink
avro2hdfs.sinks.k1.type = hdfs
avro2hdfs.sinks.k1.hdfs.path = hdfs://node1:8020/flume/%y-%m-%d-%H-%M
avro2hdfs.sinks.k1.hdfs.fileType = DataStream
avro2hdfs.sinks.k1.hdfs.writeFormat = Text
avro2hdfs.sinks.k1.hdfs.useLocalTimeStamp = true
avro2hdfs.sinks.k1.hdfs.rollInterval = 300
avro2hdfs.sinks.k1.hdfs.rollSize = 30000000
avro2hdfs.sinks.k1.hdfs.rollCount = 0


# Use a channel which buffers events in memory
avro2hdfs.channels.c1.type = memory
avro2hdfs.channels.c1.capacity = 10000
avro2hdfs.channels.c1.transactionCapacity = 1000

# Bind the source and sink to the channel
avro2hdfs.sources.r1.channels = c1
avro2hdfs.sinks.k1.channel = c1

4️⃣ Step4:启动 node3 上的 Flume

flume-ng agent -c /opt/flume-1.9.0/conf/ -f /opt/flume-1.9.0/agents/avro2hdfs.conf -n avro2hdfs -Dflume.root.logger=INFO,console

5️⃣ Step5:启动 node1、node2 上的 Flume

flume-ng agent -c /opt/flume-1.9.0/conf/ -f /opt/flume-1.9.0/agents/taildir2avro.conf -n taildir2avro -Dflume.root.logger=INFO,console

6️⃣ Step6:控制台查看日志输出

image-20211122120404492

7️⃣ Step7:在web端查看是否有文件上传

1122t2juhe1

3. 写在最后

配置文件大致一样,每次测试时都先向监控的目录或文件下写点东西,每测试完清空hdfs上的文件。

 


❤️ END ❤️
  • 0
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 打赏
    打赏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

JOEL-T99

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值