Flume安装及组件作用配置

最新推荐文章于 2023-12-18 09:46:44 发布

sq0723

最新推荐文章于 2023-12-18 09:46:44 发布

阅读量492

点赞数

分类专栏： flume 大数据开发文章标签： flume 日志收集系统

本文链接：https://blog.csdn.net/sunqingok/article/details/101350272

版权

大数据开发同时被 2 个专栏收录

40 篇文章 0 订阅

订阅专栏

flume

2 篇文章 0 订阅

订阅专栏

1、flume安装
解压，conf目录下拷贝配置，安装目录下测试运行

   Tar -zxvf apache-flume-1.8.0.0-bin.tar.gz
   cp flume-conf.properties.template flume-conf-sequence.properties
   bin/flume-ng agent -n agent1 -c conf -f conf/flume-conf-sequence.properties -Dflume.root.logger=INFO,console

2、source组件练习
1）Avro source组件
启动运行： bin/flume-ng agent -n agent1 -c conf -f conf/avro-memory-logger.properties -Dflume.root.logger=INFO,console

发送数据：avro-client客户端发送数据 bin/flume-ng avro-client -c ./conf -H master -p 4141 -F djt.txt -Dflume.root.logger=INFO,console
2）Thrift source
监听端口号，接受端口号发过来的数据
启动运行：bin/flume-ng agent -n agent1 -c conf -f conf/thrift-memory-logger.properties -Dflume.root.logger=INFO,console
发送数据：java客户端
3）NetCat TCP source
启动运行：bin/flume-ng agent -n agent1 -c conf -f conf/netcat-memory-logger.properties -Dflume.root.logger=INFO,console
发送数据：telnet命令发送数据 telnet bigdata11 6666
4）HTTP source
启动运行：bin/flume-ng agent -n agent1 -c conf -f conf/http-memory-logger.properties -Dflume.root.logger=INFO,console
发送数据：curl -X POST -d’[{“headers”:{“key1”:“flume”,“key2”:“kafka”},“body”:“hello flume”}]’ http://192.168.137.11:5140
5）Exec source
启动运行：bin/flume-ng agent -n agent1 -c conf -f conf/exec-memory-logger.properties -Dflume.root.logger=INFO,console
发送数据：往big.txt追加内容 echo “11111” >> big.txt
6）Spooling Directory source
监测目录下是否有文件添加
启动运行：bin/flume-ng agent -n agent1 -c conf -f conf/SpoolingDirectory-memory-logger.properties -Dflume.root.logger=INFO,console
7）Taildir source
记录采集的位置，当flume关闭掉后再打开会从记录的位置开始采集
启动运行：bin/flume-ng agent -n agent1 -c conf -f conf/TaildirSource-memory-logger.properties -Dflume.root.logger=INFO,console
发送数据：往big.txt追加内容 echo “11111” >> big.txt
3、channel组件
1）Memory channel
启动运行：bin/flume-ng agent -n agent1 -c conf -f conf/avro-memory-logger.properties -Dflume.root.logger=INFO,console
发送数据：avro-client客户端发送数据 bin/flume-ng avro-client -c ./conf -H master -p 4141 -F djt.txt -Dflume.root.logger=INFO,console
2）File channel
启动运行：bin/flume-ng agent -n agent1 -c conf -f conf/SpoolingDirectory-file-logger.properties -Dflume.root.logger=INFO,console
发送数据：
4、拦截器
1）Timestamp Interceptor
以时间建立hdfs的目录
启动运行：bin/flume-ng agent -n agent1 -c conf -f conf/SpoolingDirectory-timestampIntercepter-file-hdfssink.properties -Dflume.root.logger=INFO,console
2）Host Interceptor
以host为前缀命名
启动运行：bin/flume-ng agent -n agent1 -c conf -f conf/SpoolingDirectory-hosttampIntercepter-file-hdfssink.properties -Dflume.root.logger=INFO,console
3）static interceptor
定义静态变量，直接使用
启动运行：bin/flume-ng agent -n agent1 -c conf -f conf/SpoolingDirectory-staticIntercepter-file-hdfssink.properties -Dflume.root.logger=INFO,console
发送数据：
4）remove header interceptor （待确定）
1.8版本新增的拦截器，拦截header
启动运行：master：bin/flume-ng agent -n agent1 -c conf -f conf/avro-removeHeaderIntercepter-memory-logger.properties -Dflume.root.logger=INFO,console
Slave1：bin/flume-ng agent -n agent1 -c conf -f conf/exec-staticIntercepter-memory-avro.properties -Dflume.root.logger=INFO,console
5、multi-agent flow
Slave1：配置avro-sink
Slave2：配置avro-source，avro-sink
Master：配置avro-source，logger-sink
启动命令：Slave1：bin/flume-ng agent -n agent1 -c conf -f conf/avro-memory-slave1avro-salve1.properties -Dflume.root.logger=INFO,console
Slave2：bin/flume-ng agent -n agent1 -c conf -f conf/avro-memory-slave1avro.properties -Dflume.root.logger=INFO,console
Master：bin/flume-ng agent -n agent1 -c conf -f conf/slave1avro-memory-logger.properties -Dflume.root.logger=INFO,console
监控slave1目录下的文件，传递到slave2，然后传递到master，启动顺序为：先启动master，再启动slave2，最后启动slave1
聚合flow操作与multi-agent flow类似
6、多路flow
复制channel选择器，将每个事件复制到source的channels参数指定的所有channel中。Channel选择器可以指定一组channel是必须的，另一组是可选的，当事件写入可选的channel失败会被忽略，如果写入必须的channel中发生失败，则source会抛出异常并要求重试。
配置如a1.sources.r1.channels = c1 c2 c3
a1.sources.r1.selector.optional = c3
复用channel选择器，一种专门用于动态路由事件的Channel选择器，通过它选择事件应该写入的Channel，基于一个特定的事件头的值进行路由。多路复用Channel选择寻找一个特定的报头，该报头通过选择器的配置指定。
对于每个事件，选择器查找配置中header参数指定的键的报头，然后检查报头的值是否与mapping中的配置相匹配。如果与mapping的配置相匹配，通过映射将事件写出到Channel。如果选择器没有找到匹配或报头本身不存在，那么它写事件到default参数指定的Channel中。
配置如：a1.sources = r1
a1.channels = c1 c2
a1.sources.r1.selector.type = multiplexing
a1.sources.r1.selector.header = state
a1.sources.r1.selector.mapping.CZ = c1
a1.sources.r1.selector.mapping.US = c2
a1.sources.r1.selector.default = c1
这里假定header的key值为state，可以根据key的值将来自不同标签的日志输出到不同的channel。
7、负载均衡和故障切换
在这里插入图片描述
Sink运行器是运行sink组的，仅仅是询问sink组的处理下一批事件的线程。Sink处理器选择sink组中的sink处理下一批事件。当Sink运行器要求Sink组告知其中一个用于从Channel中拉取事件的Sink，且将事件写到下一个阶段（下一个agent或者存储系统）时，Sink处理器就是实际用来选择完成这个过程的Sink组件。
为什么需要Load balancing Sink处理器？如下图，当不发生故障时整个流程是正常的。当第二层的agent发生故障时，所有sink都将停止，直到发生故障的agent重新上线运行。这样就会有两个问题，发生故障的sink在不断的重新连接，耗尽agent的线程，浪费cpu资源，对与File Channel，所有Sink不发送数据，但是Source仍然在写入数据，造成了I/O成本和磁盘空间成本。
在这里插入图片描述
为了避免上面的问题，Sink组选择使用Load balancing Sink处理器，它将从Sink组所有的Sink中选择一个Sink，处理来自Channel的事件。Sink选择的顺序可以配置为random或者round-robin。
 如果是random，那么将随机从Sink组中选择一个Sink，从自己的Channel中读取数据并发送出去。
 如果是round-robin，那么将从Sink组中以循环的方式选取sink，从自己的Channel中读取数据并发送出去。
 如果Sink写入到一个失败的Agent或者速度太慢的Agent，会导致超时，Sink处理器会选择另一个Sink写数据。Sink处理器会将失败的Sink加入到黑名单，回退时间以指数方式增长直到达到上限值。这能确保相同的Sink不会循环重复尝试，且不浪费资源，直到回退时间过期。
Load balancing Sink处理器示例配置如下：
a1.sinkgroups = g1
a1.sinkgroups.g1.sinks = k1 k2
a1.sinkgroups.g1.processor.type = load_balance
a1.sinkgroups.g1.processor.backoff = true
a1.sinkgroups.g1.processor.selector = random
a1.sinkgroups.g1.processor.selector.maxTimeOut = 10000
 该sink组使用Load balancing Sink处理器，随机选择k1 k2 中的一个。
 如果一个Sink失败，那么该Sink会被加入到黑名单，回退时间从250毫秒开始，然后以指数形式增长，直到达到10秒。在这之后，每次写操作失败，Sink就回退10秒，直到它能够成功写入数据，然后回退时间被重置为0。
 如果selector参数值为round-robin，那么k1被首先用来处理数据，然后是k2，然后再是k1。
 在任何时候同一个时刻每个Agent只有一个Sink写数据。
Failover 处理器是从sink组中以优先级的顺序选择Sink，拥有最高优先级的sink先写数据，
直到它失败为止（在RPC Sink的情况下，Sink的失败可能是由于下游的Agent的挂掉或者宕机造成的），然后才选择次优先级的sink写数据。只有当前sink写入数据失败时，才会选择另一个不同的Sink写数据，这能确保当sink没有失败时，每台机器上只有一个Sink写入到第二层的所有Agent
failover sink处理器示例配置如下所示：
a1.sinkgroups = g1
a1.sinkgroups.g1.sinks = k1 k2
a1.sinkgroups.g1.processor.type = failover
a1.sinkgroups.g1.processor.priority.k1 = 5
a1.sinkgroups.g1.processor.priority.k2 = 10
a1.sinkgroups.g1.processor.maxpenalty = 10000
 该配置中2个sink使用了failover的配置，k2具有最高优先级。
 如果存在没有指定优先级的sink，第一个没有优先级的sink被分配优先级为0，下一个为-1，下一个为-2，以此类推。
 如果两个Sink有相同的优先级，只激活Sink组中首先指定的Sink。