- 2.1 单agent多source
- 2.2 Multiplexing Channel Selector
- 2.3 Colsolidation(整合)
- 2.4 Flume Sink Processors
- 2.5 自定义source
第一章:上次课回顾
https://blog.csdn.net/zhikanjiani/article/details/100135799
第二章:多agent
我们之前学习的都是单agent的,工作中肯定是多agent的。
-
涉及到上下游,一个地方的输出要到另一个的输入,涉及到网络的传输多agent的采用AVRO的方式。
-
进入如下网址:http://archive.cloudera.com/cdh5/cdh/5/flume-ng-1.6.0-cdh5.7.0/,选中Document,点击Flume User Guide,
解析:
假设有两台机器,分为node1和node2,机器1中是一个source、channel、sink,对于第一台机器:我们是去读文件,监控源端的日志,通过网络传输,写到hdfs上去。
Agent技术选型:
- source --> channel --> sink --> source --> channel --> sink
- exec memory avro avro memory logger
How to 实现?
1、第一个配置文件,文件名字:
>>>flume-avro-sink.conf
a1 --> flume-avro-sink
r1 --> exec-source
c1 --> avro-memory-channel
flume-avro-sink.sources = exec-source
flume-avro-sink.channels = avro-memory-channel
flume-avro-sink.sinks = avro-sink
flume-avro-sink.sources.exec-source.type = exec
flume-avro-sink.sources.exec-source.command = tail -F /home/hadoop/data/avro_access.data
flume-avro-sink.channels.avro-memory-channel.type = memory
flume-avro-sink.sinks.avro-sink.type = avro
flume-avro-sink.sinks.avro-sink.hostname = localhost
flume-avro-sink.sinks.avro-sink.port = 44444
flume-avro-sink.sources.exec-source.channels = avro-memory-channel
flume-avro-sink.sinks.avro-sink.channel = avro-memory-channel
2、第二个配置文件flume-avro-source.conf
flume-avro-source.sources = avro-source
flume-avro-source.channels = avro-memory-channel
flume-avro-source.sinks = logger-sink
flume-avro-source.sources.avro-source.type = avro
flume-avro-source.sources.avro-source.bind = localhost
flume-avro-source.sources.avro-source.port = 44444
flume-avro-source.channels.avro-memory-channel.type = memory
flume-avro-source.sinks.logger-sink.type = logger
flume-avro-source.sources.avro-source.channels = avro-memory-channel
flume-avro-source.sinks.logger-sink.channel = avro-memory-channel
启动的时候先启动Node2,再启动Node1:
1、第一组启动:
flume-ng agent \
--name flume-avro-sink \
--conf $FLUME_HOME/conf \
--conf-file $FLUME_HOME/conf/flume-avro-sink.conf \
-Dflume.root.logger=INFO,console
2、第二组启动:
flume-ng agent \
--name flume-avro-source \
--conf $FLUME_HOME/conf \
--conf-file $FLUME_HOME/conf/flume-avro-source.conf \
-Dflume.root.logger=INFO,console
3、进行验证:
[hadoop@hadoop ~]$ cd data
[hadoop@hadoop data]$ ll
total 8
-rw-rw-r-- 1 hadoop hadoop 2 Oct 20 11:03 avro_access.data
-rw-rw-r-- 1 hadoop hadoop 42 Oct 18 13:49 ruozeinput.txt
[hadoop@hadoop data]$ echo ruozedata >> avro_access.data
[hadoop@hadoop data]$ echo ruoze >> avro_access.data
[hadoop@hadoop data]$ echo jepson >> avro_access.data
验证下数据都过来了:
小结:
-
在工作中只要把source换掉,不使用exec,把sink换成hdfs,就能通过网络的方式转换到hdfs.
-
注意点:我们在一台机器上配置的,所以ip和port要配置成一样的;要先启动node2节点,然后再启动node1节点.
2.1 单agent多source的配置
这是一个单source,多channel、sink.
假设我们的数据在外面的一堆服务器上,生产上离线和实时采用的是同一份数据;写一份数据到HDFS上去,还有一份数据流转到Kafka上去。
引出概念:Channel Selector
If the type is not specified, then defaults to “replicating”.
- source端的数据以复制的方式将event写到一到多个channel中去。
1、Multiplexing the flow:
- 举例理解:source中有3条数据,channel1、channel2、channel3中各走一条数据。
2、Flume supports multiplexing the event flow to one or more deatinations. This is achieved by defining a flow multiplex that can replicate or selectively route an event to one or more channels.
- flume支持将事件流复用到一个或多个目的地;这是通过定义一个流多路复用器来实现的,该多路复用器可以复制或选择性地将一个事件路由到一个或多个通道。
1、采用的方式:
netcat --> channel -->hdfs
--> channel --> logger
> > > > channel-replicating-selector.conf
a1.sources = r1
a1.channels = c1 c2
a1.sinks = k1 k2
#定义netcat source:
a1.sources.r1.type = netcat
a1.sources.r1.bind = 172.17.0.5
a1.sources.r1.port = 44444
a1.source.r1.selector.type = replicating
a1.source.r1.channels = c1 c2
#配置channel:
a1.channels.c1.type = memory
a1.channels.c2.type = memory
#配置sink:
a1.sinks.k2.type = logger
a1.sinks.k1.channel = c1
a1.sinks.k2.channel = c2
#hdfs的sink:
a1.sinks.k1.type = hdfs
a1.sinks.k1.hdfs.path = /flume/g6events/%y%m%d%H%M
a1.sinks.k1.hdfs.filePrefix = events-
a1.sinks.k1.hdfs.round = true
a1.sinks.k1.hdfs.roundValue = 10
a1.sinks.k1.hdfs.roundUnit = minute
a1.sinks.k1.useLocalTimeStamp = true
a1.sinks.k1.fileType = DataStream
a1.sinks.k1.writeFormat = Text
启动脚本:
flume-ng agent \
--name a1 \
--conf $FLUME_HOME/conf \
--conf-file $FLUME_HOME/conf/channel-replicating-selector.conf \
-Dflume.root.logger=INFO,console
测试有无问题:
- 图一:
- 图二:
- 图三:设置的每隔一分钟一个批次
查看是否正确:
[hadoop@hadoop apache-flume-1.6.0-cdh5.7.0-bin]$ hdfs dfs -text /flume/g6-events/1910221432/events-.1571725926033
19/10/22 14:36:51 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
jepson
ruoze
ruozedata
2.2 Multiplexing Channel Selector
图解:
-
source中进来的数据分为几种状态,US、CN、CA各一组;要求是着几个不同状态的source都写到一个channel中去,然后再写道sink中去;数据就不是复制得了,是根据条件写进来的。
-
每一个channel中的数据是根据state来判断的,查看官网:
a1.sources = r1
a1.channels = c1 c2 c3 c4
a1.sources.r1.selector.type = multiplexing
a1.sources.r1.selector.header = state //加一个头state
a1.sources.r1.selector.mapping.CZ = c1
a1.sources.r1.selector.mapping.US = c2 c3
a1.sources.r1.selector.default = c4
使用static interceptor即可:
- 至少需要写4个配置文件.
生产应用场景:日志采集过来,有很多业务日志在一起,如何进行区分;不同的业务日志区分后写到不同的目录中去;后续到对应的hdfs目录中拿数据处理即可。
2.3 Colsolidation(整合)
1、A very common scenario in log collection is a large number of log producing clients sending data to a few consumer agents that are attached to the storage subsystem(存储系统). For example, logs collected from hundreds of web servers sent to a dozen of agents that write to HDFS cluster.
- 一个常见的场景再日志采集中是一个大量日志处理机器发送数据到一个少的消费机器上去,连接到存储系统上去。举例:从数百个web服务器收集的日志发送给十几个代理,这些代理再写到HDFS集群中去。
- 多个机器上配置avro-source、avro-sink,我们只要进行启动即可;
这样会出现问题:
1、这些机器部署在客户方,不可能写到我们自己的数据平台上去;
2、10000台机器写到hdfs,hdfs是扛不住太多机器的不间断写入,吞吐量不够;
3、如果集群内部某一台机器挂了,写不过来,堵住的channel就会挂掉。
引出概念:落盘机:
- 数据采集过来后先配置在本地channel,此时再把本地的文件通过avro-sink一并写到agent上;这一批agent机器只要具备gateway权限即可,再给他写到HDFS上。
2.4 Flume Sink Processors
1、Sink groups allow users to group multiple sinks into one entity(允许用户将多个接收器分到一个实体中). Sink processors can be used to provide load balancing capabilities over all sinks inside the group(Sink processors能够被用来为组内所有的sink提供负载均衡功能) or to achieve fail over from one sink to another in case of temporal failure(或者是在时间失效的情况下还能做到一个sink到另一个sink的故障转移).
Failover Sink Processor图解:
- channel的数据发送到sink,假设sink3挂掉了该怎么办?还需要做一个failover,假设sink3挂掉了,能够给我们发到一个冗余的机器上去,在生产上必然是使用这种架构的。
概念:
failover sink processor maintains a prioritized list of sinks, gunrantee that so long as one is available events will be processed. (delivered)
例子:
a1.sinkgroups = g1
a1.sinkgroups.g1.sinks = k1 k2
a1.sinkgroups.g1.processor.type = failover
a1.sinkgroups.g1.processor.priority.k1 = 5
a1.sinkgroups.g1.processor.priority.k2 = 10
a1.sinkgroups.g1.processor.maxpenalty = 10000
2.5 自定义source
1、A custom source is your own implementation of the source interface, A custom source’s class and its dependencies must be included in the agent’s classpath when starting the Flume agent. The type of the custom source is its FQCN.
-
定义完了一个一个source,打包放到如下目录:$FLUME_HOME/lib
-
此时就需要转到flume的开发文档:
打开IDEA,在IDEA的pom.xml中加上一个文件:
<dependency>
<groupId>org.apache.flume</groupId>
<artifactId>flume-ng-core</artifactId>
<version>${flume.version}</version>
</dependency>
假设我们想要一个source去读取MySQL的数据:
source:读取mysql
configure:初始化,读一些相关信息进来的
如果让我们自己开发的话,首先去github上面找:
- https://github.com/keedio/flume-ng-sql-source
从mysql中读取出来的数据拼接成一个event,然后再进行发送。
2.6 完整的一个流程和数据平台进行交互
- 外部web server 通过agent(中继),写到离线-HDFS集群(Spark),再写到实时Kafka集群(Spark Streaming、Flink、Storm)
本次课程作业:
使用taildir source --> File channel --> Avro sink
- Avro source --> File channel --> HDFS
-
--> Kafka