七个案例，教你从零到一上手 Flume

最新推荐文章于 2023-05-15 14:10:31 发布

我很ruo

最新推荐文章于 2023-05-15 14:10:31 发布

阅读量227

点赞数 2

文章标签：大数据 flume

本文链接：https://blog.csdn.net/weixin_44512041/article/details/104854886

版权

1. Flume简介

1. 1 简介

Flume是一个针对日志数据进行高效收集、聚合和传输的框架。

1.2 基本单元

Flume是Java编写，因此在使用Flume时需要启动一个Java进程，这个进程在Flume中称为Agent。一个Agent包含source、sink、channel三部分，官网中的示意图如下：

1.3 核心概念

Agent: 一个 Flume java 进程。

Source: 负责对接数据源，数据源是不同的类型就使用不同的 source。

Sink: 负责将读取到的数据写出到指定的目的地，写出的目的地不同就使用不同的sink。

Channel：source 和 sink 之间的缓冲。

Event: flume 中数据传输的最基本单位，包含一个 header(map) 和 body(byte[])。

Interceptors: 拦截器是在 source 向 Channel 写入 event 的途中，对 event 进行拦截处理。

Channel Selectors：当一个source对接多个Channel，会使用选择器选取合适的 channel。

Sink Processors: 适用于从一个sink组中，挑选一个sink去channel读取数据。

1.4 基本常识

一个Source可以对接多个Channel；一个Sink只能对接一个Channel。这一点在后边的复杂案例中会进一步介绍。

1.5 安装

版本选择：目前使用的较多的是flume ng 1.7版本。

由于 Flume 使用JAVA编写，需要配置 JAVA_HOME 环境变量，之后解压即可。

验证：

bin/flume-ng version

1.6 使用

① Setting up an agent

准备一个 agent 的配置文件，这个配置文件遵循 java 的 properties 文件格式。

这个文件中可以配置一个或多个 agent，并且还会配置这些 agent 使用的组件的各种属性，以及这些组件如何组合连接，构成数据流。

② Configuring individual components

在 agent 的配置文件中，独立地配置每个需要的组件的属性等。

③ Wiring the pieces together

将组件组合在一起

④ Starting an agent

$ bin/flume-ng agent -n $agent_name -c conf -f conf/flume-conf.properties.template

-n： agent的名称，需要和配置文件中agent的名称一致

-c : flume全局配置文件的目录

-f: agent的配置文件

2. 案例1：网络端口 -> 控制台

2.1 要求

要求：监听某台机器的指定端口，将收到的信息，输出到控制台。

2.2 所需组件

2.1.1 NetCat TCP Source

工作原理非常类似nc-k -l [host] [port]，这个source可以打开一个端口号，监听端口号收到的消息，将消息的每行，封装为一个event。

安装NC：

sudo yum -y install nc

nc-k -l [host] [port]的运行：

在hadoop102执行：

nc -k -l hadoop102 4444

在hadoop103向hadoop102:4444发送tcp请求

nc hadoop102 4444

2.3 配置说明

2.3.1 source

type	–	组件名称，必须为 `netcat`
bind	–	要绑定的 ip 地址或主机名
port	–	要绑定的端口号

2.3.2 Logger Sink

采用 logger以info级别将event输出到指定的路径（文件或控制台）。

type	–	必须是 `logger`
maxBytesToLog(无效)	16	Maximum number of bytes of the Event body to log

2.3.3 Memory Channel

将event存储在内存中的队列中！一般适用于高吞吐量的场景，但是如果agent故障，会损失阶段性的数据。

type	–	The component type name, needs to be `memory`
capacity	100	存放event的容量限制

2.4 配置文件

# 命名每个组件 a1代表agent的名称 
#a1.sources代表a1中配置的source,多个使用空格间隔
#a1.sinks代表a1中配置的sink,多个使用空格间隔
#a1.channels代表a1中配置的channel,多个使用空格间隔
a1.sources = r1
a1.sinks = k1
a1.channels = c1

# 配置source
a1.sources.r1.type = netcat
a1.sources.r1.bind = hadoop103
a1.sources.r1.port = 44444

# 配置sink
a1.sinks.k1.type = logger
a1.sinks.k1.maxBytesToLog = 100

# 配置channel
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000

# 绑定和连接组件
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

2.5 启动命令

bin/flume-ng agent -c conf/ -n a1 -f flumeagents/netcatSource-loggersink.conf -Dflume.root.logger=DEBUG,console

启动命令说明：

为了方便演示，使用 Dflume.root.logger = DEBUG,console 将日志内容打印到了控制台。如果希望将日志内容写入文件，可以通过后台运行，将原本在 console 的输出定向到 flume 目录下的 nohup.out 文件中：

nohup bin/flume-ng agent -c conf/ -n a1 -f flumeagents/netcatSource-loggersink.conf -Dflume.root.logger=DEBUG,console &

去掉 “Dflume.root.logger = DEBUG,console” 就可以直接将原本在 console 的输出的内容打印到日志中：

bin/flume-ng agent -c conf/ -n a1 -f flumeagents/netcatSource-loggersink.conf

日志的存放位置可以在 flume/conf/log4j.properties 中进行配置，默认存放在 flume/logs 目录下：

#flume.root.logger=DEBUG,console
flume.root.logger=INFO,LOGFILE
flume.log.dir=./logs
flume.log.file=flume.log

3. 案例2：实时日志 -> HDFS

3.1 要求

实时监控Hive日志，并上传到HDFS中

3.2 所需组件

3.2.1 Exec Source

Exec Source在启动后执行一个linux命令，期望这个命令可以持续地在标注输出中产生内容。一旦命令停止了，进程也就停止了，因此像 cat / tail -f 这些可以产生持续数据的命令是合适的，而像 date 这些只能产生一条信息，之后就结束的命令，是不适合的。

type	–	`exec`
command	–	要执行的命令

ExecSource存在的问题：

和其他的异步source一样，ExecSource无法保证在出现故障时，可以将event放入 channel，并通知客户端。异步Source在异常情况下，如果无法把从客户端读取的event进行缓存的话，是有丢失数据的风险的。因此建议使用 Spooling Directory Source, Taildir Source来替换ExecSource。

3.2.2 HDFS Sink

HDFS Sink负责将数据写到HDFS。

目前支持创建 text 和 SequnceFile 文件。
以上两种文件格式，都可以使用压缩。
文件可以基于时间周期性滚动或基于文件大小滚动或基于 events 的数量滚动。
可以根据数据产生的时间戳或主机名对数据进行分桶或分区。
上传的路径名可以包含格式化的转义序列，转义序列会在文件/目录真正上传时被替换。
如果要使用这个 sink，必须已经按照了 hadoop，这样 flume 才能使用 Jar 包和 hdfs 通信。

必配属性：

type	–	`hdfs`
hdfs.path	–	上传的路径名

可选属性：

配置文件的滚动策略：0都代表禁用。

hdfs.rollInterval	30	每间隔多少秒滚动一次文件
hdfs.rollSize	1024	文件一旦达到多少bytes就触发滚动
hdfs.rollCount	10	文件一旦写入多少个event就触发滚动

配置文件的类型和压缩类型。

hdfs.codeC	–	支持的压缩类型：gzip, bzip2, lzo, lzop, snappy
hdfs.fileType	SequenceFile	文件格式，当前支持： `SequenceFile`, `DataStream` or `CompressedStream`。DataStream 代表不使用压缩也就是纯文本CompressedStream 代表使用压缩

配置目录的滚动策略。

hdfs.round	false	代表时间戳是否需要向下舍去，如果为true，会影响所有的基于时间的转义序列，除了%t
hdfs.roundValue	1	将时间戳向下舍到离此值最高倍数的一个时间，小于等于当前时间
hdfs.roundUnit	second	时间单位 - `second`, `minute` or `hour`.

最关键的属性！！！

hdfs.useLocalTimeStamp	false	使用flume进程所在的本地时间，替换event header中的timestamp属性，替换后，用来影响转义序列

注意：所有和时间相关的转义序列，都要求 event 的 header 中有 timestamp 的属性名，值为时间戳。除非配置了 hdfs.useLocalTimeStamp=true，此时会使用服务器的本地时间，来生成时间戳，替换header中的timestamp属性。或者可以使用 TimestampInterceptor 生成时间戳的 key。

3.3 配置文件

a1.sources = r1
a1.sinks = k1
a1.channels = c1

# 配置source
a1.sources.r1.type = exec
a1.sources.r1.command=tail -f /tmp/atguigu/hive.log


# 配置sink
a1.sinks.k1.type = hdfs
a1.sinks.k1.hdfs.path = hdfs://hadoop102:9000/flume/%Y%m%d/%H%M
#上传文件的前缀
a1.sinks.k1.hdfs.filePrefix = logs-
# 滚动目录 一分钟滚动一次目录
a1.sinks.k1.hdfs.round = true
a1.sinks.k1.hdfs.roundValue = 1
a1.sinks.k1.hdfs.roundUnit = minute
# 是否使用本地时间戳
a1.sinks.k1.hdfs.useLocalTimeStamp = true
# 配置文件滚动
a1.sinks.k1.hdfs.rollInterval = 30
a1.sinks.k1.hdfs.rollSize = 134217700
a1.sinks.k1.hdfs.rollCount = 0
# 使用文件格式存储数据
a1.sinks.k1.hdfs.fileType=DataStream 

# 配置channel
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000

# 绑定和连接组件
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

3.4 启动命令

bin/flume-ng agent -c conf/ -n a1 -f flumeagents/execsource-hdfssink.conf -Dflume.root.logger=INFO,console

4. 案例3：离线日志 -> HDFS

4.1 要求

监控目录中新增的日志文件的内容，上传到HDFS。

4.2 所需组件

4.2.1 SpoolingDirSource

适用于：已经在一个目录中生成了大量的离线日志，且日志不会再进行写入和修改的场合。

SpoolingDirSource 在监控一个目录中新放入的文件的数据，一旦发现就数据封装为event。在目录中，已经传输完成的数据，会使用重命名或删除来标识这些文件已经传输完成。

SpoolingDirSource 要求放入目录的文件必须是一成不变（不能修改）的，且不能重名。

一旦发现放入的文件，又发生了写操作，或重名，agent 就会故障停机。

必须配置：

type	–	`spooldir`
spoolDir	–	监控的目录

可选配置：

fileSuffix	.COMPLETED	为已经读完的文件标识后缀
deletePolicy	never	文件读完后，是立刻删除还是不删除 `never` or `immediate`
fileHeader	false	是否在header中存放文件的绝对路径属性
fileHeaderKey	file	存放的绝对路径的key

4.3 配置文件

a1.sources = r1
a1.sinks = k1
a1.channels = c1

# 配置source
a1.sources.r1.type = spooldir
a1.sources.r1.spoolDir=/home/jeffery/flume

# 配置sink
a1.sinks.k1.type = hdfs
a1.sinks.k1.hdfs.path = hdfs://hadoop102:9000/flume/%Y%m%d/%H%M
#上传文件的前缀
a1.sinks.k1.hdfs.filePrefix = logs-
#滚动目录 一分钟滚动一次目录
a1.sinks.k1.hdfs.round = true
a1.sinks.k1.hdfs.roundValue = 1
a1.sinks.k1.hdfs.roundUnit = minute
#是否使用本地时间戳
a1.sinks.k1.hdfs.useLocalTimeStamp = true
#配置文件滚动
a1.sinks.k1.hdfs.rollInterval = 30
a1.sinks.k1.hdfs.rollSize = 134217700
a1.sinks.k1.hdfs.rollCount = 0
#使用文件格式存储数据
a1.sinks.k1.hdfs.fileType=DataStream 

# 配置channel
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000

# 绑定和连接组件
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

4.4 启动命令

bin/flume-ng agent -c conf/ -n a1 -f flumeagents/execsource-hdfssink.conf -Dflume.root.logger=INFO,console

5. 案例4：多实时日志 -> HDFS

5.1 需求

实时监控多个文件，避免使用 ExecSouce，使用 LoggerSink 输出到控制台。

5.2 所需组件

5.2.1 TailDirSource

TailDirSource 以接近实时的速度监控文件中写入的新行，并且将每个文件tail的位置记录在一个JSON的文件中；即便agent挂掉，重启后，source依然可以从上次记录的位置继续执行tail操作。用户可以通过修改Position文件的参数，来改变source继续读取的位置；如果postion文件丢失了，那么source会重新从每个文件的第一行开始读取(重复读)。

必须配置：

type	–	`TAILDIR`.
filegroups	–	组名
filegroups.filegroup.filename	–	一个组中可以配置多个文件的路径

可选参数


positionFile	存放postionfile的路径

5.3 配置文件

a1.sources = r1
a1.sinks = k1
a1.channels = c1

# 配置source
a1.sources.r1.type = TAILDIR
a1.sources.r1.filegroups = f1 f2
a1.sources.r1.filegroups.f1 = /home/atguigu/a.txt
a1.sources.r1.filegroups.f2 = /home/atguigu/b.txt
a1.sources.r1.positionFile=/home/atguigu/taildir_position.json
# 配置sink
a1.sinks.k1.type = logger

# 配置channel
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000

# 绑定和连接组件
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

5.4 启动命令

bin/flume-ng agent -c conf/ -n a1 -f flumeagents/taildirSource-loggersink.conf -Dflume.root.logger=INFO,console

6. 复杂案例1：实时日志 -> 本地 + HDFS

6.1 需求

Agent1：execsource–2个memoeyChannel（放入相同的数据）-----2Avrosink

Agent2：AvroSource----memoeyChannel—hdfssink

Agent3：AvroSource----memoeyChannel—FileRollssink

类似于官网的如下示意图：

6.2 所需组件

Avro Sink 和 Avro Source 是搭配使用的。

6.2.1 Avro Sink

Avro sink将event以 avro 序列化的格式发送到另外一台机器的指定进程。

type	–	`avro`.
hostname	–	source绑定的主机名
port	–	绑定的端口号

6.2.2 Avro Source

source 读取 avro 格式的数据，反序列化为 event 对象。启动Avro Source时，会自动绑定一个RPC端口，这个端口可以接受Avro Sink发送的数据。

type	–	`avro`
bind	–	绑定的主机名或ip地址
port	–	绑定的端口号

6.2.3 File Roll Sink

将event写入到本地磁盘。数据在写入到目录后，会自动进行滚动文件。

type	–	`file_roll`.
sink.directory	–	数据写入到目录

可选：

sink.rollInterval=30，每间隔多久滚动一次。

6.3 配置案例

6.3.1 Agent3

a1.sources = r1
a1.sinks = k1
a1.channels = c1

# 配置source
a1.sources.r1.type = avro
a1.sources.r1.bind = hadoop104
a1.sources.r1.port = 1234

# 配置sink
a1.sinks.k1.type = file_roll
a1.sinks.k1.sink.directory=/home/atguigu/flume
a1.sinks.k1.sink.rollInterval=600

# 配置channel
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000

# 绑定和连接组件
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

启动命令（hadoop104 - agent3先起）

bin/flume-ng agent -c conf/ -n a1 -f flumeagents/example1-agent3.conf -Dflume.root.logger=INFO,console

6.3.2 Agent2

a1.sources = r1
a1.sinks = k1
a1.channels = c1

# 配置source
a1.sources.r1.type = avro
a1.sources.r1.bind = hadoop102
a1.sources.r1.port = 12345

# 配置sink
a1.sinks.k1.type = hdfs
a1.sinks.k1.hdfs.path = hdfs://hadoop102:9000/flume/%Y%m%d/%H%M
#上传文件的前缀
a1.sinks.k1.hdfs.filePrefix = logs-
#滚动目录 一分钟滚动一次目录
a1.sinks.k1.hdfs.round = true
a1.sinks.k1.hdfs.roundValue = 1
a1.sinks.k1.hdfs.roundUnit = minute
#是否使用本地时间戳
a1.sinks.k1.hdfs.useLocalTimeStamp = true
#配置文件滚动
a1.sinks.k1.hdfs.rollInterval = 30
a1.sinks.k1.hdfs.rollSize = 134217700
a1.sinks.k1.hdfs.rollCount = 0
#使用文件格式存储数据
a1.sinks.k1.hdfs.fileType=DataStream 

# 配置channel
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000

# 绑定和连接组件
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

启动命令（hadoop102）

bin/flume-ng agent -c conf/ -n a1 -f flumeagents/example1-agent2.conf -Dflume.root.logger=INFO,console

6.3.3 Agent1

a1.sources = r1
a1.sinks = k1 k2
a1.channels = c1 c2

#指定使用复制的channel选择器，此选择器会选中所有的channel,每个channel复制一个event,可以省略，默认
#a1.sources.r1.selector.type = replicating
# 配置source
a1.sources.r1.type = exec
a1.sources.r1.command=tail -f /tmp/atguigu/hive.log

# 配置sink
a1.sinks.k1.type = avro
a1.sinks.k1.hostname=hadoop102
a1.sinks.k1.port=12345

a1.sinks.k2.type = avro
a1.sinks.k2.hostname=hadoop104
a1.sinks.k2.port=1234

# 配置channel
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000

a1.channels.c2.type = memory
a1.channels.c2.capacity = 1000

# 绑定和连接组件
a1.sources.r1.channels = c1 c2
a1.sinks.k1.channel = c1
a1.sinks.k2.channel = c2

启动命令（hadoop103）

bin/flume-ng agent -c conf/ -n a1 -f flumeagents/example1-agent1.conf -Dflume.root.logger=INFO,console

7. 复杂案例2：Sink Processor 应用

7.1 需求

SinkProcessor 的应用场景就是多个 sink 同时从一个channel拉取数据。

Agent1：netcatsource----memorychannel-----2AvroSink (hadoop102)

Agent2：ArvoSource----memorychannel-----loggersink (103)

Agent3：ArvoSource----memorychannel-----loggersink (104)

7.2 所需组件

7.2.1 Default Sink Processor

若 agent 中只有一个 sink，此时默认使用 Default Sink Processor，不强制用户显式地配置 Sink Processor 和 sink 组。

7.2.2 Failover Sink Processor

Failover Sink Processor：故障转移的sink处理器。这个sink处理器会维护一组有优先级的sink，默认挑选优先级最高（数值越大）的sink来处理数据。故障的sink会放入池中冷却一段时间，恢复后，重新加入到存活的池中，此时在 live pool(存活的池) 中选优先级最高的接替工作。

配置：

sinks	–	空格分割的，多个sink组成的集合
processor.type	`default`	`failover`
processor.priority.sinkName	–	优先级

示例：

a1.sinkgroups = g1
a1.sinkgroups.g1.sinks = k1 k2
a1.sinkgroups.g1.processor.type = failover
a1.sinkgroups.g1.processor.priority.k1 = 5
a1.sinkgroups.g1.processor.priority.k2 = 10
a1.sinkgroups.g1.processor.maxpenalty = 10000

7.2.3 Load balancing Sink Processor

Load balancing Sink Processor：使用 round_robinorrandom 两种算法，让多个激活的 sink 间的负载均衡(多个sink轮流干活)。

配置：

processor.sinks	–	空格分割的，多个sink组成的集合
processor.type	`default`	`load_balance`
processor.selector	`round_robin`	`round_robin`, `random` or FQCN of custom class that inherits from `AbstractSinkSelector`

a1.sinkgroups = g1
a1.sinkgroups.g1.sinks = k1 k2
a1.sinkgroups.g1.processor.type = load_balance
a1.sinkgroups.g1.processor.backoff = true
a1.sinkgroups.g1.processor.selector = random

7.3 故障转移案例

7.3.1 Agent1

netcatsource----memorychannel-----2AvroSink (hadoop102)

a1.sources = r1
a1.sinks = k1 k2
a1.channels = c1

# 配置source
a1.sources.r1.type = netcat
a1.sources.r1.bind = hadoop102
a1.sources.r1.port = 44444

# 配置sink组
a1.sinkgroups = g1
a1.sinkgroups.g1.sinks = k1 k2
a1.sinkgroups.g1.processor.type = failover
a1.sinkgroups.g1.processor.priority.k1 = 5
a1.sinkgroups.g1.processor.priority.k2 = 10
a1.sinkgroups.g1.processor.maxpenalty = 10000

# 配置sink
a1.sinks.k1.type = avro
a1.sinks.k1.hostname=hadoop103
a1.sinks.k1.port=12345

a1.sinks.k2.type = avro
a1.sinks.k2.hostname=hadoop104
a1.sinks.k2.port=1234

# 配置channel
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000

# 绑定和连接组件
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
a1.sinks.k2.channel = c1

启动命令

bin/flume-ng agent -c conf/ -n a1 -f flumeagents/example2-agent1.conf -Dflume.root.logger=INFO,console

7.3.2 Agent2

Agent2： ArvoSource----memorychannel-----loggersink (103)

a1.sources = r1
a1.sinks = k1
a1.channels = c1

# 配置source
a1.sources.r1.type = avro
a1.sources.r1.bind = hadoop103
a1.sources.r1.port = 12345

# 配置sink
a1.sinks.k1.type = logger

# 配置channel
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000

# 绑定和连接组件
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

启动命令

bin/flume-ng agent -c conf/ -n a1 -f flumeagents/example2-agent2.conf -Dflume.root.logger=INFO,console

7.3.3 Agent3

Agent3： ArvoSource----memorychannel-----loggersink (104)

a1.sources = r1
a1.sinks = k1
a1.channels = c1

# 配置source
a1.sources.r1.type = avro
a1.sources.r1.bind = hadoop104
a1.sources.r1.port = 1234

# 配置sink
a1.sinks.k1.type = logger

# 配置channel
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000

# 绑定和连接组件
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

启动命令

bin/flume-ng agent -c conf/ -n a1 -f flumeagents/example2-agent3.conf -Dflume.root.logger=INFO,console

7.4 负载均衡案例

将 agent1 的 sink processor 进行修改即可，其他配置不变。

a1.sources = r1
a1.sinks = k1 k2
a1.channels = c1

# 配置source
a1.sources.r1.type = netcat
a1.sources.r1.bind = hadoop102
a1.sources.r1.port = 44444

#配置sink组
a1.sinkgroups = g1
a1.sinkgroups.g1.sinks = k1 k2
a1.sinkgroups.g1.processor.type = load_balance
a1.sinkgroups.g1.processor.selector = random


# 配置sink
a1.sinks.k1.type = avro
a1.sinks.k1.hostname=hadoop103
a1.sinks.k1.port=12345

a1.sinks.k2.type = avro
a1.sinks.k2.hostname=hadoop104
a1.sinks.k2.port=1234

# 配置channel
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000

# 绑定和连接组件
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
a1.sinks.k2.channel = c1

启动命令

bin/flume-ng agent -c conf/ -n a1 -f flumeagents/example2-agent4.conf -Dflume.root.logger=INFO,console

8. 复杂案例3：interceptor + Multiplexing Channel Selector 应用

8.1 需求

案例主要介绍Multiplexing Channel Selector的使用。

在102机器：

（1）agent1（netcatsource—memorychannel—avrosink）

（2）agent2（execsource—memorychannel—avrosink）

在103机器：

agent3( avrosouce---- 2 memorychannel—2sink(loggersink,hdfssink))

其中，loggersink只写出来自agent1的数据；hdfssink只写出来自agent2的数据。
本案例需要用到 interceptor 和 Multiplexing Channel Selector，在 agent1、agent2 中通过 interceptor 在event 的 header 中标识数据来源，在 agent3 的 Multiplexing Channel Selector 中根据数据的来源分别进行不同的处理，原理图如下：
复杂案例3处理过程示意图

8.2 所需组件

8.2.1 Multiplexing Channel Selector

Multiplexing Channel Selector：将event分类到不同的channel。

如何分类：固定根据配置读取 event header 中指定 key 的 value，根据 value 的映射，分配到不同的channel。

配置：

selector.type	replicating	`multiplexing`
selector.header	flume.selector.header	默认读取event中的header的名称
selector.default	–	默认分配到哪个channel
selector.mapping.*	–	自定义的映射规则

示例：

a1.sources = r1
a1.channels = c1 c2 c3 c4
a1.sources.r1.selector.type = multiplexing
a1.sources.r1.selector.header = state
a1.sources.r1.selector.mapping.CZ = c1
a1.sources.r1.selector.mapping.US = c2 c3
a1.sources.r1.selector.default = c4

8.2.2 Static Interceptor

Static Interceptor允许用户向 event 添加一个静态的 key-value。

type	–	`static`
preserveExisting	true	If configured header already exists, should it be preserved - true or false
key	key	key的名称
value	value	value的值

示例：

a1.sources = r1
a1.channels = c1
a1.sources.r1.channels =  c1
a1.sources.r1.type = seq
a1.sources.r1.interceptors = i1
a1.sources.r1.interceptors.i1.type = static
a1.sources.r1.interceptors.i1.key = datacenter
a1.sources.r1.interceptors.i1.value = NEW_YORK

8.3 配置文件

8.3.1 agent1（hadoop102）

a1.sources = r1
a1.sinks = k1
a1.channels = c1

# 配置source
a1.sources.r1.type = netcat
a1.sources.r1.bind = hadoop102
a1.sources.r1.port = 44444

# 配置拦截器
# 拦截器 i1 用于本案例演示
# 拦截器 i2 用于演示时间戳 timestamp
a1.sources.r1.interceptors = i1 i2
a1.sources.r1.interceptors.i1.type = static
a1.sources.r1.interceptors.i1.key = mykey
a1.sources.r1.interceptors.i1.value = agent1
a1.sources.r1.interceptors.i2.type = timestamp

# 配置sink
a1.sinks.k1.type = avro
a1.sinks.k1.hostname=hadoop103
a1.sinks.k1.port=12345

# 配置channel
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000

# 绑定和连接组件
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

启动命令（先启动agent3）

bin/flume-ng agent -c conf/ -n a1 -f flumeagents/example3-agent1.conf -Dflume.root.logger=INFO,console

8.3.2 agent2（hadoop102）

a1.sources = r1
a1.sinks = k1
a1.channels = c1

# 配置source
a1.sources.r1.type = exec
a1.sources.r1.command=tail -f /home/atguigu/hello.txt

#配置拦截器
a1.sources.r1.interceptors = i1
a1.sources.r1.interceptors.i1.type = static
a1.sources.r1.interceptors.i1.key = mykey
a1.sources.r1.interceptors.i1.value = agent2

# 配置sink
a1.sinks.k1.type = avro
a1.sinks.k1.hostname=hadoop103
a1.sinks.k1.port=12345

# 配置channel
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000

# 绑定和连接组件
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

启动命令（先启动agent3）

bin/flume-ng agent -c conf/ -n a1 -f flumeagents/example3-agent2.conf -Dflume.root.logger=INFO,console

8.3.3 agent3（hadoop103）

a1.sources = r1
a1.sinks = k1 k2
a1.channels = c1 c2

# 配置source
a1.sources.r1.type = avro
a1.sources.r1.bind = hadoop103
a1.sources.r1.port = 12345

# 配置 channel 选择器
a1.sources.r1.selector.type = multiplexing
a1.sources.r1.selector.header = mykey
a1.sources.r1.selector.mapping.agent1 = c2
a1.sources.r1.selector.mapping.agent2 = c1

# 配置sink
a1.sinks.k2.type = logger

# 配置sink
a1.sinks.k1.type = hdfs
a1.sinks.k1.hdfs.path = hdfs://hadoop103:9000/flume/%Y%m%d/%H%M
#上传文件的前缀
a1.sinks.k1.hdfs.filePrefix = logs-
#滚动目录 一分钟滚动一次目录
a1.sinks.k1.hdfs.round = true
a1.sinks.k1.hdfs.roundValue = 1
a1.sinks.k1.hdfs.roundUnit = minute
#是否使用本地时间戳
a1.sinks.k1.hdfs.useLocalTimeStamp = true
#配置文件滚动
a1.sinks.k1.hdfs.rollInterval = 30
a1.sinks.k1.hdfs.rollSize = 134217700
a1.sinks.k1.hdfs.rollCount = 0
#使用文件格式存储数据
a1.sinks.k1.hdfs.fileType=DataStream 

# 配置channel
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000

a1.channels.c2.type = memory
a1.channels.c2.capacity = 1000

# 绑定和连接组件
a1.sources.r1.channels = c1 c2
a1.sinks.k1.channel = c1
a1.sinks.k2.channel = c2

启动命令（先启动agent3）

bin/flume-ng agent -c conf/ -n a1 -f flumeagents/example3-agent3.conf -Dflume.root.logger=INFO,console

我很ruo

关注

2
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
七个案例，教你从零到一上手 Flume

手动
复制链接

扫一扫

七个案例，教你从零到一上手 Flume

1. Flume简介

1. 1 简介

1.2 基本单元

1.3 核心概念

1.4 基本常识

1.5 安装

1.6 使用

2. 案例1：网络端口 -> 控制台

2.1 要求

2.2 所需组件

2.1.1 NetCat TCP Source

2.3 配置说明

2.3.1 source

2.3.2 Logger Sink

2.3.3 Memory Channel

2.4 配置文件

2.5 启动命令

3. 案例2：实时日志 -> HDFS

3.1 要求

3.2 所需组件

3.2.1 Exec Source

3.2.2 HDFS Sink

3.3 配置文件

3.4 启动命令

4. 案例3：离线日志 -> HDFS

4.1 要求

4.2 所需组件

4.2.1 SpoolingDirSource

4.3 配置文件

4.4 启动命令

5. 案例4：多实时日志 -> HDFS

5.1 需求

5.2 所需组件

5.2.1 TailDirSource

5.3 配置文件

5.4 启动命令

6. 复杂案例1：实时日志 -> 本地 + HDFS

6.1 需求

6.2 所需组件

6.2.1 Avro Sink

6.2.2 Avro Source

6.2.3 File Roll Sink

6.3 配置案例

6.3.1 Agent3

启动命令（hadoop104 - agent3先起）

6.3.2 Agent2

启动命令（hadoop102）

6.3.3 Agent1

启动命令（hadoop103）

7. 复杂案例2：Sink Processor 应用

7.1 需求

7.2 所需组件

7.2.1 Default Sink Processor

7.2.2 Failover Sink Processor

7.2.3 Load balancing Sink Processor

7.3 故障转移案例

7.3.1 Agent1

启动命令

7.3.2 Agent2

启动命令

7.3.3 Agent3

启动命令

7.4 负载均衡案例

启动命令

8. 复杂案例3：interceptor + Multiplexing Channel Selector 应用

8.1 需求

8.2 所需组件

8.2.1 Multiplexing Channel Selector

8.2.2 Static Interceptor

8.3 配置文件

8.3.1 agent1（hadoop102）

启动命令（先启动agent3）

8.3.2 agent2（hadoop102）

启动命令（先启动agent3）

8.3.3 agent3（hadoop103）

启动命令（先启动agent3）

“相关推荐”对你有帮助么？