flume.apache.org
flume概念解析
Flume是一种分布式的,能够有效地收集,聚合和移动大量日志数据的工具。flume有着可靠的故障转移和恢复机制,具有强大的容错性。
flume有两个版本,Flume-og和Flume-ng,本次使用的是
apache-flume-1.9.0-bin.tar.gz。
Flume架构
Ageng是最小的日志收集单元,所谓flume的日志采集是通过拼接若干个Agent完成的。
agent中的source从web server中提取原生日志流,通过通道拦截器进行拦截和装饰Event,再进入通道选择器进行复制和分流,然后进入channel通道,再分别进入不同的SinkGroup(负载均衡,故障转移)组,最后进入kafka集群或者hdfs文件处理系统。
Flume安装
1.保证jdk1.8版本正常运行,并且配置JAVA_HOME环境变量
2.安装Flume ---网址 flume.apache.org 左侧download
tar -zxvf apache-flume-1.9.0-bin.tar.gz -C /usr/soft/
cd /usr/soft/apache-flume-1.9.0-bin/
#验证是否安装成功 执行./bin/flume-ng version
Flume 1.9.0
Source code repository: https://git-wip-us.apache.org/repos/asf/flume.git
Revision: d4fcab4f501d41597bc616921329a4339f73585e
Compiled by fszabo on Mon Dec 17 20:45:25 CET 2018
From source with checksum 35db629a3bda49d23e9b3690c80737f9
Agent的配置模板
#声明组件信息
<Agent>.sources=<source1> <source2>
<Agent>.sinks = <Sink1> <Sink1>
<Agent>.channels = <Channel1> <Channel2>
#配置组件信息
<Agent>.sources.<Source>.<someProperty> = <someValue>
<Agent>.channels.<Channel>.<someProperty> = <someValue>
<Agent>.sinks.<Sink>.<someProperty> = <someValue>
#链接组件
<Agent>.sources.<Source>.channels = <Channel1> <Channel2>
<Agent>.sinks.<Sink>.channel = <Channel1>
模板结构是必须掌握的,掌握该模板的⽬的是为了便于后期的查阅和配置。
简单案例解析
1.配置flume的配置文件
e1.properties 单个Agent的配置,将此配置文件放在flume安装目录的conf目录下。
#声明基本组件
a1.sources = sr1
a1.sinks = sk1
a1.channels = c1
#配置组件信息,从netcat中接收数据 ---去网站上面查找对应的配置信息
a1.sources.sr1.type = netcat
a1.sources.sr1.bind = centos
a1.sources.sr1.port = 44444
#配置Sink组件信息,将数据打印在日志控制台 ---一般测试调试使用
a1.sinks.sk1.type = logger
#配置Channel通道,主要负责数据缓冲
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000 --容纳1000个event
a1.channels.c1.transactionCapacity = 100 --一次传输100个event
#进行组件链接
a1.sources.s1.channels = c1
a1.sinks.sk1.channel = c1
为了测试需要安装netcat服务,在linux系统下执行如下命令:
yum -y install nmap-ncat
yum -y install telnet
配置组件信息的网址为:flume.apache.org 左侧选择document,
选择第一个Flume User Guide即可
2.启动a1 采集组件
[root@CentOS apache-flume-1.9.0-bin]# ./bin/flume-ng agent --conf conf/ --name a1 --conf-file conf/e1.properties -Dflume.root.logger=INFO,console
#agent 必要参数
#conf flume配置文件位置
#name agent名称
#conf-file 具体哪一个Agent配置文件
#-Dflume.root.logger=INFO,console 将对应sink的输出,将输出打印在日志控制台上
附注启动命令参数
[root@centos apache-flume-1.9.0-bin]# ./bin/flume-ng help
3.测试a1
[root@CentOS apache-flume-1.9.0-bin]# telnet CentOS 44444
Trying 192.168.52.134...
Connected to CentOS.
Escape character is '^]'.
hello world
2020-02-05 11:44:43,546 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO -
org.apache.flume.sink.LoggerSink.process(LoggerSink.java:95)] Event: { headers:{}
body: 68 65 6C 6C 6F 20 77 6F 72 6C 64 0D hello world. }
组件概述
source-输入源
1.Avro Source
内部启动一个Avro的服务器,用于接收Avro client的请求,并且将存储的数据保存在Channel中
Avro类似http的一种传输协议,只有符合Avro的才能相互传递。
属性 | 默认值 | 说明 |
---|---|---|
channels | 需要对接的channel | |
type | 对应source组件的类型,必须为 avro | |
bind | 绑定的ip(服务器主机ip) | |
port | 绑定监听的端口号 |
在flume的conf中创建文件e2.properties
# 声明基本组件
a1.sources = s1
a1.sinks = sk1
a1.channels = c1
# 配置Source组件,从Socket中接收⽂本数据
a1.sources.s1.type = avro
a1.sources.s1.bind = CentOS
a1.sources.s1.port = 44444
# 配置Sink组件,将接收数据打印在⽇志控制台
a1.sinks.sk1.type = logger
# 配置Channel通道,主要负责数据缓冲
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
# 进⾏组件间的绑定
a1.sources.s1.channels = c1
a1.sinks.sk1.channel = c1
启动a1对应的文件 —e2.properties
./bin/flume-ng agent --conf conf/ --name a1 --conf-file conf/e2.properties -Dflume.root.logger=INFO,console
由于设置avro source,所以输入端必须为avro client,可以通过:
[root@centos apache-flume-1.9.0-bin]# ./bin/flume-ng help
命令查询执行avro client
所需要的参数
[root@centos apache-flume-1.9.0-bin]# ./bin/flume-ng avro-client --host centos --port 44444 --filename /root/t_user
2.Exec Source
可以将指令在控制台的输出采集出来
属性 | 默认值 | 说明 |
---|---|---|
channels | 需要对接Channel | |
type | 必须为 exec | |
command | 要执⾏的命令 |
在flume的conf中创建配置文件e3.properties
#声明基本组件
a1.sources = sr1
a1.sinks = sk1
a1.channels = c1
#配置组件信息
a1.sources.sr1.type = exec
a1.sources.sr1.command = tail -f /root/t_user
#配置sink组件信息
a1.sinks.sk1.type = logger
#配置channel组件信息
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
#进行组件间的链接
a1.sources.sr1.channels = c1
a1.sinks.sk1.channel = c1
[root@CentOS apache-flume-1.9.0-bin]# ./bin/flume-ng agent --conf conf/ --name a1 --conf-file conf/e3.properties -Dflume.root.logger=INFO,console
注意:此类型的source组件可以进行动态采集命令结果信息,但每次执行agent都会从头执行一次,无法实现增量采集。
3.Spooling Directory Source
动态采集静态目录下,新增的文本文件,采集完成后会修改文件后缀,但是不会删除采集的源文件,会更改源文件名称,如果用户想要采集一次,可以通过修改该source默认行为来实现。
属性 | 默认值 | 说明 |
---|---|---|
channels | 对接的Channel | |
type | 必须修改为 spooldir | |
spoolDir | 给定需要采集的⽬录 | |
fileSuffix | .COMPLETED | 使⽤该值修改采集完成⽂件名 |
deletePolicy | never | 可选值 never / immediate |
includePattern | ^.*$ | 表示匹配所有⽂件 |
ignorePattern | ^$ | 表示不匹配的⽂件 |
# 声明基本组件 Source Channel Sink example4.properties
a1.sources = s1
a1.sinks = sk1
a1.channels = c1
# 配置Source组件,从Socket中接收⽂本数据
a1.sources.s1.type = spooldir
a1.sources.s1.spoolDir = /root/spooldir
a1.sources.s1.fileHeader = true
# 配置Sink组件,将接收数据打印在⽇志控制台
a1.sinks.sk1.type = logger
# 配置Channel通道,主要负责数据缓冲
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
# 进⾏组件间的绑定
a1.sources.s1.channels = c1
a1.sinks.sk1.channel = c1
启动a1
[root@CentOS apache-flume-1.9.0-bin]# ./bin/flume-ng agent --conf conf/ --name a1 --conf-file conf/e4.properties -Dflume.root.logger=INFO,console
4.Taildir Source
实时监测动态文本文件的追加行,并且记录采集文件读取位置的偏移量,即下一次再次采集,可以实现增量采集(从上一次采集过的位置进行采集)。
positionFile:用来记录采集文件的位置,如果不想实现增量采集,可以直接删除 ~/.flume/ 此目录下的文件。
属性 | 默认值 | 说明 |
---|---|---|
channels | 对接的通道 | |
type | 必须指定为 TAILDIR | |
filegroups | 以空格分隔的⽂件组列表。 | |
filegroups. | ⽂件组的绝对路径。正则表达式(⽽⾮⽂件系统模式)只能⽤于⽂件名。 | |
positionFile | ~/.flume/taildir_position.json | 记录采集⽂件的位置信息,实现增量采集 |
在flume中创建e5.properties
# 声明基本组件 Source Channel Sink example5.properties
a1.sources = s1
a1.sinks = sk1
a1.channels = c1
# 配置Source组件,从Socket中接收⽂本数据
a1.sources.s1.type = TAILDIR
a1.sources.s1.filegroups = g1 g2
a1.sources.s1.filegroups.g1 = /root/taildir/.*\.log$
a1.sources.s1.filegroups.g2 = /root/taildir/.*\.java$
#source采集event头信息
a1.sources.s1.headers.g1.type = log
#source采集event头信息
a1.sources.s1.headers.g2.type = java
# 配置Sink组件,将接收数据打印在⽇志控制台
a1.sinks.sk1.type = logger
# 配置Channel通道,主要负责数据缓冲
a1.channels.c1.type = memory
#队列最多存储数据条数
a1.channels.c1.capacity = 1000
#sink最大收集数据条数
a1.channels.c1.transactionCapacity = 100
# 进⾏组件间的绑定
a1.sources.s1.channels = c1
a1.sinks.sk1.channel = c1
启动a1
[root@CentOS apache-flume-1.9.0-bin]# ./bin/flume-ng agent --conf conf/ --name a1 --
conf-file conf/e5.properties -Dflume.root.logger=INFO,console
5.Kafka Source
Sink-输出
1.logger sink
通常用于测试和调试数据
2.File Roll Sink
可以将采集的数据写入到本地文件
属性 | 默认值 | 说明 |
---|---|---|
channel | 对应的channel | |
type | 必须为 file_roll | |
sink.directory | 采集数据存储的地方 | |
sink.rollInterval | 更换文件的时间(0—永不更换文件,30—30秒后更换文件) |
在flume的conf中配置e6.properties
# 声明基本组件 Source Channel Sink example6.properties
a1.sources = s1
a1.sinks = sk1
a1.channels = c1
# 配置Source组件,从Socket中接收⽂本数据
a1.sources.s1.type = netcat
a1.sources.s1.bind = CentOS
a1.sources.s1.port = 44444
# 配置Sink组件,将接收数据打印在⽇志控制台
a1.sinks.sk1.type = file_roll
a1.sinks.sk1.sink.directory = /root/file_roll
a1.sinks.sk1.sink.rollInterval = 0
# 配置Channel通道,主要负责数据缓冲
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
# 进⾏组件间的绑定
a1.sources.s1.channels = c1
a1.sinks.sk1.channel = c1
[root@CentOS apache-flume-1.9.0-bin]# ./bin/flume-ng agent --conf conf/ --name a1 --conf-file conf/e6.properties
3.HDFS Sink
可以将数据写入到hdfs文件系统
属性 | 默认值 | 说明 |
---|---|---|
channel | 要连接的channel | |
type | 必须为 hdfs | |
hdfs.path | 存储hdfs文件路径(eg hdfs://namenode/flume/webdata/) |
在flume的conf目录下创建e7.properties
# 声明基本组件 Source Channel Sink example7.properties
a1.sources = s1
a1.sinks = sk1
a1.channels = c1
# 配置Source组件,从Socket中接收⽂本数据
a1.sources.s1.type = netcat
a1.sources.s1.bind = CentOS
a1.sources.s1.port = 44444
# 配置Sink组件,将接收数据打印在⽇志控制台
a1.sinks.sk1.type = hdfs
a1.sinks.sk1.hdfs.path = /flume-hdfs/%y-%m-%d #默认上传到本机的hdfs文件系统
a1.sinks.sk1.hdfs.rollInterval = 0 #hdfs文件系统不在本机拷贝hadoop目录,并配置hadoop
a1.sinks.sk1.hdfs.rollSize = 0 #环境变量
a1.sinks.sk1.hdfs.rollCount = 0
a1.sinks.sk1.hdfs.useLocalTimeStamp = true
a1.sinks.sk1.hdfs.fileType = DataStream
# 配置Channel通道,主要负责数据缓冲
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
# 进⾏组件间的绑定
a1.sources.s1.channels = c1
a1.sinks.sk1.channel = c1
4.Kafka Sink
将数据写⼊Kafka的Topic中
在flume的conf中创建配置文件e8.properties
# 声明基本组件 Source Channel Sink example8.properties
a1.sources = s1
a1.sinks = sk1
a1.channels = c1
# 配置Source组件,从Socket中接收⽂本数据
a1.sources.s1.type = netcat
a1.sources.s1.bind = CentOS
a1.sources.s1.port = 44444
# 配置Sink组件,将接收数据打印在⽇志控制台
a1.sinks.sk1.type = org.apache.flume.sink.kafka.KafkaSink
a1.sinks.sk1.kafka.bootstrap.servers = CentOS:9092
a1.sinks.sk1.kafka.topic = topic01
a1.sinks.sk1.kafka.flumeBatchSize = 20
a1.sinks.sk1.kafka.producer.acks = 1
a1.sinks.sk1.kafka.producer.linger.ms = 1
a1.sinks.sk1.kafka.producer.compression.type = snappy
# 配置Channel通道,主要负责数据缓冲
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
# 进⾏组件间的绑定
a1.sources.s1.channels = c1
a1.sinks.sk1.channel = c1
5.Avro Sink
将数据写出给 Avro Source
avro sink 可以作为 avro client 输出端将采集的日志信息输入到另一个 avro source中去。
# 声明基本组件 Source Channel Sink example9.properties
a1.sources = s1
a1.sinks = sk1
a1.channels = c1
# 配置Source组件,从Socket中接收⽂本数据
a1.sources.s1.type = org.apache.flume.source.kafka.KafkaSource
a1.sources.s1.batchSize = 100
a1.sources.s1.batchDurationMillis = 2000
a1.sources.s1.kafka.bootstrap.servers = CentOS:9092
a1.sources.s1.kafka.topics = topic01
a1.sources.s1.kafka.consumer.group.id = g1
# 配置Sink组件,将接收数据打印在⽇志控制台
a1.sinks.sk1.type = avro
a1.sinks.sk1.hostname = CentOS
a1.sinks.sk1.port = 44444
# 配置Channel通道,主要负责数据缓冲
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
# 进⾏组件间的绑定
a1.sources.s1.channels = c1
a1.sinks.sk1.channel = c1
# 声明基本组件 Source Channel Sink example9.properties
a2.sources = s1
a2.sinks = sk1
a2.channels = c1
# 配置Source组件,从Socket中接收⽂本数据
a2.sources.s1.type = avro
a2.sources.s1.bind = CentOS
a2.sources.s1.port = 44444
# 配置Sink组件,将接收数据打印在⽇志控制台
a2.sinks.sk1.type = file_roll
a2.sinks.sk1.sink.directory = /root/file_roll
a2.sinks.sk1.sink.rollInterval = 0
# 配置Channel通道,主要负责数据缓冲
a2.channels.c1.type = memory
a2.channels.c1.capacity = 1000
a2.channels.c1.transactionCapacity = 100
# 进⾏组件间的绑定
a2.sources.s1.channels = c1
a2.sinks.sk1.channel = c1
[root@CentOS apache-flume-1.9.0-bin]# ./bin/flume-ng agent --conf conf/ --conf-file
conf/emple9.properties --name a2
[root@CentOS apache-flume-1.9.0-bin]# ./bin/flume-ng agent --conf conf/ --conf-file
conf/emple9.properties --name a1
[root@CentOS kafka_2.11-2.2.0]# ./bin/kafka-console-producer.sh --broker-list
CentOS:9092 --topic topic01
[root@CentOS apache-flume-1.9.0-bin]# ./bin/flume-ng agent --conf conf/ --conf-file conf/emple9.properties --name a2
[root@CentOS apache-flume-1.9.0-bin]# ./bin/flume-ng agent --conf conf/ --conf-file conf/emple9.properties --name a1
[root@CentOS kafka_2.11-2.2.0]# ./bin/kafka-console-producer.sh --broker-list CentOS:9092 --topic topic01
Channel-通道
1.Memory Channel
快
,将 source 采集的数据直接写入内存,不安全,可能会丢失。
参数 | 默认值 | 说明 |
---|---|---|
type | 只可以写 memory | |
capacity | channel通道内存储的event最大时间数 | |
transactionCapacity | 每一次source写入或sink读取channel的最大批量 |
transactionCapacity <= capacity
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
2.JDBC Channel
事件存储在数据库中,支持持久性存储,并且支持事务。channel内置一个数据库 Derby,这是一个持久通道,非常适合可恢复型很重要的流程。存储很重要的数据时,使用 jdbc 存储 channel。
a1.channels.c1.type = jdbc
3.Kafka Channel
将source采集的数据存入外围的 Kafka 集群中。 —相当于一个消费者组
a1.channels.c1.type = org.apache.flume.channel.kafka.KafkaChannel
a1.channels.c1.kafka.bootstrap.servers = centos:9092
a1.channels.c1.kafka.topic = topic_channel
a1.channels.c1.kafka.consumer.group.id = g1
4.File Channel
a1.channels.c1.type = file
a1.channels.c1.checkpointDir = /root/flume/checkpoint
a1.channels.c1.dataDirs = /root/flume/data
高级组件配置
1.通道拦截器
作用于 source 组件,对 source 组装的 Event 进行装饰
和拦截
( Event 中包含 Event header 和 Event Body),Flume中内建了许多拦截器。
Tempstamp Interceptor:装饰类型,负责在 Event Header 中添加时间信息。
Host Interceptor:装饰类型,负责在 Event Header 中添加主机信息。
Static Interceptor:装饰类型,负责在 Event Header 中添加自定义的 Key 和 Value 类型。
Remove Header Interceptor:装饰类型,负责删除 Event Header 指定的 Key。
UUID Interceptor:装饰类型,负责在 Event Header 中添加uuid的随机的唯⼀字符串。
Search and Replace Interceptor:装饰类型,负责搜索EventBody的内容,并且将匹配的内容进⾏
替换。
Regex Filtering Interceptor:拦截类型,将中满⾜正则表达式的内容进⾏过滤或者匹配。
属性 | 默认值 | 说明 |
---|---|---|
type | 必须为 regex_filter | |
regex | 要匹配的形式(正则表达式) | |
excludeEvents | false | false,匹配满足正则表达式的内容 |
Regex Extrator Interceptor:装饰类型,负责搜索EventBody的内容,并且将匹配的内容添加到
Event Header⾥⾯。
属性 | 默认值 | 说明 |
---|---|---|
type | 必须为 regex_extractor | |
regex | 搜索的内容(正则)^(INFO|ERROR) | |
serializers | 将抽取的内容添加到header内,添加时key的值 |
案例解析
# 声明基本组件 Source Channel Sink example11.properties
a1.sources = s1
a1.sinks = sk1
a1.channels = c1
# 配置Source组件,从Socket中接收⽂本数据
a1.sources.s1.type = netcat
a1.sources.s1.bind = centos
a1.sources.s1.port = 44444
# 添加拦截器
a1.sources.s1.interceptors = i1 i2 i3 i4 i5 i6
a1.sources.s1.interceptors.i1.type = timestamp
a1.sources.s1.interceptors.i2.type = host
a1.sources.s1.interceptors.i3.type = static
#自定义的 event header 中的键值对
a1.sources.s1.interceptors.i3.key = from
a1.sources.s1.interceptors.i3.value = baizhi
a1.sources.s1.interceptors.i4.type = org.apache.flume.sink.solr.morphline.UUIDInterceptor$Builder
a1.sources.s1.interceptors.i4.headerName = uuid
a1.sources.s1.interceptors.i5.type = remove_header
a1.sources.s1.interceptors.i5.withName = from
#搜索 Event Body 中的内容,将匹配的数据进行替换
a1.sources.s1.interceptors.i6.type = search_replace
a1.sources.s1.interceptors.i6.searchPattern = ^jiangzz
a1.sources.s1.interceptors.i6.replaceString = baizhi
# 配置Sink组件,将接收数据打印在⽇志控制台
a1.sinks.sk1.type = logger
# 配置Channel通道,主要负责数据缓冲
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
# 进⾏组件间的绑定
a1.sources.s1.channels = c1
a1.sinks.sk1.channel = c1
解析 Regex Filtering Interceptor,Regex Extrator Interceptor (Event body 的过滤和 信息在 header 中的扩展)
# 声明基本组件 Source Channel Sink example12.properties
a1.sources = s1
a1.sinks = sk1
a1.channels = c1
# 配置Source组件,从Socket中接收⽂本数据
a1.sources.s1.type = netcat
a1.sources.s1.bind = centos
a1.sources.s1.port = 44444
# 添加拦截器
a1.sources.s1.interceptors = i1 i2
#将event body中满足条件的内容抽取到event header中,以loglevel作为key
a1.sources.s1.interceptors.i1.type = regex_extractor
a1.sources.s1.interceptors.i1.regex = ^(INFO|ERROR)
a1.sources.s1.interceptors.i1.serializers = s1
a1.sources.s1.interceptors.i1.serializers.s1.name = loglevel
#只匹配带有baizhi的信息
a1.sources.s1.interceptors.i2.type = regex_filter
a1.sources.s1.interceptors.i2.regex = .*baizhi.*
a1.sources.s1.interceptors.i2.excludeEvents = false
# 配置Sink组件,将接收数据打印在⽇志控制台
a1.sinks.sk1.type = logger
# 配置Channel通道,主要负责数据缓冲
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
# 进⾏组件间的绑定
a1.sources.s1.channels = c1
a1.sinks.sk1.channel = c1
2.通道选择器
当一个 Source 组件链接多个 Channel 时,通道选择器决定了Source的数据进入哪个channel通道中,如果用户不指定通道选择器,系统默认会将source中的数据广播给所有Channel(默认使用replicating)。
replicating
配置信息
# 声明基本组件 Source Channel Sink example13.properties
a1.sources = s1
a1.sinks = sk1 sk2
a1.channels = c1 c2
# 配置Source组件,从Socket中接收⽂本数据
a1.sources.s1.type = netcat
a1.sources.s1.bind = centos
a1.sources.s1.port = 44444
# 配置Sink组件,将接收数据打印在⽇志控制台
a1.sinks.sk1.type = file_roll
a1.sinks.sk1.sink.directory = /root/file_roll_1
a1.sinks.sk1.sink.rollInterval = 0
a1.sinks.sk2.type = file_roll
a1.sinks.sk2.sink.directory = /root/file_roll_2
a1.sinks.sk2.sink.rollInterval = 0
# 配置Channel通道,主要负责数据缓冲
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
a1.channels.c2.type = jdbc
# 进⾏组件间的绑定
a1.sources.s1.channels = c1 c2
a1.sinks.sk1.channel = c1
a1.sinks.sk2.channel = c2
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-bJu7yEUW-1580996073271)(C:\Users\Administrator\AppData\Roaming\Typora\typora-user-images\1580989488084.png)]
1.如果系统中已经安装过hive,hive的lib目录下的derby-10.14.1.0.jar会与flume的lib目录中的derby-10.14.1.0.jar产生冲突,移走flume中lib包下的jar包即可。
如果不手动指定通道选择器的类型,会默认使用复制|广播模式的选择器。
等价配置
# 通道选择器 复制模式
a1.sources.s1.selector.type = replicating
a1.sources.s1.channels = c1 c2
Multiplexing
使用这种选择器,会将source中的数据分流给不同的channel通道中,最后进入不同的sink group。
配置文件
# 声明基本组件 Source Channel Sink example15.properties
a1.sources = s1
a1.sinks = sk1 sk2
a1.channels = c1 c2
# 通道选择器 复制模式
a1.sources.s1.selector.type = multiplexing
a1.sources.s1.channels = c1 c2
#source中event的header中键(level)对应的值分别为(INFO,ERROR)
a1.sources.s1.selector.header = level
a1.sources.s1.selector.mapping.INFO = c1
a1.sources.s1.selector.mapping.ERROR = c2
a1.sources.s1.selector.default = c1
# 配置Source组件,从Socket中接收⽂本数据
a1.sources.s1.type = netcat
a1.sources.s1.bind = CentOS
a1.sources.s1.port = 44444
#通道过滤器,对event的header进行装饰
a1.sources.s1.interceptors = i1
a1.sources.s1.interceptors.i1.type = regex_extractor
a1.sources.s1.interceptors.i1.regex = ^(INFO|ERROR)
a1.sources.s1.interceptors.i1.serializers = s1
a1.sources.s1.interceptors.i1.serializers.s1.name = level
# 配置Sink组件,将接收数据打印在⽇志控制台
a1.sinks.sk1.type = file_roll
a1.sinks.sk1.sink.directory = /root/file_roll_1
a1.sinks.sk1.sink.rollInterval = 0
a1.sinks.sk2.type = file_roll
a1.sinks.sk2.sink.directory = /root/file_roll_2
a1.sinks.sk2.sink.rollInterval = 0
# 配置Channel通道,主要负责数据缓冲
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
a1.channels.c2.type = jdbc
# 进⾏组件间的绑定
a1.sources.s1.channels = c1 c2
a1.sinks.sk1.channel = c1
a1.sinks.sk2.channel = c2
3,Sink Processors
Flume使用SInk group将多个Sink实例封装成一个逻辑上的Sink组件,内部使用Sink Processor实现SInk Group的负载均衡和故障转移。
Load balancing Sink Processor
# 声明基本组件 Source Channel Sink example16.properties
a1.sources = s1
a1.sinks = sk1 sk2
a1.channels = c1
# 配置Source组件,从Socket中接收⽂本数据
a1.sources.s1.type = netcat
a1.sources.s1.bind = CentOS
a1.sources.s1.port = 44444
# 配置Sink组件,将接收数据打印在⽇志控制台
a1.sinks.sk1.type = file_roll
a1.sinks.sk1.sink.directory = /root/file_roll_1
a1.sinks.sk1.sink.rollInterval = 0
a1.sinks.sk1.sink.batchSize = 1
a1.sinks.sk2.type = file_roll
a1.sinks.sk2.sink.directory = /root/file_roll_2
a1.sinks.sk2.sink.rollInterval = 0
a1.sinks.sk2.sink.batchSize = 1
# 配置Sink Porcessors
a1.sinkgroups = g1
a1.sinkgroups.g1.sinks = sk1 sk2
a1.sinkgroups.g1.processor.type = load_balance
a1.sinkgroups.g1.processor.backoff = true
a1.sinkgroups.g1.processor.selector = round_robin
# 配置Channel通道,主要负责数据缓冲
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 1
# 进⾏组件间的绑定
a1.sources.s1.channels = c1
a1.sinks.sk1.channel = c1
a1.sinks.sk2.channel = c1
如果想看到负载均衡效果, sink.batchSize 和 transactionCapacity 必须配置成1
Failover Sink Processor
# 声明基本组件 Source Channel Sink example17.properties
a1.sources = s1
a1.sinks = sk1 sk2
a1.channels = c1
# 配置Source组件,从Socket中接收⽂本数据
a1.sources.s1.type = netcat
a1.sources.s1.bind = CentOS
a1.sources.s1.port = 44444
# 配置Sink组件,将接收数据打印在⽇志控制台
a1.sinks.sk1.type = file_roll
a1.sinks.sk1.sink.directory = /root/file_roll_1
a1.sinks.sk1.sink.rollInterval = 0
a1.sinks.sk1.sink.batchSize = 1
a1.sinks.sk2.type = file_roll
a1.sinks.sk2.sink.directory = /root/file_roll_2
a1.sinks.sk2.sink.rollInterval = 0
a1.sinks.sk2.sink.batchSize = 1
# 配置Sink Porcessors
a1.sinkgroups = g1
a1.sinkgroups.g1.sinks = sk1 sk2
a1.sinkgroups.g1.processor.type = failover
a1.sinkgroups.g1.processor.priority.sk1 = 20
a1.sinkgroups.g1.processor.priority.sk2 = 10
a1.sinkgroups.g1.processor.maxpenalty = 10000
# 配置Channel通道,主要负责数据缓冲
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 1
# 进⾏组件间的绑定
a1.sources.s1.channels = c1
a1.sinks.sk1.channel = c1
a1.sinks.sk2.channel = c1
Flume 应用集成 API
原生API集成 Flume-SDK
应用集成时要保证 avro source 进行接收
<dependency>
<groupId>org.apache.flume</groupId>
<artifactId>flume-ng-sdk</artifactId>
<version>1.9.0</version>
</dependency>
在test目录中建立class
package com.baizhi;
import org.apache.flume.Event;
import org.apache.flume.EventDeliveryException;
import org.apache.flume.api.RpcClient;
import org.apache.flume.api.RpcClientFactory;
import org.apache.flume.event.EventBuilder;
import org.junit.After;
import org.junit.Before;
import org.junit.Test;
import java.util.HashMap;
import java.util.Map;
public class RpcClientTest {
private RpcClient client;
@Before
public void before(){
client= RpcClientFactory.getDefaultInstance("centos",44444);
}
@Test
public void testClient() throws EventDeliveryException {
Event event= EventBuilder.withBody("this is a demo".getBytes());
Map<String,String> map=new HashMap<String, String>();
map.put("from","world");
event.setHeaders(map);
client.append(event);
}
@After
public void after() {
client.close();
}
}
集成配置 —替换上面before中的内容
// Setup properties for the failover
Properties props = new Properties();
props.put("client.type", "default_failover");
// List of hosts (space-separated list of user-chosen host aliases)
props.put("hosts", "h1 h2 h3");
// host/port pair for each host alias
String host1 = "host1.example.org:41414";
String host2 = "host2.example.org:41414";
String host3 = "host3.example.org:41414";
props.put("hosts.h1", host1);
props.put("hosts.h2", host2);
props.put("hosts.h3", host3);
props.put("host-selector", "random"); // For random host selection
// props.put("host-selector", "round_robin"); // For round-robin host
// // selection
props.put("backoff", "true"); // Disabled by default.
props.put("maxBackoff", "10000"); // Defaults 0, which effectively
// becomes 30000 ms
// create the client with failover properties
RpcClient client = RpcClientFactory.getInstance(props);
lo4j集成
<dependency>
<groupId>org.apache.flume</groupId>
<artifactId>flume-ng-sdk</artifactId>
<version>1.9.0</version>
</dependency>
<dependency>
<groupId>org.apache.flume.flume-ng-clients</groupId>
<artifactId>flume-ng-log4jappender</artifactId>
<version>1.9.0</version>
</dependency>
<dependency>
<groupId>log4j</groupId>
<artifactId>log4j</artifactId>
<version>1.2.17</version>
</dependency>
<dependency>
<groupId>org.slf4j</groupId>
<artifactId>slf4j-log4j12</artifactId>
<version>1.7.5</version>
</dependency>
单机
log4j.rootLogger=debug,FLUME
log4j.appender.flume=org.apache.flume.clients.log4jappender.Log4jAppender
log4j.appender.flume.Hostname = 192.168.40.129
log4j.appender.flume.Port = 44444
log4j.appender.flume.UnsafeMode = true
log4j.appender.flume.layout=org.apache.log4j.PatternLayout
log4j.appender.flume.layout.ConversionPattern=%p %d{yyyy-MM-dd HH:mm:ss} %c %m%n
负载均衡配置
log4j.rootLogger=debug,FLUME
log4j.appender.flume= org.apache.flume.clients.log4jappender.LoadBalancingLog4jAppender
log4j.appender.flume.Hosts = 192.168.40.129:44444,...
log4j.appender.flume.Selector = ROUND_ROBIN
log4j.appender.flume.UnsafeMode = true
log4j.appender.flume.layout=org.apache.log4j.PatternLayout
log4j.appender.flume.layout.ConversionPattern=%p %d{yyyy-MM-dd HH:mm:ss} % c %m%n
import org.apache.commons.logging.Log;
import org.apache.commons.logging.LogFactory;
public class TestLog4j {
private static Log log= LogFactory.getLog(TestLog4j.class);
public static void main(String[] args) {
log.debug("你好!_debug");
log.info("你好!_info");
log.warn("你好!_warn");
log.error("你好!_error");
}
}
集成springboot
将资料中springboot-flume.zip的com文件夹拷贝到项目中,再将资料中logback.xml拷贝到项目的resources目录中
https://github.com/gilt/logback-flume-appender
ss} %c %m%n
负载均衡配置
```properties
log4j.rootLogger=debug,FLUME
log4j.appender.flume= org.apache.flume.clients.log4jappender.LoadBalancingLog4jAppender
log4j.appender.flume.Hosts = 192.168.40.129:44444,...
log4j.appender.flume.Selector = ROUND_ROBIN
log4j.appender.flume.UnsafeMode = true
log4j.appender.flume.layout=org.apache.log4j.PatternLayout
log4j.appender.flume.layout.ConversionPattern=%p %d{yyyy-MM-dd HH:mm:ss} % c %m%n
import org.apache.commons.logging.Log;
import org.apache.commons.logging.LogFactory;
public class TestLog4j {
private static Log log= LogFactory.getLog(TestLog4j.class);
public static void main(String[] args) {
log.debug("你好!_debug");
log.info("你好!_info");
log.warn("你好!_warn");
log.error("你好!_error");
}
}
集成springboot
将资料中springboot-flume.zip的com文件夹拷贝到项目中,再将资料中logback.xml拷贝到项目的resources目录中
https://github.com/gilt/logback-flume-appender
在logback.xml中添加对应的append即可,append可在上面的网址中进行查询。