flume日志采集

最新推荐文章于 2024-06-28 22:13:28 发布

策马出凉州

最新推荐文章于 2024-06-28 22:13:28 发布

阅读量1.4k

点赞数 2

分类专栏：大数据

本文链接：https://blog.csdn.net/origin_cx/article/details/104202814

版权

大数据专栏收录该内容

4 篇文章 0 订阅

订阅专栏

flume.apache.org

flume概念解析

Flume是一种分布式的，能够有效地收集，聚合和移动大量日志数据的工具。flume有着可靠的故障转移和恢复机制，具有强大的容错性。
flume有两个版本,Flume-og和Flume-ng，本次使用的是
apache-flume-1.9.0-bin.tar.gz。

Flume架构

在这里插入图片描述

Ageng是最小的日志收集单元，所谓flume的日志采集是通过拼接若干个Agent完成的。
agent中的source从web server中提取原生日志流，通过通道拦截器进行拦截和装饰Event，再进入通道选择器进行复制和分流，然后进入channel通道，再分别进入不同的SinkGroup（负载均衡，故障转移）组，最后进入kafka集群或者hdfs文件处理系统。

Flume安装

1.保证jdk1.8版本正常运行，并且配置JAVA_HOME环境变量
2.安装Flume                ---网址   flume.apache.org  左侧download

tar -zxvf  apache-flume-1.9.0-bin.tar.gz -C /usr/soft/
cd /usr/soft/apache-flume-1.9.0-bin/
#验证是否安装成功      执行./bin/flume-ng version
Flume 1.9.0
Source code repository: https://git-wip-us.apache.org/repos/asf/flume.git
Revision: d4fcab4f501d41597bc616921329a4339f73585e
Compiled by fszabo on Mon Dec 17 20:45:25 CET 2018
From source with checksum 35db629a3bda49d23e9b3690c80737f9

Agent的配置模板

#声明组件信息
<Agent>.sources=<source1> <source2>
<Agent>.sinks = <Sink1> <Sink1>
<Agent>.channels = <Channel1> <Channel2>

#配置组件信息
<Agent>.sources.<Source>.<someProperty> = <someValue>
<Agent>.channels.<Channel>.<someProperty> = <someValue>
<Agent>.sinks.<Sink>.<someProperty> = <someValue>

#链接组件
<Agent>.sources.<Source>.channels = <Channel1> <Channel2>
<Agent>.sinks.<Sink>.channel = <Channel1>

模板结构是必须掌握的，掌握该模板的⽬的是为了便于后期的查阅和配置。

简单案例解析

1.配置flume的配置文件

e1.properties 单个Agent的配置，将此配置文件放在flume安装目录的conf目录下。

#声明基本组件
a1.sources = sr1
a1.sinks = sk1
a1.channels = c1
#配置组件信息,从netcat中接收数据     ---去网站上面查找对应的配置信息
a1.sources.sr1.type = netcat
a1.sources.sr1.bind = centos
a1.sources.sr1.port = 44444
#配置Sink组件信息，将数据打印在日志控制台    ---一般测试调试使用
a1.sinks.sk1.type = logger
#配置Channel通道，主要负责数据缓冲
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000             --容纳1000个event
a1.channels.c1.transactionCapacity = 100   --一次传输100个event
#进行组件链接
a1.sources.s1.channels = c1
a1.sinks.sk1.channel = c1

为了测试需要安装netcat服务，在linux系统下执行如下命令：

yum -y install nmap-ncat

yum -y install telnet

配置组件信息的网址为：flume.apache.org 左侧选择document，

在这里插入图片描述

选择第一个Flume User Guide即可

2.启动a1 采集组件

[root@CentOS apache-flume-1.9.0-bin]# ./bin/flume-ng agent --conf conf/ --name a1 --conf-file conf/e1.properties -Dflume.root.logger=INFO,console
#agent 必要参数
#conf flume配置文件位置
#name agent名称
#conf-file 具体哪一个Agent配置文件
#-Dflume.root.logger=INFO,console 将对应sink的输出，将输出打印在日志控制台上

附注启动命令参数

[root@centos apache-flume-1.9.0-bin]# ./bin/flume-ng help

3.测试a1

[root@CentOS apache-flume-1.9.0-bin]# telnet CentOS 44444
Trying 192.168.52.134...
Connected to CentOS.
Escape character is '^]'.
hello world

2020-02-05 11:44:43,546 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO -
org.apache.flume.sink.LoggerSink.process(LoggerSink.java:95)] Event: { headers:{}
body: 68 65 6C 6C 6F 20 77 6F 72 6C 64 0D hello world. }

组件概述

source-输入源

1.Avro Source

内部启动一个Avro的服务器，用于接收Avro client的请求，并且将存储的数据保存在Channel中

Avro类似http的一种传输协议，只有符合Avro的才能相互传递。

属性	默认值	说明
channels		需要对接的channel
type		对应source组件的类型，必须为 avro
bind		绑定的ip（服务器主机ip）
port		绑定监听的端口号

在flume的conf中创建文件e2.properties

# 声明基本组件
a1.sources = s1
a1.sinks = sk1
a1.channels = c1
# 配置Source组件,从Socket中接收⽂本数据
a1.sources.s1.type = avro
a1.sources.s1.bind = CentOS
a1.sources.s1.port = 44444
# 配置Sink组件,将接收数据打印在⽇志控制台
a1.sinks.sk1.type = logger
# 配置Channel通道，主要负责数据缓冲
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
# 进⾏组件间的绑定
a1.sources.s1.channels = c1
a1.sinks.sk1.channel = c1

启动a1对应的文件 —e2.properties

./bin/flume-ng agent --conf conf/ --name a1 --conf-file conf/e2.properties -Dflume.root.logger=INFO,console

由于设置avro source，所以输入端必须为avro client，可以通过：

[root@centos apache-flume-1.9.0-bin]# ./bin/flume-ng help命令查询执行avro client 所需要的参数

[root@centos apache-flume-1.9.0-bin]# ./bin/flume-ng avro-client --host centos --port 44444 --filename /root/t_user

2.Exec Source

可以将指令在控制台的输出采集出来

属性	默认值	说明
channels		需要对接Channel
type		必须为 exec
command		要执⾏的命令

在flume的conf中创建配置文件e3.properties

#声明基本组件
a1.sources = sr1
a1.sinks = sk1
a1.channels = c1
#配置组件信息
a1.sources.sr1.type = exec
a1.sources.sr1.command = tail -f /root/t_user
#配置sink组件信息
a1.sinks.sk1.type = logger
#配置channel组件信息
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
#进行组件间的链接
a1.sources.sr1.channels = c1
a1.sinks.sk1.channel = c1

[root@CentOS apache-flume-1.9.0-bin]# ./bin/flume-ng agent --conf conf/ --name a1 --conf-file conf/e3.properties -Dflume.root.logger=INFO,console

注意：此类型的source组件可以进行动态采集命令结果信息，但每次执行agent都会从头执行一次，无法实现增量采集。

3.Spooling Directory Source

动态采集静态目录下，新增的文本文件，采集完成后会修改文件后缀，但是不会删除采集的源文件，会更改源文件名称，如果用户想要采集一次，可以通过修改该source默认行为来实现。

属性	默认值	说明
channels		对接的Channel
type		必须修改为 spooldir
spoolDir		给定需要采集的⽬录
fileSuffix	.COMPLETED	使⽤该值修改采集完成⽂件名
deletePolicy	never	可选值 never / immediate
includePattern	^.*$	表示匹配所有⽂件
ignorePattern	^$	表示不匹配的⽂件

# 声明基本组件 Source Channel Sink example4.properties
a1.sources = s1
a1.sinks = sk1
a1.channels = c1
# 配置Source组件,从Socket中接收⽂本数据
a1.sources.s1.type = spooldir
a1.sources.s1.spoolDir = /root/spooldir
a1.sources.s1.fileHeader = true
# 配置Sink组件,将接收数据打印在⽇志控制台
a1.sinks.sk1.type = logger
# 配置Channel通道，主要负责数据缓冲
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
# 进⾏组件间的绑定
a1.sources.s1.channels = c1
a1.sinks.sk1.channel = c1

启动a1

[root@CentOS apache-flume-1.9.0-bin]# ./bin/flume-ng agent --conf conf/ --name a1 --conf-file conf/e4.properties -Dflume.root.logger=INFO,console

4.Taildir Source

实时监测动态文本文件的追加行，并且记录采集文件读取位置的偏移量，即下一次再次采集，可以实现增量采集（从上一次采集过的位置进行采集）。

positionFile：用来记录采集文件的位置，如果不想实现增量采集，可以直接删除 ~/.flume/ 此目录下的文件。

属性	默认值	说明
channels		对接的通道
type		必须指定为 TAILDIR
filegroups		以空格分隔的⽂件组列表。
filegroups.		⽂件组的绝对路径。正则表达式（⽽⾮⽂件系统模式）只能⽤于⽂件名。
positionFile	~/.flume/taildir_position.json	记录采集⽂件的位置信息，实现增量采集

在flume中创建e5.properties

# 声明基本组件 Source Channel Sink example5.properties
a1.sources = s1
a1.sinks = sk1
a1.channels = c1
# 配置Source组件,从Socket中接收⽂本数据
a1.sources.s1.type = TAILDIR
a1.sources.s1.filegroups = g1 g2
a1.sources.s1.filegroups.g1 = /root/taildir/.*\.log$
a1.sources.s1.filegroups.g2 = /root/taildir/.*\.java$
#source采集event头信息
a1.sources.s1.headers.g1.type = log
#source采集event头信息
a1.sources.s1.headers.g2.type = java
# 配置Sink组件,将接收数据打印在⽇志控制台
a1.sinks.sk1.type = logger
# 配置Channel通道，主要负责数据缓冲
a1.channels.c1.type = memory
#队列最多存储数据条数
a1.channels.c1.capacity = 1000
#sink最大收集数据条数
a1.channels.c1.transactionCapacity = 100
# 进⾏组件间的绑定
a1.sources.s1.channels = c1
a1.sinks.sk1.channel = c1

启动a1

[root@CentOS apache-flume-1.9.0-bin]# ./bin/flume-ng agent --conf conf/ --name a1 --
conf-file conf/e5.properties -Dflume.root.logger=INFO,console

5.Kafka Source

Sink-输出

1.logger sink

通常用于测试和调试数据

2.File Roll Sink

可以将采集的数据写入到本地文件

属性	默认值	说明
channel		对应的channel
type		必须为 file_roll
sink.directory		采集数据存储的地方
sink.rollInterval		更换文件的时间（0—永不更换文件，30—30秒后更换文件）

在flume的conf中配置e6.properties

# 声明基本组件 Source Channel Sink example6.properties
a1.sources = s1
a1.sinks = sk1
a1.channels = c1
# 配置Source组件,从Socket中接收⽂本数据
a1.sources.s1.type = netcat
a1.sources.s1.bind = CentOS
a1.sources.s1.port = 44444
# 配置Sink组件,将接收数据打印在⽇志控制台
a1.sinks.sk1.type = file_roll
a1.sinks.sk1.sink.directory = /root/file_roll
a1.sinks.sk1.sink.rollInterval = 0
# 配置Channel通道，主要负责数据缓冲
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
# 进⾏组件间的绑定
a1.sources.s1.channels = c1
a1.sinks.sk1.channel = c1

[root@CentOS apache-flume-1.9.0-bin]# ./bin/flume-ng agent --conf conf/ --name a1 --conf-file conf/e6.properties

3.HDFS Sink

可以将数据写入到hdfs文件系统

属性	默认值	说明
channel		要连接的channel
type		必须为 hdfs
hdfs.path		存储hdfs文件路径（eg hdfs://namenode/flume/webdata/）

在flume的conf目录下创建e7.properties

# 声明基本组件 Source Channel Sink example7.properties
a1.sources = s1
a1.sinks = sk1
a1.channels = c1
# 配置Source组件,从Socket中接收⽂本数据
a1.sources.s1.type = netcat
a1.sources.s1.bind = CentOS
a1.sources.s1.port = 44444
# 配置Sink组件,将接收数据打印在⽇志控制台
a1.sinks.sk1.type = hdfs
a1.sinks.sk1.hdfs.path = /flume-hdfs/%y-%m-%d         #默认上传到本机的hdfs文件系统
a1.sinks.sk1.hdfs.rollInterval = 0					#hdfs文件系统不在本机拷贝hadoop目录，并配置hadoop
a1.sinks.sk1.hdfs.rollSize = 0					    #环境变量
a1.sinks.sk1.hdfs.rollCount = 0
a1.sinks.sk1.hdfs.useLocalTimeStamp = true
a1.sinks.sk1.hdfs.fileType = DataStream
# 配置Channel通道，主要负责数据缓冲
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
# 进⾏组件间的绑定
a1.sources.s1.channels = c1
a1.sinks.sk1.channel = c1

4.Kafka Sink

将数据写⼊Kafka的Topic中

在flume的conf中创建配置文件e8.properties

# 声明基本组件 Source Channel Sink example8.properties
a1.sources = s1
a1.sinks = sk1
a1.channels = c1
# 配置Source组件,从Socket中接收⽂本数据
a1.sources.s1.type = netcat
a1.sources.s1.bind = CentOS
a1.sources.s1.port = 44444
# 配置Sink组件,将接收数据打印在⽇志控制台
a1.sinks.sk1.type = org.apache.flume.sink.kafka.KafkaSink
a1.sinks.sk1.kafka.bootstrap.servers = CentOS:9092
a1.sinks.sk1.kafka.topic = topic01
a1.sinks.sk1.kafka.flumeBatchSize = 20
a1.sinks.sk1.kafka.producer.acks = 1
a1.sinks.sk1.kafka.producer.linger.ms = 1
a1.sinks.sk1.kafka.producer.compression.type = snappy
# 配置Channel通道，主要负责数据缓冲
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
# 进⾏组件间的绑定
a1.sources.s1.channels = c1
a1.sinks.sk1.channel = c1

5.Avro Sink

将数据写出给 Avro Source
在这里插入图片描述
avro sink 可以作为 avro client 输出端将采集的日志信息输入到另一个 avro source中去。

# 声明基本组件 Source Channel Sink example9.properties
a1.sources = s1
a1.sinks = sk1
a1.channels = c1
# 配置Source组件,从Socket中接收⽂本数据
a1.sources.s1.type = org.apache.flume.source.kafka.KafkaSource
a1.sources.s1.batchSize = 100
a1.sources.s1.batchDurationMillis = 2000
a1.sources.s1.kafka.bootstrap.servers = CentOS:9092
a1.sources.s1.kafka.topics = topic01
a1.sources.s1.kafka.consumer.group.id = g1
# 配置Sink组件,将接收数据打印在⽇志控制台
a1.sinks.sk1.type = avro
a1.sinks.sk1.hostname = CentOS
a1.sinks.sk1.port = 44444
# 配置Channel通道，主要负责数据缓冲
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
# 进⾏组件间的绑定
a1.sources.s1.channels = c1
a1.sinks.sk1.channel = c1
# 声明基本组件 Source Channel Sink example9.properties
a2.sources = s1
a2.sinks = sk1
a2.channels = c1
# 配置Source组件,从Socket中接收⽂本数据
a2.sources.s1.type = avro
a2.sources.s1.bind = CentOS
a2.sources.s1.port = 44444
# 配置Sink组件,将接收数据打印在⽇志控制台
a2.sinks.sk1.type = file_roll
a2.sinks.sk1.sink.directory = /root/file_roll
a2.sinks.sk1.sink.rollInterval = 0
# 配置Channel通道，主要负责数据缓冲
a2.channels.c1.type = memory
a2.channels.c1.capacity = 1000
a2.channels.c1.transactionCapacity = 100
# 进⾏组件间的绑定
a2.sources.s1.channels = c1
a2.sinks.sk1.channel = c1
[root@CentOS apache-flume-1.9.0-bin]# ./bin/flume-ng agent --conf conf/ --conf-file
conf/emple9.properties --name a2
[root@CentOS apache-flume-1.9.0-bin]# ./bin/flume-ng agent --conf conf/ --conf-file
conf/emple9.properties --name a1
[root@CentOS kafka_2.11-2.2.0]# ./bin/kafka-console-producer.sh --broker-list
CentOS:9092 --topic topic01

[root@CentOS apache-flume-1.9.0-bin]# ./bin/flume-ng agent --conf conf/ --conf-file conf/emple9.properties --name a2
[root@CentOS apache-flume-1.9.0-bin]# ./bin/flume-ng agent --conf conf/ --conf-file conf/emple9.properties --name a1
[root@CentOS kafka_2.11-2.2.0]# ./bin/kafka-console-producer.sh --broker-list CentOS:9092 --topic topic01

Channel-通道

1.Memory Channel

快，将 source 采集的数据直接写入内存，不安全，可能会丢失。

参数	默认值	说明
type		只可以写 memory
capacity		channel通道内存储的event最大时间数
transactionCapacity		每一次source写入或sink读取channel的最大批量

transactionCapacity <= capacity

a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

2.JDBC Channel

事件存储在数据库中，支持持久性存储，并且支持事务。channel内置一个数据库 Derby，这是一个持久通道，非常适合可恢复型很重要的流程。存储很重要的数据时，使用 jdbc 存储 channel。

a1.channels.c1.type = jdbc

3.Kafka Channel

将source采集的数据存入外围的 Kafka 集群中。 —相当于一个消费者组

a1.channels.c1.type = org.apache.flume.channel.kafka.KafkaChannel
a1.channels.c1.kafka.bootstrap.servers = centos:9092
a1.channels.c1.kafka.topic = topic_channel
a1.channels.c1.kafka.consumer.group.id = g1

4.File Channel

a1.channels.c1.type = file
a1.channels.c1.checkpointDir = /root/flume/checkpoint
a1.channels.c1.dataDirs = /root/flume/data

高级组件配置

1.通道拦截器

作用于 source 组件，对 source 组装的 Event 进行装饰和拦截( Event 中包含 Event header 和 Event Body)，Flume中内建了许多拦截器。

Tempstamp Interceptor：装饰类型，负责在 Event Header 中添加时间信息。

Host Interceptor：装饰类型，负责在 Event Header 中添加主机信息。

Static Interceptor：装饰类型，负责在 Event Header 中添加自定义的 Key 和 Value 类型。

Remove Header Interceptor：装饰类型，负责删除 Event Header 指定的 Key。

UUID Interceptor：装饰类型，负责在 Event Header 中添加uuid的随机的唯⼀字符串。

Search and Replace Interceptor：装饰类型，负责搜索EventBody的内容，并且将匹配的内容进⾏
替换。

Regex Filtering Interceptor：拦截类型，将中满⾜正则表达式的内容进⾏过滤或者匹配。

属性	默认值	说明
type		必须为 regex_filter
regex		要匹配的形式（正则表达式）
excludeEvents	false	false，匹配满足正则表达式的内容

Regex Extrator Interceptor：装饰类型，负责搜索EventBody的内容，并且将匹配的内容添加到
Event Header⾥⾯。

属性	默认值	说明
type		必须为 regex_extractor
regex		搜索的内容（正则）^(INFO\|ERROR）
serializers		将抽取的内容添加到header内，添加时key的值

案例解析

# 声明基本组件 Source Channel Sink example11.properties
a1.sources = s1
a1.sinks = sk1
a1.channels = c1
# 配置Source组件,从Socket中接收⽂本数据
a1.sources.s1.type = netcat
a1.sources.s1.bind = centos
a1.sources.s1.port = 44444
# 添加拦截器
a1.sources.s1.interceptors = i1 i2 i3 i4 i5 i6
a1.sources.s1.interceptors.i1.type = timestamp
a1.sources.s1.interceptors.i2.type = host
a1.sources.s1.interceptors.i3.type = static
#自定义的 event header 中的键值对
a1.sources.s1.interceptors.i3.key = from
a1.sources.s1.interceptors.i3.value = baizhi
a1.sources.s1.interceptors.i4.type = org.apache.flume.sink.solr.morphline.UUIDInterceptor$Builder
a1.sources.s1.interceptors.i4.headerName = uuid
a1.sources.s1.interceptors.i5.type = remove_header
a1.sources.s1.interceptors.i5.withName = from
#搜索 Event Body 中的内容，将匹配的数据进行替换
a1.sources.s1.interceptors.i6.type = search_replace
a1.sources.s1.interceptors.i6.searchPattern = ^jiangzz
a1.sources.s1.interceptors.i6.replaceString = baizhi
# 配置Sink组件,将接收数据打印在⽇志控制台
a1.sinks.sk1.type = logger
# 配置Channel通道，主要负责数据缓冲
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
# 进⾏组件间的绑定
a1.sources.s1.channels = c1
a1.sinks.sk1.channel = c1

解析 Regex Filtering Interceptor，Regex Extrator Interceptor （Event body 的过滤和信息在 header 中的扩展）

# 声明基本组件 Source Channel Sink example12.properties
a1.sources = s1
a1.sinks = sk1
a1.channels = c1
# 配置Source组件,从Socket中接收⽂本数据
a1.sources.s1.type = netcat
a1.sources.s1.bind = centos
a1.sources.s1.port = 44444
# 添加拦截器
a1.sources.s1.interceptors = i1 i2
#将event body中满足条件的内容抽取到event header中，以loglevel作为key
a1.sources.s1.interceptors.i1.type = regex_extractor
a1.sources.s1.interceptors.i1.regex = ^(INFO|ERROR)
a1.sources.s1.interceptors.i1.serializers = s1
a1.sources.s1.interceptors.i1.serializers.s1.name = loglevel
#只匹配带有baizhi的信息
a1.sources.s1.interceptors.i2.type = regex_filter
a1.sources.s1.interceptors.i2.regex = .*baizhi.*
a1.sources.s1.interceptors.i2.excludeEvents = false
# 配置Sink组件,将接收数据打印在⽇志控制台
a1.sinks.sk1.type = logger
# 配置Channel通道，主要负责数据缓冲
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
# 进⾏组件间的绑定
a1.sources.s1.channels = c1
a1.sinks.sk1.channel = c1

2.通道选择器

当一个 Source 组件链接多个 Channel 时，通道选择器决定了Source的数据进入哪个channel通道中，如果用户不指定通道选择器，系统默认会将source中的数据广播给所有Channel（默认使用replicating）。

replicating

在这里插入图片描述

配置信息

# 声明基本组件 Source Channel Sink example13.properties
a1.sources = s1
a1.sinks = sk1 sk2
a1.channels = c1 c2
# 配置Source组件,从Socket中接收⽂本数据
a1.sources.s1.type = netcat
a1.sources.s1.bind = centos
a1.sources.s1.port = 44444
# 配置Sink组件,将接收数据打印在⽇志控制台
a1.sinks.sk1.type = file_roll
a1.sinks.sk1.sink.directory = /root/file_roll_1
a1.sinks.sk1.sink.rollInterval = 0
a1.sinks.sk2.type = file_roll
a1.sinks.sk2.sink.directory = /root/file_roll_2
a1.sinks.sk2.sink.rollInterval = 0
# 配置Channel通道，主要负责数据缓冲
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
a1.channels.c2.type = jdbc
# 进⾏组件间的绑定
a1.sources.s1.channels = c1 c2
a1.sinks.sk1.channel = c1
a1.sinks.sk2.channel = c2

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-bJu7yEUW-1580996073271)(C:\Users\Administrator\AppData\Roaming\Typora\typora-user-images\1580989488084.png)]

1.如果系统中已经安装过hive，hive的lib目录下的derby-10.14.1.0.jar会与flume的lib目录中的derby-10.14.1.0.jar产生冲突，移走flume中lib包下的jar包即可。

如果不手动指定通道选择器的类型，会默认使用复制|广播模式的选择器。

等价配置

# 通道选择器 复制模式
a1.sources.s1.selector.type = replicating
a1.sources.s1.channels = c1 c2

Multiplexing

使用这种选择器，会将source中的数据分流给不同的channel通道中，最后进入不同的sink group。

在这里插入图片描述
配置文件

# 声明基本组件 Source Channel Sink example15.properties
a1.sources = s1
a1.sinks = sk1 sk2
a1.channels = c1 c2
# 通道选择器 复制模式
a1.sources.s1.selector.type = multiplexing
a1.sources.s1.channels = c1 c2
#source中event的header中键（level）对应的值分别为（INFO,ERROR）
a1.sources.s1.selector.header = level
a1.sources.s1.selector.mapping.INFO = c1
a1.sources.s1.selector.mapping.ERROR = c2
a1.sources.s1.selector.default = c1
# 配置Source组件,从Socket中接收⽂本数据
a1.sources.s1.type = netcat
a1.sources.s1.bind = CentOS
a1.sources.s1.port = 44444
#通道过滤器，对event的header进行装饰
a1.sources.s1.interceptors = i1
a1.sources.s1.interceptors.i1.type = regex_extractor
a1.sources.s1.interceptors.i1.regex = ^(INFO|ERROR)
a1.sources.s1.interceptors.i1.serializers = s1
a1.sources.s1.interceptors.i1.serializers.s1.name = level
# 配置Sink组件,将接收数据打印在⽇志控制台
a1.sinks.sk1.type = file_roll
a1.sinks.sk1.sink.directory = /root/file_roll_1
a1.sinks.sk1.sink.rollInterval = 0
a1.sinks.sk2.type = file_roll
a1.sinks.sk2.sink.directory = /root/file_roll_2
a1.sinks.sk2.sink.rollInterval = 0
# 配置Channel通道，主要负责数据缓冲
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
a1.channels.c2.type = jdbc
# 进⾏组件间的绑定
a1.sources.s1.channels = c1 c2
a1.sinks.sk1.channel = c1
a1.sinks.sk2.channel = c2

3,Sink Processors

Flume使用SInk group将多个Sink实例封装成一个逻辑上的Sink组件，内部使用Sink Processor实现SInk Group的负载均衡和故障转移。

Load balancing Sink Processor

$[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-IY60N62h-1580996073275)(C:\Users\Administrator\AppData\Roaming\Typora\typora-user-images\1580990447359.png)]$

# 声明基本组件 Source Channel Sink example16.properties
a1.sources = s1
a1.sinks = sk1 sk2
a1.channels = c1
# 配置Source组件,从Socket中接收⽂本数据
a1.sources.s1.type = netcat
a1.sources.s1.bind = CentOS
a1.sources.s1.port = 44444
# 配置Sink组件,将接收数据打印在⽇志控制台
a1.sinks.sk1.type = file_roll
a1.sinks.sk1.sink.directory = /root/file_roll_1
a1.sinks.sk1.sink.rollInterval = 0
a1.sinks.sk1.sink.batchSize = 1
a1.sinks.sk2.type = file_roll
a1.sinks.sk2.sink.directory = /root/file_roll_2
a1.sinks.sk2.sink.rollInterval = 0
a1.sinks.sk2.sink.batchSize = 1
# 配置Sink Porcessors
a1.sinkgroups = g1
a1.sinkgroups.g1.sinks = sk1 sk2
a1.sinkgroups.g1.processor.type = load_balance
a1.sinkgroups.g1.processor.backoff = true
a1.sinkgroups.g1.processor.selector = round_robin
# 配置Channel通道，主要负责数据缓冲
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 1
# 进⾏组件间的绑定
a1.sources.s1.channels = c1
a1.sinks.sk1.channel = c1
a1.sinks.sk2.channel = c1

如果想看到负载均衡效果， sink.batchSize 和 transactionCapacity 必须配置成1

Failover Sink Processor

# 声明基本组件 Source Channel Sink example17.properties
a1.sources = s1
a1.sinks = sk1 sk2
a1.channels = c1
# 配置Source组件,从Socket中接收⽂本数据
a1.sources.s1.type = netcat
a1.sources.s1.bind = CentOS
a1.sources.s1.port = 44444
# 配置Sink组件,将接收数据打印在⽇志控制台
a1.sinks.sk1.type = file_roll
a1.sinks.sk1.sink.directory = /root/file_roll_1
a1.sinks.sk1.sink.rollInterval = 0
a1.sinks.sk1.sink.batchSize = 1
a1.sinks.sk2.type = file_roll
a1.sinks.sk2.sink.directory = /root/file_roll_2
a1.sinks.sk2.sink.rollInterval = 0
a1.sinks.sk2.sink.batchSize = 1
# 配置Sink Porcessors
a1.sinkgroups = g1
a1.sinkgroups.g1.sinks = sk1 sk2
a1.sinkgroups.g1.processor.type = failover
a1.sinkgroups.g1.processor.priority.sk1 = 20
a1.sinkgroups.g1.processor.priority.sk2 = 10
a1.sinkgroups.g1.processor.maxpenalty = 10000
# 配置Channel通道，主要负责数据缓冲
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 1
# 进⾏组件间的绑定
a1.sources.s1.channels = c1
a1.sinks.sk1.channel = c1
a1.sinks.sk2.channel = c1

Flume 应用集成 API

原生API集成 Flume-SDK

应用集成时要保证 avro source 进行接收

<dependency>
    <groupId>org.apache.flume</groupId>
    <artifactId>flume-ng-sdk</artifactId>
    <version>1.9.0</version>
</dependency>

在test目录中建立class

package com.baizhi;
import org.apache.flume.Event;
import org.apache.flume.EventDeliveryException;
import org.apache.flume.api.RpcClient;
import org.apache.flume.api.RpcClientFactory;
import org.apache.flume.event.EventBuilder;
import org.junit.After;
import org.junit.Before;
import org.junit.Test;
import java.util.HashMap;
import java.util.Map;

public class RpcClientTest {
    private RpcClient client;
    @Before
    public void before(){
        client= RpcClientFactory.getDefaultInstance("centos",44444);
    }

    @Test
    public void testClient() throws EventDeliveryException {
        Event event= EventBuilder.withBody("this is a demo".getBytes());
        Map<String,String> map=new HashMap<String, String>();
        map.put("from","world");
        event.setHeaders(map);
        client.append(event);
    }

    @After
    public void after() {
        client.close();
    }
}

集成配置 —替换上面before中的内容

// Setup properties for the failover
Properties props = new Properties();
props.put("client.type", "default_failover");

// List of hosts (space-separated list of user-chosen host aliases)
props.put("hosts", "h1 h2 h3");

// host/port pair for each host alias
String host1 = "host1.example.org:41414";
String host2 = "host2.example.org:41414";
String host3 = "host3.example.org:41414";
props.put("hosts.h1", host1);
props.put("hosts.h2", host2);
props.put("hosts.h3", host3);

props.put("host-selector", "random"); // For random host selection
// props.put("host-selector", "round_robin"); // For round-robin host
//                                            // selection
props.put("backoff", "true"); // Disabled by default.

props.put("maxBackoff", "10000"); // Defaults 0, which effectively
                                  // becomes 30000 ms

// create the client with failover properties
RpcClient client = RpcClientFactory.getInstance(props);

lo4j集成

<dependency>
    <groupId>org.apache.flume</groupId>
    <artifactId>flume-ng-sdk</artifactId>
    <version>1.9.0</version>
</dependency>
<dependency>
    <groupId>org.apache.flume.flume-ng-clients</groupId>
    <artifactId>flume-ng-log4jappender</artifactId>
    <version>1.9.0</version>
</dependency>
<dependency>
    <groupId>log4j</groupId>
    <artifactId>log4j</artifactId>
    <version>1.2.17</version>
</dependency>
<dependency>
    <groupId>org.slf4j</groupId>
    <artifactId>slf4j-log4j12</artifactId>
    <version>1.7.5</version>
</dependency>

单机

log4j.rootLogger=debug,FLUME
        
log4j.appender.flume=org.apache.flume.clients.log4jappender.Log4jAppender
log4j.appender.flume.Hostname = 192.168.40.129
log4j.appender.flume.Port = 44444
log4j.appender.flume.UnsafeMode = true
log4j.appender.flume.layout=org.apache.log4j.PatternLayout
log4j.appender.flume.layout.ConversionPattern=%p %d{yyyy-MM-dd HH:mm:ss} %c %m%n

负载均衡配置

log4j.rootLogger=debug,FLUME
        
log4j.appender.flume= org.apache.flume.clients.log4jappender.LoadBalancingLog4jAppender
log4j.appender.flume.Hosts = 192.168.40.129:44444,...
log4j.appender.flume.Selector = ROUND_ROBIN
log4j.appender.flume.UnsafeMode = true
log4j.appender.flume.layout=org.apache.log4j.PatternLayout
log4j.appender.flume.layout.ConversionPattern=%p %d{yyyy-MM-dd HH:mm:ss} %  c %m%n

import org.apache.commons.logging.Log;
import org.apache.commons.logging.LogFactory;

public class TestLog4j {
    private static Log log= LogFactory.getLog(TestLog4j.class);
    public static void main(String[] args) {
        log.debug("你好！_debug");
        log.info("你好！_info");
        log.warn("你好！_warn");
        log.error("你好！_error");
    }
}

集成springboot

将资料中springboot-flume.zip的com文件夹拷贝到项目中，再将资料中logback.xml拷贝到项目的resources目录中

https://github.com/gilt/logback-flume-appender

ss} %c %m%n


负载均衡配置

```properties
log4j.rootLogger=debug,FLUME
        
log4j.appender.flume= org.apache.flume.clients.log4jappender.LoadBalancingLog4jAppender
log4j.appender.flume.Hosts = 192.168.40.129:44444,...
log4j.appender.flume.Selector = ROUND_ROBIN
log4j.appender.flume.UnsafeMode = true
log4j.appender.flume.layout=org.apache.log4j.PatternLayout
log4j.appender.flume.layout.ConversionPattern=%p %d{yyyy-MM-dd HH:mm:ss} %  c %m%n

import org.apache.commons.logging.Log;
import org.apache.commons.logging.LogFactory;

public class TestLog4j {
    private static Log log= LogFactory.getLog(TestLog4j.class);
    public static void main(String[] args) {
        log.debug("你好！_debug");
        log.info("你好！_info");
        log.warn("你好！_warn");
        log.error("你好！_error");
    }
}

集成springboot

将资料中springboot-flume.zip的com文件夹拷贝到项目中，再将资料中logback.xml拷贝到项目的resources目录中

https://github.com/gilt/logback-flume-appender

在logback.xml中添加对应的append即可，append可在上面的网址中进行查询。

策马出凉州

关注

2
点赞
踩
5

收藏

觉得还不错? 一键收藏
0
评论
flume日志采集

flume.apache.orgflume概念解析Flume是一种分布式的，能够有效地收集，聚合和移动大量日志数据的工具。flume有着可靠的故障转移和恢复机制，具有强大的容错性。flume有两个版本,Flume-og和Flume-ng，本次使用的是apache-flume-1.9.0-bin.tar.gz。Flume架构Ageng是最小的日志收集单元，所谓flume的日志采集是...
复制链接

扫一扫

专栏目录