一、flume概述
Aapche Flume是由Cloudera提供的一个高可用的,高可靠的,分布式的海量日志采集、聚合和传输的软件,网址: http://flume.apache.org/
Apache Flume的核心是把数据从数据源(source)收集过来,再将收集到的数据送到指定的目的地(sink)。为了保证输送的过程一定成功,在送到目的地(sink)之前,会先缓存数据(channel),待数据真正到达目的地(sink)后,flume在删除自己缓存的数据。
Flume系统中核心的角色是agent,agent本身是一个Java进程,一般运行在日志收集节点。
每一个agent相当于一个数据传递员,内部有三个组件:
• Source:采集源,用于跟数据源对接,以获取数据;
• Sink:下沉地,采集数据的传送目的,用于往下一级agent或者往最终存储系统传递数据;
• Channel:agent内部的数据传输通道,用于从source将数据传递到sink;
1、常见source
(1)Avro (一般用于多个flume 首尾相连时, 作为sink source媒介, 实现数据跨服务器传输, 多级流动)
(2)Exec 监听指定单个文件 (如tail -F xxx.log)
(3)Taildir * 监控一个目录, 可用使用正则表达式匹配目录中的文件进行实时收集, 支持断点续传
(4)Kafka *
(5)Spooling Directory source 监听一个目录下的所有文件(不支持断点续传, 可用排除某些文件, 不常用)
2、常见channel
(保证了事务, 提供了缓冲)
(1)Memory 内存 *(特点:读写快、容量小、安全性较差)
(2)Jdbc 内置的derby
(3)File 磁盘文件(读写相对慢、容量大、安全性较高)
(4)Kafka *
3、常见sink
(1)Hdfs
(2)Logger
(3)Avro 将数据转换成avro event ,发送到rpc端口(一般用于多个flume 首尾相连时, 作为sink source媒介)File 本地文件系统
(4)Hbase
(5)Es
(6)Kafka
二、安装部署flume
1、上传安装包,node1机器上安装
2、解压安装
tar -zxvf apache-flume-1.9.0-bin.tar.gz -C/export/server/
cd /export/server
ln-s apache-flume-1.9.0-bin flume
chown -R root:root flume/
3、修改Flume环境变量
cd /export/server/flume/conf/
mv flume-env.sh.template flume-env.sh
vim flume-env.sh
# 修改22行
export JAVA_HOME=/export/server/jdk
# 修改35行
export HADOOP_HOME=/export/server/hadoop
4、集成HDFS,拷贝HDFS配置文件
cp /export/server/hadoop/etc/hadoop/core-site.xml /export/server/hadoop/etc/hadoop/hdfs-site.xml /export/server/flume/conf/
5、删除Flume自带的guava包,替换成Hadoop的
# 删除低版本jar包
rm -rf /export/server/flume/lib/guava-11.0.2.jar
# 拷贝高版本jar包
cp /export/server/hadoop/share/hadoop/common/lib/guava-27.0-jre.jar /export/server/flume/lib/
三、flume案例
1、(exec-memory-logger)采集日志文件,缓存到内存,打印到控制台
#define the agent
a1.sources = s1
a1.channels = c1
a1.sinks = k1
#define the source
a1.sources.s1.type = exec
a1.sources.s1.command = tail -F /export/server/flume/datas/test.log
#define the channel
a1.channels.c1.type = memory
a1.channels.c1.capacity = 10000
#define the sink
a1.sinks.k1.type = logger
#bond
a1.sources.s1.channels = c1
a1.sinks.k1.channel = c1
2、(exec-memory-hdfs)采集日志数据,缓存到内存,保存到hdfs
#定义当前的agent的名称,以及对应source、channel、sink的名字
a1.sources = s1
a1.channels = c1
a1.sinks = k1
#定义s1:从哪读数据,读谁
a1.sources.s1.type = exec
a1.sources.s1.command = tail -F /export/server/flume/datas/test.log
#定义c1:缓存在什么地方
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
#定义k1:将数据发送给谁
a1.sinks.k1.type = hdfs
a1.sinks.k1.hdfs.path = hdfs://node1.itcast.cn:8020/flume/test1
#指定生成的文件的前缀
a1.sinks.k1.hdfs.filePrefix = nginx
#指定生成的文件的后缀
a1.sinks.k1.hdfs.fileSuffix = .log
#指定写入HDFS的文件的类型:普通的文件
a1.sinks.k1.hdfs.fileType = DataStream
# 问题:Flume默认写入HDFS上会产生很多小文件,都在1KB左右,不利用HDFS存储
# 设置文件滚动策略:
#指定按照时间生成文件,一般关闭
a1.sinks.k1.hdfs.rollInterval = 0
#指定文件大小生成文件,一般120 ~ 125M对应的字节数
a1.sinks.k1.hdfs.rollSize = 1024
#指定event个数生成文件,一般关闭
a1.sinks.k1.hdfs.rollCount = 0
#s1将数据给哪个channel
a1.sources.s1.channels = c1
#k1从哪个channel中取数据
a1.sinks.k1.channel = c1
3、(exec-memory-kafka)采集日志数据,缓存到内存,保存到hdfs
a1.sources = s1
a1.channels = c1
a1.sinks = k1
#定义s1:从哪读数据,读谁
a1.sources.s1.type = exec
a1.sources.s1.command = tail -F /export/server/flume/datas/test.log
#定义c1:缓存在什么地方
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
#定义k1:将数据发送给谁
a1.sinks.k1.type = org.apache.flume.sink.kafka.KafkaSink
a1.sinks.k1.kafka.bootstrap.servers = node1:9092,node2:9092,node3:9092
a1.sinks.k1.kafka.topic = topic01
#s1将数据给哪个channel
a1.sources.s1.channels = c1
#k1从哪个channel中取数据
a1.sinks.k1.channel = c1
四、 sink处理器
1、负载均衡sink处理器
需求:采集node1指定文件, 使用负载均衡sink处理器发送数据到node2 node3
(1)(TAILDIR-file-avro)在node1采集日志文件,配置文件如下
# 通过负载均衡实现k1 k2随机从channel中获取数据
#define the agent
a1.sources = s1
a1.channels = c1
a1.sinkgroups = g1
a1.sinks = k1 k2
#define the source
a1.sources.s1.type = TAILDIR
a1.sources.s1.filegroups = f1
#被监听的数据源文件
a1.sources.s1.filegroups.f1 = /export/server/flume/logs/test.log
a1.sources.s1.channels = c1
#define the channel
a1.channels.c1.type = file
a1.channels.c1.checkPointDir = /export/server/flume-1.9.0/checkpoint
a1.channels.c1.dataDirs = /export/server/flume-1.9.0/dataDirs
#define the sink
# sink组
a1.sinks.k1.g1.sinks = k1 k2
# 配置处理器负载均衡
a1.sinks.g1.processor.type = load_balance
a1.sinks.g1.processor.backoff = true
# 定义处理器发送数据的方式(默认是round_robin. 还有random)
a1.sinks.g1.processor.selector = round_robin
a1.sinks.g1.processor.selector.maxTimeOut = 10000
#定义一个sink把数据放松到node2
a1.sinks.k1.type = avro
a1.sinks.k1.channel = c1
a1.sinks.k1.batchSize = 1
a1.sinks.k1.hostname = node2
a1.sinks.k1.port = 1234
#定义一个sink把数据放松到node3
a1.sinks.k2.type = avro
a1.sinks.k2.channel = c1
a1.sinks.k2.batchSize = 1
a1.sinks.k2.hostname = node3
a1.sinks.k2.port = 1234
(2)(avro-file-logger)node2、node3配置文件如下
agent1.sources=r1
agent1.channels=c1
agent1.sinks=k1
#定义一个avro source
agent1.sources.r1.type = avro
agent1.sources.r1.channels = c1
agent1.sources.r1.bind = 0.0.0.0
agent1.sources.r1.port = 1234
#定义一个file channel
agent1.channels.c1.type = file
#channel中的checkpoint
agent1.channels.c1.checkpointDir = /export/server/flume-1.9.0/checkpoint
#channel中数据存放的位置
agent1.channels.c1.dataDirs = /export/server/flume-1.9.0/dataDirs
#定义一个logger sink
agent1.sinks.k1.type = logger
agent1.sinks.k1.channel = c1
(3)先启动node2、node3的flume服务,再启动node1的flume服务
2、故障转移sink处理器
需求:采集node1指定文件, 使用故障转移sink处理器发送数据到node2 node3
(1)(TAILDIR-file-avro)在node1采集日志文件,配置文件如下
#define the agent
a1.sources = s1
a1.channels = c1
a1.sinkgroups = g1
a1.sinks = k1 k2
#define the source
a1.sources.s1.type = TAILDIR
a1.sources.s1.filegroups = f1
a1.sources.s1.filegroups.f1 = /export/server/flume-1.9.0/logs/test.log
a1.sources.s1.channels = c1
#define the channel
a1.channels.c1.type = file
a1.channels.c1.checkPointDir = /export/server/flume-1.9.0/checkpoint
a1.channels.c1.dataDirs = /export/server/flume-1.9.0/dataDirs
#define the sink
# sink组
a1.sinks.k1.g1.sinks = k1 k2
# 配置处理器 故障转移
a1.sinks.g1.processor.type = failover
a1.sinks.g1.processor.priority.k1 =5
a1.sinks.g1.processor.priority.k2 =10
#优先级较高的服务器宕机后,如果在指定时间内连接上,则不转移到优先级较低的服务器
a1.sinks.g1.processor.priority.maxpenalty =10000
#定义一个sink把数据放松到node2
a1.sinks.k1.type = avro
a1.sinks.k1.channel = c1
a1.sinks.k1.batchSize = 1
a1.sinks.k1.hostname = node2
a1.sinks.k1.port = 1234
#定义一个sink把数据放松到node3
a1.sinks.k2.type = avro
a1.sinks.k2.channel = c1
a1.sinks.k2.batchSize = 1
a1.sinks.k2.hostname = node3
a1.sinks.k2.port = 1234
(2)(avro-file-logger)node2、node3配置文件如下
agent1.sources=r1
agent1.channels=c1
agent1.sinks=k1
#定义一个avro source
agent1.sources.r1.type = avro
agent1.sources.r1.channels = c1
agent1.sources.r1.bind = 0.0.0.0
agent1.sources.r1.port = 1234
#定义一个file channel
agent1.channels.c1.type = file
agent1.channels.c1.checkpointDir = /export/server/flume-1.9.0/checkpoint
agent1.channels.c1.dataDirs = /export/server/flume-1.9.0/dataDirs
#定义一个logger sink
agent1.sinks.k1.type = logger
agent1.sinks.k1.channel = c1
(3)先启动node2、node3的flume服务,再启动node1的flume服务
五、flume拦截器
1、buildin拦截器
(1)timestamp interceptor
在头信息中插入时间戳
(2)host interceptor
在头中插入主机名
(3)static
在头信息中插入指定的kv (如果需要多个kv, 就配置多个static 拦截器)
(4)搜索替换
通过正则匹配替换掉指定字符
(5)正则过滤
通过正则过滤掉含指定字符的数据
(6)Regex Extractor Interception
通过正则中的捕获组(即: 小括号)获取body中的信息, 放到头信息中
2、buildin拦截器案例
#define the agent
a1.sources = s1
a1.channels = c1
a1.sinks = k1
#define the source
a1.sources.s1.type = TAILDIR
a1.sources.s1.filegroups = f1
#被监听的数据源文件
a1.sources.s1.filegroups.f1 = /export/server/flume/logs/test.log
#如果存在多个类型的拦截器, 就配置多个,用空格分开
a1.sources.s1.interceptors = i1 i2 i3 i5 i6
a1.sources.s1.interceptors.i1.type = timestamp
a1.sources.s1.interceptors.i2.type = host
a1.sources.s1.interceptors.i3.type = static
#指定静态拦截器的kv
a1.sources.s1.interceptors.i3.key = name
a1.sources.s1.interceptors.i3.value = xt
#a1.sources.s1.interceptors.i4.type =search_replace
#指定要替换的对象
#a1.sources.s1.interceptors.i4.searchPattern=[a-z]
#指定要替换成啥
#a1.sources.s1.interceptors.i4.replaceString = *
a1.sources.s1.interceptors.i5.type = regex_filter
#指定要过滤的对象
a1.sources.s1.interceptors.i5.regex= ^A.*
a1.sources.s1.interceptors.i5.excludeEvents= true
a1.sources.s1.interceptors.i6.type= regex_extractor
a1.sources.s1.interceptors.i6.regex=(^[a-zA-Z]*)\\s([0-9]*$)
#根据正则提取两个字符串
a1.sources.s1.interceptors.i6.serializers= ser1 ser2
#给两个字符串命名
a1.sources.s1.interceptors.i6.serializers.ser1.name=word
a1.sources.s1.interceptors.i6.serializers.ser2.name=num
a1.sources.s1.channels = c1
#define the channel
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
#define the sink
#定义一个sink把数据放松到node2
a1.sinks.k1.type = logger
a1.sinks.k1.channel = c1
3、自定义拦截器及channel多路复用通道选择器
需求:判断日志数据中是否包含spring字符串,如果包含spring,则该数据发送到node2,否则发送到node3。在event的header信息中标记为type_spring,否则标记为type_other。
(1)maven项目引入依赖
<dependency>
<groupId>org.apache.flume</groupId>
<artifactId>flume-ng-core</artifactId>
<version>1.9.0</version>
</dependency>
(2)实现Interceptor 接口
功能:判断日志数据中是否包含spring字符串,如果包含spring,则在event的header信息中标记为type_spring,否则标记为type_other。
public class WordInterceptor implements Interceptor {
//申明一个集合, 用于存放拦截器处理后的时间
private List<Event> events;
public void initialize() {
//初始化集合
events = new ArrayList<Event>();
}
/**
* 单个事件处理方法
*/
public Event intercept(Event event) {
//获取事件的header和body
Map<String, String> headers = event.getHeaders();
byte[] body = event.getBody();
//将原始的body转换成string
String bodyStr = new String(body);
//根据body中是否存在spring进行分类
if (bodyStr.contains("spring")) {
headers.put("type", "type_spring");
} else {
headers.put("type", "type_other");
}
return event;
}
/**
* 批量事件处理方法
*/
public List<Event> intercept(List<Event> list) {
events.clear();
for (Event event : list) {
events.add(intercept(event));//调用 单个事件处理的方法
}
return events;
}
public void close() {
}
/**
* 加这个类, 是为了在flume配置文件中使用自定义拦截器
* 配置文件中的使用方式:
*/
public static class Builder implements Interceptor.Builder {
public Interceptor build() {
return new WordInterceptor();
}
public void configure(Context context) {
}
}
}
(3)打成jar,并上传服务器
(4)配置node1的flume文件
# 使用自定义拦截器 对数据进行分组
#define the agent
a1.sources = s1
a1.channels = c1 c2
a1.sinks = k1 k2
#define the source
a1.sources.s1.type = netcat
a1.sources.s1.bind = 192.168.88.151
a1.sources.s1.port = 2781
a1.sources.s1.interceptors = i1 i2
a1.sources.s1.interceptors.i1.type = org.example.WordInterceptor$Builder
# 第二个拦截器
a1.sources.s1.interceptors.i2.type = timestamp
#设置多路复用通道选择器( 通过头信息中type字段的值进行设置该条数据 属于哪个channel)
a1.sources.s1.selector.type = multiplexing
a1.sources.s1.selector.header = type
a1.sources.s1.selector.mapping.type_spring = c1
a1.sources.s1.selector.mapping.type_other = c2
a1.sources.s1.selector.default = c2
a1.sources.s1.channels = c1 c2
#define the channel
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
a1.channels.c2.type = memory
a1.channels.c2.capacity = 1000
a1.channels.c2.transactionCapacity = 100
#define the sink
#定义一个sink把数据放松到node2
a1.sinks.k1.type = avro
a1.sinks.k1.channel = c1
a1.sinks.k1.batchSize = 1
a1.sinks.k1.hostname = node2
a1.sinks.k1.port = 1234
#定义一个sink把数据放松到node3
a1.sinks.k2.type = avro
a1.sinks.k2.channel = c2
a1.sinks.k2.batchSize = 1
a1.sinks.k2.hostname = node3
a1.sinks.k2.port = 1234
(5)配置node2、node3的flume文件
agent1.sources=r1
agent1.channels=c1
agent1.sinks=k1
#定义一个avro source
agent1.sources.r1.type = avro
agent1.sources.r1.channels = c1
agent1.sources.r1.bind = 0.0.0.0
agent1.sources.r1.port = 1234
#定义一个file channel
agent1.channels.c1.type = file
agent1.channels.c1.checkpointDir = /export/server/flume-1.9.0/checkpoint
agent1.channels.c1.dataDirs = /export/server/flume-1.9.0/dataDirs
#定义一个logger sink
agent1.sinks.k1.type = logger
agent1.sinks.k1.channel = c1