Flume的安装使用和参数
概述
Flume是一个分布式的日志收集系统,可以处理各种类型各种数据格式的日志数据,包括avro、thrift、exec、jms、spooling directory、netcat、sequence generator、syslog、http、legacy、自定义等。
组件 | 功能 |
---|---|
Agent | 使用JVM 运行Flume。每台机器运行一个agent,但是可以在一个agent中包含多个source,channel以及sink。 |
Source | 从Client收集数据,传递给Channel。 |
Sink | 从Channel收集数据,运行在一个独立线程。 |
Channel | 连接 sources 和 sinks ,这个有点像一个队列。 |
Events | 可以是日志记录、 avro 对象等。 |
flume安装
必须有Java 环境 推荐JDK1.8
Flume安装成功 可以不用配置环境变量 (方便的使用的话推荐配置)
[root@HadoopNode00 ~]# mkdir /home/flume
[root@HadoopNode00 ~]# tar -zxvf apache-flume-1.9.0-bin.tar.gz -C /home/flume/
[root@HadoopNode00 bin]# ./flume-ng version # 正确显示如下日志就代表flume启动成功
Flume 1.9.0
Source code repository: https://git-wip-us.apache.org/repos/asf/flume.git
Revision: d4fcab4f501d41597bc616921329a4339f73585e
Compiled by fszabo on Mon Dec 17 20:45:25 CET 2018
From source with checksum 35db629a3bda49d23e9b3690c80737f9
flume使用
常用组件
source
- Avro Source
- Exec Source
- NetCat TCP Source
- Taildir Source
- Kafka Source
- Spooling Directory Source
sink
- HDFS Sink
- Avro Sink
- Logger Sink
- File Roll Sink
- Kafka Sink
channel
- Memory Channel
- JDBC Channel
- Kafka Channel
- File Channel
基本案例
# agent = a1 a1下有一个source叫r1
a1.sources = r1
# agent = a1 a1下有一个chanel叫c1
a1.channels = c1
# agent = a1 a1下有一个sink叫k1
a1.sinks = k1
# source
a1.sources.r1.type = netcat
a1.sources.r1.bind = HadoopNode00
a1.sources.r1.port = 6666
# channel
a1.channels.c1.type = memory
# sink
a1.sinks.k1.type = logger
# 连接channel 和 sources
a1.sources.r1.channels = c1
# 连接sinks 和 sources
a1.sinks.k1.channel = c1
Flume运行流程
Flume的数据流由事件(Event)贯穿始终。事件是Flume的基本数据单位,它携带的日志数据(字节数组形式)并且携带有头信息,这些Event有Agent外物的Source。当Source捕获事件后进行特定的格式化,然后Source会把事件推入(单个或者多个)Chanel中。你可以把Chanel看作是一个缓冲区,它将保存时间知道Sink处理完该事件。Sink负责持久化日志或者把事件推向另一个Source。Flume之所以这么神奇,是源于它自身的一个设计,这个设计就是Agent,Agent本身就是一个Java进程。运行在日志收集节点。
Java API
依赖
<dependency>
<groupId>org.apache.flume</groupId>
<artifactId>flume-ng-sdk</artifactId>
<version>1.9.0</version>
</dependency>
public class App {
public static void main(String[] args) throws Exception {
RpcClient rpcClient = RpcClientFactory.getDefaultInstance("HadoopNode00", 6666);
Event event = EventBuilder.withBody("lqq", Charset.forName("UTF-8"));
rpcClient.append(event);
rpcClient.close();
}
}
与Log4J 集成
依赖
<dependency>
<groupId>log4j</groupId>
<artifactId>log4j</artifactId>
<version>1.2.17</version>
</dependency>
<dependency>
<groupId>org.slf4j</groupId>
<artifactId>slf4j-log4j12</artifactId>
<version>1.7.25</version>
</dependency>
<dependency>
<groupId>org.apache.flume</groupId>
<artifactId>flume-ng-sdk</artifactId>
<version>1.9.0</version>
</dependency>
<dependency>
<groupId>org.apache.flume.flume-ng-clients</groupId>
<artifactId>flume-ng-log4jappender</artifactId>
<version>1.9.0</version>
</dependency>
log4j.rootLogger=info,stdout,FLUME
#console
log4j.appender.stdout=org.apache.log4j.ConsoleAppender
log4j.appender.stdout.layout=org.apache.log4j.PatternLayout
log4j.appender.stdout.layout.ConversionPattern=%p %d{yyyy-MM-dd HH:mm:ss} %c %m%n
flume.log.dir=./logs
flume.log.file=flume.log
log4j.appender.LOGFILE=org.apache.log4j.RollingFileAppender
log4j.appender.LOGFILE.MaxFileSize=100MB
log4j.appender.LOGFILE.MaxBackupIndex=10
log4j.appender.LOGFILE.File=${flume.log.dir}/${flume.log.file}
log4j.appender.LOGFILE.layout=org.apache.log4j.PatternLayout
log4j.appender.LOGFILE.layout.ConversionPattern=%d{dd MMM yyyy HH:mm:ss,SSS} %-5p [%t] (%C.%M:%L) %x - %m%n
#flume
log4j.appender.FLUME=org.apache.flume.clients.log4jappender.Log4jAppender
log4j.appender.FLUME.Hostname = HadoopNode00
log4j.appender.FLUME.Port = 6666
log4j.appender.FLUME.UnsafeMode = true
log4j.appender.FLUME.layout=org.apache.log4j.PatternLayout
log4j.appender.FLUME.layout.ConversionPattern=%p %d{yyyy-MM-dd HH:mm:ss} %c %m%n
flume拦截器
在event进入到Channel之前,可以使用拦截器interceptors对event进行相应的处理
Host Interceptor
Event添加host属性
Timestamp Interceptor
给EvnetHeader添加时间戳
Search and Replace Interceptor
匹配替换拦截器,修改event的body体
Regex Filtering Interceptor
配置满足正则表达是的event,配置的是消息体
Regex Extractor Interceptor
正则抽取evnet体的内容,将内容添加到header里
Static Interceptor
给Event添加固定头信息
UUID Interceptor
给消息添加UUID消息头
拦截器:作用于Source,按照设定的顺序对event装饰或者过滤
# 声明组件
a1.sources = r1
a1.sinks = k1
a1.channels = c1
# 配置source属性
a1.sources.r1.type = netcat
a1.sources.r1.bind = HadoopNode00
a1.sources.r1.port = 6666
a1.sources.r1.interceptors = i1 i2 i3 i4 i5
a1.sources.r1.interceptors.i1.type = timestamp
a1.sources.r1.interceptors.i2.type = host
a1.sources.r1.interceptors.i3.type = static
a1.sources.r1.interceptors.i3.key = user
a1.sources.r1.interceptors.i3.value = zs
a1.sources.r1.interceptors.i4.type = org.apache.flume.sink.solr.morphline.UUIDInterceptor$Builder
a1.sources.r1.interceptors.i5.type = regex_filter
a1.sources.r1.interceptors.i5.regex = ^((?!error).)*$ #不匹配带error的
# 配置sink属性
a1.sinks.k1.type = logger
# 配置channel属性
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
# 将source连接channel
a1.sources.r1.channels = c1
# 将sink连接channel
a1.sinks.k1.channel = c1
通道选择器
在event进入到Channel之前,可以使用通道选择器 使指定的Event进入到指定的Channel中
Flume的Selector的形式有2种:
①Replication 形式 数据同步给多个Channel
②Multiplexing形式 数据分流
# 声明 采集流 source channel sinks
agent.sources = s1
agent.channels = c1 c2
agent.sinks = sk1 sk2
# 定义source源
agent.sources.s1.type = netcat
agent.sources.s1.bind = yingkouApp
agent.sources.s1.port = 8888
# 添加正则拦截器
agent.sources.s1.interceptors = i1
agent.sources.s1.interceptors.i1.type= regex_extractor
agent.sources.s1.interceptors.i1.regex=^(ERROR|INFO|WARN).*
agent.sources.s1.interceptors.i1.serializers = s1
agent.sources.s1.interceptors.i1.serializers.s1.name = level
# 定义channel
agent.channels.c1.type = memory
agent.channels.c1.capacity = 100
agent.channels.c1.transactionCapacity = 100
agent.channels.c2.type = file
# 设置Channel Selector
agent.sources.s1.selector.type = multiplexing
agent.sources.s1.selector.header = level
agent.sources.s1.selector.mapping.ERROR = c1
agent.sources.s1.selector.mapping.INFO= c2
agent.sources.s1.selector.mapping.WARN = c2
agent.sources.s1.selector.default = c2
# 定义sinks
agent.sinks.sk1.type = file_roll
agent.sinks.sk1.sink.directory = /root/dir1
agent.sinks.sk1.sink.rollInterval=0
agent.sinks.sk2.type = file_roll
agent.sinks.sk2.sink.directory = /root/dir2
agent.sinks.sk2.sink.rollInterval=0
# 对接 source 和 sinsk
agent.sources.s1.channels = c1 c2
agent.sinks.sk1.channel = c1
agent.sinks.sk2.channel = c2
多级串联
一级agent
a1.sources = r1
a1.channels = c1
a1.sinks = k1
a1.sources.r1.type = netcat
a1.sources.r1.bind = HadoopNode00
a1.sources.r1.port = 6666
a1.channels.c1.type = memory
a1.sinks.k1.type = avro
a1.sinks.k1.hostname = HadoopNode00
a1.sinks.k1.port = 8888
a1.sinks.k1.channel = c1
a1.sources.r1.channels = c1
二级agent
a2.sources = r1
a2.channels = c1
a2.sinks = k1
a2.sources.r1.type = avro
a2.sources.r1.bind = HadoopNode00
a2.sources.r1.port = 8888
a2.sources.r1.channels = c1
a2.channels.c1.type = memory
a2.sinks.k1.type =logger
a2.sinks.k1.channel = c1
只需要上一级的sink与下一级的sources相对应