什么是Flume
一个分布式、可靠、和高可用的海量日志采集、聚合和传输的系统
Flume内部组成
1、 Source :与数据源对接,用于采集、收集数据
2、Channel : 用于数据传输(在flflumeAgent内部)
3、Sink : 用户数据的发送 或数据下沉(在flflumeAgent内部)
Flume安装部署
1、将安装包上传并解压
2、 cp flflume-env.sh.template flflume-env.sh
3、编辑 flflume-env.sh ,配置java_home
编写一个flflume配置文件的过程
1、实例单个角色
a1.sources = r1
a1.channels = c1
a1.sinks = k1
2、三个角色配置
a1.sources.r1.type = netcat
a1.sources.r1.bind = 192.168.100.211
a1.sources.r1.port = 44444
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
a1.sinks.k1.type = logger
3 、建立三者之间的关系
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
启动配置文件
bin/flume-ng agent -c conf -f conf/netcat-logger.conf -n a1 -Dflume.root.logger=INFO,console
-c conf 指定flume自身的配置文件所在目录
-f conf/netcat-logger.con 指定我们所描述的采集方案
-n a1 指定我们这个agent的名字
**案例:
接收数据包
1、实例单个角色
2、三个角色配置**
a1.sources.r1.type = netcat
a1.sources.r1.bind = 192.168.100.211
a1.sources.r1.port = 44444
a1.sinks.k1.type = logger
3 、建立三者之间的关系
在另一个节点安装telnet ,(yum install -y telnet )
并使用telnet向192.168.100.211的44444 端口发送数据
监控目录
1、实例单个角色
2、三个角色配置
a1.sources.r1.type=spooldir
a1.sources.r1.spoolDir=/export/dir
a1.sources.r1.fileHeader = true
a1.sinks.k1.type=hdfs
a1.sinks.k1.hdfs.path=hdfs://node01:8020/spooldir/
3 、建立三者之间的关系
被监控的目录已经梳理过的数据会加一个后缀,没有梳理过的数据没有后缀。
收集文件新数据
1、实例单个角色
2、三个角色配置
a1.sources.r1.type=exec
a1.sources.r1.command =tail -F /export/taillogs/access_log
a1.sinks.k1.type=hdfs
a1.sinks.k1.hdfs.path=hdfs://node01:8020/spooldir/
3 、建立三者之间的关系
两个agent级联
在第一个节点采集数据,将数据发送到第二个节点,第二个节点将数据写入HDFS
节点1
1、实例单个角色
2、三个角色配置
a1.sources.r1.type=exec
a1.sources.r1.command =tail -F /export/taillogs/access_log
a1.sinks.k1.type = avro
a1.sinks.k1.hostname = 192.168.100.212
a1.sinks.k1.port = 4141
3 、建立三者之间的关系
节点2
1、实例单个角色
2、三个角色配置
a1.sources.r1.type = avro
a1.sources.r1.bind = 192.168.100.212
a1.sources.r1.port = 4141
3 、建立三者之间的关系
a1.sinks.k1.type = hdfs
a1.sinks.k1.hdfs.path = hdfs://node01:8020/avro
故障转移
在node01 如下配置
1、实例单个角色
agent1.channels = c1
agent1.sources = r1
agent1.sinks = k1 k2
agent1.sinkgroups = g1
agent1.sinkgroups.g1.sinks = k1 k2
2、三个角色配置
设置sink优先级
agent1.sinkgroups.g1.processor.type = failover
agent1.sinkgroups.g1.processor.priority.k1 = 2
agent1.sinkgroups.g1.processor.priority.k2 = 1
agent1.sinkgroups.g1.processor.maxpenalty = 10000
3 、建立三者之间的关系
在node02,在node03 如下配置
1、实例单个角色
2、三个角色配置
3 、建立三者之间的关系
负载均衡(load balancer)
在node01 如下配置
a1.sinkgroups.g1.processor.type = load_balance
a1.sinkgroups.g1.processor.backoff = true
a1.sinkgroups.g1.processor.selector = round_robin
a1.sinkgroups.g1.processor.selector.maxTimeOut=10000
在node02,在node03 如下配置
1、实例单个角色
2、三个角色配置
3 、建立三者之间的关系
过滤器
node01 node02节点做的事相同,配置如下
1、实例三个角色
a1.sources = r1 r2 r3
a1.sinks = k1
a1.channels = c1
2、三个角色配置
#、 Describe/configure the source
a1.sources.r1.type = exec
a1.sources.r1.command = tail -F /export/taillogs/access.log
a1.sources.r1.interceptors = i1
a1.sources.r1.interceptors.i1.type = static
static拦截器的功能就是往采集到的数据的header中插入自己定## 义的key-value对
a1.sources.r1.interceptors.i1.key = type
a1.sources.r1.interceptors.i1.value = access
a1.sources.r2.type = exec
a1.sources.r2.command = tail -F /export/taillogs/nginx.log
a1.sources.r2.interceptors = i2
a1.sources.r2.interceptors.i2.type = static
a1.sources.r2.interceptors.i2.key = type
a1.sources.r2.interceptors.i2.value = nginx
a1.sources.r3.type = exec
a1.sources.r3.command = tail -F /export/taillogs/web.log
a1.sources.r3.interceptors = i3
a1.sources.r3.interceptors.i3.type = static
a1.sources.r3.interceptors.i3.key = type
a1.sources.r3.interceptors.i3.value = web
3 、建立三者之间的关系
a1.sources.r1.channels = c1
a1.sources.r2.channels = c1
a1.sources.r3.channels = c1
a1.sinks.k1.channel = c1
====
说明:
a1.sources.r2.interceptors = i2 为r2 添加过滤器 ,过滤器 的名字叫 i2
a1.sources.r2.interceptors.i2.type = static 设置过滤器类型为 static
a1.sources.r2.interceptors.i2.key = type 设置key 的值
a1.sources.r2.interceptors.i2.value = nginx 设置value的值
====
node03的配置
1、实例 三个角色
2、三个角色配置
a1.sources.r1.interceptors = i1
a1.sources.r1.interceptors.i1.type = org.apache.flume.interceptor.TimestampInterceptor$Builder
a1.sinks.k1.type = hdfs
a1.sinks.k1.hdfs.path=hdfs://192.168.52.100:8020/source/logs/%{type}/%Y%m%d
3 、建立三者之间的关系
===
说明:sources需要添加过滤器,类型必须是
org.apache.flume.interceptor.TimestampInterceptor$Builder 写入数据时获取前面设置的key ,使用%{type}
===
Flume Source详解
- Avro Source属性说明
a1.sources = r1
a1.channels = c1
a1.sources.r1.type = avro
a1.sources.r1.channels = c1
a1.sources.r1.bind = 0.0.0.0
a1.sources.r1.port = 4141
- Spooling Directory Source
监测配置的目录下新增的文件,并将文件中的数据读取出来。
a1.channels = ch-1
a1.sources = src-1
a1.sources.src-1.type = spooldir
a1.sources.src-1.channels = ch-1
a1.sources.src-1.spoolDir = /var/log/apache/flumeSpool
a1.sources.src-1.fileHeader = true
- NetCat Source
一个NetCat Source用来监听一个指定端口,并将接收到的数据的每一行转换为一个事件。
a1.sources = r1
a1.channels = c1
a1.sources.r1.type = netcat
a1.sources.r1.bind = 0.0.0.0
a1.sources.r1.port = 6666
a1.sources.r1.channels = c1
- HTTP Source
HTTP Source接受HTTP的GET和POST请求作为Flume的事件,其中GET方式应该只用于试验。
a1.sources = r1
a1.channels = c1
a1.sources.r1.type = http
a1.sources.r1.port = 5140
a1.sources.r1.channels = c1
a1.sources.r1.handler = org.example.rest.RestHandler
a1.sources.r1.handler.nickname = random props
- Kafka Source
Example for topic subscription by comma-separated topic list.
tier1.sources.source1.type = org.apache.flume.source.kafka.KafkaSource
tier1.sources.source1.channels = channel1
tier1.sources.source1.batchSize = 5000
tier1.sources.source1.batchDurationMillis = 2000
tier1.sources.source1.kafka.bootstrap.servers = localhost:9092
tier1.sources.source1.kafka.topics = test1, test2
tier1.sources.source1.kafka.consumer.group.id = custom.g.id
Example for topic subscription by regex
tier1.sources.source1.type = org.apache.flume.source.kafka.KafkaSource
tier1.sources.source1.channels = channel1
tier1.sources.source1.kafka.bootstrap.servers = localhost:9092
tier1.sources.source1.kafka.topics.regex = ^topic[0-9]$
the default kafka.consumer.group.id=flume is used
- Thrift Source
a1.sources = r1
a1.channels = c1
a1.sources.r1.type = thrift
a1.sources.r1.channels = c1
a1.sources.r1.bind = 0.0.0.0
a1.sources.r1.port = 4141
- Exec Source
tail -f 命令监控
a1.sources = r1
a1.channels = c1
a1.sources.r1.type = exec
a1.sources.r1.command = tail -F /var/log/secure
a1.sources.r1.channels = c1
8.TailDir Source
a1.sources = r1
a1.channels = c1
a1.sources.r1.type = TAILDIR
a1.sources.r1.channels = c1
a1.sources.r1.positionFile = /var/log/flume/taildir_position.json
a1.sources.r1.filegroups = f1 f2
a1.sources.r1.filegroups.f1 = /var/log/test1/example.log
a1.sources.r1.headers.f1.headerKey1 = value1
a1.sources.r1.filegroups.f2 = /var/log/test2/.log.
a1.sources.r1.headers.f2.headerKey1 = value2
a1.sources.r1.headers.f2.headerKey2 = value2-2
a1.sources.r1.fileHeader = true
9.SySLog Source
a1.sources = r1
a1.channels = c1
a1.sources.r1.type = syslogtcp
a1.sources.r1.port = 5140
a1.sources.r1.host = localhost
a1.sources.r1.channels = c1
10.MULTIPORT SysLog TCP Source
a1.sources = r1
a1.channels = c1
a1.sources.r1.type = multiport_syslogtcp
a1.sources.r1.channels = c1
a1.sources.r1.host = 0.0.0.0
a1.sources.r1.ports = 10001 10002 10003
a1.sources.r1.portHeader = port
11.SySLog UDP Source
a1.sources = r1
a1.channels = c1
a1.sources.r1.type = syslogudp
a1.sources.r1.port = 5140
a1.sources.r1.host = localhost
a1.sources.r1.channels = c1
12.Custom Source
a1.sources = r1
a1.channels = c1
a1.sources.r1.type = org.example.MySource
a1.sources.r1.channels = c1
13.Scribe Source
a1.sources = r1
a1.channels = c1
a1.sources.r1.type = org.apache.flume.source.scribe.ScribeSource
a1.sources.r1.port = 1463
a1.sources.r1.workerThreads = 5
a1.sources.r1.channels = c1