文章目录
一.Apache Flume简介
Flume用于将多种来源的日志以流的方式传输至Hadoop或者其它目的地
- 一种可靠、可用的高效分布式数据收集服务
Flume拥有基于数据流上的简单灵活架构,支持容错、故障转移与恢复
由Cloudera 2009年捐赠给Apache,现为Apache顶级项目
二.Flume架构
Client:客户端,数据产生的地方,如Web服务器
Event:事件,指通过Agent传输的单个数据包,如日志数据通常对应一行数据
Agent:代理,一个独立的JVM进程
- Flume以一个或多个Agent部署运行
- Agent包含三个组件:Source,Channel,Sink
Hello Flume
agent.sources = s1
agent.channels = c1
agent.sinks = sk1
#设置Source为netcat 端口为5678,使用的channel为c1
agent.sources.s1.type = netcat
agent.sources.s1.bind = localhost
agent.sources.s1.port = 5678
agent.sources.s1.channels = c1
#设置Sink为logger模式,使用的channel为c1
agent.sinks.sk1.type = logger
agent.sinks.sk1.channel = c1
#设置channel为capacity
agent.channels.c1.type = memory
bin/flume-ng agent --name agent -f h0.conf -Dflume.root.logger=INFO,console
Flume组件
- Source
- SourceRunner
- Interceptor
- Channel
- ChannelSelector
- ChannelProcessor
- Sink
- SinkRunner
- SinkProcessor
- SinkSelector
Flume工作流程
三.Source
1.exec source
- 执行Linux指令,并消费指令返回的结果,如“tail -f”
属性 | 缺省值 | 描述 |
---|---|---|
type | - | exec |
command | - | 如“tail -f xxx.log” |
shell | - | 选择系统Shell程序,如“/bin/sh” |
batchSize | 20 | 发送给channel的最大行数 |
练习1
使用Exec Source遍历Linux文件目录,并在控制台输出所有TXT文件名
- 1.编辑任务文件practice1:
vi practice1
#命名
a2.sources = s1
a2.sinks = k1
a2.channels = c1
#模式
a2.sources.s1.type = exec
#模式
a2.sinks.k1.type = logger
#执行命令
a2.sources.s1.command = /data/ex
#该管道配置
a2.channels.c1.type = memory
a2.channels.c1.capacity = 1000
a2.channels.c1.transactionCapacity = 100
#连接
a2.sources.s1.channels = c1
a2.sinks.k1.channel = c1
- 2.编写命令脚本文件
touch /data/ex
echo 'for i in /data/*.txt; do echo $i; done' > ex
chmod 777 ex
- 3.执行任务
flume-ng agent -c /opt/install/flume/conf/ -f practice1 -n a2 -Dflume.root.logger=INFO,console
- 4.效果如下
2.spooling directory source
从磁盘文件夹中获取文件数据,可避免重启或者发送失败后数据丢失,还可用于监控文件夹新文件
属性 | 缺省值 | 描述 |
---|---|---|
type | - | spooldir |
spoolDir | - | 需读取的文件夹 |
fileSuffix | .COMPLETED | 文件读取完成后添加的后缀 |
deletePolicy | never | 文件完成后删除策略:never和immediate |
练习2:监控文件目录
需求说明:
使用Spooling Directory Source实时监控指定目录的新文件,并将文件内容输出至控制台
- 1.编辑任务文件:practice2:
vi practice2
e1.sources = s1
e1.channels = c11
e1.sinks = k1
e1.sinks.k1.type = logger
e1.sources.s1.type = spooldir
e1.sources.s1.spoolDir = /data/p2
e1.channels.c11.type = memory
e1.sources.s1.channels = c11
e1.sinks.k1.channel = c11
- 2.执行任务
flume-ng agent -c /opt/install/flume/conf/ -f practice2 -n e1 -Dflume.root.logger=INFO,console
- 3.效果如下
3.http source
- 用于接收HTTP的Get和Post请求
属性 | 缺省值 | 描述 |
---|---|---|
type | - | http |
port | - | 监听端口 |
bind | 0.0.0.0 | 绑定IP |
handler | org.apache.flume.source.http.JSONHandler | 数据处理程序类全名 |
- 1.编写任务文件:
vi http
#命名
a1.sources = s1
a1.channels = c1
a1.sinks = sk1
#source模式
a1.sources.s1.type = http
a1.sources.s1.port = 5140
#channel模式
a1.channels.c1.type = memory
#sink模式
a1.sinks.sk1.type = logger
#连接
a1.sources.s1.channels = c1
a1.sinks.sk1.channel = c1
- 2.执行任务
flume-ng agent -c /opt/install/flume/conf/ -f http -n a1 -Dflume.root.logger=INFO,console
- 3.在其他session下输入如下命令传输json
curl -XPOST localhost:5140 -d'[{"headers":{"h1":"v1","h2":"v2"},"body":"hello body"}]'
- 4.flume界面效果如下
4.avro source
- 监听Avro端口,并从外部Avro客户端接收events
属性 | 缺省值 | 描述 |
---|---|---|
type | - | avro |
bind | - | 绑定IP地址 |
port | - | 端口 |
threads | - | 最大工作线程数量 |
5.taildir source
- 1.编写任务文件:
vi taildir
#命名
a1.sources = s1
a1.channels = c1
a1.sinks = sk1
#source模式
a1.sources.s1.type = TAILDIR
#配置有几个组
a1.sources.s1.filegroups = f1 f2
#配置groups的f1 f2
a1.sources.s1.filegroups.f1 = /data/tail_1/example.log
a1.sources.s1.filegroups.f2 = /data/tail_2/.*log.*
#指定position的位置
a1.sources.s1.positionFile = /data/tail_position/taildir_position.json
#指定headers
a1.sources.s1.headers.f1.headerKey1 = value1
a1.sources.s1.headers.f2.headerKey1 = value2
a1.sources.s1.headers.f2.headerKey2 = value2-2
a1.sources.s1.fileHeader = true
#channel模式
a1.channels.c1.type = memory
#sink模式
a1.sinks.sk1.type = logger
#连接
a1.sources.s1.channels = c1
a1.sinks.sk1.channel = c1
- 2.创建目录执行任务
mkdir -p /data/tail_1
mkdir -p /data/tail_2
flume-ng agent -c /opt/install/flume/conf/ -f taildir -n a1 -Dflume.root.logger=INFO,console
- 3.在其他session下创建文件
#session1
tail -f /data/tail_position/taildir_position.json
#session2
touch /data/tail_1/example.log
echo "hello world" >> /data/tail_1/example.log
touch /data/tail_2/hello.log
touch /data/tail_2/test.log.txt
echo "hello spark" >> /data/tail_2/hello.log
echo "hello flume" >> /data/tail_2/hello.log
- 4.效果如下:
更多案例参考官方文档
四.Channel
Memory Channel
- event保存在Java Heap中。如果允许数据小量丢失,推荐使用
File Channel
- event保存在本地文件中,可靠性高,但吞吐量低于Memory Channel
Kafka Channel
JDBC Channel
- event保存在关系数据中,一般不推荐使用
五.Sink
负责从Channel收集数据
1.avro sink
- 作为avro客户端向avro服务端发送avro事件
属性 | 缺省值 | 描述 |
---|---|---|
type | - | avro |
hostname | - | 服务端IP地址 |
post | - | 端口 |
batch-size | 100 | 批量发送事件数量 |
- 1.编写source端任务:
vi avro_source
#命名
a2.sources = s1
a2.channels = c1
a2.sinks = sk1
#source模式
a2.sources.s1.type = avro
a2.sources.s1.bind = localhost
a2.sources.s1.port = 44444
#channel模式
a2.channels.c1.type = memory
#sink模式
a2.sinks.sk1.type = logger
#连接
a2.sources.s1.channels = c1
a2.sinks.sk1.channel = c1
- 2.编写sink端任务:
vi avro_sink
#命名
a1.sources = s1
a1.channels = c1
a1.sinks = sk1
#source模式
a1.sources.s1.type = exec
a1.sources.s1.command = tail -f /data/customers.csv
#channel模式
a1.channels.c1.type = memory
#sink模式
a1.sinks.sk1.type = avro
a1.sinks.sk1.hostname = localhost
a1.sinks.sk1.port = 44444
#连接
a1.sources.s1.channels = c1
a1.sinks.sk1.channel = c1
- 3.执行任务
#session1
flume-ng agent -c /opt/install/flume/conf/ -f avro_source -n a2 -Dflume.root.logger=INFO,console
#session2
flume-ng agent -c /opt/install/flume/conf/ -f avro_sink -n a1 -Dflume.root.logger=INFO,console
- 4.效果如下
练习3:分层收集
需求说明
创建两个Agent实现分层收集
Agent-1:使用Exec Source收集“/var/log/secure”日志文件内容,并使用Avro Sink输出
Agent-2:使用Avro Source收集Agent-1中的输出,并输出到控制台中
提示
先启动Agent-2,再启动Agent-1。并想一想为什么
- 1)编辑任务文件:
vi practice3
#命名
a1.sources = s1
a1.sinks = k1
a1.channels = c1
#source模式
a1.sources.s1.type = exec
#执行命令
a1.sources.s1.command = tail -f /var/log/secure
#sink模式
a1.sinks.k1.type = avro
a1.sinks.k1.hostname = localhost
a1.sinks.k1.port = 44444
#channel模式
a1.channels.c1.type = memory
#连接
a1.sources.s1.channels = c1
a1.sinks.k1.channel = c1
#命名
a2.sources = s2
a2.sinks = k2
a2.channels = c2
#模式
a2.sources.s2.type =avro
a2.sources.s2.bind = localhost
a2.sources.s2.port = 44444
#模式
a2.sinks.k2.type = logger
#channel模式
a2.channels.c2.type = memory
#连接
a2.sources.s2.channels = c2
a2.sinks.k2.channel = c2
- 2)启动任务
#session1
flume-ng agent -c /opt/install/flume/conf/ -f practice3 -n a2 -Dflume.root.logger=INFO,console
#session2
flume-ng agent -c /opt/install/flume/conf/ -f practice3 -n a1 -Dflume.root.logger=INFO,console
- 3)效果如下:
2.HDFS sink
- 将事件写入Hadoop分布式文件系统(HDFS)
属性 | 缺省值 | 描述 |
---|---|---|
type | - | hdfs |
hdfs.path | - | hdfs目录 |
hfds.filePrefix | FlumeData | 文件前缀 |
hdfs.fileSuffix | - | 文件后缀 |
- 1.编写任务文件:
vi hdfsFlume
#命名
a1.sources = s1
a1.channels = c1
a1.sinks = sk1
#source模式
a1.sources.s1.type = exec
a1.sources.s1.command = tail -f /data/customers.csv
#channel模式
a1.channels.c1.type = memory
#sink模式
a1.sinks.sk1.type = hdfs
a1.sinks.sk1.hdfs.path = /flume/events/%y-%m-%d/%H%M/%S
a1.sinks.sk1.hdfs.useLocalTimeStamp = true
#连接
a1.sources.s1.channels = c1
a1.sinks.sk1.channel = c1
- 2.执行任务
hdfs dfs -mkdir /flume
flume-ng agent -c /opt/install/flume/conf/ -f hdfsFlume -n a1 -Dflume.root.logger=INFO,console
- 3效果如下
**练习4:收集日志至HDFS **
需求说明
使用Exec Source收集本地“/var/log/secure”日志文件内容,并使用HDFS Sink输出至HDFS“/var/log/secure”中
- 1.编写任务文件:
vi practice4
#命名
a1.sources = s1
a1.channels = c1
a1.sinks = sk1
#source模式
a1.sources.s1.type = exec
a1.sources.s1.command = tail -f /var/log/secure
a1.channels.c1.type = memory
#sink模式
a1.sinks.sk1.type = hdfs
a1.sinks.sk1.hdfs.path = /flume/events/%y-%m-%d/%H%M/%S
a1.sinks.sk1.hdfs.useLocalTimeStamp = true
#连接
a1.sources.s1.channels = c1
a1.sinks.sk1.channel = c1
- 2.执行任务
flume-ng agent -c /opt/install/flume/conf/ -f practice4 -n a1 -Dflume.root.logger=INFO,console
- 3效果如下
3.Hive sink
- 包含分隔文本或JSON数据流事件直接进入Hive表或分区
- 传入的事件数据字段映射到Hive表中相应的列
属性 | 缺省值 | 描述 |
---|---|---|
type | - | hive |
hive.metastore | - | Hive metastore URI |
hive.database | - | Hive数据库名称 |
hive.table | - | Hive表 |
serializer | - | 序列化器负责从事件中分析出字段并将它们映射为Hive表中的列。序列化器的选择取决于数据的格式。支持序列化器:DELIMITED和JSON |
4.kafka sink
a1.sources = s1
a1.sinks = k1
a1.channels = c2
a1.sources.s1.type = spooldir
a1.sources.s1.spoolDir = /data/kb07file/prodata
a1.sources.s1.deserializer = LINE
a1.sources.s1.deserializer.maxLineLength = 3000
a1.sources.s1.includePattern = train_[0-9]{4}-[0-9]{2}-[0-9]{2}.csv
a1.sources.s1.interceptors = head_filter
a1.sources.s1.interceptors.head_filter.type = regex_filter
a1.sources.s1.interceptors.head_filter.regex = ^user_id*
a1.sources.s1.interceptors.head_filter.excludeEvents = true
a1.channels.c2.type = memory
a1.channels.c2.capacity = 100000
a1.channels.c2.transactionCapacity = 10000
a1.sinks.k1.type = org.apache.flume.sink.kafka.KafkaSink
a1.sinks.k1.brokerList = 192.168.11.201:9092
a1.sinks.k1.topic = train2
a1.sources.s1.channels = c2
a1.sinks.k1.channel = c2
六.多层代理(拓扑结构)
简单串联(multi-agent flow)
多路数据流(Multiplexing the flow)
合并(Consolidation),将多个源合并到一个目的地
负载均衡和故障转移
七.Flume Sink组
- sink组是用来创建逻辑上的一组sink
- sink组的行为是由sink处理器(processor)决定的,它决定了event的路由策略
- processor包括故障转移和负载均衡两类
//故障转移
a1.sinkgroups = g1
a1.sinkgroups.g1.sinks = k1 k2
a1.sinkgroups.g1.processor.type = failover
a1.sinkgroups.g1.processor.priority.k1 = 5
a1.sinkgroups.g1.processor.priority.k2 = 10
a1.sinkgroups.g1.processor.maxpenalty = 10000
//负载均衡
a1.sinkgroups = g1
a1.sinkgroups.g1.sinks = k1 k2
a1.sinkgroups.g1.processor.type = load_balance
a1.sinkgroups.g1.processor.backoff = true
a1.sinkgroups.g1.processor.selector = random
八.拦截器(Interceptors)
拦截器可修改或丢弃事件
- 设置在source和channel之间
内置拦截器
- HostInterceptor:在event header中插入“hostname”
- TimestampInterceptor:插入时间戳
- StaticInceptor:插入key-value
- UUIDInceptor:插入UUID
自定义拦截器
- 1)写好拦截器代码,将jar包上传到$FLUME_HOME/lib 目录下
package njzb.kb07;
import org.apache.flume.Context;
import org.apache.flume.Event;
import org.apache.flume.interceptor.Interceptor;
import java.util.ArrayList;
import java.util.List;
import java.util.Map;
/**
* @author sun_0128
* @date 2020/08/17
* @Software:IntelliJ IDEA
* @description
*/
//todo 拦截器案例
public class InterceptorDemo implements Interceptor {
private List<Event> addHeaderEvents;
public void initialize() {
addHeaderEvents = new ArrayList<Event>();
}
public Event intercept(Event event) {
Map<String, String> headers = event.getHeaders();
String body =new String(event.getBody());
if(body.startsWith("gree")){
headers.put("type","gree");
}else {
headers.put("type","lijia");
}
return event;
}
public List<Event> intercept(List<Event> list) {
addHeaderEvents.clear();
for (Event event : list) {
addHeaderEvents.add(intercept(event));
}
return addHeaderEvents;
}
public void close() {
}
public static class Builder implements Interceptor.Builder{
public Interceptor build() {
return new InterceptorDemo();
}
public void configure(Context context) {
}
}
}
- 2)写拦截器任务:
vi lanjieqi.conf
a1.sources = r1
a1.channels = c1 c2
a1.sinks = k1 k2
a1.sources.r1.type = netcat
a1.sources.r1.bind = localhost
a1.sources.r1.port = 55555
a1.sources.r1.interceptors = i1
a1.sources.r1.interceptors.i1.type = njzb.kb07.InterceptorDemo$Builder
a1.sources.r1.selector.type = multiplexing
a1.sources.r1.selector.header = type
a1.sources.r1.selector.mapping.gree = c1
a1.sources.r1.selector.mapping.lijia = c2
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
a1.channels.c2.type = memory
a1.channels.c2.capacity = 1000
a1.channels.c2.transactionCapacity = 100
a1.sinks.k1.type = hdfs
a1.sinks.k1.hdfs.fileType = DataStream
a1.sinks.k1.hdfs.filePrefix = gree
a1.sinks.k1.hdfs.fileSuffix = .csv
a1.sinks.k1.hdfs.path = /flume/events/gree/%Y-%m-%d
a1.sinks.k1.hdfs.useLocalTimeStamp = true
#a1.sinks.k1.hdfs.batchSize = 640
a1.sinks.k1.hdfs.rollCount = 0
a1.sinks.k1.hdfs.rollSize = 100
a1.sinks.k1.hdfs.rollInterval = 3
a1.sinks.k2.type = hdfs
a1.sinks.k2.hdfs.fileType = DataStream
a1.sinks.k2.hdfs.filePrefix = lijia
a1.sinks.k2.hdfs.fileSuffix = .csv
a1.sinks.k2.hdfs.path = /flume/events/lijia/%Y-%m-%d
a1.sinks.k2.hdfs.useLocalTimeStamp = true
#a1.sinks.k2.hdfs.batchSize = 640
a1.sinks.k2.hdfs.rollCount = 0
a1.sinks.k2.hdfs.rollSize = 100
a1.sinks.k2.hdfs.rollInterval = 3
a1.sources.r1.channels = c1 c2
a1.sinks.k1.channel = c1
a1.sinks.k2.channel = c2
- 3)执行任务:
flume-ng agent -c ../../conf/ -f lanjieqi.conf -n a1 -Dflume.root.logger=INFO,console
- 在其他session下输入
telnet localhost 55555
输入内容 - 4)效果如下