文章目录
官方文档
flume概述
Apache Flume 是一个分布式、高可靠、高可用的用来收集、聚合、转移不同来源的大量日志数据到中央数据仓库的工具
flume架构
Client:客户端,数据产生的地方,如Web服务器
Event:事件,指通过Agent传输的单个数据包,如日志数据通常对应一行数据
Agent:代理,一个独立的JVM进程
Flume以一个或多个Agent部署运行
Agent包含三个组件:Source,Channel,Sink
Agent中的三个组件Source、Channel、Sink
Source表示从哪里读
Channel表示存储方式
Sink表示传输到哪里去
Sources
netcat(监控一个端口)
在/opt/flume1.6/conf/job
目录下创建一个conf文件netcat-flume-logger.conf
#Name 给三个组件起别名
a1.sources = r1
a1.sinks = k1
a1.channels = c1
#Sources:类型选用netcat 监控的端口为本地的44444端口
a1.sources.r1.type = netcat
a1.sources.r1.bind = localhost
a1.sources.r1.port = 44444
#Sink:类型选为logger直接在控制台输出方便测试
a1.sinks.k1.type = logger
#Channel:类型选用memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
#Bind 定义连接
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
保存后启动agent命令
flume-ng agent
--conf /opt/flume1.6/conf //flume目录下的conf文件夹路径
--conf-file /opt/flume1.6/conf/job/netcat-flume-logger.conf //自己编写的conf文件路径
--name a1 //自己编写的conf文件中的名字a1
-Dflume.root.logger=INFO,console //当sink类型为logger时 此参数表示从控制台输出 其他类型不需要
简写形式:
flume-ng agent -c conf -f /opt/flume1.6/conf/job/netcat-flume-logger.conf -n a1 -Dflume.root.logger=INFO,console
启动之后:
新开一个虚拟机窗口
输入命令进入44444端口
telnet localhost 44444
如果报错,执行下列命令安装telnet
yum list telnet* 列出telnet相关的安装包
yum install telnet-server 安装telnet服务
yum install telnet.* 安装telnet客户端
成功后显示:
在端口输入任意字符:
都会在之前开启的监控窗口中输出在控制台:
exec(根据命令监控 一般是tail或cat)
在/opt/flume1.6/conf/job
目录下创建一个conf文件exec-flume-logger.conf
a1.sources = r1
a1.sinks = k1
a1.channels = c1
a1.sources.r1.type = exec
a1.sources.r1.command = tail -f /data/test.log // 文件格式任意 一般监控日志文件
a1.sinks.k1.type = logger //这里的sink还是选择控制台输出
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
启动agent之后
往/data/test.log
文件中传输字段
控制台就会显示
spooldir(监控一个文件夹)
在/opt/flume1.6/conf/job
目录下创建一个conf文件spooldir-flume-hdfs.conf
a1.sources = r1
a1.channels = c1
a1.sinks = s1
a1.sources.r1.type = spooldir //source类型为spooldir
a1.sources.r1.spoolDir = /data/test //监控的文件夹
a1.sources.r1.deserializer = LINE
a1.sources.r1.deserializer.maxLineLength = 600000
a1.sources.r1.includePattern = test__[0-9]{4}-[0-9]{2}-[0-9]{2}.csv //监控符合此正则表达式的文件
a1.channels.c1.type = file //channel类型选择file
a1.channels.c1.checkpointDir = /opt/kb07Flume/flumeFile/checkpoint/test
a1.channels.c1.dataDirs = /opt/kb07Flume/flumeFile/data/test
a1.sinks.s1.type = hdfs //sink类型选择hdfs 结果输出到hdfs上
a1.sinks.s1.hdfs.fileType = DataStream
a1.sinks.s1.hdfs.filePrefix = test_ //输出到hdfs上是加的前缀
a1.sinks.s1.hdfs.fileSuffix = .csv //输出到hdfs上是加的后缀
a1.sinks.s1.hdfs.path = hdfs://192.168.226.101:9000/data/test/%Y-%m-%d //hdfs路径
a1.sinks.s1.hdfs.useLocalTimeStamp = true //是否使用本地时间戳
a1.sinks.s1.hdfs.batchSize = 640
a1.sinks.s1.hdfs.rollCount = 0
a1.sinks.s1.hdfs.rollSize = 64000000
a1.sinks.s1.hdfs.rollInterval = 30
a1.sources.r1.channels = c1
a1.sinks.s1.channel = c1
taildir(监控多个文件或者文件夹 特点是:断点续传)
a1.sources = r1
a1.sinks = k1
a1.channels = c1
a1.sources.r1.type = TATLDIR
a1.sources.r1.filegroups = f1 f2
a1.sources.r1.filegroups.f1 = /data/test1.log
a1.sources.r1.filegroups.f2 = /data/test2.txt
#存储inode信息 实现断点续传 不配置会传入默认文件夹
a1.sources.r1.positionFile = /opt/position/position.json
a1.sinks.k1.type = logger
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
avro
avro的source和sink可以参考下面的负载均衡和故障转移
Channels
memory(内存存储 速度快 但是不安全)
a1.sources = r1
a1.sinks = k1
a1.channels = c1
a1.sources.r1.type = exec
a1.sources.r1.command = tail -f /data/test.log
a1.sinks.k1.type = logger
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
file(本地文件存储 安全 速度慢)
a1.sources = s1
a1.channels = c1
a1.sinks = k1
a1.sources.s1.type = spooldir
a1.sources.s1.spoolDir = /opt/kb07Flume/flumeFile/test
a1.sources.s1.deserializer = LINE
a1.sources.s1.deserializer.maxLineLength = 60000
a1.sources.s1.includePattern = test_[0-9]{4}-[0-9]{2}-[0-9]{2}.csv
a1.channels.c1.type = file
a1.channels.c1.checkpointDir = /opt/kb07Flume/flumeFile/checkpoint/test
a1.channels.c1.dataDirs = /opt/kb07Flume/flumeFile/data/test
a1.sinks.k1.type = org.apache.flume.sink.kafka.KafkaSink
a1.sinks.k1.batchSize = 640
a1.sinks.k1.brokerList = 192.168.226.101:9092
a1.sinks.k1.topic = test
a1.sources.s1.channels = c1
a1.sinks.k1.channel = c1
Sinks
logger(输出到控制台)
a1.sources = r1
a1.sinks = k1
a1.channels = c1
a1.sources.r1.type = exec
a1.sources.r1.command = tail -f /data/test.log
a1.sinks.k1.type = logger
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
hdfs(输出到hdfs)
a1.sources = r1
a1.channels = c1
a1.sinks = s1
a1.sources.r1.type = spooldir //source类型为spooldir
a1.sources.r1.spoolDir = /data/test //监控的文件夹
a1.sources.r1.deserializer = LINE
a1.sources.r1.deserializer.maxLineLength = 600000
a1.sources.r1.includePattern = test__[0-9]{4}-[0-9]{2}-[0-9]{2}.csv //监控符合此正则表达式的文件
a1.channels.c1.type = file //channel类型选择file
a1.channels.c1.checkpointDir = /opt/kb07Flume/flumeFile/checkpoint/test
a1.channels.c1.dataDirs = /opt/kb07Flume/flumeFile/data/test
a1.sinks.s1.type = hdfs //sink类型选择hdfs 结果输出到hdfs上
a1.sinks.s1.hdfs.fileType = DataStream
a1.sinks.s1.hdfs.filePrefix = test_ //输出到hdfs上是加的前缀
a1.sinks.s1.hdfs.fileSuffix = .csv //输出到hdfs上是加的后缀
a1.sinks.s1.hdfs.path = hdfs://192.168.226.101:9000/data/test/%Y-%m-%d //hdfs路径
a1.sinks.s1.hdfs.useLocalTimeStamp = true //是否使用本地时间戳
a1.sinks.s1.hdfs.batchSize = 640
a1.sinks.s1.hdfs.rollCount = 0
a1.sinks.s1.hdfs.rollSize = 64000000
a1.sinks.s1.hdfs.rollInterval = 30
a1.sources.r1.channels = c1
a1.sinks.s1.channel = c1
kafka(输出到kafka)
a1.sources = s1
a1.channels = c1
a1.sinks = k1
a1.sources.s1.type = spooldir
a1.sources.s1.spoolDir = /opt/kb07Flume/flumeFile/test
a1.sources.s1.deserializer = LINE
a1.sources.s1.deserializer.maxLineLength = 60000
a1.sources.s1.includePattern = test_[0-9]{4}-[0-9]{2}-[0-9]{2}.csv
a1.channels.c1.type = file
a1.channels.c1.checkpointDir = /opt/kb07Flume/flumeFile/checkpoint/test
a1.channels.c1.dataDirs = /opt/kb07Flume/flumeFile/data/test
a1.sinks.k1.type = org.apache.flume.sink.kafka.KafkaSink
a1.sinks.k1.batchSize = 640
a1.sinks.k1.brokerList = 192.168.226.101:9092
a1.sinks.k1.topic = test
a1.sources.s1.channels = c1
a1.sinks.k1.channel = c1
filr row(保存在本地文件)
可以参考下面的 复制
选择器副本机制(复制)
模式为一份输入,多份输出
需要创建两个channel 两个sink 一份输出到hdfs 一分输出保存到本地文件
创建flume1.conf
#Name 复制
a1.sources = r1
a1.channels = c1 c2
a1.sinks = k1 k2
#Source
a1.sources.r1.type = TATLDIR
a1.sources.r1.filegroups = f1
a1.sources.r1.filegroups.f1 = /data/test1.log
#存储inode信息 实现断点续传 不配置会传入默认文件夹
a1.sources.r1.positionFile = /opt/position/position1.json
#Channel
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
a1.channels.c2.type = memory
a1.channels.c2.capacity = 1000
a1.channels.c2.transactionCapacity = 100
#Sink
a1.sinks.k1.type = avro
a1.sinks.k1.hostname = hadoop1
a1.sinks.k1.port = 4141
a1.sinks.k2.type = avro
a1.sinks.k2.hostname = hadoop1
a1.sinks.k2.port = 4142
#Bind
a1.sources.r1.channels = c1 c2
a1.sinks.k1.channel = c1
a1.sinks.k2.channel = c2
创建flume2.conf
a2.sources = r1
a2.channels = c1
a2.sinks = k1
a2.sources.r1.type = avro
a2.sources.r1.bind = hadoop1
a2.sources.r1.port = 4141
#Channel
a2.channels.c1.type = memory
a2.channels.c1.capacity = 1000
a2.channels.c1.transactionCapacity = 100
#Sink
a2.sinks.k1.type = hdfs
a2.sinks.k1.hdfs.fileType = DataStream
a2.sinks.k1.hdfs.filePrefix = test_
a2.sinks.k1.hdfs.fileSuffix = .csv
a2.sinks.k1.hdfs.path = hdfs://192.168.226.101:9000/data/test/%Y-%m-%d
a2.sinks.k1.hdfs.useLocalTimeStamp = true
a2.sinks.k1.hdfs.batchSize = 640
a2.sinks.k1.hdfs.rollCount = 0
a2.sinks.k1.hdfs.rollSize = 64000000
a2.sinks.k1.hdfs.rollInterval = 30
#Bind
a2.sources.r1.channels = c1
a2.sinks.k1.channel = c1
创建flume3.conf
a3.sources = r1
a3.channels = c1
a3.sinks = k1
a3.sources.r1.type = avro
a3.sources.r1.bind = hadoop1
a3.sources.r1.port = 4142
#Channel
a3.channels.c1.type = memory
a3.channels.c1.capacity = 1000
a3.channels.c1.transactionCapacity = 100
#Sink:file_row保存到本地文件 需要先创建目录
a3.sinks.k1.type = file_row
a3.sinks.k1.sink.directory = /data/file_row
#Bind
a3.sources.r1.channels = c1
a3.sinks.k1.channel = c1
故障转移
模式为一个channel对两个sinks,优先使用优先级高的 k2挂掉 自动使用k1
创建三个flume.conf
flume1.conf
#故障转移
#Name
a1.sources = r1
a1.sinks = k1 k2
a1.channels = c1
a1.sinkgroups = g1
#Sources
a1.sources.r1.type = netcat
a1.sources.r1.bind = localhost
a1.sources.r1.port = 44444
#Channel
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
#Sink
a1.sinks.k1.type = avro
a1.sinks.k1.hostname = hadoop1
a1.sinks.k1.port = 4141
a1.sinks.k2.type = avro
a1.sinks.k2.hostname = hadoop1
a1.sinks.k2.port = 4142
#Sink Group
a1.sinkgroups.g1.sink = k1 k2
#故障转移的策略
a1.sinkgroups.g1.processor.type = failover
#优先级
a1.sinkgroups.g1.processor.priority.k1 = 5
a1.sinkgroups.g1.processor.priority.k2 = 10
a1.sinkgroups.g1.processor.maxpenalty = 10000
#Bind
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
a1.sinks.k2.channel = c1
flume2.conf
#Name
a2.sources = r1
a2.sinks = k1
a2.channels = c1
#Sources
a2.sources.r1.type = avro
a2.sources.r1.bind = hadoop1
a2.sources.r1.port = 4141
#Channel
a2.channels.c1.type = memory
a2.channels.c1.capacity = 1000
a2.channels.c1.transactionCapacity = 100
#Sink
a2.sinks.k1.type = logger
#Bind
a2.sources.r1.channels = c1
a2.sinks.k1.channel = c1
flume3.conf
#Name
a3.sources = r1
a3.sinks = k1
a3.channels = c1
#Sources
a3.sources.r1.type = avro
a3.sources.r1.bind = hadoop1
a3.sources.r1.port = 4142
#Channel
a3.channels.c1.type = memory
a3.channels.c1.capacity = 1000
a3.channels.c1.transactionCapacity = 100
#Sink
a3.sinks.k1.type = logger
#Bind
a3.sources.r1.channels = c1
a3.sinks.k1.channel = c1
负载均衡
模式为一个source输入 将接受的信息平均分发到多个sink上 不会造成数据倾斜
先创建三个flume.conf
创建flume1.conf
#负载均衡
#Name
a1.sources = r1
a1.sinks = k1 k2
a1.channels = c1
a1.sinkgroups = g1
#Sources
a1.sources.r1.type = netcat
a1.sources.r1.bind = localhost
a1.sources.r1.port = 44444
#Channel
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
#Sink
a1.sinks.k1.type = avro
a1.sinks.k1.hostname = hadoop1
a1.sinks.k1.port = 4141
a1.sinks.k2.type = avro
a1.sinks.k2.hostname = hadoop1
a1.sinks.k2.port = 4142
#Sink Group
a1.sinkgroups.g1.sink = k1 k2
#负载均衡的策略
a1.sinkgroups.g1.processor.type = load_balance
a1.sinkgroups.g1.processor.backoff = true
a1.sinkgroups.g1.processor.selector = random
#Bind
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
a1.sinks.k2.channel = c1
创建flume2.conf
#Name
a2.sources = r1
a2.sinks = k1
a2.channels = c1
#Sources
a2.sources.r1.type = avro
a2.sources.r1.bind = hadoop1
a2.sources.r1.port = 4141
#Channel
a2.channels.c1.type = memory
a2.channels.c1.capacity = 1000
a2.channels.c1.transactionCapacity = 100
#Sink
a2.sinks.k1.type = logger
#Bind
a2.sources.r1.channels = c1
a2.sinks.k1.channel = c1
创建flume3.conf
#Name
a3.sources = r1
a3.sinks = k1
a3.channels = c1
#Sources
a3.sources.r1.type = avro
a3.sources.r1.bind = hadoop1
a3.sources.r1.port = 4142
#Channel
a3.channels.c1.type = memory
a3.channels.c1.capacity = 1000
a3.channels.c1.transactionCapacity = 100
#Sink
a3.sinks.k1.type = logger
#Bind
a3.sources.r1.channels = c1
a3.sinks.k1.channel = c1
自定义拦截器
在idea中编写自定义拦截器并打包上传到/opt/flume1.6/lib
导入maven依赖
<dependency>
<groupId>org.apache.flume</groupId>
<artifactId>flume-ng-core</artifactId>
<version>1.6.0</version>
</dependency>
package cn.kgc.kb07.flume;
import org.apache.flume.Context;
import org.apache.flume.Event;
import org.apache.flume.interceptor.Interceptor;
import java.util.ArrayList;
import java.util.List;
import java.util.Map;
/**
* @author WGY
* 拦截器
*/
public class InterceptorDemo implements Interceptor {
//声明一个存放时间的集合
private List<Event> addHeaderEvents;
@Override
public void initialize() {
//初始化
addHeaderEvents = new ArrayList<>();
}
//对单个事件处理
@Override
public Event intercept(Event event) {
//获取事件中的头信息
Map<String, String> headers = event.getHeaders();
//获取事件中的body信息
String body = new String(event.getBody());
//根据body中是否含有“hello”来决定添加怎样的头信息
if(body.contains("hello")){
//添加头信息
headers.put("type","1");
}else{
//添加头信息
headers.put("type","2");
}
return null;
}
//对批量事件处理
@Override
public List<Event> intercept(List<Event> list) {
//1、清空集合
addHeaderEvents.clear();
//2、遍历list,
for (Event event : list) {
//3、给每一个事件添加头信息
addHeaderEvents.add(intercept(event));
}
//返回结果
return addHeaderEvents;
}
@Override
public void close() {
}
public static class Builder implements Interceptor.Builder{
@Override
public Interceptor build() {
return new InterceptorDemo();
}
@Override
public void configure(Context context) {
}
}
}
a1.sources = r1
a1.channels = c1 c2
a1.sinks = k1 k2
a1.sources.r1.type = netcat
a1.sources.r1.bind = localhost
a1.sources.r1.port = 44444
a1.sources.r1.interceptors = i1 //自定义拦截器
a1.sources.r1.interceptors.i1.type = cn.kgc.kb07.flume.InterceptorDemo$Builder
a1.sources.r1.selector.type = multiplexing //定义选择器
a1.sources.r1.selector.header = type
a1.sources.r1.selector.mapping.gree = c1
a1.sources.r1.selector.mapping.lijia = c2
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
a1.channels.c2.type = memory
a1.channels.c2.capacity = 1000
a1.channels.c2.transactionCapacity = 100
a1.sinks.k1.type = hdfs
a1.sinks.k1.hdfs.fileType = DataStream
a1.sinks.k1.hdfs.filePrefix = gree
a1.sinks.k1.hdfs.fileSuffix = .csv
a1.sinks.k1.hdfs.path = hdfs://192.168.226.101:9000/data/greedemo/%Y-%m-%d
a1.sinks.k1.hdfs.useLocalTimeStamp = true
a1.sinks.k1.hdfs.batchSize = 640
a1.sinks.k1.hdfs.rollCount = 0
a1.sinks.k1.hdfs.rollSize = 100
a1.sinks.k1.hdfs.rollInterval = 3
a1.sinks.k2.type = hdfs
a1.sinks.k2.hdfs.fileType = DataStream
a1.sinks.k2.hdfs.filePrefix = lijia
a1.sinks.k2.hdfs.fileSuffix = .csv
a1.sinks.k2.hdfs.path = hdfs://192.168.226.101:9000/data/lijiademo/%Y-%m-%d
a1.sinks.k2.hdfs.useLocalTimeStamp = true
a1.sinks.k2.hdfs.batchSize = 640
a1.sinks.k2.hdfs.rollCount = 0
a1.sinks.k2.hdfs.rollSize = 100
a1.sinks.k2.hdfs.rollInterval = 3
a1.sources.r1.channels = c1 c2
a1.sinks.k1.channel = c1
a1.sinks.k2.channel = c2