官方说明:http://flume.apache.org/index.html
官方配置文档:http://flume.apache.org/releases/content/1.9.0/FlumeUserGuide.html
翻译文档:http://www.51niux.com/?id=197
client生产数据,运行在一个独立的线程。
source从client收集数据,传递给channel。
sink从channel收集数据,运行在一个独立线程。
最后channel连接 sources 和 sinks
source类型之: netcat
说明: 网络端口收集数据
案例:获取本机44444端口输入内容:
坑:配置 sources 时bind为 locaohost,则 telnet 的ip也为 localhost(保持一致,最好不使用localhost)
1.安装telnet服务
[root@master ~]# yum install -y telnet telnet-server
2.配置conf文件
[root@master ~]# cat netcat.conf
#定义名称
a1.sources = s1 #取名s1,可同时命名多个,空格隔开(a1.sources = s1 s2 s3)
a1.sinks = k1
a1.channels = c1
#配置sources
a1.sources.s1.type = netcat
a1.sources.s1.bind = localhost
a1.sources.s1.port = 44444
#配置sinks
a1.sinks.k1.type = logger
#配置channels
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
#链接sinks 与 sources
a1.sources.s1.channels = c1
a1.sinks.k1.channel = c1
运行一个agent:
[root@master ~]# flume-ng agent --conf-file netcat.conf --name a1 -Dflume.root.logger=INFO,console
3.运行:telnet (先启运行agent),往 agent 的 source 所监听的端口上发送数据,让 agent 有数据可采。
[root@master ~]# telnet localhost 44444
Trying ::1...
telnet: connect to address ::1: Connection refused
Trying 127.0.0.1...
Connected to localhost.
Escape character is '^]'.
hello #手动输入的内容
OK
lucky
OK
4.效果:
19/03/29 05:24:33 INFO source.NetcatSource: Created serverSocket:sun.nio.ch.ServerSocketChannelImpl[/127.0.0.1:44444]
19/03/29 05:25:35 INFO sink.LoggerSink: Event: { headers:{} body: 68 65 6C 6C 6F 0D hello. }
19/03/29 05:25:39 INFO sink.LoggerSink: Event: { headers:{} body: 6C 75 63 6B 79 0D lucky. }
source类型之: exec
ExecSource的配置就是设定一个Unix(linux)命令,然后通过这个命令不断输出数据。如果进程退出,Exec Source也一起退出,不会产生进一步的数据。**
案例1: 持续采集系统log日志(目标:/var/log/secure),并将采集内容存放到 本地
sinks保存到本地的配置参数 (加粗必选)
Property Name | Default | Description |
---|---|---|
channel | - | |
type | - | The component type name, needs to be file_roll. |
sink.directory | - | The directory where files will be stored |
sink.pathManager | DEFAULT | The PathManager implementation to use. |
sink.pathManager.extension | - | The file extension if the default PathManager is used.(文件后缀名) |
sink.pathManager.prefix | - | A character string to add to the beginning of the file name if the default PathManager is used(文件开头) |
sink.rollInterval | 30 | Roll the file every 30 seconds. Specifying 0 will disable rolling and cause all events to be written to a single file.(30秒生成一个文件,0则关闭) |
sink.serializer | TEXT | Other possible options include avro_event or the FQCN of an implementation of EventSerializer.Builder interface |
batchSize | 100 |
1.配置conf文件
[root@master flume]# cat exec.conf
#定义名称
a1.sources = s1
a1.sinks = k1
a1.channels = c1
#配置sources
a1.sources.s1.type = exec
a1.sources.s1.command = tail -f /var/log/secure
#配置sinks
a1.sinks.k1.type = file_roll
a1.sinks.k1.sink.directory = /root/flume/secure-log/
a1.sinks.k1.sink.pathManager.extension = aaa
a1.sinks.k1.sink.pathManager.prefix = aaa-
a1.sinks.k1.sink.rollInterval = 0
a1.sinks.k1.batchSize = 100
#配置channels
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
#配置sources 与 sinks 连接
a1.sources.s1.channels = c1
a1.sinks.k1.channel = c1
2.创建存储目录 /root/flume/secure-log/
[root@master ~]# mkdir /root/flume/secure-log
3.运行agent
[root@master flume]# flume-ng agent --conf-file exec.conf --name a1 -Dflume.root.logger=INFO,console
4.查看结果
[root@master flume]# ll secure-log/
-rw-r--r--. 1 root root 0 Apr 19 03:39 aaa-1555659597380-1.aaa
source类型之: avro
侦听Avro端口并从外部Avro客户端流接收事件。 当与另一个(上一跳)Flume代理上的内置Avro Sink配对时,它可以创建分层集合拓扑。
案例1: 实现如下功能
1)在节点slaver1安装flume通过exec source收集日志,
2)通过avro sink发送到 汇总节点master,
3)汇总节点master通过avro source接收slaver1传送来的数据,
4)将日志上传至hdfs文件系统
保存至 hdfs 常用参数: (其他参数点此)
Property Name | Default | Description |
---|---|---|
channel | - | |
type | - | - |
path | - | 写入hdfs的路径,需要包含文件系统标识 |
filePrefix | FlumeData | 写入hdfs的文件名前缀,可以使用flume提供的日期及%{host}表达式。 |
fileType | SequenceFile | 文件格式,包括:SequenceFile, DataStream,CompressedStream,当使用DataStream时候,文件不会被压缩,不需要设置hdfs.codeC;当使用CompressedStream时候,必须设置一个正确的hdfs.codeC值; |
fileSuffix | 写入hdfs的文件名后缀,比如:.lzo .log等。 | |
inUsePrefix | 临时文件的文件名前缀,hdfs sink会先往目标目录中写临时文件,再根据相关规则重命名成最终目标文件; | |
inUseSuffix | .tmp | 临时文件的文件名后缀。 |
roundUnit | seconds | 时间上进行”舍弃”的单位,包含:second,minute,hour |
roundValue | 1 | 时间上进行“舍弃”的值 |
round | false | 是否启用时间上的”舍弃” |
rollInterval | 30 | hdfs sink间隔多长将临时文件滚动成最终目标文件,单位:秒;如果设置成0,则表示不根据时间来滚动文件;注:滚动(roll)指的是,hdfs sink将临时文件重命名成最终目标文件,并新打开一个临时文件来写入数据; |
rollSize | 1024 | 当临时文件达到该大小(单位:bytes)时,滚动成目标文件;如果设置成0,则表示不根据临时文件大小来滚动文件; |
rollCount | 10 | 当events数据达到该数量时候,将临时文件滚动成目标文件如果设置成0,则表示不根据events数据来滚动文件; |
writeFormat | Writable | 写sequence文件的格式。包含:Text, Writable(默认) |
threadsPoolSize | 10 | hdfs sink启动的操作HDFS的线程数。 |
实现:
1、 配置master节点 conf配置文件
[root@master ~]# cat /root/flume/conf/avro.conf
#配置名称
b1.sources = s1
b1.sinks = k1
b1.channels = c1
#配置sources
b1.sources.s1.type = avro
b1.sources.s1.bind = 10.0.0.13 #master节点ip
b1.sources.s1.port = 44444
#配置sinks
b1.sinks.k1.type = hdfs
b1.sinks.k1.hdfs.path = hdfs://master.hadoop:8020/data/flume/
b1.sinks.k1.hdfs.filePrefix = slver1-abc-log-
b1.sinks.k1.hdfs.fileType = DataStream
#配置channels
b1.channels.c1.type = memory
b1.channels.c1.capacity = 1000
b1.channels.c1.transactionCapacity = 100
#配置 sources 与 sinks 连接
b1.sources.s1.channels = c1
b1.sinks.k1.channel = c1
2、配置slaver1(192.168.200.5)节点 conf配置文件
[root@slaver1 ~]# cat /root/flume/conf/exec.conf
#配置名称
a1.sources = s1
a1.sinks = k1
a1.channels =c1
#配置sourcers
a1.sources.s1.type = exec
a1.sources.s1.command = tail -F /root/abc.txt
#配置sinks
a1.sinks.k1.type = avro
a1.sinks.k1.hostname = master.hadoop
a1.sinks.k1.port = 44444
#配置channels
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.keep-alive = 20
#配置连接
a1.sources.s1.channels = c1
a1.sinks.k1.channel = c1
3、生成自动向 abc.txt 写入数据的脚本,赋予执行权限
[root@slaver1 ~]# cat bash.sh
for k in $(seq 1 100)
do
echo $k" :aaaaa" >> /root/abc.txt
done
[root@slaver1 ~]# chmod +x bash.sh
4、先运行master节点的agent,再运行slaver1节点agent
[root@master ~]# flume-ng agent --conf-file /root/flume/conf/avro.conf --name b1 -Dflume.root.logger=DEBUG,console
[root@slaver1 ~]# flume-ng agent --conf-file /root/flume/exec.conf --name a1 -Dflume.root.logger=DEBUG,console
在master节点可以看到建立的连接
19/03/30 01:55:49 INFO source.AvroSource: Starting Avro source s1: { bindAddress: 10.0.0.13, port: 44444 }...
19/03/30 01:55:50 INFO instrumentation.MonitoredCounterGroup: Monitored counter group for type: SOURCE, name: s1: Successfully registered new MBean.
19/03/30 01:55:50 INFO instrumentation.MonitoredCounterGroup: Component type: SOURCE, name: s1 started
19/03/30 01:55:50 INFO source.AvroSource: Avro source s1 started.
19/03/30 01:55:50 INFO ipc.NettyServer: [id: 0x6b3d911a, /10.0.0.14:38978 => /10.0.0.13:44444] OPEN
19/03/30 01:55:50 INFO ipc.NettyServer: [id: 0x6b3d911a, /10.0.0.14:38978 => /10.0.0.13:44444] BOUND: /10.0.0.13:44444
19/03/30 01:55:50 INFO ipc.NettyServer: [id: 0x6b3d911a, /10.0.0.14:38978 => /10.0.0.13:44444] CONNECTED: /10.0.0.14:38978
5、运行脚本插入数据
[root@slaver1 ~]# ./bash.sh
6、查看上传至: hdfs://data/flume 的日志文件
[root@master ~]# hadoop fs -ls /data/flume
Found 16 items
-rw-r--r-- 3 root hdfs 40 2019-03-30 01:56 /data/flume/slver1-abc-log-.1553910996063
-rw-r--r-- 3 root hdfs 40 2019-03-30 01:56 /data/flume/slver1-abc-log-.1553910996064
-rw-r--r-- 3 root hdfs 40 2019-03-30 01:56 /data/flume/slver1-abc-log-.1553910996065
-rw-r--r-- 3 root hdfs 40 2019-03-30 01:56 /data/flume/slver1-abc-log-.1553910996066
source类型之: http (参考原文点此)
获取get 、 post 类型的json数据
http-memory-logger
1)配置文件
[root@master conf]# cat http-memory-logger.conf
#命名
a1.sources = s1
a1.sinks = k1
a1.channels = c1
#配置sources
a1.sources.s1.type = http
a1.sources.s1.bind = master.hadoop
a1.sources.s1.port = 44444
#配置sinks
a1.sinks.k1.type = logger
#配置channels
a1.channels.c1.type = memory
a1.channels.c1.capacity = 5000
a1.channels.c1.transactionCapacity = 200
#连接
a1.sources.s1.channels = c1
a1.sinks.k1.channel = c1
[root@master conf]#
2)运行agent
[root@master conf]# flume-ng agent --conf-file http-memory-logger.conf --name a1
3)发送测试数据
[root@master conf]# curl -X POST -d'[{"headers":{"h1":"v1","h2":"v2"},"body":"hello"}]' http://master.hadoop:44444
4)效果
19/03/30 03:07:18 INFO mortbay.log: Started SelectChannelConnector@master.hadoop:44444
19/03/30 03:07:18 INFO instrumentation.MonitoredCounterGroup: Monitored counter group for type: SOURCE, name: s1: Successfully registered new MBean.
19/03/30 03:07:18 INFO instrumentation.MonitoredCounterGroup: Component type: SOURCE, name: s1 started
19/03/30 03:07:49 INFO sink.LoggerSink: Event: { headers:{h1=v1, h2=v2} body: 68 65 6C 6C 6F hello }
拦截器
实现修改和丢弃事件功能
常用拦截器:Host、Regex Filtering
Host 拦截器,修改header信息(IP、主机名)
Property Name | Default | Description |
---|---|---|
type | - | The component type name, has to be host |
preserveExisting | false | If the host header already exists, should it be preserved - true or false |
useIP | true | Use the IP Address if true, else use hostname(true用ip,false用hostname) |
hostHeader | host | The header key to be used. [(key=value) 中指定key] |
[root@master flume]# cat netcat.conf
#定义名称
a1.sources = s1
a1.sinks = k1
a1.channels = c1
#配置sources
a1.sources.s1.type = netcat
a1.sources.s1.bind = localhost
a1.sources.s1.port = 44444
a1.sources.s1.interceptors = i1
a1.sources.s1.interceptors.i1.type = host
#a1.sources.s1.interceptors.i1.useIP = false #当此值为false时,则使用hostname(默认为true)
#a1.sources.s1.interceptors.i1.hostHeader = myhost
#配置sinks
a1.sinks.k1.type = logger
#配置channels
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
#配置 sources、sinks
a1.sources.s1.channels = c1
a1.sinks.k1.channel = c1
Regex Filtering 过滤器,匹配符合正则表达式的信息
Property Name | Default | Description |
---|---|---|
type | - | The component type name, has to be regex_filter |
regex | ”.*” | Regular expression for matching against events |
excludeEvents | false | If true, regex determines events to exclude, otherwise regex determines events to include. |
配置案例
[root@master flume]# cat regex_filter.conf
#定义名称
a1.sources = s1
a1.sinks = k1
a1.channels = c1
#配置sources
a1.sources.s1.type = netcat
a1.sources.s1.bind = localhost
a1.sources.s1.port = 44444
a1.sources.s1.interceptors = i1 i2
a1.sources.s1.interceptors.i1.type = regex_filter #正则匹配
a1.sources.s1.interceptors.i1.regex = \\{.*\\} #匹配带有{}的数据
a1.sources.s1.interceptors.i2.type = host
a1.sources.s1.interceptors.i2.hostHeader = myhost
#配置sinks
a1.sinks.k1.type = logger
#配置channels
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
#配置 sources、sinks
a1.sources.s1.channels = c1
a1.sinks.k1.channel = c1
Timestamp Interceptor
Property Name | Default | Description |
---|---|---|
type | - | The component type name, has to be timestamp or the FQCN |
headerName | timestamp | The name of the header in which to place the generated timestamp. |
preserveExisting | false | If the timestamp already exists, should it be preserved - true or false |
Static Interceptor
Property Name | Default | Description |
---|---|---|
type | - | The component type name, has to be static |
preserveExisting | true | If configured header already exists, should it be preserved - true or false |
key | key | Name of header that should be created |
value | value | Static value that should be created |
UUID Interceptor
Property Name | Default | Description |
---|---|---|
type | - | The component type name has to be org.apache.flume.sink.solr.morphline.UUIDInterceptor$Builder |
headerName | id | The name of the Flume header to modify |
preserveExisting | true | If the UUID header already exists, should it be preserved - true or false |
prefix | “” | The prefix string constant to prepend to each generated UUID |
Morphline Interceptor
Property Name | Default | Description |
---|---|---|
type | - | The component type name has to be org.apache.flume.sink.solr.morphline.MorphlineInterceptor$Builder |
morphlineFile | - | The relative or absolute path on the local file system to the morphline configuration file. Example: /etc/flume-ng/conf/morphline.conf |
morphlineId | null | Optional name used to identify a morphline if there are multiple morphlines in a morphline config file |
Search and Replace Interceptor
Property Name | Default | Description |
---|---|---|
type | - | The component type name has to be search_replace |
searchPattern | - | The pattern to search for and replace. |
replaceString | The replacement string. | |
charset | UTF-8 | The charset of the event body. Assumed by default to be UTF-8. |
Regex Extractor Interceptor
Property Name | Default | Description |
---|---|---|
type | - | The component type name has to be regex_extractor |
regex | - | Regular expression for matching against events |
serializers | - | Space-separated list of serializers for mapping matches to header names and serializing their values. (See example below) Flume provides built-in support for the following serializers: org.apache.flume.interceptor.RegexExtractorInterceptorPassThroughSerializer org.apache.flume.interceptor.RegexExtractorInterceptorMillisSerializer |
serializers..type | defalut | Must be default (org.apache.flume.interceptor.RegexExtractorInterceptorPassThroughSerializer), org.apache.flume.interceptor.RegexExtractorInterceptorMillisSerializer, or the FQCN of a custom class that implements org.apache.flume.interceptor.RegexExtractorInterceptorSerializer |
serializers..name | - | |
serializers.* | - | Serializer-specific properties |
[root@master flume]# cat static-interceptor.conf
#定义名称
a1.sources = s1
a1.sinks = k1
a1.channels = c1
#配置sources
a1.sources.s1.type = netcat
a1.sources.s1.bind = localhost
a1.sources.s1.port = 44444
a1.sources.s1.interceptors = i1 i2 i3 i4 i5
a1.sources.s1.interceptors.i1.type = host
a1.sources.s1.interceptoes.i1.hostHeader = ip
a1.sources.s1.interceptors.i2.type = static
a1.sources.s1.interceptors.i2.preserveExisting = false
a1.sources.s1.interceptors.i2.key = ID
a1.sources.s1.interceptors.i2.value = 1
a1.sources.s1.interceptors.i3.type = timestamp
a1.sources.s1.interceptors.i3.headerName = time
a1.sources.s1.interceptors.i4.type = remove_header
a1.sources.s1.interceptors.i4.withName = ID
a1.sources.s1.interceptors.i5.type = org.apache.flume.sink.solr.morphline.UUIDInterceptor$Builder
a1.sources.s1.interceptors.i5.headerName = uuid
a1.sources.s1.interceptors.i5.prefix = 1--
#配置sinks
a1.sinks.k1.type = logger
#配置channels
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
#配置 sources、sinks
a1.sources.s1.channels = c1
a1.sinks.k1.channel = c1
Flume Channel Selectors
If the type is not specified, then defaults to “replicating”. (如果没有指定类型,则默认为分发)
Property Name | Default | Description |
---|---|---|
selector.type | replicating | The component type name, needs to be multiplexing |
selector.header | - | flume.selector.header |
selector.default | - | |
selector.mapping.* | - |
实现如下:
根据不同的header选择不同的通道
1)配置选择器
[root@master flume]# cat multiplexing-selector1.conf
#定义名称
a1.sources = s1
a1.sinks = k1 k2
a1.channels = c1 c2
#配置sources
a1.sources.s1.type = avro
a1.sources.s1.bind = 192.168.17.150
a1.sources.s1.port = 44444
a1.sources.s1.selector.type = multiplexing
a1.sources.s1.selector.header = type
a1.sources.s1.selector.mapping.1 = c1
a1.sources.s1.selector.mapping.2 = c2
#配置sinks
a1.sinks.k1.type = file_roll
a1.sinks.k1.sink.directory = /root/flume/file/secure-log/
a1.sinks.k1.sink.pathManager.extension = aaa
a1.sinks.k1.sink.pathManager.prefix = aaa-
a1.sinks.k1.sink.rollInterval = 0
a1.sinks.k1.batchSize = 100
a1.sinks.k2.type = logger
#配置channels
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
a1.channels.c2.type = memory
a1.channels.c2.capacity = 1000
a1.channels.c2.transactionCapacity = 100
#配置 sources、sinks
a1.sources.s1.channels = c1 c2
a1.sinks.k1.channel = c1
a1.sinks.k2.channel = c2
2)配置tomcat1
[root@master flume]# cat tomcat1.conf
#定义名称
a1.sources = s1
a1.sinks = k1
a1.channels = c1
#配置sources
a1.sources.s1.type = netcat
a1.sources.s1.bind = 192.168.17.150
a1.sources.s1.port = 55555
a1.sources.s1.interceptors = i2
a1.sources.s1.interceptors.i2.type = static
a1.sources.s1.interceptors.i2.key = type
a1.sources.s1.interceptors.i2.value = 1
#配置sinks
a1.sinks.k1.type = avro
a1.sinks.k1.hostname = 192.168.17.150
a1.sinks.k1.port = 44444
#配置channels
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
#配置 sources、sinks
a1.sources.s1.channels = c1
a1.sinks.k1.channel = c1
3)配置tomcat2
[root@master flume]# cat tomcat2.conf
#定义名称
a1.sources = s1
a1.sinks = k1
a1.channels = c1
#配置sources
a1.sources.s1.type = netcat
a1.sources.s1.bind = 192.168.17.150
a1.sources.s1.port = 55556
a1.sources.s1.interceptors = i2
a1.sources.s1.interceptors.i2.type = static
a1.sources.s1.interceptors.i2.key = type
a1.sources.s1.interceptors.i2.value = 2
#配置sinks
a1.sinks.k1.type = avro
a1.sinks.k1.hostname = 192.168.17.150
a1.sinks.k1.port = 44444
#配置channels
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
#配置 sources、sinks
a1.sources.s1.channels = c1
a1.sinks.k1.channel = c1
4)先启动 multiplexing-selector1.conf ,后启动tomcat1.conf、tomcat2.conf
Kafka && flume
- flume 配置文件
[root@master kafka]# cat /root/flume/kafka.conf
#name
a1.sources = s1
a1.sinks = k1
a1.channels = c1
#sources
a1.sources.s1.type = netcat
a1.sources.s1.bind = 192.168.17.150
a1.sources.s1.port = 44444
#sinks
a1.sinks.k1.type = org.apache.flume.sink.kafka.KafkaSink
a1.sinks.k1.brokerList = 192.168.17.150:9092
a1.sinks.k1.topic = test
a1.sinks.k1.serializer.class = kafka.serializer.StringEncoder
#channels
a1.channels.c1.type=memory
a1.channels.c1.capacity=10000
a1.channels.c1.transactionCapacity=100
#link
a1.sources.s1.channels = c1
a1.sinks.k1.channel = c1
2)启动 zookeeper和 kafka
# /home/zookeeper-3.5.3-beta/bin/zkServer.sh start
# /home/kafka/bin/kafka-server-start.sh /home/kafka/config/server.properties & (& :后台运行)
3)另开端口,创建topic(主题)
# bin/kafka-topics.sh --create --zookeeper 192.168.17.150:2181 --replication-factor 1 --partitions 1 --topic test
4)创建消费者Consumer
# bin/kafka-console-consumer.sh --bootstrap-server 192.168.17.150:9092 --topic test --from-beginning
5)另开端口,启动flume
# flume-ng agent --conf-file kafka.conf --name a1 -Dflume.root.logger=DEBUG,console
6)添加测试数据,查看结果
[root@master ~]# telnet 192.168.17.150 44444
Trying 192.168.17.150...
Connected to 192.168.17.150.
Escape character is '^]'.
qweqweqwe
OK
hello kafka&flume
OK
查看第4步运行界面, 已接收到数据。
[root@master kafka]# bin/kafka-console-consumer.sh --bootstrap-server 192.168.17.150:9092 --topic kafkatest --from-beginning
qweqweqwe
hello kafka&flume
kafka故障转移&负载均衡
负载均衡
Property Name | Default | Description |
---|---|---|
sinks | - | Space-separated list of sinks that are participating in the group |
processor.type | default | The component type name, needs to be failover |
processor.priority. | – | Priority value. must be one of the sink instances associated with the current sink group A higher priority value Sink gets activated earlier. A larger absolute value indicates higher priority |
processor.maxpenalty | 30000 | The maximum backoff period for the failed Sink (in millis) |