hadoop-flume

最新推荐文章于 2024-07-13 21:32:34 发布

情绪Zzz

最新推荐文章于 2024-07-13 21:32:34 发布

阅读量248

点赞数

本文链接：https://blog.csdn.net/qq_40659767/article/details/88875850

版权

官方说明：http://flume.apache.org/index.html
官方配置文档：http://flume.apache.org/releases/content/1.9.0/FlumeUserGuide.html
翻译文档：http://www.51niux.com/?id=197

client生产数据，运行在一个独立的线程。

source从client收集数据，传递给channel。

sink从channel收集数据，运行在一个独立线程。

最后channel连接 sources 和 sinks

source类型之： netcat

说明：网络端口收集数据

案例：获取本机44444端口输入内容：

坑：配置 sources 时bind为 locaohost，则 telnet 的ip也为 localhost（保持一致，最好不使用localhost）

1.安装telnet服务

[root@master ~]# yum install -y telnet telnet-server

2.配置conf文件

[root@master ~]# cat netcat.conf
#定义名称
a1.sources = s1    #取名s1,可同时命名多个，空格隔开（a1.sources = s1 s2 s3）
a1.sinks = k1
a1.channels = c1

#配置sources
a1.sources.s1.type = netcat
a1.sources.s1.bind = localhost
a1.sources.s1.port = 44444

#配置sinks
a1.sinks.k1.type = logger


#配置channels
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

#链接sinks 与 sources
a1.sources.s1.channels = c1
a1.sinks.k1.channel = c1

运行一个agent：

[root@master ~]# flume-ng  agent  --conf-file netcat.conf  --name  a1  -Dflume.root.logger=INFO,console

3.运行：telnet （先启运行agent），往 agent 的 source 所监听的端口上发送数据，让 agent 有数据可采。

[root@master ~]# telnet localhost 44444
Trying ::1...
telnet: connect to address ::1: Connection refused
Trying 127.0.0.1...
Connected to localhost.
Escape character is '^]'.
hello    #手动输入的内容
OK
lucky
OK

4.效果：

19/03/29 05:24:33 INFO source.NetcatSource: Created serverSocket:sun.nio.ch.ServerSocketChannelImpl[/127.0.0.1:44444]
19/03/29 05:25:35 INFO sink.LoggerSink: Event: { headers:{} body: 68 65 6C 6C 6F 0D                               hello. }
19/03/29 05:25:39 INFO sink.LoggerSink: Event: { headers:{} body: 6C 75 63 6B 79 0D                               lucky. }

source类型之： exec

ExecSource的配置就是设定一个Unix(linux)命令，然后通过这个命令不断输出数据。如果进程退出，Exec Source也一起退出，不会产生进一步的数据。**

案例1：持续采集系统log日志（目标：/var/log/secure），并将采集内容存放到本地

sinks保存到本地的配置参数 (加粗必选)

Property Name	Default	Description
channel	-
type	-	The component type name, needs to be file_roll.
sink.directory	-	The directory where files will be stored
sink.pathManager	DEFAULT	The PathManager implementation to use.
sink.pathManager.extension	-	The file extension if the default PathManager is used.(文件后缀名)
sink.pathManager.prefix	-	A character string to add to the beginning of the file name if the default PathManager is used（文件开头）
sink.rollInterval	30	Roll the file every 30 seconds. Specifying 0 will disable rolling and cause all events to be written to a single file.(30秒生成一个文件，0则关闭)
sink.serializer	TEXT	Other possible options include avro_event or the FQCN of an implementation of EventSerializer.Builder interface
batchSize	100

1.配置conf文件

[root@master flume]# cat exec.conf    
#定义名称
a1.sources = s1
a1.sinks = k1
a1.channels = c1

#配置sources
a1.sources.s1.type = exec
a1.sources.s1.command  = tail -f /var/log/secure

#配置sinks
a1.sinks.k1.type = file_roll
a1.sinks.k1.sink.directory = /root/flume/secure-log/
a1.sinks.k1.sink.pathManager.extension = aaa
a1.sinks.k1.sink.pathManager.prefix = aaa-
a1.sinks.k1.sink.rollInterval = 0
a1.sinks.k1.batchSize = 100

#配置channels
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

#配置sources 与 sinks 连接
a1.sources.s1.channels = c1
a1.sinks.k1.channel = c1

2.创建存储目录 /root/flume/secure-log/

[root@master ~]# mkdir /root/flume/secure-log

3.运行agent

[root@master flume]# flume-ng agent --conf-file exec.conf --name a1  -Dflume.root.logger=INFO,console

4.查看结果

[root@master flume]# ll  secure-log/
-rw-r--r--. 1 root root    0 Apr 19 03:39 aaa-1555659597380-1.aaa

source类型之： avro

侦听Avro端口并从外部Avro客户端流接收事件。当与另一个（上一跳）Flume代理上的内置Avro Sink配对时，它可以创建分层集合拓扑。

案例1： 实现如下功能
1）在节点slaver1安装flume通过exec source收集日志，
2）通过avro sink发送到汇总节点master，
3）汇总节点master通过avro source接收slaver1传送来的数据，
4）将日志上传至hdfs文件系统

保存至 hdfs 常用参数： （其他参数点此）

Property Name	Default	Description
channel	-
type	-	-
path	-	写入hdfs的路径，需要包含文件系统标识
filePrefix	FlumeData	写入hdfs的文件名前缀，可以使用flume提供的日期及%{host}表达式。
fileType	SequenceFile	文件格式，包括：SequenceFile, DataStream,CompressedStream，当使用DataStream时候，文件不会被压缩，不需要设置hdfs.codeC;当使用CompressedStream时候，必须设置一个正确的hdfs.codeC值；
fileSuffix		写入hdfs的文件名后缀，比如：.lzo .log等。
inUsePrefix		临时文件的文件名前缀，hdfs sink会先往目标目录中写临时文件，再根据相关规则重命名成最终目标文件；
inUseSuffix	.tmp	临时文件的文件名后缀。
roundUnit	seconds	时间上进行”舍弃”的单位，包含：second,minute,hour
roundValue	1	时间上进行“舍弃”的值
round	false	是否启用时间上的”舍弃”
rollInterval	30	hdfs sink间隔多长将临时文件滚动成最终目标文件，单位：秒；如果设置成0，则表示不根据时间来滚动文件；注：滚动（roll）指的是，hdfs sink将临时文件重命名成最终目标文件，并新打开一个临时文件来写入数据；
rollSize	1024	当临时文件达到该大小（单位：bytes）时，滚动成目标文件；如果设置成0，则表示不根据临时文件大小来滚动文件；
rollCount	10	当events数据达到该数量时候，将临时文件滚动成目标文件如果设置成0，则表示不根据events数据来滚动文件；
writeFormat	Writable	写sequence文件的格式。包含：Text, Writable（默认）
threadsPoolSize	10	hdfs sink启动的操作HDFS的线程数。

实现：
1、配置master节点 conf配置文件

[root@master ~]# cat /root/flume/conf/avro.conf 
#配置名称
b1.sources = s1
b1.sinks = k1
b1.channels = c1

#配置sources
b1.sources.s1.type = avro
b1.sources.s1.bind = 10.0.0.13     #master节点ip
b1.sources.s1.port = 44444

#配置sinks
b1.sinks.k1.type = hdfs
b1.sinks.k1.hdfs.path = hdfs://master.hadoop:8020/data/flume/ 
b1.sinks.k1.hdfs.filePrefix = slver1-abc-log-
b1.sinks.k1.hdfs.fileType = DataStream


#配置channels
b1.channels.c1.type = memory
b1.channels.c1.capacity = 1000
b1.channels.c1.transactionCapacity = 100

#配置 sources 与 sinks 连接
b1.sources.s1.channels = c1
b1.sinks.k1.channel = c1

2、配置slaver1(192.168.200.5)节点 conf配置文件

[root@slaver1 ~]# cat /root/flume/conf/exec.conf 
#配置名称
a1.sources = s1
a1.sinks = k1
a1.channels =c1

#配置sourcers
a1.sources.s1.type = exec
a1.sources.s1.command = tail -F /root/abc.txt

#配置sinks
a1.sinks.k1.type = avro
a1.sinks.k1.hostname = master.hadoop
a1.sinks.k1.port = 44444

#配置channels
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.keep-alive = 20

#配置连接
a1.sources.s1.channels = c1
a1.sinks.k1.channel = c1

3、生成自动向 abc.txt 写入数据的脚本,赋予执行权限

[root@slaver1 ~]# cat bash.sh 
for k in $(seq 1 100)
do
        echo $k"  :aaaaa" >> /root/abc.txt
done
[root@slaver1 ~]# chmod +x bash.sh

4、先运行master节点的agent，再运行slaver1节点agent

[root@master ~]# flume-ng agent  --conf-file /root/flume/conf/avro.conf --name b1  -Dflume.root.logger=DEBUG,console

[root@slaver1 ~]# flume-ng agent --conf-file /root/flume/exec.conf --name a1  -Dflume.root.logger=DEBUG,console

在master节点可以看到建立的连接

19/03/30 01:55:49 INFO source.AvroSource: Starting Avro source s1: { bindAddress: 10.0.0.13, port: 44444 }...
19/03/30 01:55:50 INFO instrumentation.MonitoredCounterGroup: Monitored counter group for type: SOURCE, name: s1: Successfully registered new MBean.
19/03/30 01:55:50 INFO instrumentation.MonitoredCounterGroup: Component type: SOURCE, name: s1 started
19/03/30 01:55:50 INFO source.AvroSource: Avro source s1 started.
19/03/30 01:55:50 INFO ipc.NettyServer: [id: 0x6b3d911a, /10.0.0.14:38978 => /10.0.0.13:44444] OPEN
19/03/30 01:55:50 INFO ipc.NettyServer: [id: 0x6b3d911a, /10.0.0.14:38978 => /10.0.0.13:44444] BOUND: /10.0.0.13:44444
19/03/30 01:55:50 INFO ipc.NettyServer: [id: 0x6b3d911a, /10.0.0.14:38978 => /10.0.0.13:44444] CONNECTED: /10.0.0.14:38978

5、运行脚本插入数据

[root@slaver1 ~]# ./bash.sh

6、查看上传至： hdfs://data/flume 的日志文件

[root@master ~]# hadoop fs -ls /data/flume     
Found 16 items
-rw-r--r--   3 root hdfs         40 2019-03-30 01:56 /data/flume/slver1-abc-log-.1553910996063
-rw-r--r--   3 root hdfs         40 2019-03-30 01:56 /data/flume/slver1-abc-log-.1553910996064
-rw-r--r--   3 root hdfs         40 2019-03-30 01:56 /data/flume/slver1-abc-log-.1553910996065
-rw-r--r--   3 root hdfs         40 2019-03-30 01:56 /data/flume/slver1-abc-log-.1553910996066

source类型之： http （参考原文点此）

获取get 、 post 类型的json数据

http-memory-logger
1）配置文件

[root@master conf]# cat http-memory-logger.conf 
#命名
a1.sources = s1
a1.sinks = k1
a1.channels = c1

#配置sources
a1.sources.s1.type = http
a1.sources.s1.bind = master.hadoop
a1.sources.s1.port = 44444

#配置sinks
a1.sinks.k1.type = logger

#配置channels
a1.channels.c1.type = memory
a1.channels.c1.capacity = 5000
a1.channels.c1.transactionCapacity = 200

#连接
a1.sources.s1.channels = c1
a1.sinks.k1.channel = c1
[root@master conf]#

2）运行agent

[root@master conf]# flume-ng agent --conf-file http-memory-logger.conf --name a1

3）发送测试数据

[root@master conf]# curl -X POST -d'[{"headers":{"h1":"v1","h2":"v2"},"body":"hello"}]'  http://master.hadoop:44444

4）效果

    19/03/30 03:07:18 INFO mortbay.log: Started SelectChannelConnector@master.hadoop:44444
    19/03/30 03:07:18 INFO instrumentation.MonitoredCounterGroup: Monitored counter group for type: SOURCE, name: s1: Successfully registered new MBean.
    19/03/30 03:07:18 INFO instrumentation.MonitoredCounterGroup: Component type: SOURCE, name: s1 started
    19/03/30 03:07:49 INFO sink.LoggerSink: Event: { headers:{h1=v1, h2=v2} body: 68 65 6C 6C 6F                                  hello }

拦截器

实现修改和丢弃事件功能
常用拦截器：Host、Regex Filtering
Host 拦截器，修改header信息(IP、主机名)

Property Name	Default	Description
type	-	The component type name, has to be host
preserveExisting	false	If the host header already exists, should it be preserved - true or false
useIP	true	Use the IP Address if true, else use hostname（true用ip，false用hostname）
hostHeader	host	The header key to be used. [(key=value) 中指定key]

[root@master flume]# cat netcat.conf

#定义名称
a1.sources = s1
a1.sinks = k1
a1.channels = c1
#配置sources
a1.sources.s1.type = netcat
a1.sources.s1.bind = localhost
a1.sources.s1.port = 44444
a1.sources.s1.interceptors = i1
a1.sources.s1.interceptors.i1.type = host
#a1.sources.s1.interceptors.i1.useIP = false        #当此值为false时，则使用hostname(默认为true)
#a1.sources.s1.interceptors.i1.hostHeader = myhost
#配置sinks
a1.sinks.k1.type = logger
#配置channels
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
#配置 sources、sinks
a1.sources.s1.channels = c1
a1.sinks.k1.channel = c1

Regex Filtering 过滤器，匹配符合正则表达式的信息

Property Name	Default	Description
type	-	The component type name, has to be regex_filter
regex	”.*”	Regular expression for matching against events
excludeEvents	false	If true, regex determines events to exclude, otherwise regex determines events to include.

配置案例
[root@master flume]# cat regex_filter.conf

#定义名称
a1.sources = s1
a1.sinks = k1
a1.channels = c1


#配置sources
a1.sources.s1.type = netcat
a1.sources.s1.bind = localhost
a1.sources.s1.port = 44444

a1.sources.s1.interceptors = i1 i2
a1.sources.s1.interceptors.i1.type = regex_filter  #正则匹配
a1.sources.s1.interceptors.i1.regex = \\{.*\\}     #匹配带有{}的数据

a1.sources.s1.interceptors.i2.type = host  
a1.sources.s1.interceptors.i2.hostHeader = myhost 
#配置sinks
a1.sinks.k1.type = logger


#配置channels
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

#配置 sources、sinks
a1.sources.s1.channels = c1
a1.sinks.k1.channel = c1

Timestamp Interceptor

Property Name	Default	Description
type	-	The component type name, has to be timestamp or the FQCN
headerName	timestamp	The name of the header in which to place the generated timestamp.
preserveExisting	false	If the timestamp already exists, should it be preserved - true or false

Static Interceptor

Property Name	Default	Description
type	-	The component type name, has to be static
preserveExisting	true	If configured header already exists, should it be preserved - true or false
key	key	Name of header that should be created
value	value	Static value that should be created

UUID Interceptor

Property Name	Default	Description
type	-	The component type name has to be org.apache.flume.sink.solr.morphline.UUIDInterceptor$Builder
headerName	id	The name of the Flume header to modify
preserveExisting	true	If the UUID header already exists, should it be preserved - true or false
prefix	“”	The prefix string constant to prepend to each generated UUID

Morphline Interceptor

Property Name	Default	Description
type	-	The component type name has to be org.apache.flume.sink.solr.morphline.MorphlineInterceptor$Builder
morphlineFile	-	The relative or absolute path on the local file system to the morphline configuration file. Example: /etc/flume-ng/conf/morphline.conf
morphlineId	null	Optional name used to identify a morphline if there are multiple morphlines in a morphline config file

Search and Replace Interceptor

Property Name	Default	Description
type	-	The component type name has to be search_replace
searchPattern	-	The pattern to search for and replace.
replaceString		The replacement string.
charset	UTF-8	The charset of the event body. Assumed by default to be UTF-8.

Regex Extractor Interceptor

Property Name	Default	Description
type	-	The component type name has to be regex_extractor
regex	-	Regular expression for matching against events
serializers	-	Space-separated list of serializers for mapping matches to header names and serializing their values. (See example below) Flume provides built-in support for the following serializers: org.apache.flume.interceptor.RegexExtractorInterceptorPassThroughSerializer org.apache.flume.interceptor.RegexExtractorInterceptorMillisSerializer
serializers..type	defalut	Must be default (org.apache.flume.interceptor.RegexExtractorInterceptorPassThroughSerializer), org.apache.flume.interceptor.RegexExtractorInterceptorMillisSerializer, or the FQCN of a custom class that implements org.apache.flume.interceptor.RegexExtractorInterceptorSerializer
serializers..name	-
serializers.*	-	Serializer-specific properties

[root@master flume]# cat  static-interceptor.conf   
#定义名称
a1.sources = s1
a1.sinks = k1
a1.channels = c1


#配置sources
a1.sources.s1.type = netcat
a1.sources.s1.bind = localhost
a1.sources.s1.port = 44444
a1.sources.s1.interceptors = i1 i2 i3 i4 i5

a1.sources.s1.interceptors.i1.type = host
a1.sources.s1.interceptoes.i1.hostHeader = ip
 
a1.sources.s1.interceptors.i2.type = static
a1.sources.s1.interceptors.i2.preserveExisting = false
a1.sources.s1.interceptors.i2.key = ID
a1.sources.s1.interceptors.i2.value = 1 

a1.sources.s1.interceptors.i3.type = timestamp 
a1.sources.s1.interceptors.i3.headerName = time

a1.sources.s1.interceptors.i4.type = remove_header
a1.sources.s1.interceptors.i4.withName = ID

a1.sources.s1.interceptors.i5.type = org.apache.flume.sink.solr.morphline.UUIDInterceptor$Builder
a1.sources.s1.interceptors.i5.headerName = uuid
a1.sources.s1.interceptors.i5.prefix = 1--
#配置sinks
a1.sinks.k1.type = logger


#配置channels
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

#配置 sources、sinks
a1.sources.s1.channels = c1
a1.sinks.k1.channel = c1

Flume Channel Selectors

If the type is not specified, then defaults to “replicating”. （如果没有指定类型，则默认为分发）

Property Name	Default	Description
selector.type	replicating	The component type name, needs to be multiplexing
selector.header	-	flume.selector.header
selector.default	-
selector.mapping.*	-

实现如下：
根据不同的header选择不同的通道
在这里插入图片描述

1）配置选择器

[root@master flume]# cat multiplexing-selector1.conf 
#定义名称
a1.sources = s1
a1.sinks = k1 k2
a1.channels = c1 c2

#配置sources
a1.sources.s1.type = avro
a1.sources.s1.bind = 192.168.17.150
a1.sources.s1.port = 44444

a1.sources.s1.selector.type = multiplexing
a1.sources.s1.selector.header = type
a1.sources.s1.selector.mapping.1 = c1
a1.sources.s1.selector.mapping.2 = c2

#配置sinks
a1.sinks.k1.type = file_roll
a1.sinks.k1.sink.directory = /root/flume/file/secure-log/
a1.sinks.k1.sink.pathManager.extension = aaa
a1.sinks.k1.sink.pathManager.prefix = aaa-
a1.sinks.k1.sink.rollInterval = 0
a1.sinks.k1.batchSize = 100

a1.sinks.k2.type = logger

#配置channels
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

a1.channels.c2.type = memory
a1.channels.c2.capacity = 1000
a1.channels.c2.transactionCapacity = 100

#配置 sources、sinks
a1.sources.s1.channels = c1 c2
a1.sinks.k1.channel = c1
a1.sinks.k2.channel = c2

2）配置tomcat1

[root@master flume]# cat tomcat1.conf   
#定义名称
a1.sources = s1
a1.sinks = k1 
a1.channels = c1

#配置sources
a1.sources.s1.type = netcat
a1.sources.s1.bind = 192.168.17.150
a1.sources.s1.port = 55555
a1.sources.s1.interceptors = i2

a1.sources.s1.interceptors.i2.type = static
a1.sources.s1.interceptors.i2.key = type
a1.sources.s1.interceptors.i2.value = 1

#配置sinks
a1.sinks.k1.type = avro
a1.sinks.k1.hostname = 192.168.17.150
a1.sinks.k1.port = 44444


#配置channels
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

#配置 sources、sinks
a1.sources.s1.channels = c1
a1.sinks.k1.channel = c1

3）配置tomcat2

[root@master flume]# cat tomcat2.conf 
#定义名称
a1.sources = s1
a1.sinks = k1 
a1.channels = c1 

#配置sources
a1.sources.s1.type = netcat
a1.sources.s1.bind = 192.168.17.150
a1.sources.s1.port = 55556
a1.sources.s1.interceptors = i2

a1.sources.s1.interceptors.i2.type = static
a1.sources.s1.interceptors.i2.key = type
a1.sources.s1.interceptors.i2.value = 2

#配置sinks
a1.sinks.k1.type = avro
a1.sinks.k1.hostname = 192.168.17.150
a1.sinks.k1.port = 44444

#配置channels
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

#配置 sources、sinks
a1.sources.s1.channels = c1
a1.sinks.k1.channel = c1

4）先启动 multiplexing-selector1.conf ，后启动tomcat1.conf、tomcat2.conf

Kafka && flume

在这里插入图片描述

flume 配置文件

[root@master kafka]# cat /root/flume/kafka.conf 
#name
a1.sources = s1
a1.sinks = k1
a1.channels = c1

#sources
a1.sources.s1.type = netcat
a1.sources.s1.bind = 192.168.17.150
a1.sources.s1.port = 44444

#sinks
a1.sinks.k1.type = org.apache.flume.sink.kafka.KafkaSink
a1.sinks.k1.brokerList = 192.168.17.150:9092
a1.sinks.k1.topic = test          
a1.sinks.k1.serializer.class = kafka.serializer.StringEncoder 

#channels
a1.channels.c1.type=memory
a1.channels.c1.capacity=10000
a1.channels.c1.transactionCapacity=100

#link
a1.sources.s1.channels = c1
a1.sinks.k1.channel = c1

2)启动 zookeeper和 kafka

# /home/zookeeper-3.5.3-beta/bin/zkServer.sh start    
# /home/kafka/bin/kafka-server-start.sh  /home/kafka/config/server.properties &     (& :后台运行)

3)另开端口，创建topic(主题)

# bin/kafka-topics.sh --create --zookeeper 192.168.17.150:2181 --replication-factor 1 --partitions 1 --topic test

4)创建消费者Consumer

# bin/kafka-console-consumer.sh --bootstrap-server 192.168.17.150:9092 --topic test   --from-beginning

5）另开端口，启动flume

# flume-ng agent --conf-file kafka.conf --name a1 -Dflume.root.logger=DEBUG,console

6）添加测试数据，查看结果

[root@master ~]# telnet 192.168.17.150 44444
Trying 192.168.17.150...
Connected to 192.168.17.150.
Escape character is '^]'.
qweqweqwe
OK
hello  kafka&flume
OK

查看第4步运行界面, 已接收到数据。

[root@master kafka]# bin/kafka-console-consumer.sh --bootstrap-server 192.168.17.150:9092 --topic kafkatest --from-beginning
qweqweqwe
hello  kafka&flume

kafka故障转移&负载均衡

负载均衡

Property Name	Default	Description
sinks	-	Space-separated list of sinks that are participating in the group
processor.type	default	The component type name, needs to be failover
processor.priority.	–	Priority value. must be one of the sink instances associated with the current sink group A higher priority value Sink gets activated earlier. A larger absolute value indicates higher priority
processor.maxpenalty	30000	The maximum backoff period for the failed Sink (in millis)