hadoop-flume

官方说明:http://flume.apache.org/index.html
官方配置文档:http://flume.apache.org/releases/content/1.9.0/FlumeUserGuide.html
翻译文档:http://www.51niux.com/?id=197

client生产数据,运行在一个独立的线程。

source从client收集数据,传递给channel。

sink从channel收集数据,运行在一个独立线程。

最后channel连接 sources 和 sinks

source类型之: netcat

说明: 网络端口收集数据

案例:获取本机44444端口输入内容:

坑:配置 sources 时bind为 locaohost,则 telnet 的ip也为 localhost(保持一致,最好不使用localhost)

1.安装telnet服务

[root@master ~]# yum install -y telnet telnet-server

2.配置conf文件

[root@master ~]# cat netcat.conf
#定义名称
a1.sources = s1    #取名s1,可同时命名多个,空格隔开(a1.sources = s1 s2 s3)
a1.sinks = k1
a1.channels = c1

#配置sources
a1.sources.s1.type = netcat
a1.sources.s1.bind = localhost
a1.sources.s1.port = 44444

#配置sinks
a1.sinks.k1.type = logger


#配置channels
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

#链接sinks 与 sources
a1.sources.s1.channels = c1
a1.sinks.k1.channel = c1

运行一个agent:

[root@master ~]# flume-ng  agent  --conf-file netcat.conf  --name  a1  -Dflume.root.logger=INFO,console

3.运行:telnet (先启运行agent),往 agent 的 source 所监听的端口上发送数据,让 agent 有数据可采。

[root@master ~]# telnet localhost 44444
Trying ::1...
telnet: connect to address ::1: Connection refused
Trying 127.0.0.1...
Connected to localhost.
Escape character is '^]'.
hello    #手动输入的内容
OK
lucky
OK

4.效果:

19/03/29 05:24:33 INFO source.NetcatSource: Created serverSocket:sun.nio.ch.ServerSocketChannelImpl[/127.0.0.1:44444]
19/03/29 05:25:35 INFO sink.LoggerSink: Event: { headers:{} body: 68 65 6C 6C 6F 0D                               hello. }
19/03/29 05:25:39 INFO sink.LoggerSink: Event: { headers:{} body: 6C 75 63 6B 79 0D                               lucky. }

source类型之: exec

ExecSource的配置就是设定一个Unix(linux)命令,然后通过这个命令不断输出数据。如果进程退出,Exec Source也一起退出,不会产生进一步的数据。**

案例1: 持续采集系统log日志(目标:/var/log/secure),并将采集内容存放到 本地

sinks保存到本地的配置参数 (加粗必选)

Property NameDefaultDescription
channel-
type-The component type name, needs to be file_roll.
sink.directory-The directory where files will be stored
sink.pathManagerDEFAULTThe PathManager implementation to use.
sink.pathManager.extension-The file extension if the default PathManager is used.(文件后缀名)
sink.pathManager.prefix-A character string to add to the beginning of the file name if the default PathManager is used(文件开头)
sink.rollInterval30Roll the file every 30 seconds. Specifying 0 will disable rolling and cause all events to be written to a single file.(30秒生成一个文件,0则关闭)
sink.serializerTEXTOther possible options include avro_event or the FQCN of an implementation of EventSerializer.Builder interface
batchSize100

1.配置conf文件

[root@master flume]# cat exec.conf    
#定义名称
a1.sources = s1
a1.sinks = k1
a1.channels = c1

#配置sources
a1.sources.s1.type = exec
a1.sources.s1.command  = tail -f /var/log/secure

#配置sinks
a1.sinks.k1.type = file_roll
a1.sinks.k1.sink.directory = /root/flume/secure-log/
a1.sinks.k1.sink.pathManager.extension = aaa
a1.sinks.k1.sink.pathManager.prefix = aaa-
a1.sinks.k1.sink.rollInterval = 0
a1.sinks.k1.batchSize = 100

#配置channels
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

#配置sources 与 sinks 连接
a1.sources.s1.channels = c1
a1.sinks.k1.channel = c1

2.创建存储目录 /root/flume/secure-log/

[root@master ~]# mkdir /root/flume/secure-log

3.运行agent

[root@master flume]# flume-ng agent --conf-file exec.conf --name a1  -Dflume.root.logger=INFO,console

4.查看结果

[root@master flume]# ll  secure-log/
-rw-r--r--. 1 root root    0 Apr 19 03:39 aaa-1555659597380-1.aaa

source类型之: avro

侦听Avro端口并从外部Avro客户端流接收事件。 当与另一个(上一跳)Flume代理上的内置Avro Sink配对时,它可以创建分层集合拓扑。

案例1: 实现如下功能
1)在节点slaver1安装flume通过exec source收集日志,
2)通过avro sink发送到 汇总节点master,
3)汇总节点master通过avro source接收slaver1传送来的数据,
4)将日志上传至hdfs文件系统

保存至 hdfs 常用参数:其他参数点此

Property NameDefaultDescription
channel-
type--
path-写入hdfs的路径,需要包含文件系统标识
filePrefixFlumeData写入hdfs的文件名前缀,可以使用flume提供的日期及%{host}表达式。
fileTypeSequenceFile文件格式,包括:SequenceFile, DataStream,CompressedStream,当使用DataStream时候,文件不会被压缩,不需要设置hdfs.codeC;当使用CompressedStream时候,必须设置一个正确的hdfs.codeC值;
fileSuffix写入hdfs的文件名后缀,比如:.lzo .log等。
inUsePrefix临时文件的文件名前缀,hdfs sink会先往目标目录中写临时文件,再根据相关规则重命名成最终目标文件;
inUseSuffix.tmp临时文件的文件名后缀。
roundUnitseconds时间上进行”舍弃”的单位,包含:second,minute,hour
roundValue1时间上进行“舍弃”的值
roundfalse是否启用时间上的”舍弃”
rollInterval30hdfs sink间隔多长将临时文件滚动成最终目标文件,单位:秒;如果设置成0,则表示不根据时间来滚动文件;注:滚动(roll)指的是,hdfs sink将临时文件重命名成最终目标文件,并新打开一个临时文件来写入数据;
rollSize1024当临时文件达到该大小(单位:bytes)时,滚动成目标文件;如果设置成0,则表示不根据临时文件大小来滚动文件;
rollCount10当events数据达到该数量时候,将临时文件滚动成目标文件如果设置成0,则表示不根据events数据来滚动文件;
writeFormatWritable写sequence文件的格式。包含:Text, Writable(默认)
threadsPoolSize10hdfs sink启动的操作HDFS的线程数。

实现:
1、 配置master节点 conf配置文件

[root@master ~]# cat /root/flume/conf/avro.conf 
#配置名称
b1.sources = s1
b1.sinks = k1
b1.channels = c1

#配置sources
b1.sources.s1.type = avro
b1.sources.s1.bind = 10.0.0.13     #master节点ip
b1.sources.s1.port = 44444

#配置sinks
b1.sinks.k1.type = hdfs
b1.sinks.k1.hdfs.path = hdfs://master.hadoop:8020/data/flume/ 
b1.sinks.k1.hdfs.filePrefix = slver1-abc-log-
b1.sinks.k1.hdfs.fileType = DataStream


#配置channels
b1.channels.c1.type = memory
b1.channels.c1.capacity = 1000
b1.channels.c1.transactionCapacity = 100

#配置 sources 与 sinks 连接
b1.sources.s1.channels = c1
b1.sinks.k1.channel = c1

2、配置slaver1(192.168.200.5)节点 conf配置文件

[root@slaver1 ~]# cat /root/flume/conf/exec.conf 
#配置名称
a1.sources = s1
a1.sinks = k1
a1.channels =c1

#配置sourcers
a1.sources.s1.type = exec
a1.sources.s1.command = tail -F /root/abc.txt

#配置sinks
a1.sinks.k1.type = avro
a1.sinks.k1.hostname = master.hadoop
a1.sinks.k1.port = 44444

#配置channels
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.keep-alive = 20

#配置连接
a1.sources.s1.channels = c1
a1.sinks.k1.channel = c1

3、生成自动向 abc.txt 写入数据的脚本,赋予执行权限

[root@slaver1 ~]# cat bash.sh 
for k in $(seq 1 100)
do
        echo $k"  :aaaaa" >> /root/abc.txt
done
[root@slaver1 ~]# chmod +x bash.sh

4、先运行master节点的agent,再运行slaver1节点agent

[root@master ~]# flume-ng agent  --conf-file /root/flume/conf/avro.conf --name b1  -Dflume.root.logger=DEBUG,console
[root@slaver1 ~]# flume-ng agent --conf-file /root/flume/exec.conf --name a1  -Dflume.root.logger=DEBUG,console  

在master节点可以看到建立的连接

19/03/30 01:55:49 INFO source.AvroSource: Starting Avro source s1: { bindAddress: 10.0.0.13, port: 44444 }...
19/03/30 01:55:50 INFO instrumentation.MonitoredCounterGroup: Monitored counter group for type: SOURCE, name: s1: Successfully registered new MBean.
19/03/30 01:55:50 INFO instrumentation.MonitoredCounterGroup: Component type: SOURCE, name: s1 started
19/03/30 01:55:50 INFO source.AvroSource: Avro source s1 started.
19/03/30 01:55:50 INFO ipc.NettyServer: [id: 0x6b3d911a, /10.0.0.14:38978 => /10.0.0.13:44444] OPEN
19/03/30 01:55:50 INFO ipc.NettyServer: [id: 0x6b3d911a, /10.0.0.14:38978 => /10.0.0.13:44444] BOUND: /10.0.0.13:44444
19/03/30 01:55:50 INFO ipc.NettyServer: [id: 0x6b3d911a, /10.0.0.14:38978 => /10.0.0.13:44444] CONNECTED: /10.0.0.14:38978

5、运行脚本插入数据

[root@slaver1 ~]# ./bash.sh 

6、查看上传至: hdfs://data/flume 的日志文件

[root@master ~]# hadoop fs -ls /data/flume     
Found 16 items
-rw-r--r--   3 root hdfs         40 2019-03-30 01:56 /data/flume/slver1-abc-log-.1553910996063
-rw-r--r--   3 root hdfs         40 2019-03-30 01:56 /data/flume/slver1-abc-log-.1553910996064
-rw-r--r--   3 root hdfs         40 2019-03-30 01:56 /data/flume/slver1-abc-log-.1553910996065
-rw-r--r--   3 root hdfs         40 2019-03-30 01:56 /data/flume/slver1-abc-log-.1553910996066

source类型之: http (参考原文点此)

获取get 、 post 类型的json数据

http-memory-logger
1)配置文件

[root@master conf]# cat http-memory-logger.conf 
#命名
a1.sources = s1
a1.sinks = k1
a1.channels = c1

#配置sources
a1.sources.s1.type = http
a1.sources.s1.bind = master.hadoop
a1.sources.s1.port = 44444

#配置sinks
a1.sinks.k1.type = logger

#配置channels
a1.channels.c1.type = memory
a1.channels.c1.capacity = 5000
a1.channels.c1.transactionCapacity = 200

#连接
a1.sources.s1.channels = c1
a1.sinks.k1.channel = c1
[root@master conf]# 

2)运行agent

[root@master conf]# flume-ng agent --conf-file http-memory-logger.conf --name a1 

3)发送测试数据

[root@master conf]# curl -X POST -d'[{"headers":{"h1":"v1","h2":"v2"},"body":"hello"}]'  http://master.hadoop:44444 

4)效果

    19/03/30 03:07:18 INFO mortbay.log: Started SelectChannelConnector@master.hadoop:44444
    19/03/30 03:07:18 INFO instrumentation.MonitoredCounterGroup: Monitored counter group for type: SOURCE, name: s1: Successfully registered new MBean.
    19/03/30 03:07:18 INFO instrumentation.MonitoredCounterGroup: Component type: SOURCE, name: s1 started
    19/03/30 03:07:49 INFO sink.LoggerSink: Event: { headers:{h1=v1, h2=v2} body: 68 65 6C 6C 6F                                  hello }

拦截器

实现修改和丢弃事件功能
常用拦截器:Host、Regex Filtering
Host 拦截器,修改header信息(IP、主机名)

Property NameDefaultDescription
type-The component type name, has to be host
preserveExistingfalseIf the host header already exists, should it be preserved - true or false
useIPtrueUse the IP Address if true, else use hostname(true用ip,false用hostname)
hostHeaderhostThe header key to be used. [(key=value) 中指定key]

[root@master flume]# cat netcat.conf

#定义名称
a1.sources = s1
a1.sinks = k1
a1.channels = c1
#配置sources
a1.sources.s1.type = netcat
a1.sources.s1.bind = localhost
a1.sources.s1.port = 44444
a1.sources.s1.interceptors = i1
a1.sources.s1.interceptors.i1.type = host
#a1.sources.s1.interceptors.i1.useIP = false        #当此值为false时,则使用hostname(默认为true)
#a1.sources.s1.interceptors.i1.hostHeader = myhost
#配置sinks
a1.sinks.k1.type = logger
#配置channels
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
#配置 sources、sinks
a1.sources.s1.channels = c1
a1.sinks.k1.channel = c1

Regex Filtering 过滤器,匹配符合正则表达式的信息

Property NameDefaultDescription
type-The component type name, has to be regex_filter
regex”.*”Regular expression for matching against events
excludeEventsfalseIf true, regex determines events to exclude, otherwise regex determines events to include.

配置案例
[root@master flume]# cat regex_filter.conf

#定义名称
a1.sources = s1
a1.sinks = k1
a1.channels = c1


#配置sources
a1.sources.s1.type = netcat
a1.sources.s1.bind = localhost
a1.sources.s1.port = 44444

a1.sources.s1.interceptors = i1 i2
a1.sources.s1.interceptors.i1.type = regex_filter  #正则匹配
a1.sources.s1.interceptors.i1.regex = \\{.*\\}     #匹配带有{}的数据

a1.sources.s1.interceptors.i2.type = host  
a1.sources.s1.interceptors.i2.hostHeader = myhost 
#配置sinks
a1.sinks.k1.type = logger


#配置channels
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

#配置 sources、sinks
a1.sources.s1.channels = c1
a1.sinks.k1.channel = c1

Timestamp Interceptor

Property NameDefaultDescription
type-The component type name, has to be timestamp or the FQCN
headerNametimestampThe name of the header in which to place the generated timestamp.
preserveExistingfalseIf the timestamp already exists, should it be preserved - true or false

Static Interceptor

Property NameDefaultDescription
type-The component type name, has to be static
preserveExistingtrueIf configured header already exists, should it be preserved - true or false
keykeyName of header that should be created
valuevalueStatic value that should be created

UUID Interceptor

Property NameDefaultDescription
type-The component type name has to be org.apache.flume.sink.solr.morphline.UUIDInterceptor$Builder
headerNameidThe name of the Flume header to modify
preserveExistingtrueIf the UUID header already exists, should it be preserved - true or false
prefix“”The prefix string constant to prepend to each generated UUID

Morphline Interceptor

Property NameDefaultDescription
type-The component type name has to be org.apache.flume.sink.solr.morphline.MorphlineInterceptor$Builder
morphlineFile-The relative or absolute path on the local file system to the morphline configuration file. Example: /etc/flume-ng/conf/morphline.conf
morphlineIdnullOptional name used to identify a morphline if there are multiple morphlines in a morphline config file

Search and Replace Interceptor

Property NameDefaultDescription
type-The component type name has to be search_replace
searchPattern-The pattern to search for and replace.
replaceStringThe replacement string.
charsetUTF-8The charset of the event body. Assumed by default to be UTF-8.

Regex Extractor Interceptor

Property NameDefaultDescription
type-The component type name has to be regex_extractor
regex-Regular expression for matching against events
serializers-Space-separated list of serializers for mapping matches to header names and serializing their values. (See example below) Flume provides built-in support for the following serializers: org.apache.flume.interceptor.RegexExtractorInterceptorPassThroughSerializer org.apache.flume.interceptor.RegexExtractorInterceptorMillisSerializer
serializers..typedefalutMust be default (org.apache.flume.interceptor.RegexExtractorInterceptorPassThroughSerializer), org.apache.flume.interceptor.RegexExtractorInterceptorMillisSerializer, or the FQCN of a custom class that implements org.apache.flume.interceptor.RegexExtractorInterceptorSerializer
serializers..name-
serializers.*-Serializer-specific properties
[root@master flume]# cat  static-interceptor.conf   
#定义名称
a1.sources = s1
a1.sinks = k1
a1.channels = c1


#配置sources
a1.sources.s1.type = netcat
a1.sources.s1.bind = localhost
a1.sources.s1.port = 44444
a1.sources.s1.interceptors = i1 i2 i3 i4 i5

a1.sources.s1.interceptors.i1.type = host
a1.sources.s1.interceptoes.i1.hostHeader = ip
 
a1.sources.s1.interceptors.i2.type = static
a1.sources.s1.interceptors.i2.preserveExisting = false
a1.sources.s1.interceptors.i2.key = ID
a1.sources.s1.interceptors.i2.value = 1 

a1.sources.s1.interceptors.i3.type = timestamp 
a1.sources.s1.interceptors.i3.headerName = time

a1.sources.s1.interceptors.i4.type = remove_header
a1.sources.s1.interceptors.i4.withName = ID

a1.sources.s1.interceptors.i5.type = org.apache.flume.sink.solr.morphline.UUIDInterceptor$Builder
a1.sources.s1.interceptors.i5.headerName = uuid
a1.sources.s1.interceptors.i5.prefix = 1--
#配置sinks
a1.sinks.k1.type = logger


#配置channels
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

#配置 sources、sinks
a1.sources.s1.channels = c1
a1.sinks.k1.channel = c1

Flume Channel Selectors

If the type is not specified, then defaults to “replicating”. (如果没有指定类型,则默认为分发)

Property NameDefaultDescription
selector.typereplicatingThe component type name, needs to be multiplexing
selector.header-flume.selector.header
selector.default-
selector.mapping.*-

实现如下:
根据不同的header选择不同的通道
在这里插入图片描述

1)配置选择器

[root@master flume]# cat multiplexing-selector1.conf 
#定义名称
a1.sources = s1
a1.sinks = k1 k2
a1.channels = c1 c2

#配置sources
a1.sources.s1.type = avro
a1.sources.s1.bind = 192.168.17.150
a1.sources.s1.port = 44444

a1.sources.s1.selector.type = multiplexing
a1.sources.s1.selector.header = type
a1.sources.s1.selector.mapping.1 = c1
a1.sources.s1.selector.mapping.2 = c2

#配置sinks
a1.sinks.k1.type = file_roll
a1.sinks.k1.sink.directory = /root/flume/file/secure-log/
a1.sinks.k1.sink.pathManager.extension = aaa
a1.sinks.k1.sink.pathManager.prefix = aaa-
a1.sinks.k1.sink.rollInterval = 0
a1.sinks.k1.batchSize = 100

a1.sinks.k2.type = logger

#配置channels
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

a1.channels.c2.type = memory
a1.channels.c2.capacity = 1000
a1.channels.c2.transactionCapacity = 100

#配置 sources、sinks
a1.sources.s1.channels = c1 c2
a1.sinks.k1.channel = c1
a1.sinks.k2.channel = c2

2)配置tomcat1

[root@master flume]# cat tomcat1.conf   
#定义名称
a1.sources = s1
a1.sinks = k1 
a1.channels = c1

#配置sources
a1.sources.s1.type = netcat
a1.sources.s1.bind = 192.168.17.150
a1.sources.s1.port = 55555
a1.sources.s1.interceptors = i2

a1.sources.s1.interceptors.i2.type = static
a1.sources.s1.interceptors.i2.key = type
a1.sources.s1.interceptors.i2.value = 1

#配置sinks
a1.sinks.k1.type = avro
a1.sinks.k1.hostname = 192.168.17.150
a1.sinks.k1.port = 44444


#配置channels
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

#配置 sources、sinks
a1.sources.s1.channels = c1
a1.sinks.k1.channel = c1

3)配置tomcat2

[root@master flume]# cat tomcat2.conf 
#定义名称
a1.sources = s1
a1.sinks = k1 
a1.channels = c1 

#配置sources
a1.sources.s1.type = netcat
a1.sources.s1.bind = 192.168.17.150
a1.sources.s1.port = 55556
a1.sources.s1.interceptors = i2

a1.sources.s1.interceptors.i2.type = static
a1.sources.s1.interceptors.i2.key = type
a1.sources.s1.interceptors.i2.value = 2

#配置sinks
a1.sinks.k1.type = avro
a1.sinks.k1.hostname = 192.168.17.150
a1.sinks.k1.port = 44444

#配置channels
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

#配置 sources、sinks
a1.sources.s1.channels = c1
a1.sinks.k1.channel = c1

4)先启动 multiplexing-selector1.conf ,后启动tomcat1.conf、tomcat2.conf

Kafka && flume

在这里插入图片描述

  1. flume 配置文件
[root@master kafka]# cat /root/flume/kafka.conf 
#name
a1.sources = s1
a1.sinks = k1
a1.channels = c1

#sources
a1.sources.s1.type = netcat
a1.sources.s1.bind = 192.168.17.150
a1.sources.s1.port = 44444

#sinks
a1.sinks.k1.type = org.apache.flume.sink.kafka.KafkaSink
a1.sinks.k1.brokerList = 192.168.17.150:9092
a1.sinks.k1.topic = test          
a1.sinks.k1.serializer.class = kafka.serializer.StringEncoder 

#channels
a1.channels.c1.type=memory
a1.channels.c1.capacity=10000
a1.channels.c1.transactionCapacity=100

#link
a1.sources.s1.channels = c1
a1.sinks.k1.channel = c1

2)启动 zookeeper和 kafka

# /home/zookeeper-3.5.3-beta/bin/zkServer.sh start    
# /home/kafka/bin/kafka-server-start.sh  /home/kafka/config/server.properties &     (& :后台运行)

3)另开端口,创建topic(主题)

# bin/kafka-topics.sh --create --zookeeper 192.168.17.150:2181 --replication-factor 1 --partitions 1 --topic test    

4)创建消费者Consumer

# bin/kafka-console-consumer.sh --bootstrap-server 192.168.17.150:9092 --topic test   --from-beginning

5)另开端口,启动flume

# flume-ng agent --conf-file kafka.conf --name a1 -Dflume.root.logger=DEBUG,console

6)添加测试数据,查看结果

[root@master ~]# telnet 192.168.17.150 44444
Trying 192.168.17.150...
Connected to 192.168.17.150.
Escape character is '^]'.
qweqweqwe
OK
hello  kafka&flume
OK

查看第4步运行界面, 已接收到数据。

[root@master kafka]# bin/kafka-console-consumer.sh --bootstrap-server 192.168.17.150:9092 --topic kafkatest --from-beginning
qweqweqwe
hello  kafka&flume

kafka故障转移&负载均衡

负载均衡

Property NameDefaultDescription
sinks-Space-separated list of sinks that are participating in the group
processor.typedefaultThe component type name, needs to be failover
processor.priority.Priority value. must be one of the sink instances associated with the current sink group A higher priority value Sink gets activated earlier. A larger absolute value indicates higher priority
processor.maxpenalty30000The maximum backoff period for the failed Sink (in millis)
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值