一,认识flume
Flume的架构主要有一下几个核心概念:
Event:一个数据单元,带有一个可选的消息头
Flow:Event从源点到达目的点的迁移的抽象
Client:操作位于源点处的Event,将其发送到Flume Agent
Agent:一个独立的Flume进程,包含组件Source、Channel、Sink
Source:用来消费传递到该组件的Event
Channel:中转Event的一个临时存储,保存有Source组件传递过来的Event
Sink:从Channel中读取并移除Event,将Event传递到Flow Pipeline中的下一个Agent(如果有的话)
二,安装flume
1,基础环境准备
系统环境:cenots6.7
java环境:jdk1.8.92
http://download.oracle.com/otn-pub/java/jdk/8u92-b14/jdk-8u92-linux-x64.tar.gz?AuthParam=1466493996_149f31c41a3a9ef17975ade95149bfcf
#tar zxvf jdk-8u92-linux-x64.tar.gz
#mv jdk-8u92-linux-x64 /usr/local/jdk
2,flume下载地址
官网下载地址:
http://flume.apache.org/download.html
需要下载两个文件
apache-flume-1.6.0-bin.tar.gz
apache-flume-1.6.0-src.tar.gz
3,安装flume
分别解压下载的两个tar包
#tar zxvf apache-flume-1.6.0-bin.tar.gz
#tar zxvf apache-flume-1.6.0-src.tar.gz
src里面文件内容,覆盖解压后bin文件里面的内容
#cp -ri apache-flume-1.6.0-src/* apache-flume-1.6.0-bin
重命名
#mv apache-flume-1.6.0-bin /usr/local/flume
设置环境变量
#vi /etc/profile
export JAVA_HOME=/usr/local/jdk
export CLASSPATH=.:$JAVA_HOME/lib/dt.jar:$JAVA_HOME/lib/tools.jar
export PATH=$PATH:$JAVA_HOME/bin
export FLUME_HOME=/usr/local/flume
export FLUME_CONF_DIR=$FLUME_HOME/conf
export PATH=.:$PATH::$FLUME_HOME/bin
export CLASSPATH=.:$JAVA_HOME/lib/dt.jar:$JAVA_HOME/lib/tools.jar
export PATH=$PATH:$JAVA_HOME/bin
配置环境变量生效
source /etc/profile
修改配置文件
#vi /usr/local/flume/conf/flume-env.sh
export JAVA_HOME=/usr/local/jdk
测试flume安装是否成功
[root@localhost ~]# flume-ng version
Flume 1.6.0
Source code repository: https://git-wip-us.apache.org/repos/asf/flume.git
Compiled by hshreedharan on Mon May 11 11:15:44 PDT 2015
From source with checksum b29e416802ce9ece3269d34233baf43f
出现以上信息说明flume安装成功!
4,配置flume
4.1,avro
#vi /usr/local/flume/conf/avro.conf
# example.conf: A single-node Flume configuration
# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1
# Describe/configure the source
a1.sources.r1.type = avro
a1.sources.r1.channels = c1
a1.sources.r1.bind = 0.0.0.0
a1.sources.r1.port = 4141
# Describe the sink
a1.sinks.k1.type = logger
# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
启动flume
#flume-ng agent--conf conf --conf-file avro.conf --name a1 -Dflume.root.logger=INFO,console
PS:-Dflume.root.logger=INFO,console 仅为 debug 使用,请勿生产环境生搬硬套,否则大量的日志会返回到终端。。。
-c/--conf 后跟配置目录,-f/--conf-file后跟具体的配置文件,-n/--name指定agent的名称
然后我们再开一个 shell 终端窗口,telnet 上配置中侦听的端口,就可以发消息看到效果了:
4,2,spooldir
Spool监测配置的目录下新增的文件,并将文件中的数据读取出来。需要注意两点:
1) 拷贝到spool目录下的文件不可以再打开编辑。
2) spool目录下不可包含相应的子目录
#vi /usr/local/flume/conf/spool.conf
a1.sources = r1
a1.sinks = k1
a1.channels = c1
# Describe/configure the source
a1.sources.r1.type= spooldir
a1.sources.r1.channels = c1
a1.sources.r1.spoolDir =/data/logs/web
a1.sources.r1.fileHeader =true
# Describe the sink
a1.sinks.k1.type= logger
# Use a channel which buffers events in memory
a1.channels.c1.type= memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
启动flume
#flume-ng agent -c /usr/local/flume/conf -f /usr/local/flume/conf/spool.conf -n a1 -Dflume.root.logger=INFO,console
然后,手动拷贝一个文件到 /root/log 目录,观察日志输出以及/root/log 目录下的变化。
4.3,exec
EXEC执行一个给定的命令获得输出的源,如果要使用tail命令,必选使得file足够大才能看到输出内容
#vi /usr/local/flume/conf/exec_tail.conf
a1.sources = r1
a1.sinks = k1
a1.channels = c1
# Describe/configure the source
a1.sources.r1.type = exec
a1.sources.r1.channels = c1
a1.sources.r1.command = tail -F /data/logs/web/access.log
# Describe the sink
a1.sinks.k1.type = logger
# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
启动flume
#flume-ng agent -c /usr/local/flume/conf -f /usr/local/flume/conf/exec_tail.conf -n a1 -Dflume.root.logger=INFO,console
输入
#echo "exec_tail test1" >> /data/logs/web/access.log
#echo "exec_tail test2" >> /data/logs/web/access.log
#echo "exec_tail test3" >> /data/logs/web/access.log
输出
2016-08-01 15:49:47,837 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - org.apache.flume.sink.LoggerSink.process(LoggerSink.java:94)] Event: { headers:{} body: 73 70 6F 6F 6C 20 74 65 73 74 31 exec_tail test1 }
2016-08-01 15:49:56,838 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - org.apache.flume.sink.LoggerSink.process(LoggerSink.java:94)] Event: { headers:{} body: 73 70 6F 6F 6C 20 74 65 73 74 32 exec_tail test2 }
2016-08-01 15:50:05,841 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - org.apache.flume.sink.LoggerSink.process(LoggerSink.java:94)] Event: { headers:{} body: 73 70 6F 6F 6C 20 74 65 73 74 33 exec_tail test3 }
4.4,Syslogtcp
Syslogtcp监听TCP的端口做为数据源
#vi /usr/local/flume/conf/syslog_tcp.conf
a1.sources = r1
a1.sinks = k1
a1.channels = c1
# Describe/configure the source
a1.sources.r1.type= syslogtcp
a1.sources.r1.port = 5140
a1.sources.r1.host = localhost
a1.sources.r1.channels = c1
# Describe the sink
a1.sinks.k1.type= logger
# Use a channel which buffers events in memory
a1.channels.c1.type= memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
启动flume
#flume-ng agent -c /usr/local/flume/conf -f /usr/local/flume/conf/syslog_tcp.conf -n a1 -Dflume.root.logger=INFO,console
测试生成log日志
#echo "syslog test1" | nc localhost 5140
输出
2016-08-01 16:31:00,779 (New I/O worker #1) [WARN - org.apache.flume.source.SyslogUtils.buildEvent(SyslogUtils.java:316)] Event created from Invalid Syslog data.
2016-08-01 16:31:05,320 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - org.apache.flume.sink.LoggerSink.process(LoggerSink.java:94)] Event: { headers:{Severity=0, Facility=0, flume.syslog.status=Invalid} body: 73 79 73 6C 6F 67 20 74 65 73 74 31 syslog test1 }
2016-08-01 16:31:16,710 (New I/O worker #2) [WARN - org.apache.flume.source.SyslogUtils.buildEvent(SyslogUtils.java:316)] Event created from Invalid Syslog data.
2016-08-01 16:31:16,710 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - org.apache.flume.sink.LoggerSink.process(LoggerSink.java:94)] Event: { headers:{Severity=0, Facility=0, flume.syslog.status=Invalid} body: 73 79 73 6C 6F 67 20 74 65 73 74 32 syslog test2 }
4.5,JSONHandler
#vi /usr/local/flume/conf/post_json.conf
a1.sources = r1
a1.sinks = k1
a1.channels = c1
# Describe/configure the source
a1.sources.r1.type= org.apache.flume.source.http.HTTPSource
a1.sources.r1.port = 8888
a1.sources.r1.channels = c1
# Describe the sink
a1.sinks.k1.type= logger
# Use a channel which buffers events in memory
a1.channels.c1.type= memory
a1.channels.c1.capacity = 100
a1.channels.c1.transactionCapacity = 100
# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
启动flume
#flume-ng agent -c /usr/local/flume/conf -f /usr/local/flume/conf/post_json.conf -n a1 -Dflume.root.logger=INFO,console
测试
# curl -X POST -d '[{ "headers" :{"a" : "a1","b" : "b1"},"body" : "idoall.org_body"}]' http://localhost:8888
输出
2016-08-02 14:03:26,708 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - org.apache.flume.sink.LoggerSink.process(LoggerSink.java:94)] Event: { headers:{a=a1, b=b1} body: 69 64 6F 61 6C 6C 2E 6F 72 67 5F 62 6F 64 79 idoall.org_body }
4.6,Hadoop sink
#vi /usr/local/flume/conf/hdfs_sink.conf
a1.sources = r1
a1.sinks = k1
a1.channels = c1
# Describe/configure the source
a1.sources.r1.type= syslogtcp
a1.sources.r1.port = 5140
a1.sources.r1.host = localhost
a1.sources.r1.channels = c1
# Describe the sink
a1.sinks.k1.type= hdfs
a1.sinks.k1.channel = c1
a1.sinks.k1.hdfs.path = hdfs://10.1.13.201:9000/flume/syslogtcp
a1.sinks.k1.hdfs.filePrefix = Syslog
a1.sinks.k1.hdfs.round =true
a1.sinks.k1.hdfs.roundValue = 10
a1.sinks.k1.hdfs.roundUnit = minute
# Use a channel which buffers events in memory
a1.channels.c1.type= memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
启动flume
#flume-ng agent -c /usr/local/flume/conf -f /usr/local/flume/conf/hdfs_sink.conf -n a1 -Dflume.root.logger=INFO,console
测试
#echo "hello idoall flume -> hadoop testing one" | nc localhost 5140
输出
2016-08-02 20:40:25,960 (New I/O worker #1) [WARN - org.apache.flume.source.SyslogUtils.buildEvent(SyslogUtils.java:316)] Event created from Invalid Syslog data.
2016-08-02 20:40:29,556 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - org.apache.flume.sink.hdfs.HDFSSequenceFile.configure(HDFSSequenceFile.java:63)] writeFormat = Writable, UseRawLocalFileSystem = false
2016-08-02 20:40:29,858 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - org.apache.flume.sink.hdfs.BucketWriter.open(BucketWriter.java:234)] Creating hdfs://10.1.13.201:8020/flume/syslogtcp/Syslog.1470141629556.tmp
2016-08-02 20:40:30,111 (hdfs-k1-call-runner-0) [WARN - org.apache.hadoop.util.NativeCodeLoader.<clinit>(NativeCodeLoader.java:62)] Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2016-08-02 20:41:01,273 (hdfs-k1-roll-timer-0) [INFO - org.apache.flume.sink.hdfs.BucketWriter.close(BucketWriter.java:363)] Closing hdfs://10.1.13.201:8020/flume/syslogtcp/Syslog.1470141629556.tmp
2016-08-02 20:41:01,317 (hdfs-k1-call-runner-4) [INFO - org.apache.flume.sink.hdfs.BucketWriter$8.call(BucketWriter.java:629)] Renaming hdfs://10.1.13.201:8020/flume/syslogtcp/Syslog.1470141629556.tmp to hdfs://10.1.13.201:8020/flume/syslogtcp/Syslog.1470141629556
2016-08-02 20:41:01,453 (hdfs-k1-roll-timer-0) [INFO - org.apache.flume.sink.hdfs.HDFSEventSink$1.run(HDFSEventSink.java:394)] Writer callback called.
[root@master ~]# hadoop fs -ls /flume/syslogtcp
16/08/02 20:41:32 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Found 1 items
-rw-r--r-- 2 root supergroup 155 2016-08-02 20:41 /flume/syslogtcp/Syslog.1470141629556
4.7,File Roll Sink
#vi /usr/local/flume/conf/file_roll.conf
a1.sources = r1
a1.sinks = k1
a1.channels = c1
# Describe/configure the source
a1.sources.r1.type = syslogtcp
a1.sources.r1.port = 5555
a1.sources.r1.host = localhost
a1.sources.r1.channels = c1
# Describe the sink
a1.sinks.k1.type = file_roll
a1.sinks.k1.sink.directory = /data/logs/web/logs
# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
启动
[root@master conf]# flume-ng agent -c . -f ./file_roll.conf -n a1 -Dflume.root.logger=INFO,console
测试
[root@master conf]# echo "hello idoall.org syslog" | nc 127.0.0.1 5555
[root@master conf]# echo "hello idoall.org syslog 2" | nc 127.0.0.1 5555
输出
[root@master conf]# ll /data/logs/web/logs/
总用量 8
-rw-r--r--. 1 root root 0 8月 2 20:57 1470142592748-1
-rw-r--r--. 1 root root 24 8月 2 20:57 1470142633346-1
-rw-r--r--. 1 root root 26 8月 2 20:57 1470142633346-2
4.8,Replicating Channel Selector
Flume支持Fan out流从一个源到多个通道。有两种模式的Fan out,分别是复制和复用。在复制的情况下,流的事件被发送到所有的配置通道。在复用的情况下,事件被发送到可用的渠道中的一个子集。Fan out流需要指定源和Fan out通道的规则。(由于在上一个案例中配置了hdfs,在本次案例中使用的两台机器使用的是hdfs配置的hostname)
这次我们需要用到master.hadoop,slave01.hadoop两台机器
在m1创建replicating_Channel_Selector配置文件
#vi /usr/local/flume/conf/replicating_Channel_Selector.conf
a1.sources = r1
a1.sinks = k1 k2
a1.channels = c1 c2
# Describe/configure the source
a1.sources.r1.type = syslogtcp
a1.sources.r1.port = 5140
a1.sources.r1.host = 127.0.0.1
a1.sources.r1.channels = c1 c2
a1.sources.r1.selector.type = replicating
# Describe the sink
a1.sinks.k1.type = avro
a1.sinks.k1.channel = c1
a1.sinks.k1.hostname = master.hadoop
a1.sinks.k1.port = 5555
a1.sinks.k2.type = avro
a1.sinks.k2.channel = c2
a1.sinks.k2.hostname = slave01.hadoop
a1.sinks.k2.port = 5555
# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
a1.channels.c2.type = memory
a1.channels.c2.capacity = 1000
a1.channels.c2.transactionCapacity = 100
在m1创建replicating_Channel_Selector_avro配置文件
#vi /usr/local/flume/conf/replicating_Channel_Selector_avro.conf
a1.sources = r1
a1.sinks = k1
a1.channels = c1
# Describe/configure the source
a1.sources.r1.type = avro
a1.sources.r1.channels = c1
a1.sources.r1.bind = 0.0.0.0
a1.sources.r1.port = 5555
# Describe the sink
a1.sinks.k1.type = logger
# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
在master.hadoop上将2个配置文件复制到slave01.hadoop上一份
# scp /usr/local/flume/conf/replicating_Channel_Selector* 10.1.13.202:/usr/local/flume/conf/
打开4个窗口,在master.hadoop和slave01.hadoop上同时启动两个flume agent
#cd /usr/loca/flume/conf
#flume-ng agent -c . -f ./replicating_Channel_Selector_avro.conf -n a1 -Dflume.root.logger=INFO,console
#flume-ng agent -c . -f ./replicating_Channel_Selector.conf -n a1 -Dflume.root.logger=INFO,console
然后在master.hadoop或slave01.hadoop上,测试产生syslog
# echo "hello idoall.org syslog" | nc 127.0.0.1 5140
在master.hadoop和slave01.hadoop的sink窗口,分别可以看到以下信息,这说明信息得到了同步:
2016-08-03 14:13:47,397 (New I/O server boss #1 ([id: 0x895a7f83, /0:0:0:0:0:0:0:0:5555])) [INFO - org.apache.avro.ipc.NettyServer$NettyServerAvroHandler.handleUpstream(NettyServer.java:171)] [id: 0x17efd5a4, /10.1.13.201:33406 => /10.1.13.202:5555] OPEN
2016-08-03 14:13:47,399 (New I/O worker #1) [INFO - org.apache.avro.ipc.NettyServer$NettyServerAvroHandler.handleUpstream(NettyServer.java:171)] [id: 0x17efd5a4, /10.1.13.201:33406 => /10.1.13.202:5555] BOUND: /10.1.13.202:5555
2016-08-03 14:13:47,399 (New I/O worker #1) [INFO - org.apache.avro.ipc.NettyServer$NettyServerAvroHandler.handleUpstream(NettyServer.java:171)] [id: 0x17efd5a4, /10.1.13.201:33406 => /10.1.13.202:5555] CONNECTED: /10.1.13.201:33406
2016-08-03 14:13:48,481 (New I/O server boss #1 ([id: 0x895a7f83, /0:0:0:0:0:0:0:0:5555])) [INFO - org.apache.avro.ipc.NettyServer$NettyServerAvroHandler.handleUpstream(NettyServer.java:171)] [id: 0x5e2a9e2e, /10.1.13.202:36400 => /10.1.13.202:5555] OPEN
2016-08-03 14:13:48,482 (New I/O worker #2) [INFO - org.apache.avro.ipc.NettyServer$NettyServerAvroHandler.handleUpstream(NettyServer.java:171)] [id: 0x5e2a9e2e, /10.1.13.202:36400 => /10.1.13.202:5555] BOUND: /10.1.13.202:5555
2016-08-03 14:13:48,483 (New I/O worker #2) [INFO - org.apache.avro.ipc.NettyServer$NettyServerAvroHandler.handleUpstream(NettyServer.java:171)] [id: 0x5e2a9e2e, /10.1.13.202:36400 => /10.1.13.202:5555] CONNECTED: /10.1.13.202:36400
2016-08-03 14:25:09,861 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - org.apache.flume.sink.LoggerSink.process(LoggerSink.java:94)] Event: { headers:{Severity=0, Facility=0, flume.syslog.status=Invalid} body: 68 65 6C 6C 6F 32 20 69 64 6F 61 6C 6C 2E 6F 72 hello2 idoall.or }
2.10,Flume Sink Processors
failover的机器是一直发送给其中一个sink,当这个sink不可用的时候,自动发送到下一个sink。
在master.hadoop创建Flume_Sink_Processors配置文件
#vi /usr/local/flume/confi/Flume_Sink_Processors.conf
a1.sources = r1
a1.sinks = k1 k2
a1.channels = c1 c2
#这个是配置failover的关键,需要有一个sink group
a1.sinkgroups = g1
a1.sinkgroups.g1.sinks = k1 k2
#处理的类型是failover
a1.sinkgroups.g1.processor.type = failover
#优先级,数字越大优先级越高,每个sink的优先级必须不相同
a1.sinkgroups.g1.processor.priority.k1 = 5
a1.sinkgroups.g1.processor.priority.k2 = 10
#设置为10秒,当然可以根据你的实际状况更改成更快或者很慢
a1.sinkgroups.g1.processor.maxpenalty = 10000
# Describe/configure the source
a1.sources.r1.type = syslogtcp
a1.sources.r1.port = 37240
a1.sources.r1.channels = c1 c2
a1.sources.r1.selector.type = replicating
# Describe the sink
a1.sinks.k1.type = avro
a1.sinks.k1.channel = c1
a1.sinks.k1.hostname = master.hadoop
a1.sinks.k1.port = 5555
a1.sinks.k2.type = avro
a1.sinks.k2.channel = c2
a1.sinks.k2.hostname = slave02.hadoop
a1.sinks.k2.port = 5555
# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
a1.channels.c2.type = memory
a1.channels.c2.capacity = 1000
a1.channels.c2.transactionCapacity = 100
在master.hadoop创建Flume_Sink_Processors_arvo配置文件
a1.sources = r1
a1.sinks = k1
a1.channels = c1
# Describe/configure the source
a1.sources.r1.type = avro
a1.sources.r1.channels = c1
a1.sources.r1.bind = 0.0.0.0
a1.sources.r1.port = 5555
# Describe the sink
a1.sinks.k1.type = logger
# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
在master.hadoop上将2个配置文件复制到slave02.hadoop上一份
# scp /usr/local/flume/conf/Flume_Sink_Processors* 10.1.13.203:/usr/local/flume/conf/
打开4个窗口,在master.hadoop和slave02.hadoop上同时启动两个flume agent,一定要先在两台服务器上启动Multiplexing_Channel_Selector_avro.conf,再启动 Flume_Sink_Processors.conf,否则在master.hadoop启动Flume_Sink_Processors.conf 时报Caused by: java.net.ConnectException: 拒绝连接,不过经过测试flume运行正常。
#cd /usr/local/flume/conf
#flume-ng agent -c . -f ./Flume_Sink_Processors.conf -n a1 -Dflume.root.logger=INFO,console
#flume-ng agent -c . -f ./Flume_Sink_Processors_avro.conf -n a1 -Dflume.root.logger=INFO,console
在master.hadoop(slave02.hadoop也可)的任意一台机器上,测试产生log
echo "aaaaaaaaaaaaaa" | nc 127.0.0.1 37240
因为slave02.hadoop的优先级高,所以在slave02.hadoop的Multiplexing_Channel_Selector_avro.conf进程窗口,可以看到以下信息,而master.hadoop没有:
2016-08-05 12:15:54,105 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - org.apache.flume.sink.LoggerSink.process(LoggerSink.java:94)] Event: { headers:{Severity=0, Facility=0, flume.syslog.status=Invalid} body: 61 61 61 61 61 61 61 61 61 61 61 61 61 61 aaaaaaaaaaaaaa }
将slava2的Flume_Sink_Processors.conf进程杀掉,在master.hadoop执行“echo "bbbbbbbbbbbbbb" | nc 127.0.0.1 37240”,在master.hadoop的Flume_Sink_Processors.conf进程窗口有以下输出:
2016-08-05 12:15:54,105 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - org.apache.flume.sink.LoggerSink.process(LoggerSink.java:94)] Event: { headers:{Severity=0, Facility=0, flume.syslog.status=Invalid} body: 61 61 61 61 61 61 61 61 61 61 61 61 61 61 aaaaaaaaaaaaaa }
2016-08-05 12:15:54,106 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - org.apache.flume.sink.LoggerSink.process(LoggerSink.java:94)] Event: { headers:{Severity=0, Facility=0, flume.syslog.status=Invalid} body: 62 62 62 62 62 62 62 62 62 62 62 62 62 62 bbbbbbbbbbbbbb }
报错:
2016-08-05 11:21:40,161 (lifecycleSupervisor-1-5) [INFO - org.apache.flume.source.SyslogTcpSource.start(SyslogTcpSource.java:119)] Syslog TCP Source starting...
2016-08-05 11:21:40,163 (lifecycleSupervisor-1-5) [ERROR - org.apache.flume.lifecycle.LifecycleSupervisor$MonitorRunnable.run(LifecycleSupervisor.java:253)] Unable to start EventDrivenSourceRunner: { source:org.apache.flume.source.SyslogTcpSource{name:r1,state:IDLE} } - Exception follows.
org.jboss.netty.channel.ChannelException: Failed to bind to: 0.0.0.0/0.0.0.0:33340
at org.jboss.netty.bootstrap.ServerBootstrap.bind(ServerBootstrap.java:297)
at org.apache.flume.source.SyslogTcpSource.start(SyslogTcpSource.java:122)
at org.apache.flume.source.EventDrivenSourceRunner.start(EventDrivenSourceRunner.java:44)
at org.apache.flume.lifecycle.LifecycleSupervisor$MonitorRunnable.run(LifecycleSupervisor.java:251)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.net.BindException: 地址已在使用
at sun.nio.ch.Net.bind0(Native Method)
at sun.nio.ch.Net.bind(Net.java:433)
at sun.nio.ch.Net.bind(Net.java:425)
at sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:223)
at sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:74)
at org.jboss.netty.channel.socket.nio.NioServerSocketPipelineSink.bind(NioServerSocketPipelineSink.java:140)
at org.jboss.netty.channel.socket.nio.NioServerSocketPipelineSink.handleServerSocket(NioServerSocketPipelineSink.java:90)
at org.jboss.netty.channel.socket.nio.NioServerSocketPipelineSink.eventSunk(NioServerSocketPipelineSink.java:64)
at org.jboss.netty.channel.Channels.bind(Channels.java:569)
at org.jboss.netty.channel.AbstractChannel.bind(AbstractChannel.java:189)
at org.jboss.netty.bootstrap.ServerBootstrap$Binder.channelOpen(ServerBootstrap.java:342)
at org.jboss.netty.channel.Channels.fireChannelOpen(Channels.java:170)
at org.jboss.netty.channel.socket.nio.NioServerSocketChannel.<init>(NioServerSocketChannel.java:80)
at org.jboss.netty.channel.socket.nio.NioServerSocketChannelFactory.newChannel(NioServerSocketChannelFactory.java:158)
at org.jboss.netty.channel.socket.nio.NioServerSocketChannelFactory.newChannel(NioServerSocketChannelFactory.java:86)
at org.jboss.netty.bootstrap.ServerBootstrap.bind(ServerBootstrap.java:276)
... 10 more
调试过程
1,如果遇到提示以下内容:
2016-08-05 12:04:14,800 (lifecycleSupervisor-1-9) [ERROR - org.apache.flume.lifecycle.LifecycleSupervisor$MonitorRunnable.run(LifecycleSupervisor.java:253)] Unable to start EventDrivenSourceRunner: { source:org.apache.flume.source.SyslogTcpSource{name:r1,state:IDLE} } - Exception follows.
...........................
Caused by: java.net.BindException: 地址已在使用
..............................
可尝试更换a1.sources.r1.port= 端口,a1.sinks.k1.port = 端口
2,调试过程中可以打开debug功能,#flume-ng agent -c . -f ./Flume_Sink_Processors_avro.conf -n a1 -Dflume.root.logger=DEBUG,console,调试完毕后改会INFO,否则看不到测试输出。
3,重新启动时应先查看netstat -antp,将配置文件配置的端口进程杀掉
4,如果更换端口,slave02.hadoop的日志中仍然出现原来的端口,可更换一台服务器测试或重启
2.11,Load balancing Sink Processor
load balance type和failover不同的地方是,load balance有两个配置,一个是轮询,一个是随机。两种情况下如果被选择的sink不可用,就会自动尝试发送到下一个可用的sink上面。
在master.hadoop创建Load_balancing_Sink_Processors配置文件
#vi /usr/local/flume/conf/Load_balancing_Sink_Processors.conf
a1.sources = r1
a1.sinks = k1 k2
a1.channels = c1
#这个是配置Load balancing的关键,需要有一个sink group
a1.sinkgroups = g1
a1.sinkgroups.g1.sinks = k1 k2
a1.sinkgroups.g1.processor.type = load_balance
a1.sinkgroups.g1.processor.backoff = true
a1.sinkgroups.g1.processor.selector = round_robin
# Describe/configure the source
a1.sources.r1.type = syslogtcp
a1.sources.r1.port = 25140
a1.sources.r1.channels = c1
# Describe the sink
a1.sinks.k1.type = avro
a1.sinks.k1.channel = c1
a1.sinks.k1.hostname = master.hadoop
a1.sinks.k1.port = 5555
a1.sinks.k2.type = avro
a1.sinks.k2.channel = c1
a1.sinks.k2.hostname = slave01.hadoop
a1.sinks.k2.port = 5555
# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
在master.hadoop创建Load_balancing_Sink_Processors_avro配置文件
#vi /usr/local/flume/conf/Load_balancing_Sink_Processors_avro.conf
a1.sources = r1
a1.sinks = k1
a1.channels = c1
# Describe/configure the source
a1.sources.r1.type = avro
a1.sources.r1.channels = c1
a1.sources.r1.bind = 0.0.0.0
a1.sources.r1.port = 5555
# Describe the sink
a1.sinks.k1.type = logger
# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
将2个配置文件复制到slave01.hadoop上一份
在master.hadoop上将2个配置文件复制到slave02.hadoop上一份
# scp /usr/local/flume/conf/Load_balancing_Sink_Processors* 10.1.13.203:/usr/local/flume/conf/
打开4个窗口,在master.hadoop和slave01.hadoop上同时启动两个flume agent,一定要先在两台服务器上启动Load_balancing_Sink_Processors_avro.conf,再启动 Load_balancing_Sink_Processors.conf
#cd /usr/local/flume/conf
#flume-ng agent -c . -f ./Load_balancing_Sink_Processors_avro.conf -n a1 -Dflume.root.logger=INFO,console
#flume-ng agent -c . -f ./Load_balancing_Sink_Processors.conf -n a1 -Dflume.root.logger=INFO,console
然后在master.hadoop或slave01.hadoop的任意一台机器上的新窗口,测试产生log,一行一行输入,输入太快,容易落到一台机器上
[root@master conf]# echo "test1" | nc 127.0.0.1 25140
[root@master conf]# echo "test2" | nc 127.0.0.1 25140
[root@master conf]# echo "test3" | nc 127.0.0.1 25140
[root@master conf]# echo "test4" | nc 127.0.0.1 25140
输出:
在master.hadoop的Load_balancing_Sink_Processors_avro.conf进程窗口,可以看到如下信息:
2016-08-05 14:35:15,543 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - org.apache.flume.sink.LoggerSink.process(LoggerSink.java:94)] Event: { headers:{Severity=0, Facility=0, flume.syslog.status=Invalid} body: 74 65 73 74 32 test2 }
2016-08-05 14:35:20,232 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - org.apache.flume.sink.LoggerSink.process(LoggerSink.java:94)] Event: { headers:{Severity=0, Facility=0, flume.syslog.status=Invalid} body: 74 65 73 74 34 test4 }
在slave01.hadoop的Load_balancing_Sink_Processors_avro.conf进程窗口,可以看到如下信息:
2016-08-05 14:35:05,567 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - org.apache.flume.sink.LoggerSink.process(LoggerSink.java:94)] Event: { headers:{Severity=0, Facility=0, flume.syslog.status=Invalid} body: 74 65 73 74 31 test1 }
2016-08-05 14:35:15,205 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - org.apache.flume.sink.LoggerSink.process(LoggerSink.java:94)] Event: { headers:{Severity=0, Facility=0, flume.syslog.status=Invalid} body: 74 65 73 74 33 test3 }
2.12,Hbase sink
未测试
扩展阅读
一,Client端获取数据来源方式
Client端操作消费数据的来源,Flume 支持 Avro,log4j,syslog 和 http post(body为json格式)。可以让应用程序同已有的Source直接打交道,如AvroSource,SyslogTcpSource。也可以 写一个 Source,以 IPC 或 RPC 的方式接入自己的应用,Avro和 Thrift 都可以(分别有 NettyAvroRpcClient 和 ThriftRpcClient 实现了 RpcClient接口),其中 Avro 是默认的 RPC 协议。具体代码级别的 Client 端数据接入,可以参考官方手册。
对现有程序改动最小的使用方式是使用是直接读取程序原来记录的日志文件,基本可以实现无缝接入,不需要对现有程序进行任何改动。
对于直接读取文件 Source,有两种方式:
ExecSource: 以运行 Linux 命令的方式,持续的输出最新的数据,如tail -F 文件名指令,在这种方式下,取的文件名必须是指定的。 ExecSource 可以实现对日志的实时收集,但是存在Flume不运行或者指令执行出错时,将无法收集到日志数据,无法保证日志数据的完整性。
SpoolSource: 监测配置的目录下新增的文件,并将文件中的数据读取出来。需要注意两点:拷贝到 spool 目录下的文件不可以再打开编辑;spool 目录下不可包含相应的子目录。
SpoolSource 虽然无法实现实时的收集数据,但是可以使用以分钟的方式分割文件,趋近于实时。
如果应用无法实现以分钟切割日志文件的话, 可以两种收集方式结合使用。 在实际使用的过程中,可以结合 log4j 使用,使用 log4j的时候,将 log4j 的文件分割机制设为1分钟一次,将文件拷贝到spool的监控目录。
log4j 有一个 TimeRolling 的插件,可以把 log4j 分割文件到 spool 目录。基本实现了实时的监控。Flume 在传完文件之后,将会修改文件的后缀,变为 .COMPLETED(后缀也可以在配置文件中灵活指定)。
Flume Source 支持的类型:
二,channel数据交换的方式
当前有几个 channel 可供选择,分别是 Memory Channel, JDBC Channel , File Channel,Psuedo Transaction Channel。比较常见的是前三种 channel。
MemoryChannel 可以实现高速的吞吐,但是无法保证数据的完整性。
MemoryRecoverChannel 在官方文档的建议上已经建义使用FileChannel来替换。
FileChannel保证数据的完整性与一致性。在具体配置FileChannel时,建议FileChannel设置的目录和程序日志文件保存的目录设成不同的磁盘,以便提高效率。
File Channel 是一个持久化的隧道(channel),它持久化所有的事件,并将其存储到磁盘中。因此,即使 Java 虚拟机当掉,或者操作系统崩溃或重启,再或者事件没有在管道中成功地传递到下一个代理(agent),这一切都不会造成数据丢失。Memory Channel 是一个不稳定的隧道,其原因是由于它在内存中存储所有事件。如果 java 进程死掉,任何存储在内存的事件将会丢失。另外,内存的空间收到 RAM大小的限制,而 File Channel 这方面是它的优势,只要磁盘空间足够,它就可以将所有事件数据存储到磁盘上。
Flume Channel 支持的类型:
三,sink存储
Sink在设置存储数据时,可以向文件系统、数据库、hadoop存数据,在日志数据较少时,可以将数据存储在文件系中,并且设定一定的时间间隔保存数据。在日志数据较多时,可以将相应的日志数据存储到Hadoop中,便于日后进行相应的数据分析.
Flume Sink支持的类型
四,可靠性
Flume的核心是把数据从数据源收集过来,再送到目的地。为了保证输送一定成功,在送到目的地之前,会先缓存数据,待数据真正到达目的地后,删除自己缓存的数据。 Flume使用事务性的方式保证传送Event整个过程的可靠性。Sink必须在Event被存入Channel后,或者,已经被传达到下一站agent里,又或者,已经被存入外部数据目的地之后,才能把Event从Channel中remove掉。这样数据流里的event无论是在一个agent里还是多个agent之间流转,都能保证可靠,因为以上的事务保证了event会被成功存储起来。而Channel的多种实现在可恢复性上有不同的保证。也保证了event不同程度的可靠性。比如Flume支持在本地保存一份文件channel作为备份,而memorychannel将event存在内存queue里,速度快,但丢失的话无法恢复
Flume常用配置(sources,channels,sinks)
1 常用sources 配置
1.1 Avro sources
Avro是一个数据序列化的系统,flume 通过监听Avro端口,从外部 Avro client 获取events,配置文件格式如下:
a1.sources = r1
a1.channels = c1
a1.sources.r1.type = avro
a1.sources.r1.channels = c1
a1.sources.r1.bind = 0.0.0.0
a1.sources.r1.port = 4141
1.2 Execsources
此源启动运行一个给定的Unix命令,预计这一过程中不断产生标准输出(stderr被简单地丢弃,除非logStdErr=TRUE)上的数据。如果因任何原因的进程退出时,源也退出,并不会产生任何进一步的数据。
配置文件格式如下:
exec-agent.sources= tail
exec-agent.channels= memoryChannel-1
exec-agent.sinks= logger
exec-agent.sources.tail.type= exec
exec-agent.sources.tail.command= tail -f /var/log/secure
该例子中,会首先启动tail -f/var/log/secure 命令,然后有数据产生的时候就会不断的收集数据。
1.3 Netcatsources
一个netcat在某一端口上侦听,每一行文字变成一个事件源。行为像“nc -k -l [主机][端口]”。换句话说,它打开一个指定端口,侦听数据。意料的是,所提供的数据是换行符分隔的文本。每一行文字变成Flume事件,并通过连接通道发送。
1.4 Syslog TCP sources
用于监控TCP端口信息,可以用来接收socket通信通过TCP发过来的信息。格式如下:
a1.sources = r1
a1.channels = c1
a1.sources.r1.type = syslogtcp
a1.sources.r1.port = 5140
a1.sources.r1.host = localhost
a1.sources.r1.channels = c1
1.5 Syslog UDP sources
用于监控UDP端口信息,可以用来接收socket通信通过TCP发过来的信息。格式如下:
a1.sources = r1
a1.channels = c1
a1.sources.r1.type = syslogudp
a1.sources.r1.port = 5140
a1.sources.r1.host = localhost
a1.sources.r1.channels = c1
1.6 文件夹 sources
#Name the components on this agent
agent-1.sinks= k1
agent-1.channels= ch-1
agent-1.sources= src-1
#Describe/configure the source
agent-1.sources.src-1.type= spooldir
agent-1.sources.src-1.channels= ch-1
agent-1.sources.src-1.spoolDir= /home/storm/test
agent-1.sources.src-1.fileHeader= true
#Describe the sink
agent-1.sinks.k1.type= hdfs
agent-1.sinks.k1.hdfs.path= hdfs://192.168.2.238:9000/user/hadoop/input
agent-1.sinks.k1.hdfs.filePrefix= events-
agent-1.sinks.k1.hdfs.fileType= DataStream
#Use a channel which buffers events in memory
agent-1.channels.ch-1.type= memory
agent-1.channels.ch-1.capacity= 1000
agent-1.channels.ch-1.transactionCapacity= 100
#Bind the source and sink to the channel
agent-1.sources.src1.channels= ch-1
agent-1.sinks.k1.channel= ch-1[storm@vsphere5 conf]$
2 常用channels 配置
2.1 Memory Channel
用内存空间来存储sources收集到的数据,配置如下:
a1.channels = c1
a1.channels.c1.type = memory
a1.channels.c1.capacity = 10000
a1.channels.c1.transactionCapacity = 10000
a1.channels.c1.byteCapacityBufferPercentage = 20
a1.channels.c1.byteCapacity = 800000
2.2 JDBCChannel
用数据库来存储sources收集到的数据,目前支持DERBY数据库,配置如下:
a1.channels = c1
a1.channels.c1.type = jdbc
a1.channels.c1.driver.url= ** # jdbc url连接
2.3 FILEChannel
用文件当做channel 来存储中间数据,配置如下:
a1.channels = c1
a1.channels.c1.type = file
a1.channels.c1.checkpointDir = /mnt/flume/checkpoint #检查文件需要放的地方
a1.channels.c1.dataDirs = /mnt/flume/data #日志文件存放的位置
3 常用sinks 配置
3.1 HDFS Sinks
用HDFS来存储channel里面的信息,配置如下:
a1.channels = c1
a1.sinks = k1
a1.sinks.k1.type = hdfs
a1.sinks.k1.channel = c1
a1.sinks.k1.hdfs.path = /flume/events/%y-%m-%d/%H%M/%S
a1.sinks.k1.hdfs.filePrefix = events-
a1.sinks.k1.hdfs.round = true
a1.sinks.k1.hdfs.roundValue = 10
a1.sinks.k1.hdfs.roundUnit = minute
3.2 LoggerSinks
将接收到的信息显示在控制台,配置如下:
a1.channels = c1
a1.sinks = k1
a1.sinks.k1.type = logger
a1.sinks.k1.channel = c1
3.3 File Roll Sinks
将消息存储在本地文件中,配置说明如下:
a1.channels = c1
a1.sinks = k1
a1.sinks.k1.type = file_roll
a1.sinks.k1.channel = c1
a1.sinks.k1.sink.directory = /var/log/flume #文件存储的路径
a1.sinks.k1.sink.rollInterval = 30 #多长时间往写出一次。
a1.sinks.k1.sink.serializer = TEXT #写出格式
3.4 Hbase Sinks
消息写入到Hbase 数据库中,配置说明如下(hbase-site.xml 配置文件必须在当前的目录下):
a1.channels = c1
a1.sinks = k1
a1.sinks.k1.type = hbase
a1.sinks.k1.table = foo_table
a1.sinks.k1.columnFamily = bar_cf
a1.sinks.k1.serializer = org.apache.flume.sink.hbase.RegexHbaseEventSerializer
a1.sinks.k1.channel = c1