flume安装配置实例

一,认识flume

Flume的架构主要有一下几个核心概念:

Event:一个数据单元,带有一个可选的消息头

Flow:Event从源点到达目的点的迁移的抽象

Client:操作位于源点处的Event,将其发送到Flume Agent

Agent:一个独立的Flume进程,包含组件Source、Channel、Sink

Source:用来消费传递到该组件的Event

Channel:中转Event的一个临时存储,保存有Source组件传递过来的Event

Sink:从Channel中读取并移除Event,将Event传递到Flow Pipeline中的下一个Agent(如果有的话)

二,安装flume

1,基础环境准备

系统环境:cenots6.7

java环境:jdk1.8.92

http://download.oracle.com/otn-pub/java/jdk/8u92-b14/jdk-8u92-linux-x64.tar.gz?AuthParam=1466493996_149f31c41a3a9ef17975ade95149bfcf

#tar zxvf jdk-8u92-linux-x64.tar.gz

#mv jdk-8u92-linux-x64 /usr/local/jdk

2,flume下载地址

官网下载地址:

http://flume.apache.org/download.html

需要下载两个文件

apache-flume-1.6.0-bin.tar.gz

apache-flume-1.6.0-src.tar.gz

3,安装flume

分别解压下载的两个tar包

#tar zxvf apache-flume-1.6.0-bin.tar.gz    

#tar zxvf apache-flume-1.6.0-src.tar.gz    

src里面文件内容,覆盖解压后bin文件里面的内容

#cp -ri apache-flume-1.6.0-src/* apache-flume-1.6.0-bin

重命名

#mv apache-flume-1.6.0-bin /usr/local/flume

设置环境变量

#vi /etc/profile

export JAVA_HOME=/usr/local/jdk

export CLASSPATH=.:$JAVA_HOME/lib/dt.jar:$JAVA_HOME/lib/tools.jar

export PATH=$PATH:$JAVA_HOME/bin

 

export FLUME_HOME=/usr/local/flume

export FLUME_CONF_DIR=$FLUME_HOME/conf

export PATH=.:$PATH::$FLUME_HOME/bin

 

export CLASSPATH=.:$JAVA_HOME/lib/dt.jar:$JAVA_HOME/lib/tools.jar

export PATH=$PATH:$JAVA_HOME/bin

配置环境变量生效

source /etc/profile

修改配置文件

#vi /usr/local/flume/conf/flume-env.sh

export JAVA_HOME=/usr/local/jdk

测试flume安装是否成功

[root@localhost ~]# flume-ng version

Flume 1.6.0

Source code repository: https://git-wip-us.apache.org/repos/asf/flume.git

Compiled by hshreedharan on Mon May 11 11:15:44 PDT 2015

From source with checksum b29e416802ce9ece3269d34233baf43f

出现以上信息说明flume安装成功!

4,配置flume

4.1,avro

#vi /usr/local/flume/conf/avro.conf

# example.conf: A single-node Flume configuration

# Name the components on this agent

a1.sources = r1

a1.sinks = k1

a1.channels = c1  

# Describe/configure the source

a1.sources.r1.type = avro

a1.sources.r1.channels = c1

a1.sources.r1.bind = 0.0.0.0

a1.sources.r1.port = 4141

# Describe the sink

a1.sinks.k1.type = logger

# Use a channel which buffers events in memory

a1.channels.c1.type = memory

a1.channels.c1.capacity = 1000

a1.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel

a1.sources.r1.channels = c1

a1.sinks.k1.channel = c1

启动flume

#flume-ng agent--conf conf --conf-file avro.conf --name a1 -Dflume.root.logger=INFO,console

PS:-Dflume.root.logger=INFO,console 仅为 debug 使用,请勿生产环境生搬硬套,否则大量的日志会返回到终端。。。

-c/--conf 后跟配置目录,-f/--conf-file后跟具体的配置文件,-n/--name指定agent的名称

然后我们再开一个 shell 终端窗口,telnet 上配置中侦听的端口,就可以发消息看到效果了:

 

4,2,spooldir

Spool监测配置的目录下新增的文件,并将文件中的数据读取出来。需要注意两点:

1) 拷贝到spool目录下的文件不可以再打开编辑。

2) spool目录下不可包含相应的子目录

#vi /usr/local/flume/conf/spool.conf

a1.sources = r1

a1.sinks = k1

a1.channels = c1

# Describe/configure the source

a1.sources.r1.type= spooldir

a1.sources.r1.channels = c1

a1.sources.r1.spoolDir =/data/logs/web

a1.sources.r1.fileHeader =true

# Describe the sink

a1.sinks.k1.type= logger

# Use a channel which buffers events in memory

a1.channels.c1.type= memory

a1.channels.c1.capacity = 1000

a1.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel

a1.sources.r1.channels = c1

a1.sinks.k1.channel = c1

启动flume

#flume-ng agent -c /usr/local/flume/conf -f /usr/local/flume/conf/spool.conf -n a1 -Dflume.root.logger=INFO,console

然后,手动拷贝一个文件到 /root/log 目录,观察日志输出以及/root/log 目录下的变化。

 

4.3,exec

EXEC执行一个给定的命令获得输出的源,如果要使用tail命令,必选使得file足够大才能看到输出内容

#vi /usr/local/flume/conf/exec_tail.conf

a1.sources = r1

a1.sinks = k1

a1.channels = c1

# Describe/configure the source

a1.sources.r1.type = exec

a1.sources.r1.channels = c1

a1.sources.r1.command = tail -F /data/logs/web/access.log

# Describe the sink

a1.sinks.k1.type = logger

# Use a channel which buffers events in memory

a1.channels.c1.type = memory

a1.channels.c1.capacity = 1000

a1.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel

a1.sources.r1.channels = c1

a1.sinks.k1.channel = c1

 

启动flume

#flume-ng agent -c /usr/local/flume/conf -f /usr/local/flume/conf/exec_tail.conf -n a1 -Dflume.root.logger=INFO,console

输入

#echo "exec_tail test1" >> /data/logs/web/access.log

#echo "exec_tail test2" >> /data/logs/web/access.log

#echo "exec_tail test3" >> /data/logs/web/access.log

 

输出

2016-08-01 15:49:47,837 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - org.apache.flume.sink.LoggerSink.process(LoggerSink.java:94)] Event: { headers:{} body: 73 70 6F 6F 6C 20 74 65 73 74 31                exec_tail test1 }

2016-08-01 15:49:56,838 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - org.apache.flume.sink.LoggerSink.process(LoggerSink.java:94)] Event: { headers:{} body: 73 70 6F 6F 6C 20 74 65 73 74 32              exec_tail test2 }

2016-08-01 15:50:05,841 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - org.apache.flume.sink.LoggerSink.process(LoggerSink.java:94)] Event: { headers:{} body: 73 70 6F 6F 6C 20 74 65 73 74 33                exec_tail test3 }

 

4.4,Syslogtcp

Syslogtcp监听TCP的端口做为数据源

#vi /usr/local/flume/conf/syslog_tcp.conf

a1.sources = r1

a1.sinks = k1

a1.channels = c1

# Describe/configure the source

a1.sources.r1.type= syslogtcp

a1.sources.r1.port = 5140

a1.sources.r1.host = localhost

a1.sources.r1.channels = c1

# Describe the sink

a1.sinks.k1.type= logger

# Use a channel which buffers events in memory

a1.channels.c1.type= memory  

a1.channels.c1.capacity = 1000

a1.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel

a1.sources.r1.channels = c1

a1.sinks.k1.channel = c1

启动flume

#flume-ng agent -c /usr/local/flume/conf -f /usr/local/flume/conf/syslog_tcp.conf -n a1 -Dflume.root.logger=INFO,console

测试生成log日志

#echo "syslog test1" | nc localhost 5140

输出

2016-08-01 16:31:00,779 (New I/O  worker #1) [WARN - org.apache.flume.source.SyslogUtils.buildEvent(SyslogUtils.java:316)] Event created from Invalid Syslog data.

2016-08-01 16:31:05,320 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - org.apache.flume.sink.LoggerSink.process(LoggerSink.java:94)] Event: { headers:{Severity=0, Facility=0, flume.syslog.status=Invalid} body: 73 79 73 6C 6F 67 20 74 65 73 74 31            syslog test1 }

2016-08-01 16:31:16,710 (New I/O  worker #2) [WARN - org.apache.flume.source.SyslogUtils.buildEvent(SyslogUtils.java:316)] Event created from Invalid Syslog data.

2016-08-01 16:31:16,710 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - org.apache.flume.sink.LoggerSink.process(LoggerSink.java:94)] Event: { headers:{Severity=0, Facility=0, flume.syslog.status=Invalid} body: 73 79 73 6C 6F 67 20 74 65 73 74 32            syslog test2 }

 

4.5,JSONHandler

#vi /usr/local/flume/conf/post_json.conf

a1.sources = r1

a1.sinks = k1

a1.channels = c1

# Describe/configure the source

a1.sources.r1.type= org.apache.flume.source.http.HTTPSource

a1.sources.r1.port = 8888

a1.sources.r1.channels = c1

# Describe the sink

a1.sinks.k1.type= logger

# Use a channel which buffers events in memory

a1.channels.c1.type= memory

a1.channels.c1.capacity = 100

a1.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel

a1.sources.r1.channels = c1

a1.sinks.k1.channel = c1

启动flume

#flume-ng agent -c /usr/local/flume/conf -f /usr/local/flume/conf/post_json.conf -n a1 -Dflume.root.logger=INFO,console

测试

# curl -X POST -d '[{ "headers" :{"a" : "a1","b" : "b1"},"body" : "idoall.org_body"}]' http://localhost:8888

输出

2016-08-02 14:03:26,708 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - org.apache.flume.sink.LoggerSink.process(LoggerSink.java:94)] Event: { headers:{a=a1, b=b1} body: 69 64 6F 61 6C 6C 2E 6F 72 67 5F 62 6F 64 79    idoall.org_body }

 

4.6,Hadoop sink

#vi /usr/local/flume/conf/hdfs_sink.conf

a1.sources = r1

a1.sinks = k1

a1.channels = c1

# Describe/configure the source

a1.sources.r1.type= syslogtcp

a1.sources.r1.port = 5140

a1.sources.r1.host = localhost

a1.sources.r1.channels = c1

# Describe the sink

a1.sinks.k1.type= hdfs

a1.sinks.k1.channel = c1

a1.sinks.k1.hdfs.path = hdfs://10.1.13.201:9000/flume/syslogtcp

a1.sinks.k1.hdfs.filePrefix = Syslog

a1.sinks.k1.hdfs.round =true

a1.sinks.k1.hdfs.roundValue = 10

a1.sinks.k1.hdfs.roundUnit = minute

# Use a channel which buffers events in memory

a1.channels.c1.type= memory

a1.channels.c1.capacity = 1000

a1.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel

a1.sources.r1.channels = c1

a1.sinks.k1.channel = c1

启动flume

#flume-ng agent -c /usr/local/flume/conf -f /usr/local/flume/conf/hdfs_sink.conf -n a1 -Dflume.root.logger=INFO,console

测试

#echo "hello idoall flume -> hadoop testing one" | nc localhost 5140

输出

2016-08-02 20:40:25,960 (New I/O  worker #1) [WARN - org.apache.flume.source.SyslogUtils.buildEvent(SyslogUtils.java:316)] Event created from Invalid Syslog data.

2016-08-02 20:40:29,556 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - org.apache.flume.sink.hdfs.HDFSSequenceFile.configure(HDFSSequenceFile.java:63)] writeFormat = Writable, UseRawLocalFileSystem = false

2016-08-02 20:40:29,858 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - org.apache.flume.sink.hdfs.BucketWriter.open(BucketWriter.java:234)] Creating hdfs://10.1.13.201:8020/flume/syslogtcp/Syslog.1470141629556.tmp

2016-08-02 20:40:30,111 (hdfs-k1-call-runner-0) [WARN - org.apache.hadoop.util.NativeCodeLoader.<clinit>(NativeCodeLoader.java:62)] Unable to load native-hadoop library for your platform... using builtin-java classes where applicable

2016-08-02 20:41:01,273 (hdfs-k1-roll-timer-0) [INFO - org.apache.flume.sink.hdfs.BucketWriter.close(BucketWriter.java:363)] Closing hdfs://10.1.13.201:8020/flume/syslogtcp/Syslog.1470141629556.tmp

2016-08-02 20:41:01,317 (hdfs-k1-call-runner-4) [INFO - org.apache.flume.sink.hdfs.BucketWriter$8.call(BucketWriter.java:629)] Renaming hdfs://10.1.13.201:8020/flume/syslogtcp/Syslog.1470141629556.tmp to hdfs://10.1.13.201:8020/flume/syslogtcp/Syslog.1470141629556

2016-08-02 20:41:01,453 (hdfs-k1-roll-timer-0) [INFO - org.apache.flume.sink.hdfs.HDFSEventSink$1.run(HDFSEventSink.java:394)] Writer callback called.

 

 

[root@master ~]# hadoop fs -ls /flume/syslogtcp

16/08/02 20:41:32 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable

Found 1 items

-rw-r--r--   2 root supergroup        155 2016-08-02 20:41 /flume/syslogtcp/Syslog.1470141629556

 

4.7,File Roll Sink

 

#vi /usr/local/flume/conf/file_roll.conf

 

a1.sources = r1

a1.sinks = k1

a1.channels = c1

# Describe/configure the source

a1.sources.r1.type = syslogtcp

a1.sources.r1.port = 5555

a1.sources.r1.host = localhost

a1.sources.r1.channels = c1

# Describe the sink

a1.sinks.k1.type = file_roll

a1.sinks.k1.sink.directory = /data/logs/web/logs

# Use a channel which buffers events in memory

a1.channels.c1.type = memory

a1.channels.c1.capacity = 1000

a1.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel

a1.sources.r1.channels = c1

a1.sinks.k1.channel = c1

 

启动

[root@master conf]# flume-ng agent -c . -f ./file_roll.conf -n a1 -Dflume.root.logger=INFO,console

 

测试

[root@master conf]# echo "hello idoall.org syslog" | nc 127.0.0.1 5555

[root@master conf]# echo "hello idoall.org syslog 2" | nc 127.0.0.1 5555

 

 

输出

[root@master conf]# ll /data/logs/web/logs/

 

总用量 8

-rw-r--r--. 1 root root  0 8月   2 20:57 1470142592748-1

-rw-r--r--. 1 root root 24 8月   2 20:57 1470142633346-1

-rw-r--r--. 1 root root 26 8月   2 20:57 1470142633346-2

 

 

4.8,Replicating Channel Selector

  Flume支持Fan out流从一个源到多个通道。有两种模式的Fan out,分别是复制和复用。在复制的情况下,流的事件被发送到所有的配置通道。在复用的情况下,事件被发送到可用的渠道中的一个子集。Fan out流需要指定源和Fan out通道的规则。(由于在上一个案例中配置了hdfs,在本次案例中使用的两台机器使用的是hdfs配置的hostname)

  这次我们需要用到master.hadoop,slave01.hadoop两台机器

  在m1创建replicating_Channel_Selector配置文件

  #vi /usr/local/flume/conf/replicating_Channel_Selector.conf

  a1.sources = r1

a1.sinks = k1 k2

a1.channels = c1 c2

# Describe/configure the source

a1.sources.r1.type = syslogtcp

a1.sources.r1.port = 5140

a1.sources.r1.host = 127.0.0.1

a1.sources.r1.channels = c1 c2

a1.sources.r1.selector.type = replicating

# Describe the sink

a1.sinks.k1.type = avro

a1.sinks.k1.channel = c1

a1.sinks.k1.hostname = master.hadoop

a1.sinks.k1.port = 5555

a1.sinks.k2.type = avro

a1.sinks.k2.channel = c2

a1.sinks.k2.hostname = slave01.hadoop

a1.sinks.k2.port = 5555

# Use a channel which buffers events in memory

a1.channels.c1.type = memory

a1.channels.c1.capacity = 1000

a1.channels.c1.transactionCapacity = 100

a1.channels.c2.type = memory

a1.channels.c2.capacity = 1000

a1.channels.c2.transactionCapacity = 100

 

 

  在m1创建replicating_Channel_Selector_avro配置文件

  #vi /usr/local/flume/conf/replicating_Channel_Selector_avro.conf

  a1.sources = r1

a1.sinks = k1

a1.channels = c1

# Describe/configure the source

a1.sources.r1.type = avro

a1.sources.r1.channels = c1

a1.sources.r1.bind = 0.0.0.0

a1.sources.r1.port = 5555

# Describe the sink

a1.sinks.k1.type = logger

# Use a channel which buffers events in memory

a1.channels.c1.type = memory

a1.channels.c1.capacity = 1000

a1.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel

a1.sources.r1.channels = c1

a1.sinks.k1.channel = c1

 

  在master.hadoop上将2个配置文件复制到slave01.hadoop上一份

  # scp /usr/local/flume/conf/replicating_Channel_Selector* 10.1.13.202:/usr/local/flume/conf/

 

  打开4个窗口,在master.hadoop和slave01.hadoop上同时启动两个flume agent

  #cd /usr/loca/flume/conf

  #flume-ng agent -c . -f  ./replicating_Channel_Selector_avro.conf -n a1 -Dflume.root.logger=INFO,console

  #flume-ng agent -c . -f  ./replicating_Channel_Selector.conf -n a1 -Dflume.root.logger=INFO,console

 

 

  然后在master.hadoop或slave01.hadoop上,测试产生syslog

  # echo "hello idoall.org syslog" | nc 127.0.0.1 5140

 

  在master.hadoop和slave01.hadoop的sink窗口,分别可以看到以下信息,这说明信息得到了同步:

2016-08-03 14:13:47,397 (New I/O server boss #1 ([id: 0x895a7f83, /0:0:0:0:0:0:0:0:5555])) [INFO - org.apache.avro.ipc.NettyServer$NettyServerAvroHandler.handleUpstream(NettyServer.java:171)] [id: 0x17efd5a4, /10.1.13.201:33406 => /10.1.13.202:5555] OPEN

2016-08-03 14:13:47,399 (New I/O  worker #1) [INFO - org.apache.avro.ipc.NettyServer$NettyServerAvroHandler.handleUpstream(NettyServer.java:171)] [id: 0x17efd5a4, /10.1.13.201:33406 => /10.1.13.202:5555] BOUND: /10.1.13.202:5555

2016-08-03 14:13:47,399 (New I/O  worker #1) [INFO - org.apache.avro.ipc.NettyServer$NettyServerAvroHandler.handleUpstream(NettyServer.java:171)] [id: 0x17efd5a4, /10.1.13.201:33406 => /10.1.13.202:5555] CONNECTED: /10.1.13.201:33406

2016-08-03 14:13:48,481 (New I/O server boss #1 ([id: 0x895a7f83, /0:0:0:0:0:0:0:0:5555])) [INFO - org.apache.avro.ipc.NettyServer$NettyServerAvroHandler.handleUpstream(NettyServer.java:171)] [id: 0x5e2a9e2e, /10.1.13.202:36400 => /10.1.13.202:5555] OPEN

2016-08-03 14:13:48,482 (New I/O  worker #2) [INFO - org.apache.avro.ipc.NettyServer$NettyServerAvroHandler.handleUpstream(NettyServer.java:171)] [id: 0x5e2a9e2e, /10.1.13.202:36400 => /10.1.13.202:5555] BOUND: /10.1.13.202:5555

2016-08-03 14:13:48,483 (New I/O  worker #2) [INFO - org.apache.avro.ipc.NettyServer$NettyServerAvroHandler.handleUpstream(NettyServer.java:171)] [id: 0x5e2a9e2e, /10.1.13.202:36400 => /10.1.13.202:5555] CONNECTED: /10.1.13.202:36400

 

2016-08-03 14:25:09,861 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - org.apache.flume.sink.LoggerSink.process(LoggerSink.java:94)] Event: { headers:{Severity=0, Facility=0, flume.syslog.status=Invalid} body: 68 65 6C 6C 6F 32 20 69 64 6F 61 6C 6C 2E 6F 72 hello2 idoall.or }

 

2.10,Flume Sink Processors

failover的机器是一直发送给其中一个sink,当这个sink不可用的时候,自动发送到下一个sink。

 

在master.hadoop创建Flume_Sink_Processors配置文件

#vi /usr/local/flume/confi/Flume_Sink_Processors.conf

a1.sources = r1

a1.sinks = k1 k2

a1.channels = c1 c2

 

#这个是配置failover的关键,需要有一个sink group

a1.sinkgroups = g1

a1.sinkgroups.g1.sinks = k1 k2

#处理的类型是failover

a1.sinkgroups.g1.processor.type = failover

#优先级,数字越大优先级越高,每个sink的优先级必须不相同

a1.sinkgroups.g1.processor.priority.k1 = 5

a1.sinkgroups.g1.processor.priority.k2 = 10

#设置为10秒,当然可以根据你的实际状况更改成更快或者很慢

a1.sinkgroups.g1.processor.maxpenalty = 10000

 

# Describe/configure the source

a1.sources.r1.type = syslogtcp

a1.sources.r1.port = 37240

a1.sources.r1.channels = c1 c2

a1.sources.r1.selector.type = replicating

 

 

# Describe the sink

a1.sinks.k1.type = avro

a1.sinks.k1.channel = c1

a1.sinks.k1.hostname = master.hadoop

a1.sinks.k1.port = 5555

a1.sinks.k2.type = avro

a1.sinks.k2.channel = c2

a1.sinks.k2.hostname = slave02.hadoop

a1.sinks.k2.port = 5555

# Use a channel which buffers events in memory

a1.channels.c1.type = memory

a1.channels.c1.capacity = 1000

a1.channels.c1.transactionCapacity = 100

a1.channels.c2.type = memory

a1.channels.c2.capacity = 1000

a1.channels.c2.transactionCapacity = 100

 

在master.hadoop创建Flume_Sink_Processors_arvo配置文件

a1.sources = r1

a1.sinks = k1

a1.channels = c1

 

# Describe/configure the source

a1.sources.r1.type = avro

a1.sources.r1.channels = c1

a1.sources.r1.bind = 0.0.0.0

a1.sources.r1.port = 5555

 

# Describe the sink

a1.sinks.k1.type = logger

 

# Use a channel which buffers events in memory

a1.channels.c1.type = memory

a1.channels.c1.capacity = 1000

a1.channels.c1.transactionCapacity = 100

 

# Bind the source and sink to the channel

a1.sources.r1.channels = c1

a1.sinks.k1.channel = c1

 

 在master.hadoop上将2个配置文件复制到slave02.hadoop上一份

# scp /usr/local/flume/conf/Flume_Sink_Processors* 10.1.13.203:/usr/local/flume/conf/

 

  打开4个窗口,在master.hadoop和slave02.hadoop上同时启动两个flume agent,一定要先在两台服务器上启动Multiplexing_Channel_Selector_avro.conf,再启动 Flume_Sink_Processors.conf,否则在master.hadoop启动Flume_Sink_Processors.conf 时报Caused by: java.net.ConnectException: 拒绝连接,不过经过测试flume运行正常。

#cd /usr/local/flume/conf

#flume-ng agent -c . -f ./Flume_Sink_Processors.conf -n a1 -Dflume.root.logger=INFO,console

#flume-ng agent -c . -f ./Flume_Sink_Processors_avro.conf -n a1 -Dflume.root.logger=INFO,console

 

在master.hadoop(slave02.hadoop也可)的任意一台机器上,测试产生log

echo "aaaaaaaaaaaaaa" | nc 127.0.0.1 37240

 

因为slave02.hadoop的优先级高,所以在slave02.hadoop的Multiplexing_Channel_Selector_avro.conf进程窗口,可以看到以下信息,而master.hadoop没有:

 

2016-08-05 12:15:54,105 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - org.apache.flume.sink.LoggerSink.process(LoggerSink.java:94)] Event: { headers:{Severity=0, Facility=0, flume.syslog.status=Invalid} body: 61 61 61 61 61 61 61 61 61 61 61 61 61 61       aaaaaaaaaaaaaa }

 

 

将slava2的Flume_Sink_Processors.conf进程杀掉,在master.hadoop执行“echo "bbbbbbbbbbbbbb" | nc 127.0.0.1 37240”,在master.hadoop的Flume_Sink_Processors.conf进程窗口有以下输出:

2016-08-05 12:15:54,105 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - org.apache.flume.sink.LoggerSink.process(LoggerSink.java:94)] Event: { headers:{Severity=0, Facility=0, flume.syslog.status=Invalid} body: 61 61 61 61 61 61 61 61 61 61 61 61 61 61       aaaaaaaaaaaaaa }

2016-08-05 12:15:54,106 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - org.apache.flume.sink.LoggerSink.process(LoggerSink.java:94)] Event: { headers:{Severity=0, Facility=0, flume.syslog.status=Invalid} body: 62 62 62 62 62 62 62 62 62 62 62 62 62 62       bbbbbbbbbbbbbb }

 

报错:

2016-08-05 11:21:40,161 (lifecycleSupervisor-1-5) [INFO - org.apache.flume.source.SyslogTcpSource.start(SyslogTcpSource.java:119)] Syslog TCP Source starting...

2016-08-05 11:21:40,163 (lifecycleSupervisor-1-5) [ERROR - org.apache.flume.lifecycle.LifecycleSupervisor$MonitorRunnable.run(LifecycleSupervisor.java:253)] Unable to start EventDrivenSourceRunner: { source:org.apache.flume.source.SyslogTcpSource{name:r1,state:IDLE} } - Exception follows.

org.jboss.netty.channel.ChannelException: Failed to bind to: 0.0.0.0/0.0.0.0:33340

at org.jboss.netty.bootstrap.ServerBootstrap.bind(ServerBootstrap.java:297)

at org.apache.flume.source.SyslogTcpSource.start(SyslogTcpSource.java:122)

at org.apache.flume.source.EventDrivenSourceRunner.start(EventDrivenSourceRunner.java:44)

at org.apache.flume.lifecycle.LifecycleSupervisor$MonitorRunnable.run(LifecycleSupervisor.java:251)

at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)

at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)

at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)

at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)

at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)

at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)

at java.lang.Thread.run(Thread.java:745)

Caused by: java.net.BindException: 地址已在使用

at sun.nio.ch.Net.bind0(Native Method)

at sun.nio.ch.Net.bind(Net.java:433)

at sun.nio.ch.Net.bind(Net.java:425)

at sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:223)

at sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:74)

at org.jboss.netty.channel.socket.nio.NioServerSocketPipelineSink.bind(NioServerSocketPipelineSink.java:140)

at org.jboss.netty.channel.socket.nio.NioServerSocketPipelineSink.handleServerSocket(NioServerSocketPipelineSink.java:90)

at org.jboss.netty.channel.socket.nio.NioServerSocketPipelineSink.eventSunk(NioServerSocketPipelineSink.java:64)

at org.jboss.netty.channel.Channels.bind(Channels.java:569)

at org.jboss.netty.channel.AbstractChannel.bind(AbstractChannel.java:189)

at org.jboss.netty.bootstrap.ServerBootstrap$Binder.channelOpen(ServerBootstrap.java:342)

at org.jboss.netty.channel.Channels.fireChannelOpen(Channels.java:170)

at org.jboss.netty.channel.socket.nio.NioServerSocketChannel.<init>(NioServerSocketChannel.java:80)

at org.jboss.netty.channel.socket.nio.NioServerSocketChannelFactory.newChannel(NioServerSocketChannelFactory.java:158)

at org.jboss.netty.channel.socket.nio.NioServerSocketChannelFactory.newChannel(NioServerSocketChannelFactory.java:86)

at org.jboss.netty.bootstrap.ServerBootstrap.bind(ServerBootstrap.java:276)

... 10 more

 

 

调试过程

1,如果遇到提示以下内容:

2016-08-05 12:04:14,800 (lifecycleSupervisor-1-9) [ERROR - org.apache.flume.lifecycle.LifecycleSupervisor$MonitorRunnable.run(LifecycleSupervisor.java:253)] Unable to start EventDrivenSourceRunner: { source:org.apache.flume.source.SyslogTcpSource{name:r1,state:IDLE} } - Exception follows.

...........................

Caused by: java.net.BindException: 地址已在使用

..............................

可尝试更换a1.sources.r1.port= 端口,a1.sinks.k1.port = 端口

2,调试过程中可以打开debug功能,#flume-ng agent -c . -f ./Flume_Sink_Processors_avro.conf -n a1 -Dflume.root.logger=DEBUG,console,调试完毕后改会INFO,否则看不到测试输出。

3,重新启动时应先查看netstat -antp,将配置文件配置的端口进程杀掉

4,如果更换端口,slave02.hadoop的日志中仍然出现原来的端口,可更换一台服务器测试或重启

 

 

2.11,Load balancing Sink Processor

 load balance type和failover不同的地方是,load balance有两个配置,一个是轮询,一个是随机。两种情况下如果被选择的sink不可用,就会自动尝试发送到下一个可用的sink上面。

  在master.hadoop创建Load_balancing_Sink_Processors配置文件

  #vi /usr/local/flume/conf/Load_balancing_Sink_Processors.conf

a1.sources = r1

a1.sinks = k1 k2

a1.channels = c1

 

#这个是配置Load balancing的关键,需要有一个sink group

a1.sinkgroups = g1

a1.sinkgroups.g1.sinks = k1 k2

a1.sinkgroups.g1.processor.type = load_balance

a1.sinkgroups.g1.processor.backoff = true

a1.sinkgroups.g1.processor.selector = round_robin

 

# Describe/configure the source

a1.sources.r1.type = syslogtcp

a1.sources.r1.port = 25140

a1.sources.r1.channels = c1

 

 

# Describe the sink

a1.sinks.k1.type = avro

a1.sinks.k1.channel = c1

a1.sinks.k1.hostname = master.hadoop

a1.sinks.k1.port = 5555

 

a1.sinks.k2.type = avro

a1.sinks.k2.channel = c1

a1.sinks.k2.hostname = slave01.hadoop

a1.sinks.k2.port = 5555

 

# Use a channel which buffers events in memory

a1.channels.c1.type = memory

a1.channels.c1.capacity = 1000

a1.channels.c1.transactionCapacity = 100

 

在master.hadoop创建Load_balancing_Sink_Processors_avro配置文件

#vi /usr/local/flume/conf/Load_balancing_Sink_Processors_avro.conf

a1.sources = r1

a1.sinks = k1

a1.channels = c1

 

# Describe/configure the source

a1.sources.r1.type = avro

a1.sources.r1.channels = c1

a1.sources.r1.bind = 0.0.0.0

a1.sources.r1.port = 5555

 

# Describe the sink

a1.sinks.k1.type = logger

 

# Use a channel which buffers events in memory

a1.channels.c1.type = memory

a1.channels.c1.capacity = 1000

a1.channels.c1.transactionCapacity = 100

 

# Bind the source and sink to the channel

a1.sources.r1.channels = c1

a1.sinks.k1.channel = c1

 

 

将2个配置文件复制到slave01.hadoop上一份

 在master.hadoop上将2个配置文件复制到slave02.hadoop上一份

# scp /usr/local/flume/conf/Load_balancing_Sink_Processors* 10.1.13.203:/usr/local/flume/conf/

 

  打开4个窗口,在master.hadoop和slave01.hadoop上同时启动两个flume agent,一定要先在两台服务器上启动Load_balancing_Sink_Processors_avro.conf,再启动 Load_balancing_Sink_Processors.conf

  #cd /usr/local/flume/conf

  #flume-ng agent -c . -f ./Load_balancing_Sink_Processors_avro.conf -n a1 -Dflume.root.logger=INFO,console

  #flume-ng agent -c . -f ./Load_balancing_Sink_Processors.conf -n a1 -Dflume.root.logger=INFO,console

 

  然后在master.hadoop或slave01.hadoop的任意一台机器上的新窗口,测试产生log,一行一行输入,输入太快,容易落到一台机器上

[root@master conf]#  echo "test1" | nc 127.0.0.1 25140

[root@master conf]#  echo "test2" | nc 127.0.0.1 25140

[root@master conf]#  echo "test3" | nc 127.0.0.1 25140

[root@master conf]#  echo "test4" | nc 127.0.0.1 25140

 

输出:

在master.hadoop的Load_balancing_Sink_Processors_avro.conf进程窗口,可以看到如下信息:

2016-08-05 14:35:15,543 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - org.apache.flume.sink.LoggerSink.process(LoggerSink.java:94)] Event: { headers:{Severity=0, Facility=0, flume.syslog.status=Invalid} body: 74 65 73 74 32                                  test2 }

2016-08-05 14:35:20,232 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - org.apache.flume.sink.LoggerSink.process(LoggerSink.java:94)] Event: { headers:{Severity=0, Facility=0, flume.syslog.status=Invalid} body: 74 65 73 74 34                                  test4 }

 

在slave01.hadoop的Load_balancing_Sink_Processors_avro.conf进程窗口,可以看到如下信息:

2016-08-05 14:35:05,567 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - org.apache.flume.sink.LoggerSink.process(LoggerSink.java:94)] Event: { headers:{Severity=0, Facility=0, flume.syslog.status=Invalid} body: 74 65 73 74 31                                  test1 }

2016-08-05 14:35:15,205 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - org.apache.flume.sink.LoggerSink.process(LoggerSink.java:94)] Event: { headers:{Severity=0, Facility=0, flume.syslog.status=Invalid} body: 74 65 73 74 33                                  test3 }

 

2.12,Hbase sink

未测试

 

 

 

扩展阅读

一,Client端获取数据来源方式

Client端操作消费数据的来源,Flume 支持 Avro,log4j,syslog 和 http post(body为json格式)。可以让应用程序同已有的Source直接打交道,如AvroSource,SyslogTcpSource。也可以 写一个 Source,以 IPC 或 RPC 的方式接入自己的应用,Avro和 Thrift 都可以(分别有 NettyAvroRpcClient 和 ThriftRpcClient 实现了 RpcClient接口),其中 Avro 是默认的 RPC 协议。具体代码级别的 Client 端数据接入,可以参考官方手册。

对现有程序改动最小的使用方式是使用是直接读取程序原来记录的日志文件,基本可以实现无缝接入,不需要对现有程序进行任何改动。

对于直接读取文件 Source,有两种方式:

ExecSource: 以运行 Linux 命令的方式,持续的输出最新的数据,如tail -F 文件名指令,在这种方式下,取的文件名必须是指定的。 ExecSource 可以实现对日志的实时收集,但是存在Flume不运行或者指令执行出错时,将无法收集到日志数据,无法保证日志数据的完整性。

SpoolSource: 监测配置的目录下新增的文件,并将文件中的数据读取出来。需要注意两点:拷贝到 spool 目录下的文件不可以再打开编辑;spool 目录下不可包含相应的子目录。

SpoolSource 虽然无法实现实时的收集数据,但是可以使用以分钟的方式分割文件,趋近于实时。

如果应用无法实现以分钟切割日志文件的话, 可以两种收集方式结合使用。 在实际使用的过程中,可以结合 log4j 使用,使用 log4j的时候,将 log4j 的文件分割机制设为1分钟一次,将文件拷贝到spool的监控目录。

log4j 有一个 TimeRolling 的插件,可以把 log4j 分割文件到 spool 目录。基本实现了实时的监控。Flume 在传完文件之后,将会修改文件的后缀,变为 .COMPLETED(后缀也可以在配置文件中灵活指定)。

Flume Source 支持的类型:

 

 

二,channel数据交换的方式

当前有几个 channel 可供选择,分别是 Memory Channel, JDBC Channel , File Channel,Psuedo Transaction Channel。比较常见的是前三种 channel。

MemoryChannel 可以实现高速的吞吐,但是无法保证数据的完整性。

MemoryRecoverChannel 在官方文档的建议上已经建义使用FileChannel来替换。

FileChannel保证数据的完整性与一致性。在具体配置FileChannel时,建议FileChannel设置的目录和程序日志文件保存的目录设成不同的磁盘,以便提高效率。

File Channel 是一个持久化的隧道(channel),它持久化所有的事件,并将其存储到磁盘中。因此,即使 Java 虚拟机当掉,或者操作系统崩溃或重启,再或者事件没有在管道中成功地传递到下一个代理(agent),这一切都不会造成数据丢失。Memory Channel 是一个不稳定的隧道,其原因是由于它在内存中存储所有事件。如果 java 进程死掉,任何存储在内存的事件将会丢失。另外,内存的空间收到 RAM大小的限制,而 File Channel 这方面是它的优势,只要磁盘空间足够,它就可以将所有事件数据存储到磁盘上。

Flume Channel 支持的类型:

 

三,sink存储

Sink在设置存储数据时,可以向文件系统、数据库、hadoop存数据,在日志数据较少时,可以将数据存储在文件系中,并且设定一定的时间间隔保存数据。在日志数据较多时,可以将相应的日志数据存储到Hadoop中,便于日后进行相应的数据分析.

Flume Sink支持的类型

 

四,可靠性

Flume的核心是把数据从数据源收集过来,再送到目的地。为了保证输送一定成功,在送到目的地之前,会先缓存数据,待数据真正到达目的地后,删除自己缓存的数据。 Flume使用事务性的方式保证传送Event整个过程的可靠性。Sink必须在Event被存入Channel后,或者,已经被传达到下一站agent里,又或者,已经被存入外部数据目的地之后,才能把Event从Channel中remove掉。这样数据流里的event无论是在一个agent里还是多个agent之间流转,都能保证可靠,因为以上的事务保证了event会被成功存储起来。而Channel的多种实现在可恢复性上有不同的保证。也保证了event不同程度的可靠性。比如Flume支持在本地保存一份文件channel作为备份,而memorychannel将event存在内存queue里,速度快,但丢失的话无法恢复

 

 

 

 Flume常用配置(sources,channels,sinks)

 

1  常用sources 配置

 

1.1  Avro sources

 

Avro是一个数据序列化的系统,flume 通过监听Avro端口,从外部 Avro client 获取events,配置文件格式如下:

 

 

 

a1.sources = r1

 

a1.channels = c1

 

a1.sources.r1.type = avro

 

a1.sources.r1.channels = c1

 

a1.sources.r1.bind = 0.0.0.0

 

a1.sources.r1.port = 4141

 

 

1.2  Execsources

 

此源启动运行一个给定的Unix命令,预计这一过程中不断产生标准输出(stderr被简单地丢弃,除非logStdErr=TRUE)上的数据。如果因任何原因的进程退出时,源也退出,并不会产生任何进一步的数据。

 

 

 

配置文件格式如下:

 

exec-agent.sources= tail

 

exec-agent.channels= memoryChannel-1

 

exec-agent.sinks= logger

 

exec-agent.sources.tail.type= exec

 

exec-agent.sources.tail.command= tail -f /var/log/secure

 

 

 

该例子中,会首先启动tail -f/var/log/secure 命令,然后有数据产生的时候就会不断的收集数据。

 

 

 

1.3  Netcatsources

 

一个netcat在某一端口上侦听,每一行文字变成一个事件源。行为像“nc -k -l [主机][端口]”。换句话说,它打开一个指定端口,侦听数据。意料的是,所提供的数据是换行符分隔的文本。每一行文字变成Flume事件,并通过连接通道发送。

 

 

 

 

 

 

 

1.4  Syslog TCP sources

 

    用于监控TCP端口信息,可以用来接收socket通信通过TCP发过来的信息。格式如下:

 

 

 

       a1.sources = r1

 

       a1.channels = c1

 

       a1.sources.r1.type = syslogtcp

 

       a1.sources.r1.port = 5140

 

       a1.sources.r1.host = localhost

 

       a1.sources.r1.channels = c1

 

1.5  Syslog UDP sources

 

用于监控UDP端口信息,可以用来接收socket通信通过TCP发过来的信息。格式如下:

 

 

 

       a1.sources = r1

 

       a1.channels = c1

 

       a1.sources.r1.type = syslogudp

 

       a1.sources.r1.port = 5140

 

       a1.sources.r1.host = localhost

 

       a1.sources.r1.channels = c1

 

1.6  文件夹 sources

 

#Name the components on this agent

 

agent-1.sinks= k1

 

agent-1.channels= ch-1

 

agent-1.sources= src-1

 

#Describe/configure the source

 

agent-1.sources.src-1.type= spooldir

 

agent-1.sources.src-1.channels= ch-1

 

agent-1.sources.src-1.spoolDir= /home/storm/test

 

agent-1.sources.src-1.fileHeader= true

 

#Describe the sink

 

agent-1.sinks.k1.type= hdfs

 

agent-1.sinks.k1.hdfs.path= hdfs://192.168.2.238:9000/user/hadoop/input

 

agent-1.sinks.k1.hdfs.filePrefix= events-

 

agent-1.sinks.k1.hdfs.fileType= DataStream

 

#Use a channel which buffers events in memory

 

agent-1.channels.ch-1.type= memory

 

agent-1.channels.ch-1.capacity= 1000

 

agent-1.channels.ch-1.transactionCapacity= 100

 

#Bind the source and sink to the channel

 

agent-1.sources.src1.channels= ch-1

 

agent-1.sinks.k1.channel= ch-1[storm@vsphere5 conf]$

 

 

 

 

 

 

 

2  常用channels 配置

 

2.1  Memory Channel

 

  用内存空间来存储sources收集到的数据,配置如下:

 

 

 

       a1.channels = c1

 

       a1.channels.c1.type = memory

 

       a1.channels.c1.capacity = 10000

 

       a1.channels.c1.transactionCapacity = 10000

 

       a1.channels.c1.byteCapacityBufferPercentage = 20

 

       a1.channels.c1.byteCapacity = 800000

 

2.2  JDBCChannel

 

用数据库来存储sources收集到的数据,目前支持DERBY数据库,配置如下:

 

 

 

       a1.channels = c1

 

       a1.channels.c1.type = jdbc

 

       a1.channels.c1.driver.url= **    # jdbc url连接

 

 

 

 

 

2.3  FILEChannel

 

用文件当做channel 来存储中间数据,配置如下:

 

 

 

a1.channels = c1

 

a1.channels.c1.type = file

 

a1.channels.c1.checkpointDir = /mnt/flume/checkpoint  #检查文件需要放的地方

 

a1.channels.c1.dataDirs = /mnt/flume/data             #日志文件存放的位置

 

 

 

 

 

 

 

 

 

3  常用sinks 配置

 

3.1  HDFS Sinks

 

用HDFS来存储channel里面的信息,配置如下:

 

 

 

       a1.channels = c1

 

       a1.sinks = k1

 

       a1.sinks.k1.type = hdfs

 

       a1.sinks.k1.channel = c1

 

       a1.sinks.k1.hdfs.path = /flume/events/%y-%m-%d/%H%M/%S

 

       a1.sinks.k1.hdfs.filePrefix = events-

 

       a1.sinks.k1.hdfs.round = true

 

       a1.sinks.k1.hdfs.roundValue = 10

 

       a1.sinks.k1.hdfs.roundUnit = minute

 

3.2  LoggerSinks

 

将接收到的信息显示在控制台,配置如下:

 

 

 

       a1.channels = c1

 

       a1.sinks = k1

 

       a1.sinks.k1.type = logger

 

       a1.sinks.k1.channel = c1

 

3.3  File Roll Sinks

 

将消息存储在本地文件中,配置说明如下:

 

 

 

a1.channels = c1

 

a1.sinks = k1

 

a1.sinks.k1.type = file_roll

 

a1.sinks.k1.channel = c1

 

a1.sinks.k1.sink.directory = /var/log/flume      #文件存储的路径

 

a1.sinks.k1.sink.rollInterval = 30               #多长时间往写出一次。

 

a1.sinks.k1.sink.serializer = TEXT               #写出格式

 

 

 

 

 

3.4  Hbase Sinks

 

消息写入到Hbase 数据库中,配置说明如下(hbase-site.xml 配置文件必须在当前的目录下):

 

 

 

a1.channels = c1

 

a1.sinks = k1

 

a1.sinks.k1.type = hbase

 

a1.sinks.k1.table = foo_table      

 

a1.sinks.k1.columnFamily = bar_cf

 

a1.sinks.k1.serializer = org.apache.flume.sink.hbase.RegexHbaseEventSerializer

 

a1.sinks.k1.channel = c1

转载于:https://my.oschina.net/duxuefeng/blog/731567

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值