flume

flume使用示例

flume的特点:

flume是一个分布式、可靠、和高可用的海量日志采集、聚合和传输的系统。支持在日志系统中定制各类数据发送方,用于收集数据;同时,Flume提供对数据进行简单处理,并写到各种数据接受方(比如文本、HDFS、Hbase等)的能力 。

flume的数据流由事件(Event)贯穿始终。事件是Flume的基本数据单位,它携带日志数据(字节数组形式)并且携带有头信息,这些Event由Agent外部的Source生成,当Source捕获事件后会进行特定的格式化,然后Source会把事件推入(单个或多个)Channel中。你可以把Channel看作是一个缓冲区,它将保存事件直到Sink处理完该事件。Sink负责持久化日志或者把事件推向另一个Source。

 

 flume的可靠性 :

 当节点出现故障时,日志能够被传送到其他节点上而不会丢失。Flume提供了三种级别的可靠性保障,从强到弱依次分别为:end-to-end(收到数据agent首先将event写到磁盘上,当数据传送成功后,再删除;如果数据发送失败,可以重新发送。),Store on failure(这也是scribe采用的策略,当数据接收方crash时,将数据写到本地,待恢复后,继续发送),Besteffort(数据发送到接收方后,不会进行确认)。

 

flume的可恢复性:

还是靠Channel。推荐使用FileChannel,事件持久化在本地文件系统里(性能较差)。 

 

flume的一些核心概念:

Agent使用JVM 运行Flume。每台机器运行一个agent,但是可以在一个agent中包含多个sources和sinks。

 

Client生产数据,运行在一个独立的线程。

Source从Client收集数据,传递给Channel。

Sink从Channel收集数据,运行在一个独立线程。

Channel连接 sources 和 sinks ,这个有点像一个队列。

Events可以是日志记录、 avro 对象等。

Flume以agent为最小的独立运行单位。一个agent就是一个JVM。单agent由Source、Sink和Channel三大组件构成,如下图:

 

  值得注意的是,Flume提供了大量内置的Source、Channel和Sink类型。不同类型的Source,Channel和Sink可以自由组合。组合方式基于用户设置的配置文件,非常灵活。比如:Channel可以把事件暂存在内存里,也可以持久化到本地硬盘上。Sink可以把日志写入HDFS, HBase,甚至是另外一个Source等等。Flume支持用户建立多级流,也就是说,多个agent可以协同工作,并且支持Fan-in、Fan-out、Contextual Routing、Backup Routes,这也正是NB之处。如下图所示:

 

  

二、如何安装?
    1.下载安装包

           2.配置环境变量

           3.修改配置文件(案例给出)

           4.启动服务(案例给出)

           5.验证   flume-ng -version

三、flume的案例
案例1:Avro 可以发送一个给定的文件给Flume,Avro 源使用AVRO RPC机制

 

(a)创建agent配置文件

 

vi /home/hadoop/flume-1.5.0-bin/conf/avro.conf

 

a1.sources = r1

a1.sinks = k1

a1.channels = c1

# Describe/configure the source

a1.sources.r1.type= avro

a1.sources.r1.channels = c1

a1.sources.r1.bind = 0.0.0.0

a1.sources.r1.port = 4141

# Describe the sink

a1.sinks.k1.type= logger

# Use a channel which buffers events in memory

a1.channels.c1.type= memory

a1.channels.c1.capacity = 1000

a1.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel

a1.sources.r1.channels = c1

a1.sinks.k1.channel = c1

   (b)启动服务 flume agent a1

 

flume-ng agent -c .-f /home/hadoop/flume-1.5.0-bin/conf/avro.conf -n a1 -Dflume.root.logger=INFO,console

  (c)创建指定文件

 

echo "hello world" > /home/hadoop/flume-1.5.0-bin/log.00

  

(d)使用avro-client发送文件

 

flume-ng avro-client -c . -H m1 -p 4141 -F /home/hadoop/flume-1.5.0-bin/log.00

      

(f)在m1的控制台,可以看到以下信息,注意最后一行:  hello world

 

 

案例2:Spool 监测配置的目录下新增的文件,并将文件中的数据读取出来。需要注意两点:

1) 拷贝到spool目录下的文件不可以再打开编辑。
2) spool目录下不可包含相应的子目录
 

(a)创建agent配置文件

 

vi /home/hadoop/flume-1.5.0-bin/conf/spool.conf

 

a1.sources = r1

a1.sinks = k1

a1.channels = c1

# Describe/configure the source

a1.sources.r1.type= spooldir

a1.sources.r1.channels = c1

a1.sources.r1.spoolDir = /home/hadoop/flume-1.5.0-bin/logs

a1.sources.r1.fileHeader = true

# Describe the sink

a1.sinks.k1.type= logger

# Use a channel which buffers events in memory

a1.channels.c1.type= memory

a1.channels.c1.capacity = 1000

a1.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel

a1.sources.r1.channels = c1

a1.sinks.k1.channel = c1

      

(b)启动服务flume agent a1

 

flume-ng agent -c . -f /home/hadoop/flume-1.5.0-bin/conf/spool.conf -n a1 -Dflume.root.logger=INFO,console

(c)追加文件到/home/hadoop/flume-1.5.0-bin/logs目录

 

echo "spool test1" > /home/hadoop/flume-1.5.0-bin/logs/spool_text.log

(d)在m1的控制台,可以看到以下相关信息:

 Event: { headers:{file=/home/hadoop/flume-1.5.0-bin/logs/spool_text.log} body: 73 70 6F 6F 6C 20 74 65 73 74 31        spool test1 }

 

案例3:Exec 执行一个给定的命令获得输出的源,如果要使用tail命令,必选使得file足够大才能看到输出内容

(a)创建agent配置文件

 

vi /home/hadoop/flume-1.5.0-bin/conf/exec_tail.conf

 

a1.sources = r1

a1.sinks = k1

a1.channels = c1

# Describe/configure the source

a1.sources.r1.type= exec

a1.sources.r1.channels = c1

a1.sources.r1.command= tail-F /home/hadoop/flume-1.5.0-bin/log_exec_tail

# Describe the sink

a1.sinks.k1.type= logger

# Use a channel which buffers events in memory

a1.channels.c1.type= memory

a1.channels.c1.capacity = 1000

a1.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel

a1.sources.r1.channels = c1

a1.sinks.k1.channel = c1

(b)启动服务flume agent a1

 

flume-ng agent -c . -f /home/hadoop/flume-1.5.0-bin/conf/exec_tail.conf -n a1 -Dflume.root.logger=INFO,console

(c)生成足够多的内容在文件里

 

for i in {1..100};do echo "exec tail$i" >> /home/hadoop/flume-1.5.0-bin/log_exec_tail;echo $i;sleep 0.1;done

(e)在m1的控制台,可以看到以下信息:

 

Event: { headers:{} body: 65 78 65 63 20 74 61 69 6C 20 74 65 73 74 exec tail test }

Event: { headers:{} body: 65 78 65 63 20 74 61 69 6C 20 74 65 73 74 exec tail test }

 

案例4:Syslogtcp 监听TCP的端口做为数据源
(a)创建agent配置文件

 

vi /home/hadoop/flume-1.5.0-bin/conf/syslog_tcp.conf

 

a1.sources = r1

a1.sinks = k1

a1.channels = c1

# Describe/configure the source

a1.sources.r1.type= syslogtcp

a1.sources.r1.port = 5140

a1.sources.r1.host = localhost

a1.sources.r1.channels = c1

# Describe the sink

a1.sinks.k1.type= logger

# Use a channel which buffers events in memory

a1.channels.c1.type= memory

a1.channels.c1.capacity = 1000

a1.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel

a1.sources.r1.channels = c1

a1.sinks.k1.channel = c1

  

 (b)启动flume agent a1

 

flume-ng agent -c . -f /home/hadoop/flume-1.5.0-bin/conf/syslog_tcp.conf -n a1 -Dflume.root.logger=INFO,console

 (c)测试产生syslog

 

echo "hello idoall.org syslog" | nc localhost 5140

 (d)在m1的控制台,可以看到以下信息:

 

Event: { headers:{Severity=0, flume.syslog.status=Invalid, Facility=0} body: 68 65 6C 6C 6F 20 69 64 6F 61 6C 6C 2E 6F 72 67 hello idoall.org }

  

案例5:JSONHandler

(a)创建agent配置文件

 

 

vi /home/hadoop/flume-1.5.0-bin/conf/post_json.conf

 

a1.sources = r1

a1.sinks = k1

a1.channels = c1

# Describe/configure the source

a1.sources.r1.type= org.apache.flume.source.http.HTTPSource

a1.sources.r1.port = 8888

a1.sources.r1.channels = c1

# Describe the sink

a1.sinks.k1.type= logger

# Use a channel which buffers events in memory

a1.channels.c1.type= memory

a1.channels.c1.capacity = 1000

a1.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel

a1.sources.r1.channels = c1

a1.sinks.k1.channel = c1

(b)启动flume agent a1

 

 

flume-ng agent -c . -f /home/hadoop/flume-1.5.0-bin/conf/post_json.conf -n a1 -Dflume.root.logger=INFO,console

(c)生成JSON 格式的POST request

 

curl -X POST -d '[{ "headers" :{"a" : "a1","b" : "b1"},"body" : "idoall.org_body"}]' http://localhost:8888

(d)在m1的控制台,可以看到以下信息:
 

 

Event: { headers:{b=b1, a=a1}

body: 69 64 6F 61 6C 6C 2E 6F 72 67 5F 62 6F 64 79  idoall.org_body }

    

案例6:Hadoop sink
(a)创建agent配置文件

 

vi /home/hadoop/flume-1.5.0-bin/conf/hdfs_sink.conf

 

a1.sources = r1

a1.sinks = k1

a1.channels = c1

# Describe/configure the source

a1.sources.r1.type= syslogtcp

a1.sources.r1.port = 5140

a1.sources.r1.host = localhost

a1.sources.r1.channels = c1

# Describe the sink

a1.sinks.k1.type= hdfs

a1.sinks.k1.channel = c1

a1.sinks.k1.hdfs.path = hdfs://m1:9000/user/flume/syslogtcp

a1.sinks.k1.hdfs.filePrefix = Syslog

a1.sinks.k1.hdfs.round = true

a1.sinks.k1.hdfs.roundValue = 10

a1.sinks.k1.hdfs.roundUnit = minute

# Use a channel which buffers events in memory

a1.channels.c1.type= memory

a1.channels.c1.capacity = 1000

a1.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel

a1.sources.r1.channels = c1

a1.sinks.k1.channel = c1

(b)启动flume agent a1

 

flume-ng agent -c . -f /home/hadoop/flume-1.5.0-bin/conf/hdfs_sink.conf -n a1 -Dflume.root.logger=INFO,console

(c)测试产生syslog

 

echo "hello idoall flume -> hadoop testing one" | nc localhost 5140

(d) 在m1上再打开一个窗口,去hadoop上检查文件是否生成

 

hadoop fs -ls /user/flume/syslogtcp

hadoop fs -cat /user/flume/syslogtcp/Syslog.1407644509504

    

案例7:File Roll Sink
(a)创建agent配置文件

 

vi /home/hadoop/flume-1.5.0-bin/conf/file_roll.conf

 

a1.sources = r1

a1.sinks = k1

a1.channels = c1

# Describe/configure the source

a1.sources.r1.type= syslogtcp

a1.sources.r1.port = 5555

a1.sources.r1.host = localhost

a1.sources.r1.channels = c1

# Describe the sink

a1.sinks.k1.type= file_roll

a1.sinks.k1.sink.directory = /home/hadoop/flume-1.5.0-bin/logs

# Use a channel which buffers events in memory

a1.channels.c1.type= memory

a1.channels.c1.capacity = 1000

a1.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel

a1.sources.r1.channels = c1

a1.sinks.k1.channel = c1

(b)启动flume agent a1

 

flume-ng agent -c . -f /home/hadoop/flume-1.5.0-bin/conf/file_roll.conf -n a1 -Dflume.root.logger=INFO,console

(c)测试产生log

 

echo "hello idoall.org syslog" | nc localhost 5555

echo "hello idoall.org syslog 2" | nc localhost 5555

(d)查看/home/hadoop/flume-1.5.0-bin/logs下是否生成文件,默认每30秒生成一个新文件

 

ll /home/hadoop/flume-1.5.0-bin/logs

cat /home/hadoop/flume-1.5.0-bin/logs/1407646164782-1

cat /home/hadoop/flume-1.5.0-bin/logs/1407646164782-2

hello idoall.org syslog

hello idoall.org syslog 2

 

 

案例8:Replicating Channel Selector Flume支持Fan out流从一个源到多个通道。有两种模式的Fan out,分别是复制和复用。在复制的情况下,流的事件被发送到所有的配置通道。在复用的情况下,事件被发送到可用的渠道中的一个子集。Fan out流需要指定源和Fan out通道的规则。这次我们需要用到m1,m2两台机器
(a)在m1创建replicating_Channel_Selector配置文件

 

vi /home/hadoop/flume-1.5.0-bin/conf/replicating_Channel_Selector.conf

 

a1.sources = r1

a1.sinks = k1 k2

a1.channels = c1 c2

# Describe/configure the source

a1.sources.r1.type= syslogtcp

a1.sources.r1.port = 5140

a1.sources.r1.host = localhost

a1.sources.r1.channels = c1 c2

a1.sources.r1.selector.type= replicating

# Describe the sink

a1.sinks.k1.type= avro

a1.sinks.k1.channel = c1

a1.sinks.k1.hostname= m1

a1.sinks.k1.port = 5555

a1.sinks.k2.type= avro

a1.sinks.k2.channel = c2

a1.sinks.k2.hostname= m2

a1.sinks.k2.port = 5555

# Use a channel which buffers events in memory

a1.channels.c1.type= memory

a1.channels.c1.capacity = 1000

a1.channels.c1.transactionCapacity = 100

a1.channels.c2.type= memory

a1.channels.c2.capacity = 1000

a1.channels.c2.transactionCapacity = 100

(b)在m1创建replicating_Channel_Selector_avro配置文件

 

vi /home/hadoop/flume-1.5.0-bin/conf/replicating_Channel_Selector_avro.conf

 

a1.sources = r1

a1.sinks = k1

a1.channels = c1

# Describe/configure the source

a1.sources.r1.type= avro

a1.sources.r1.channels = c1

a1.sources.r1.bind = 0.0.0.0

a1.sources.r1.port = 5555

# Describe the sink

a1.sinks.k1.type= logger

# Use a channel which buffers events in memory

a1.channels.c1.type= memory

a1.channels.c1.capacity = 1000

a1.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel

a1.sources.r1.channels = c1

a1.sinks.k1.channel = c1

(c)在m1上将2个配置文件复制到m2上一份

 

scp -r /home/hadoop/flume-1.5.0-bin/conf/replicating_Channel_Selector.conf root@m2:/home/hadoop/flume-1.5.0-bin/conf/replicating_Channel_Selector.conf

 scp -r /home/hadoop/flume-1.5.0-bin/conf/replicating_Channel_Selector_avro.conf root@m2:/home/hadoop/flume-1.5.0-bin/conf/replicating_Channel_Selector_avro.conf

(d)打开4个窗口,在m1和m2上同时启动两个flume agent

 

flume-ng agent -c . -f /home/hadoop/flume-1.5.0-bin/conf/replicating_Channel_Selector_avro.conf -n a1 -Dflume.root.logger=INFO,console

flume-ng agent -c . -f /home/hadoop/flume-1.5.0-bin/conf/replicating_Channel_Selector.conf -n a1 -Dflume.root.logger=INFO,console

(e)然后在m1或m2的任意一台机器上,测试产生syslog

 

echo "hello idoall.org syslog" | nc localhost 5140

(f)在m1和m2的sink窗口,分别可以看到以下信息,这说明信息得到了同步:

 

Event: { headers:{Severity=0, flume.syslog.status=Invalid, Facility=0} body: 68 65 6C 6C 6F 20 69 64 6F 61 6C 6C 2E 6F 72 67 hello idoall.org }

 

案例9:Multiplexing Channel Selector
(a)在m1创建Multiplexing_Channel_Selector配置文件

 

 

vi /home/hadoop/flume-1.5.0-bin/conf/Multiplexing_Channel_Selector.conf

 

a1.sources = r1

a1.sinks = k1 k2

a1.channels = c1 c2

# Describe/configure the source

a1.sources.r1.type= org.apache.flume.source.http.HTTPSource

a1.sources.r1.port = 5140

a1.sources.r1.channels = c1 c2

a1.sources.r1.selector.type= multiplexing

a1.sources.r1.selector.header = type

#映射允许每个值通道可以重叠。默认值可以包含任意数量的通道。

a1.sources.r1.selector.mapping.baidu = c1

a1.sources.r1.selector.mapping.ali = c2

a1.sources.r1.selector.default = c1

# Describe the sink

a1.sinks.k1.type= avro

a1.sinks.k1.channel = c1

a1.sinks.k1.hostname= m1

a1.sinks.k1.port = 5555

a1.sinks.k2.type= avro

a1.sinks.k2.channel = c2

a1.sinks.k2.hostname= m2

a1.sinks.k2.port = 5555

# Use a channel which buffers events in memory

a1.channels.c1.type= memory

a1.channels.c1.capacity = 1000

a1.channels.c1.transactionCapacity = 100

a1.channels.c2.type= memory

a1.channels.c2.capacity = 1000

a1.channels.c2.transactionCapacity = 100

(b)在m1创建Multiplexing_Channel_Selector_avro配置文件

 

vi /home/hadoop/flume-1.5.0-bin/conf/Multiplexing_Channel_Selector_avro.conf

a1.sources = r1

a1.sinks = k1

a1.channels = c1

# Describe/configure the source

a1.sources.r1.type= avro

a1.sources.r1.channels = c1

a1.sources.r1.bind = 0.0.0.0

a1.sources.r1.port = 5555

# Describe the sink

a1.sinks.k1.type= logger

# Use a channel which buffers events in memory

a1.channels.c1.type= memory

a1.channels.c1.capacity = 1000

a1.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel

a1.sources.r1.channels = c1

a1.sinks.k1.channel = c1

(c)将2个配置文件复制到m2上一份

 

scp -r /home/hadoop/flume-1.5.0-bin/conf/Multiplexing_Channel_Selector.conf root@m2:/home/hadoop/flume-1.5.0-bin/conf/Multiplexing_Channel_Selector.conf

scp -r /home/hadoop/flume-1.5.0-bin/conf/Multiplexing_Channel_Selector_avro.conf root@m2:/home/hadoop/flume-1.5.0-bin/conf/Multiplexing_Channel_Selector_avro.conf

(d)打开4个窗口,在m1和m2上同时启动两个flume agent

 

flume-ng agent -c . -f /home/hadoop/flume-1.5.0-bin/conf/Multiplexing_Channel_Selector_avro.conf -n a1 -Dflume.root.logger=INFO,console

 

flume-ng agent -c . -f /home/hadoop/flume-1.5.0-bin/conf/Multiplexing_Channel_Selector.conf -n a1 -Dflume.root.logger=INFO,console

(e)然后在m1或m2的任意一台机器上,测试产生syslog

 

curl -X POST -d '[{ "headers" :{"type" : "baidu"},"body" : "idoall_TEST1"}]' http://localhost:5140 &&

curl -X POST -d '[{ "headers" :{"type" : "ali"},"body" : "idoall_TEST2"}]' http://localhost:5140 &&

curl -X POST -d '[{ "headers" :{"type" : "qq"},"body" : "idoall_TEST3"}]' http://localhost:5140

(f)在m1的sink窗口,可以看到以下信息:

 

Event: { headers:{type=baidu} body: 69 64 6F 61 6C 6C 5F 54 45 53 54 31} 

Event: { headers:{type=qq} body: 69 64 6F 61 6C 6C 5F 54 45 53 54 33}

(g)在m2的sink窗口,可以看到以下信息:

 

Event: { headers:{type=ali} body: 69 64 6F 61 6C 6C 5F 54 45 53 54 32}

可以看到,根据header中不同的条件分布到不同的channel上

 

案例10:Flume Sink Processors failover的机器是一直发送给其中一个sink,当这个sink不可用的时候,自动发送到下一个sink。

(a)在m1创建Flume_Sink_Processors配置文件

 

vi /home/hadoop/flume-1.5.0-bin/conf/Flume_Sink_Processors.conf

 

a1.sources = r1

a1.sinks = k1 k2

a1.channels = c1 c2

#这个是配置failover的关键,需要有一个sink group

a1.sinkgroups = g1

a1.sinkgroups.g1.sinks = k1 k2

#处理的类型是failover

a1.sinkgroups.g1.processor.type= failover

#优先级,数字越大优先级越高,每个sink的优先级必须不相同

a1.sinkgroups.g1.processor.priority.k1 = 5

a1.sinkgroups.g1.processor.priority.k2 = 10

#设置为10秒,当然可以根据你的实际状况更改成更快或者很慢

a1.sinkgroups.g1.processor.maxpenalty = 10000

# Describe/configure the source

a1.sources.r1.type= syslogtcp

a1.sources.r1.port = 5140

a1.sources.r1.channels = c1 c2

a1.sources.r1.selector.type= replicating

# Describe the sink

a1.sinks.k1.type= avro

a1.sinks.k1.channel = c1

a1.sinks.k1.hostname= m1

a1.sinks.k1.port = 5555

a1.sinks.k2.type= avro

a1.sinks.k2.channel = c2

a1.sinks.k2.hostname= m2

a1.sinks.k2.port = 5555

# Use a channel which buffers events in memory

a1.channels.c1.type= memory

a1.channels.c1.capacity = 1000

a1.channels.c1.transactionCapacity = 100

a1.channels.c2.type= memory

a1.channels.c2.capacity = 1000

a1.channels.c2.transactionCapacity = 100

(b)在m1创建Flume_Sink_Processors_avro配置文件

 

vi /home/hadoop/flume-1.5.0-bin/conf/Flume_Sink_Processors_avro.conf

  

a1.sources = r1

a1.sinks = k1

a1.channels = c

# Describe/configure the source

a1.sources.r1.type= avro

a1.sources.r1.channels = c1

a1.sources.r1.bind = 0.0.0.0

a1.sources.r1.port = 5555

# Describe the sink

a1.sinks.k1.type= logger

# Use a channel which buffers events in memory

a1.channels.c1.type= memory

a1.channels.c1.capacity = 1000

a1.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel

a1.sources.r1.channels = c1

a1.sinks.k1.channel = c1

(c)将2个配置文件复制到m2上一份

 

scp -r /home/hadoop/flume-1.5.0-bin/conf/Flume_Sink_Processors.conf root@m2:/home/hadoop/flume-1.5.0-bin/conf/Flume_Sink_Processors.conf

 

scp -r /home/hadoop/flume-1.5.0-bin/conf/Flume_Sink_Processors_avro.conf root@m2:/home/hadoop/flume-1.5.0-bin/conf/Flume_Sink_Processors_avro.conf

(d)打开4个窗口,在m1和m2上同时启动两个flume agent

 

flume-ng agent -c . -f /home/hadoop/flume-1.5.0-bin/conf/Flume_Sink_Processors_avro.conf -n a1 -Dflume.root.logger=INFO,console

flume-ng agent -c . -f /home/hadoop/flume-1.5.0-bin/conf/Flume_Sink_Processors.conf -n a1 -Dflume.root.logger=INFO,console

(e)然后在m1或m2的任意一台机器上,测试产生log

 

echo "idoall.org test1 failover" | nc localhost 5140

(f)因为m2的优先级高,所以在m2的sink窗口,可以看到以下信息,而m1没有:

 

Event: { headers:{Severity=0, flume.syslog.status=Invalid, Facility=0} body: 69 64 6F 61 6C 6C 2E 6F 72 67 20 74 65 73 74 31 idoall.org test1 }

(g)这时我们停止掉m2机器上的sink(ctrl+c),再次输出测试数据:

 

echo "idoall.org test2 failover" | nc localhost 5140

(h)可以在m1的sink窗口,看到读取到了刚才发送的两条测试数据:

 

Event: { headers:{Severity=0, flume.syslog.status=Invalid, Facility=0} body: 69 64 6F 61 6C 6C 2E 6F 72 67 20 74 65 73 74 31 idoall.org test1 }

Event: { headers:{Severity=0, flume.syslog.status=Invalid, Facility=0} body: 69 64 6F 61 6C 6C 2E 6F 72 67 20 74 65 73 74 32 idoall.org test2 }

(i)我们再在m2的sink窗口中,启动sink:

 

flume-ng agent -c . -f /home/hadoop/flume-1.5.0-bin/conf/Flume_Sink_Processors_avro.conf -n a1 -Dflume.root.logger=INFO,console

(j)输入两批测试数据:

 

echo "idoall.org test3 failover" | nc localhost 5140 && echo "idoall.org test4 failover" | nc localhost 5140

(k)在m2的sink窗口,我们可以看到以下信息,因为优先级的关系,log消息会再次落到m2上:

 

Event: { headers:{Severity=0, flume.syslog.status=Invalid, Facility=0} body: 69 64 6F 61 6C 6C 2E 6F 72 67 20 74 65 73 74 33 idoall.org test3 }

Event: { headers:{Severity=0, flume.syslog.status=Invalid, Facility=0} body: 69 64 6F 61 6C 6C 2E 6F 72 67 20 74 65 73 74 34 idoall.org test4 }

 

 

案例11:Load balancing Sink Processor load balance type和failover不同的地方是,load balance有两个配置,一个是轮询,一个是随机。两种情况下如果被选择的sink不可用,就会自动尝试发送到下一个可用的sink上面。

(a)在m1创建Load_balancing_Sink_Processors配置文件

 

vi /home/hadoop/flume-1.5.0-bin/conf/Load_balancing_Sink_Processors.conf

  

a1.sources = r1

a1.sinks = k1 k2

a1.channels = c1

#这个是配置Load balancing的关键,需要有一个sink group

a1.sinkgroups = g1

a1.sinkgroups.g1.sinks = k1 k2

a1.sinkgroups.g1.processor.type= load_balance

a1.sinkgroups.g1.processor.backoff = true

a1.sinkgroups.g1.processor.selector = round_robin

# Describe/configure the source

a1.sources.r1.type= syslogtcp

a1.sources.r1.port = 5140

a1.sources.r1.channels = c1

# Describe the sink

a1.sinks.k1.type= avro

a1.sinks.k1.channel = c1

a1.sinks.k1.hostname= m1

a1.sinks.k1.port = 5555

a1.sinks.k2.type= avro

a1.sinks.k2.channel = c1

a1.sinks.k2.hostname= m2

a1.sinks.k2.port = 5555

# Use a channel which buffers events in memory

a1.channels.c1.type= memory

a1.channels.c1.capacity = 1000

a1.channels.c1.transactionCapacity = 100

(b)在m1创建Load_balancing_Sink_Processors_avro配置文件

 

vi /home/hadoop/flume-1.5.0-bin/conf/Load_balancing_Sink_Processors_avro.conf

  

a1.sources = r1

a1.sinks = k1

a1.channels = c1

# Describe/configure the source

a1.sources.r1.type= avro

a1.sources.r1.channels = c1

a1.sources.r1.bind = 0.0.0.0

a1.sources.r1.port = 5555

# Describe the sink

a1.sinks.k1.type= logger

# Use a channel which buffers events in memory

a1.channels.c1.type= memory

a1.channels.c1.capacity = 1000

a1.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel

a1.sources.r1.channels = c1

a1.sinks.k1.channel = c1

(c)将2个配置文件复制到m2上一份

 

scp -r /home/hadoop/flume-1.5.0-bin/conf/Load_balancing_Sink_Processors.conf root@m2:/home/hadoop/flume-1.5.0-bin/conf/Load_balancing_Sink_Processors.conf

 

scp -r /home/hadoop/flume-1.5.0-bin/conf/Load_balancing_Sink_Processors_avro.conf root@m2:/home/hadoop/flume-1.5.0-bin/conf/Load_balancing_Sink_Processors_avro.conf

(d)打开4个窗口,在m1和m2上同时启动两个flume agent

 

flume-ng agent -c . -f /home/hadoop/flume-1.5.0-bin/conf/Load_balancing_Sink_Processors_avro.conf -n a1 -Dflume.root.logger=INFO,console

flume-ng agent -c . -f /home/hadoop/flume-1.5.0-bin/conf/Load_balancing_Sink_Processors.conf -n a1 -Dflume.root.logger=INFO,console

(e)然后在m1或m2的任意一台机器上,测试产生log,一行一行输入,输入太快,容易落到一台机器上

 

echo "idoall.org test1" | nc localhost 5140

echo "idoall.org test2" | nc localhost 5140

echo "idoall.org test3" | nc localhost 5140

echo "idoall.org test4" | nc localhost 5140

(f)在m1的sink窗口,可以看到以下信息:

 

Event: { headers:{Severity=0, flume.syslog.status=Invalid, Facility=0} body: 69 64 6F 61 6C 6C 2E 6F 72 67 20 74 65 73 74 32 idoall.org test2 }

Event: { headers:{Severity=0, flume.syslog.status=Invalid, Facility=0} body: 69 64 6F 61 6C 6C 2E 6F 72 67 20 74 65 73 74 34 idoall.org test4 }

(g)在m2的sink窗口,可以看到以下信息:

 

Event: { headers:{Severity=0, flume.syslog.status=Invalid, Facility=0} body: 69 64 6F 61 6C 6C 2E 6F 72 67 20 74 65 73 74 31 idoall.org test1 }

Event: { headers:{Severity=0, flume.syslog.status=Invalid, Facility=0} body: 69 64 6F 61 6C 6C 2E 6F 72 67 20 74 65 73 74 33 idoall.org test3 }

 说明轮询模式起到了作用。    

 

 

案例12:Hbase sink
 (a)在测试之前,请先将hbase启动

(b)然后将以下文件复制到flume中:

 

cp/home/hadoop/hbase-0.96.2-hadoop2/lib/protobuf-java-2.5.0.jar /home/hadoop/flume-1.5.0-bin/lib

cp/home/hadoop/hbase-0.96.2-hadoop2/lib/hbase-client-0.96.2-hadoop2.jar /home/hadoop/flume-1.5.0-bin/lib

cp/home/hadoop/hbase-0.96.2-hadoop2/lib/hbase-common-0.96.2-hadoop2.jar /home/hadoop/flume-1.5.0-bin/lib

cp/home/hadoop/hbase-0.96.2-hadoop2/lib/hbase-protocol-0.96.2-hadoop2.jar /home/hadoop/flume-1.5.0-bin/lib

cp/home/hadoop/hbase-0.96.2-hadoop2/lib/hbase-server-0.96.2-hadoop2.jar /home/hadoop/flume-1.5.0-bin/lib

cp/home/hadoop/hbase-0.96.2-hadoop2/lib/hbase-hadoop2-compat-0.96.2-hadoop2.jar /home/hadoop/flume-1.5.0-bin/lib

cp/home/hadoop/hbase-0.96.2-hadoop2/lib/hbase-hadoop-compat-0.96.2-hadoop2.jar /home/hadoop/flume-1.5.0-bin/lib

cp/home/hadoop/hbase-0.96.2-hadoop2/lib/htrace-core-2.04.jar /home/hadoop/flume-1.5.0-bin/lib

(c)确保test_idoall_org表在hbase中已经存在。

(d)在m1创建hbase_simple配置文件

 

vi /home/hadoop/flume-1.5.0-bin/conf/hbase_simple.conf

  

a1.sources = r1

a1.sinks = k1

a1.channels = c1

# Describe/configure the source

a1.sources.r1.type= syslogtcp

a1.sources.r1.port = 5140

a1.sources.r1.host = localhost

a1.sources.r1.channels = c1

# Describe the sink

a1.sinks.k1.type= logger

a1.sinks.k1.type= hbase

a1.sinks.k1.table = test_idoall_org

a1.sinks.k1.columnFamily = name

a1.sinks.k1.column = idoall

a1.sinks.k1.serializer = org.apache.flume.sink.hbase.RegexHbaseEventSerializer

a1.sinks.k1.channel = memoryChannel

# Use a channel which buffers events in memory

a1.channels.c1.type= memory

a1.channels.c1.capacity = 1000

a1.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel

a1.sources.r1.channels = c1

a1.sinks.k1.channel = c1

(e)启动flume agent

 

flume-ngagent -c . –f /home/hadoop/flume-1.5.0-bin/conf/hbase_simple.conf -n a1 -Dflume.root.logger=INFO,console

(f)测试产生syslog

 

echo "hello idoall.org from flume" | nc localhost 5140

(g)这时登录到hbase中,可以发现新数据已经插入

 

hbase shell

  

hbase(main):001:0> list

TABLE                                                                                                        

hbase2hive_idoall                                                                                                  

hive2hbase_idoall                                                                                                  

test_idoall_org

                                                                                                 

=> ["hbase2hive_idoall","hive2hbase_idoall","test_idoall_org"]

 

hbase(main):002:0> scan "test_idoall_org"

  

hbase(main):004:0> quit

    

--------------------------------------------------------------------------------------------------------------------------

 

1 环境搭建

    需要jdk、flume-ng、mongodb java driver、flume-ng-mongodb-sink
(1)jdk下载地址:http://www.oracle.com/technetwork/java/javase/downloads/jdk8-downloads-2133151.html
(2)flune-ng下载地址:http://www.apache.org/dyn/closer.cgi/flume/1.5.2/apache-flume-1.5.2-bin.tar.gz
(3)mongodb java driver jar包下载地址:https://oss.sonatype.org/content/repositories/releases/org/mongodb/mongo-java-driver/2.13.0/mongo-java-driver-2.13.0.jar
(4)flume-ng-mongodb-sink 源码下载地址:https://github.com/leonlee/flume-ng-mongodb-sink
flume-ng-mongodb-sink 需要自己编译jar包,从github上下载代码,解压之后执行mvn package,即可生成。需要先安装maven用于编译jar包,且机器需要能联网。

2 简单原理介绍

    这是一个关于池子的故事。有一个池子,它一头进水,另一头出水,进水口可以配置各种管子,出水口也可以配置各种管子,可以有多个进水口、多个出水口。水术语称为Event,进水口术语称为Source、出水口术语成为Sink、池子术语成为Channel,Source+Channel+Sink,术语称为Agent。如果有需要,还可以把多个Agent连起来。
更多细节参考官方文档:http://flume.apache.org/FlumeDeveloperGuide.html

3 Flume配置

(1)  env配置

      将mongo-java-driver和flume-ng-mongodb-sink两个jar包放到flume\lib目录下,并将路径加入到flume-env.sh文件的FLUME_CLASSPATH变量中;
  JAVA_OPTS变量: 加上-Dflume.monitoring.type=http -Dflume.monitoring.port=xxxx,可以在[hostname:xxxx]/metrics 上看到监控信息;  -Xms指定JVM初始内存,-Xmx指定JVM最大内存
  FLUME_HOME变量: 设定FLUME根目录
  JAVA_HOME变量:  设定JAVA根目录

(2) log配置

      在调试时,将日志设置为debug并打到文件:flume.root.logger=DEBUG,LOGFILE

(3) 传输配置
        采用 Exec Source、file-channel、flume-ng-mongodb-sink。
    Source配置举例:

复制代码

my_agent.sources.my_source_1.channels = my_channel_1
my_agent.sources.my_source_1.type = exec
my_agent.sources.my_source_1.command = python  xxx.py
my_agent.sources.my_source_1.shell = /bin/bash -c
my_agent.sources.my_source_1.restartThrottle = 10000
my_agent.sources.my_source_1.restart = true
my_agent.sources.my_source_1.logStdErr = true
my_agent.sources.my_source_1.batchSize = 1000
my_agent.sources.my_source_1.interceptors = i1 i2 i3
my_agent.sources.my_source_1.interceptors.i1.type = static
my_agent.sources.my_source_1.interceptors.i1.key = db
my_agent.sources.my_source_1.interceptors.i1.value = cswuyg_test
my_agent.sources.my_source_1.interceptors.i2.type = static
my_agent.sources.my_source_1.interceptors.i2.key = collection
my_agent.sources.my_source_1.interceptors.i2.value = cswuyg_test
my_agent.sources.my_source_1.interceptors.i3.type = static
my_agent.sources.my_source_1.interceptors.i3.key = op
my_agent.sources.my_source_1.interceptors.i3.value = upsert

复制代码

    字段说明:
    采用exec source,指定执行命令行为python  xxx.py,我在xxx.py代码中处理日志,并按照跟flume-ng-mongodb-sink的约定,print出json格式的数据,如果update类操作必须带着_id字段,print出来的日志被当作Event的Body,我再通过interceptors给它加上自定义Event Header;
static interceptors用于为Event Header添加信息,这里我为它加上了db=cswuyg_test、collection=cswuyg_test、op=upsert,这三个key是跟flume-ng-mongodb-sink 约定的,用于指定mongodb中的db、collection名以及操作类型为update。

    Channel配置举例:

复制代码

my_agent.channels.my_channel_1.type = file
my_agent.channels.my_channel_1.checkpointDir = /home/work/flume/file-channel/my_channel_1/checkPoint
my_agent.channels.my_channel_1.useDualCheckpoints = true
my_agent.channels.my_channel_1.backupCheckpointDir = /home/work/flume/file-channel/my_channel_1/checkPoint2
my_agent.channels.my_channel_1.dataDirs = /home/work/flume/file-channel/my_channel_1/data
my_agent.channels.my_channel_1.transactionCapacity = 10000
my_agent.channels.my_channel_1.checkpointInterval = 30000
my_agent.channels.my_channel_1.maxFileSize = 4292870142
my_agent.channels.my_channel_1.minimumRequiredSpace = 524288000
my_agent.channels.my_channel_1.capacity = 100000

复制代码

    字段说明:

    要注意的参数是capacity,它指定了池子里可以存放的Event数量,需要根据日志量设置一个合适的值,如果你也采用file-channel,而且磁盘充足,那可以尽可能的设置得大些。
    dataDirs指定池子存放的位置,如果可以,选择IO不是那么高的磁盘,可以使用逗号分隔使用多个磁盘目录。

    sink配置举例:

复制代码

my_agent.sinks.my_mongo_1.type = org.riderzen.flume.sink.MongoSink
my_agent.sinks.my_mongo_1.host = xxxhost
my_agent.sinks.my_mongo_1.port = yyyport
my_agent.sinks.my_mongo_1.model = dynamic
my_agent.sinks.my_mongo_1.batch = 10
my_agent.sinks.my_mongo_1.channel = my_channel_1
my_agent.sinks.my_mongo_1.timestampField = _S

复制代码

    字段说明:

 model选择dynamic,表示mongodb的db、collection名字采用Event Header中指定的名字。timestampField 字段用于将json串中指定键的值转换为datetime格式存进mongodb,flume-ng-mongodb-sink不支持嵌套key指定(如:_S.y),但可以自己通过修改sink的代码来实现。

    agent配置举例:

my_agent.channels = my_channel_1
my_agent.sources = my_source_1
my_agent.sinks = my_mongo_1

(4) 启动

    可以写一个control.sh 脚本来控制flume的启动、关闭、重启。
    启动demo:
./bin/flume-ng agent --conf ./conf/ --conf-file ./conf/flume.conf -n agent1 > ./start.log 2>&1 &


    从此以后,日志数据就从日志文件,通过xxx.py读取,进入到flie-channel,再被flume-ng-mongodb-sink读走,进入到目的地MongoDB Cluster。
搭好基本功能之后,以后需要做的就是调整xxx.py、增强flume-ng-mongodb-sink。

4 其它

 1、监控:官方推荐的监控是ganglia:http://sourceforge.net/projects/ganglia/,有图像界面。

 2、版本变更:flume 从1.X开始已经不再使用ZooKeeper,在数据可靠性上,提供了E2E(end-to-end)的支持,去掉了重构之前的DFO(store on failure)、BE(best effort)。E2E指的是:在删除channel中的event时,保证event已经传递到了下一个agent或者终点,不过,这里没有提到数据在进入到channel之前如何保证不丢失,像Exec Source这种数据导入channel的方式,需要使用者自己保证。

 3、关闭插件:使用Exec Source时,flume重启不会关闭掉旧插件进程,需要自己关闭。

 4、Exec Source不能保证数据不丢失,因为这种方式只是把水灌到池子里,不管池子是什么状况, 参见https://flume.apache.org/FlumeUserGuide.html#exec-source 的 Warning 部分。但是,Spooling directory source 也不一定是个好方法,监控目录,但是注意不能修改文件的名字,不能出现同名覆盖文件,不要出现只有一半内容的文件。传输完成之后,文件会被重命名为xx.COMPLETED,需要有定时清理脚本把这些文件清理掉。重启会导致出现重复event,因为那些被传输到一半的文件没有被设置为完成状态。

 5、传输瓶颈:使用flume+mongodb来安全传输大量数据(每秒万条级别的日志不算大数据量,每天几百G的也不算),瓶颈会出现在MongoDB上,特别是Update类型的数据传输。

 6、需要修改当前的flume-ng-mongodb-sink 插件:(1)让update支持 $setOnInsert;(2)解决update的 $set、$inc为空时,引发exception的bug;(3)解决批量插入时,因其中一条日志有duplicate exception而导致同批插入的后续日志全部被丢弃的bug。

 7、flume跟fluentd很类似,但来自hadoop生态的flume更热门,所以我选择flume。

 8、批量部署:先把jdk、flume打包成tar,然后借助python 的 paramiko库,将tar包发到各台机器上,解压、运行。

 

在使用Python来安装geopandas包时,由于geopandas依赖于几个其他的Python库(如GDAL, Fiona, Pyproj, Shapely等),因此安装过程可能需要一些额外的步骤。以下是一个基本的安装指南,适用于大多数用户: 使用pip安装 确保Python和pip已安装: 首先,确保你的计算机上已安装了Python和pip。pip是Python的包管理工具,用于安装和管理Python包。 安装依赖库: 由于geopandas依赖于GDAL, Fiona, Pyproj, Shapely等库,你可能需要先安装这些库。通常,你可以通过pip直接安装这些库,但有时候可能需要从其他源下载预编译的二进制包(wheel文件),特别是GDAL和Fiona,因为它们可能包含一些系统级的依赖。 bash pip install GDAL Fiona Pyproj Shapely 注意:在某些系统上,直接使用pip安装GDAL和Fiona可能会遇到问题,因为它们需要编译一些C/C++代码。如果遇到问题,你可以考虑使用conda(一个Python包、依赖和环境管理器)来安装这些库,或者从Unofficial Windows Binaries for Python Extension Packages这样的网站下载预编译的wheel文件。 安装geopandas: 在安装了所有依赖库之后,你可以使用pip来安装geopandas。 bash pip install geopandas 使用conda安装 如果你正在使用conda作为你的Python包管理器,那么安装geopandas和它的依赖可能会更简单一些。 创建一个新的conda环境(可选,但推荐): bash conda create -n geoenv python=3.x anaconda conda activate geoenv 其中3.x是你希望使用的Python版本。 安装geopandas: 使用conda-forge频道来安装geopandas,因为它提供了许多地理空间相关的包。 bash conda install -c conda-forge geopandas 这条命令会自动安装geopandas及其所有依赖。 注意事项 如果你在安装过程中遇到任何问题,比如编译错误或依赖问题,请检查你的Python版本和pip/conda的版本是否是最新的,或者尝试在不同的环境中安装。 某些库(如GDAL)可能需要额外的系统级依赖,如地理空间库(如PROJ和GEOS)。这些依赖可能需要单独安装,具体取决于你的操作系统。 如果你在Windows上遇到问题,并且pip安装失败,尝试从Unofficial Windows Binaries for Python Extension Packages网站下载相应的wheel文件,并使用pip进行安装。 脚本示例 虽然你的问题主要是关于如何安装geopandas,但如果你想要一个Python脚本来重命名文件夹下的文件,在原始名字前面加上字符串"geopandas",以下是一个简单的示例: python import os # 指定文件夹路径 folder_path = 'path/to/your/folder' # 遍历文件夹中的文件 for filename in os.listdir(folder_path): # 构造原始文件路径 old_file_path = os.path.join(folder_path, filename) # 构造新文件名 new_filename = 'geopandas_' + filename # 构造新文件路径 new_file_path = os.path.join(folder_path, new_filename) # 重命名文件 os.rename(old_file_path, new_file_path) print(f'Renamed "{filename}" to "{new_filename}"') 请确保将'path/to/your/folder'替换为你想要重命名文件的实际文件夹路径。
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值