flume 用户指南 - part 1

最新推荐文章于 2023-09-22 23:49:35 发布

三千大千世界

最新推荐文章于 2023-09-22 23:49:35 发布

阅读量452

点赞数

分类专栏：分布式

分布式专栏收录该内容

15 篇文章 0 订阅

订阅专栏

原文：http://flume.apache.org/FlumeUserGuide.html

version:1.6

Data flow model

个人解释一下几个概念：

source：顾名思义，就是事件源，是入口

channel：是就事件或者消费传递的通道

sink：就是事件或者消息处理的去处，比如说可以是一个hdfs，如上图，也可以说其他地方，比如日志的采集处理的时候，可以作为kafka的源头

flume event: 这是包括了字节为内容的载体和可选的字符串属性集合的数据流单元

flume agent：这是一个jvm，它包括了event 从源到下一个目的地/输出的各个组件

外部数据源向flume 发送可识别的events，flume source进行消费处理。比如一个Avro flume source可以从Avro client 或者其他agent的发送events的Avro sink。后续略。

复杂流程

flume运行用户构建多级flows，这样事件可以在多个agent直接流转，然后到达最终的destination。同时也支持流入，流出的flows，上下文流转，对失败的机器做容错(好像不太准确)。

可靠性

略

可恢复性

略

设置agent

flume agent配置文件存储在本地。是Java properties格式的文本文件。一个配置文件可以包含一个或多个agent的配置。这个配置文件包括了source，sink，channel的各种属性配置已经他们如何组装形成一个数据流。

配置单个组件

每个组件（source，sink，channel）都有名称和类型，以及对类型和初始化相关的一些列属性。比如一个Avro source需要配置一个hostname或者ip和端口来接受数据。内存通道可以有最大的队列容量，HDFS sink需要知道文件系统的URI，创建文件的路径，文件循环的频率（hdfs.rollInterval，没关注过hdfs，个人感觉类似log4j，比如可以一天一个文件这种）。所有这些属性都要设置（这不废话嘛）。

组装零件

agent需要知道加载哪些组件，以及如何把它们链接起来以组成数据流。这可以通过罗列source，sink，channel的名称，并制定source和sink直接的channel来完成。比如,一个事件从一个叫做avroWeb的avro source，经过叫做file-channel的file channel，到达叫做hdfs-cluster1的hdfs sink。配置文件会包括这些组件的名称，且file-channel是avroWeb source 和 hsdf-cluster1 sink的共享通道。

启动agent

需要知道agent name，配置文件目录和配置文件。如下。

$ bin/flume-ng agent -n $agent_name -c conf -f conf/flume-conf.properties.template

简单示例

这个示例让用户产生事件并在控制台打印。

# example.conf: A single-node Flume configuration

# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1

# Describe/configure the source
a1.sources.r1.type = netcat
a1.sources.r1.bind = localhost
a1.sources.r1.port = 44444

# Describe the sink
a1.sinks.k1.type = logger

# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

这定义了一个agent叫a1，有一个在44444端口接收数据的source，一个内存通道缓冲数据，一个在控制台打印数据的sink，配置文件可以配置多个agent，agent启动的时候需要传递给它一个标识，让它知道加载对应的配置。如下

$ bin/flume-ng agent --conf conf --conf-file example.conf --name a1 -Dflume.root.logger=INFO,console

在--conf目录下面，会包含一个flume-env.sh，也可能包含log4j.properties。在这个例子里面，我们启用了一个Java -D参数让它打印到控制台，而且我们没有自定义的flume-env.sh。下面是演示结果。

$ telnet localhost 44444
Trying 127.0.0.1...
Connected to localhost.localdomain (127.0.0.1).
Escape character is '^]'.
Hello world! <ENTER>
OK

 
  12/06/19 15:32:19 INFO source.NetcatSource: Source starting
12/06/19 15:32:19 INFO source.NetcatSource: Created serverSocket:sun.nio.ch.ServerSocketChannelImpl[/127.0.0.1:44444]
12/06/19 15:32:34 INFO sink.LoggerSink: Event: { headers:{} body: 48 65 6C 6C 6F 20 77 6F 72 6C 64 21 0D          Hello world!. }

基本搞定。

基于zookeeper配置

在zk的可配置的目录下，上传配置文件，它们会存储在node data里面。如下。

- /flume
 |- /a1 [Agent config file]
 |- /a2 [Agent config file]

上传完配置文件后，用如下的方式启动。

$ bin/flume-ng agent –conf conf -z zkhost:2181,zkhost1:2181 -p /flume –name a1 -Dflume.root.logger=INFO,console

-z: zookeeper server list，用逗号分隔

-p:zookeeper上存储配置的跟路径

安装第三方插件

flume支持插件架构。也可以包含我们自定义的组件，路径在flume-env.sh的FLUME_CLASSPATH的变量配置。现在支持一个叫plugins.d的目录，系统自动加载这下面的插件。

plugins.d 目录

路径$FLUME_HOME/plugins.d,启动脚本flume-ng会查找这个目录的插件并确保符合下面的格式，然后当启动Java的时候把它们加载到合适的路径（感觉应该就是classpath）。

每个插件至多有3个字目录。

lib - the plugin’s jar(s)
libext - the plugin’s dependency jar(s)
native - any required native libraries, such as .so files

示例如下

plugins.d/
plugins.d/custom-source-1/
plugins.d/custom-source-1/lib/my-source.jar
plugins.d/custom-source-1/libext/spring-core-2.5.6.jar
plugins.d/custom-source-2/
plugins.d/custom-source-2/lib/custom.jar
plugins.d/custom-source-2/native/gettext.so

数据获取

支持多种机制获取数据。

flume destination的avro client 可以给flume source发送文件

$ bin/flume-ng avro-client -H localhost -p 41414 -F /usr/logs/log.10

上面的例子就是发送文件内容到端口41414的flume source。

flume有一个exec source来执行一个命令来消费数据。

注：tail 命令作为source是不支持的，不过可以用exec source进行包装，输出到流文件。

Network streams

Avro
Thrift
Syslog
Netcat

设置多agent数据流

为了数据夸多个agent，前一个agent的sink 和下一个agent的source必须是avro类型，且sink要指向source的ip和端口。

组装

一个常见的日志收集的场景是有大量的日志发送客户端向少量的消费者发送数据。比如上百个web server向数十个agent发送，agent写入hdfs。

flume可以在第一层配置多个avro sink的agent，指向一个avro source的agent（当然也使用thrift的sink/source/client）。第二层agent的source会整合收到的事件到单一的通道，然后被sink消费，最终写入到hdfs。

多路数据流

flume支持通过配置一个多路器，把事件冗余（即发送到所有通道）或者选择性的路由到一个或多个通道。

配置

定义流程

为了定义一个agent，你需要罗列对应agent的source，channel，sink，必须通过channel把source和sink链接起来。一个source可以制定对应多个channel，但是每个sink只能对应一个channel。

# list the sources, sinks and channels for the agent
<Agent>.sources = <Source>
<Agent>.sinks = <Sink>
<Agent>.channels = <Channel1> <Channel2>

# set channel for source
<Agent>.sources.<Source>.channels = <Channel1> <Channel2> ...

# set channel for sink
<Agent>.sinks.<Sink>.channel = <Channel1>

比如，一个agent叫agent_foo,从外部的avro client读数据，通过内存通道写入hdfs。它的weblog.config 配置如下

# list the sources, sinks and channels for the agent
agent_foo.sources = avro-appserver-src-1
agent_foo.sinks = hdfs-sink-1
agent_foo.channels = mem-channel-1

# set channel for source
agent_foo.sources.avro-appserver-src-1.channels = mem-channel-1

# set channel for sink
agent_foo.sinks.hdfs-sink-1.channel = mem-channel-1

配置单个组件

定义完流程之后，需要设置source，sink，channel的各种属性，这跟设置组件的类型或者其他属性时配置方式一致。

# properties for sources
<Agent>.sources.<Source>.<someProperty> = <someValue>

# properties for channels
<Agent>.channel.<Channel>.<someProperty> = <someValue>

# properties for sinks
<Agent>.sources.<Sink>.<someProperty> = <someValue>

组件的类型属性是必须的，便于flume识别。每个source/sink/channel的类型，对于不同的功能需要设置必填的属性集。前面的例子，有一个通过mem-channel-1，从avro-AppSrv-source到hdfs-Cluster1-sink的flow。下面的例子展示了各个组件的配置。

agent_foo.sources = avro-AppSrv-source
agent_foo.sinks = hdfs-Cluster1-sink
agent_foo.channels = mem-channel-1

# set channel for sources, sinks

# properties of avro-AppSrv-source
agent_foo.sources.avro-AppSrv-source.type = avro
agent_foo.sources.avro-AppSrv-source.bind = localhost
agent_foo.sources.avro-AppSrv-source.port = 10000

# properties of mem-channel-1
agent_foo.channels.mem-channel-1.type = memory
agent_foo.channels.mem-channel-1.capacity = 1000
agent_foo.channels.mem-channel-1.transactionCapacity = 100

# properties of hdfs-Cluster1-sink
agent_foo.sinks.hdfs-Cluster1-sink.type = hdfs
agent_foo.sinks.hdfs-Cluster1-sink.hdfs.path = hdfs://namenode/flume/webdata

添加多数据流

一个agent可以包含多个独立的flow，在配置文件里面罗列多个source，sink，channel，然后链接组成多个flow。

# list the sources, sinks and channels for the agent
<Agent>.sources = <Source1> <Source2>
<Agent>.sinks = <Sink1> <Sink2>
<Agent>.channels = <Channel1> <Channel2>

然后你可以用channel链接source（链接多个channel）和sink（链接一个channel），比如，你想设置2个flow，一个是从外部的avro client到外部的hdfs存储，另一个是从tail命令的输出到avro的sink（输出），下面是示例配置。

# list the sources, sinks and channels in the agent
agent_foo.sources = avro-AppSrv-source1 exec-tail-source2
agent_foo.sinks = hdfs-Cluster1-sink1 avro-forward-sink2
agent_foo.channels = mem-channel-1 file-channel-2

# flow #1 configuration
agent_foo.sources.avro-AppSrv-source1.channels = mem-channel-1
agent_foo.sinks.hdfs-Cluster1-sink1.channel = mem-channel-1

# flow #2 configuration
agent_foo.sources.exec-tail-source2.channels = file-channel-2
agent_foo.sinks.avro-forward-sink2.channel = file-channel-2

配置多个agent

为了设置多级flow，你需要让前一个agent的sink是avro或者thrift类型，且只向下一个agent的source，这个source的类型也得是avro或者thrift，类型要对应。这样第一个agent才会把事件转发到下一个agent。比如，你周期性的通过avro client 发送文件（每个event包含一个文件）到一个本地agent，这个本地agent再把事件转发到另一个存储功能的agent。Weblog agent配置如下。

# list sources, sinks and channels in the agent
agent_foo.sources = avro-AppSrv-source
agent_foo.sinks = avro-forward-sink
agent_foo.channels = file-channel

# define the flow
agent_foo.sources.avro-AppSrv-source.channels = file-channel
agent_foo.sinks.avro-forward-sink.channel = file-channel

# avro sink properties
agent_foo.sources.avro-forward-sink.type = avro
agent_foo.sources.avro-forward-sink.hostname = 10.1.1.100
agent_foo.sources.avro-forward-sink.port = 10000

# configure other pieces
#...

hdfs agent 配置

# list sources, sinks and channels in the agent
agent_foo.sources = avro-collection-source
agent_foo.sinks = hdfs-sink
agent_foo.channels = mem-channel

# define the flow
agent_foo.sources.avro-collection-source.channels = mem-channel
agent_foo.sinks.hdfs-sink.channel = mem-channel

# avro sink properties
agent_foo.sources.avro-collection-source.type = avro
agent_foo.sources.avro-collection-source.bind = 10.1.1.100
agent_foo.sources.avro-collection-source.port = 10000

# configure other pieces
#...

我们把weblog agent的avro-forward-sink链接到hdfs agent的avro-collection-source，这样从外部来的事件，最终就可以存储在hdfs里面。

Fan out flow（字面翻译叫扇出，其实就是一个source发送到多个通道，向扇面一样）

flume支持扇形发送数据，即发到多个通道，有2种模式，冗余和多路分发。冗余模式，source会发到所有的channel，多路分发模式，会发送到合适的channel里面。为了发射source，我们需要给source指定一系列channel，且指定对应的发射策略。通过在channel上配一个selector，可以知道使用容易模式或者多路分发模式。

如果使用多路分发，那么还得具体指定分发的规则。如果不指定selector，默认是冗余模式。

# List the sources, sinks and channels for the agent
<Agent>.sources = <Source1>
<Agent>.sinks = <Sink1> <Sink2>
<Agent>.channels = <Channel1> <Channel2>

# set list of channels for source (separated by space)
<Agent>.sources.<Source1>.channels = <Channel1> <Channel2>

# set channel for sinks
<Agent>.sinks.<Sink1>.channel = <Channel1>
<Agent>.sinks.<Sink2>.channel = <Channel2>

<Agent>.sources.<Source1>.selector.type = replicating

多路分发选择器需要设置各种属性来进行分流。这需要指定一个从事件属性到channel集合的映射关系。选择器会检查事件消息头的各个属性，如果能和指定的值匹配，则发送事件到映射到这个值的所有channel，如果全部不匹配，则发送到默认channel。

# Mapping for multiplexing selector
<Agent>.sources.<Source1>.selector.type = multiplexing
<Agent>.sources.<Source1>.selector.header = <someHeader>
<Agent>.sources.<Source1>.selector.mapping.<Value1> = <Channel1>
<Agent>.sources.<Source1>.selector.mapping.<Value2> = <Channel1> <Channel2>
<Agent>.sources.<Source1>.selector.mapping.<Value3> = <Channel2>
#...

<Agent>.sources.<Source1>.selector.default = <Channel2>

这个映射值可以交叉。

下面是示例，agent叫agent_foo，有一个avro source，2个channel，链接到2个sink

# list the sources, sinks and channels in the agent
agent_foo.sources = avro-AppSrv-source1
agent_foo.sinks = hdfs-Cluster1-sink1 avro-forward-sink2
agent_foo.channels = mem-channel-1 file-channel-2

# set channels for source
agent_foo.sources.avro-AppSrv-source1.channels = mem-channel-1 file-channel-2

# set channel for sinks
agent_foo.sinks.hdfs-Cluster1-sink1.channel = mem-channel-1
agent_foo.sinks.avro-forward-sink2.channel = file-channel-2

# channel selector configuration
agent_foo.sources.avro-AppSrv-source1.selector.type = multiplexing
agent_foo.sources.avro-AppSrv-source1.selector.header = State
agent_foo.sources.avro-AppSrv-source1.selector.mapping.CA = mem-channel-1
agent_foo.sources.avro-AppSrv-source1.selector.mapping.AZ = file-channel-2
agent_foo.sources.avro-AppSrv-source1.selector.mapping.NY = mem-channel-1 file-channel-2
agent_foo.sources.avro-AppSrv-source1.selector.default = mem-channel-1

选择器检查消息头的State属性，如果值是CA，发送到mem-channel-1, AZ，发送到file-channel-2，NY，都发。如果State没设置或者不匹配这3个值，则发送到默认的channel，即mem-channel-1。

选择器还支持可选的channel，为了给消息头知道可选的通道，需要如下配置。

# channel selector configuration
agent_foo.sources.avro-AppSrv-source1.selector.type = multiplexing
agent_foo.sources.avro-AppSrv-source1.selector.header = State
agent_foo.sources.avro-AppSrv-source1.selector.mapping.CA = mem-channel-1
agent_foo.sources.avro-AppSrv-source1.selector.mapping.AZ = file-channel-2
agent_foo.sources.avro-AppSrv-source1.selector.mapping.NY = mem-channel-1 file-channel-2
agent_foo.sources.avro-AppSrv-source1.selector.optional.CA = mem-channel-1 file-channel-2
agent_foo.sources.avro-AppSrv-source1.selector.mapping.AZ = file-channel-2
agent_foo.sources.avro-AppSrv-source1.selector.default = mem-channel-1

首先，选择器会有一个事务写，会先尝试写入必须的通道（即非optional），如果有一个通道写入失败，则写事务失败。事务会在所有的通道上重试（注：个人感觉这个所有的通道应该是匹配到的所有的通道，后续做测试的时候会做验证）。一旦所有的必须的通道写入成功，选择器会尝试写入可选的通道，任何可选通道上进行消费事件失败时，直接忽略，不会重试。

如果对于同一个消息头属性，有交叉的通道配置，这个通道会被认为是必选的通道，在这个通道上的写入失败，会导致整个必须通道的重试操作。比如上面，CA，mem-channel-1是必选的，且在倒数第三个配置，它又是在可选的，那么mem-channel-1会被认为是必须的，在它上面的写入失败，会在选择器上配置的所有的通道上重试。

1.如果消息头没有配置对应的必须的通道，则事件会被写入到默认的通道，并会尝试写入对应的可选的通道。

2.如果不指定必须通道，指定可选通道，则依然会导致写入默认通道。

3.如果没有配置默认通道，也没有必须的通道，则选择器会尝试写入可选通道，失败的话会被忽略。

个人解释一下1和2.

#1.依然使用上面的例子，我们添加一个值，ABC

agent_foo.sources.avro-AppSrv-source1.selector.optional.ABC = file-channel-2

则消息头State对应的值ABC，没有必须的通道，则事件会写入默认通道mem-channel-1，然后尝试写入可选的通道，file-channel-2

这点说的是消息头的某个值没有配置必选的通道。

#2.这个则是说选择器配置的都是可选通道，没有必选的。我们添加几个例子。跟上面官方的例子无关。

# channel selector configuration
agent_foo.sources.avro-AppSrv-source1.selector.type = multiplexing
agent_foo.sources.avro-AppSrv-source1.selector.header = State
agent_foo.sources.avro-AppSrv-source1.selector.optional.mapping.CA = mem-channel-1
agent_foo.sources.avro-AppSrv-source1.selector.optional.mapping.AZ = file-channel-2
agent_foo.sources.avro-AppSrv-source1.selector.default = mem-channel-1

to be continue...