Flume介绍与安装

最新推荐文章于 2023-09-30 13:09:57 发布

周天祥

最新推荐文章于 2023-09-30 13:09:57 发布

阅读量9.6w

点赞数 3

分类专栏：大数据 flume 文章标签： flume

本文链接：https://blog.csdn.net/u014646662/article/details/82940034

版权

大数据同时被 2 个专栏收录

84 篇文章 2 订阅

订阅专栏

flume

1 篇文章 0 订阅

订阅专栏

Flume 1.8.0用户指南

1.介绍

概述
系统要求

2.架构

数据流模型
复杂流动
可靠性
可恢复性
安装
多个Agent
整合
多路复用
配置多Agent流
扇出流

对人工智能感兴趣的同学，可以点击以下链接：

现在人工智能非常火爆，很多朋友都想学，但是一般的教程都是为博硕生准备的，太难看懂了。最近发现了一个非常适合小白入门的教程，不仅通俗易懂而且还很风趣幽默。所以忍不住分享一下给大家。点这里可以跳转到教程。

https://www.cbedai.net/u014646662

1.介绍

概述：

Apache Flume 是一个分布式的、可靠的、可用的系统，可以有效地收集、聚合和将大量日志数据从许多不同来源转移到集中的数据存储。

Apache Flume的使用不仅仅限于日志数据聚合。由于数据源是可定制的，所以Flume可以用于传输大量事件数据，包括但不限于网络流量数据、社交媒体生成的数据、电子邮件消息和几乎所有可能的数据源。

Apache Flume是Apache软件基金会的顶级项目。

目前有两个版本代码行可用，版本0.9.x和1.x。

鼓励新用户和现有用户使用1.x版本，以便利用最新体系结构中可用的性能改进和配置灵活性。

系统需求：

Java运行时环境：Java 1.8或更高版本
内存：为 sources, channels 和 sinks配置提供足够的内存
磁盘空间：为channels 和 sinks配置提供足够的磁盘空间
目录权限： agent使用的目录的读写权限

2.架构

数据流模型

Flume事件被定义为具有字节有效负载和一组可选字符串属性的数据流单元。Flume代理是一个(JVM)进程，它承载着事件从外部源流到下一个目的地的组件。

Flume源使用外部源(如web服务器)传递给它的事件。外部源以目标Flume 源识别的格式向Flume 发送事件。例如，可以使用Avro Flume源从Avro客户端或从Avro sink发送事件的流中的其他Flume代理接收Avro事件。类似的流可以使用Flume源来定义，从一个Thrift Sink或一个Flume Thrift Rpc客户机或由Flume Thrift协议生成的任何语言编写的Thrift客户机接收事件。当Flume源接收到事件时，它将其存储到一个或多个通道中。channel是一个被动的存储，它将事件保存到Flume sink中。文件channel 是一个由本地文件系统支持的示例。sink将事件从通道中移除，并将其放入一个外部存储库，比如HDFS(通过Flume HDFS sink)，或者将其转发到流中的下一个Flume代理(下一跳)的Flume源。给定代理中的源和接收器与通道中的事件同步运行。

复杂流动

Flume允许用户构建多跳流，事件在到达最终目的地之前通过多个代理传递。它还允许扇入和扇出流、上下文路由和备份路由(故障转移)用于失败的跳转。

可靠性

事件在每个代理上的通道中进行分段。然后将事件传递到流中的下一个代理或终端存储库(比如HDFS)。只有在事件存储在下一个代理的通道或终端存储库中之后，才会从通道中删除它们。这就是Flume中的单跳消息传递语义如何提供流的端到端可靠性的。

Flume使用事务性方法来保证事件的可靠交付。sources 和 sinks分别封装在事务中存储/检索放置在事务中或由通道提供的事务提供的事件。这确保事件集在流中的点到点之间可靠地传递。在多跳流的情况下，前一跳的接收器和下一跳的源都有自己的事务运行，以确保数据安全存储在下一跳的通道中。

可恢复性

事件在通道中进行分段，该通道管理故障恢复。Flume支持由本地文件系统支持的持久文件通道。还有一个内存通道，它可以简单地将事件存储在内存队列中，速度更快，但是当代理进程死亡时，仍然存在内存通道中的任何事件都无法恢复。

安装

设置一个Agent

Flume代理配置存储在本地配置文件中。这是一个遵循Java属性文件格式的文本文件。一个或多个代理的配置可以在同一个配置文件中指定。配置文件包括代理中的每个源、接收器和通道的属性，以及它们如何连接在一起形成数据流。

配置单个组件

流中的每个组件(source, sink 和 channel)都具有特定于类型和实例化的名称、类型和属性集。例如，Avro源需要主机名(或IP地址)和端口号来接收数据。内存通道可以有最大的队列大小(“capacity”)，HDFS接收器需要知道文件系统URI、创建文件的路径、文件旋转频率(“hdfs. rollinterval”)等。组件的所有这些属性都需要在宿主Flume代理的属性文件中设置。

把各个部分连接起来

代理需要知道要加载哪些单独的组件以及它们是如何连接的，以便构成流。通过列出代理中的每个源、接收器和通道的名称，然后指定每个接收器和源的连接通道来完成此操作。例如，一个代理通过一个名为file-channel的文件通道将事件从Avro源avroWeb流到HDFS sink HDFS -cluster1。配置文件将包含这些组件的名称和文件通道，作为avroWeb源和hdfs-cluster1接收器的共享通道。

启动一个Agent

使用名为Flume -ng的shell脚本启动代理，该脚本位于Flume发行版的bin目录中。您需要在命令行中指定代理名称、配置目录和配置文件:

$ bin/flume-ng agent -n $agent_name -c conf -f conf/flume-conf.properties.template

现在，代理将开始运行在给定属性文件中配置的源和接收器。

一个简单的例子

在这里，我们给出一个配置文件示例，描述了一个单节点Flume部署。这个配置允许用户生成事件，然后将它们记录输出到控制台。

# example.conf: A single-node Flume configuration

# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1

# Describe/configure the source
a1.sources.r1.type = netcat
a1.sources.r1.bind = localhost
a1.sources.r1.port = 44444

# Describe the sink
a1.sinks.k1.type = logger

# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

这个配置定义了一个名为a1的代理。a1有一个监听端口44444上数据的源，一个缓冲内存中事件数据的通道，以及一个将事件数据记录到控制台的接收器。配置文件各种组件命名，描述它们的类型和配置参数。一个给定的配置文件可能定义几个命名代理;当启动给定的Flume进程时，会传递一个标志，告诉它要声明哪个已命名代理。

有了这个配置文件，我们可以启动Flume，命令如下:

$ bin/flume-ng agent --conf conf --conf-file conf/example.conf --name a1 -Dflume.root.logger=INFO,console

注意，在一个完整的部署中，我们通常会包含另外一个选项:--conf=<conf-dir>。 <conf-dir>目录包含一个shell脚本flume-env.sh和一个log4j属性文件。在本例中，我们传递了一个Java选项来强制Flume登录到控制台，并且没有定制的环境脚本。

可以通过一个终端控制台用 telnet 访问44444端口发送Flume事件：

$  telnet localhost 44444
Trying ::1...
telnet: connect to address ::1: Connection refused
Trying 127.0.0.1...
Connected to localhost.
Escape character is '^]'.
hello Lucky!   --输入后回车
OK

原始的Flume终端将在日志消息中输出事件。

2018-10-05 03:17:24,511 (lifecycleSupervisor-1-0) [INFO - org.apache.flume.source.NetcatSource.start(NetcatSource.java:155)] Source starting
2018-10-05 03:17:24,538 (lifecycleSupervisor-1-0) [INFO - org.apache.flume.source.NetcatSource.start(NetcatSource.java:166)] Created serverSocket:sun.nio.ch.ServerSocketChannelImpl[/127.0.0.1:44444]
2018-10-05 03:25:55,613 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - org.apache.flume.sink.LoggerSink.process(LoggerSink.java:95)] Event: { headers:{} body: 68 65 6C 6C 6F 20 4C 75 63 6B 79 21 0D          hello Lucky!. }

在配置文件中使用环境变量

Flume有替换环境变量的能力：

a1.sources = r1
a1.sources.r1.type = netcat
a1.sources.r1.bind = 0.0.0.0
a1.sources.r1.port = ${NC_PORT}
a1.sources.r1.channels = c1

启动命令：

NC_PORT=44444 bin/flume-ng agent --conf conf --conf-file conf/example.conf --name a1 -Dflume.root.logger=INFO,console -DpropertiesImplementation=org.apache.flume.node.EnvVarResolverProperties

注意：-DpropertiesImplementation=org.apache.flume.node.EnvVarResolverProperties不可省略

多个Agent

Two agents communicating over Avro RPC

为了跨多个代理或跃点流动数据，前一个代理和当前跃点的源的sink需要是avro类型，sink指向源的主机名(或IP地址)和端口。

整合

日志收集中的一个非常常见的场景是，大量产生日志的客户机将数据发送给附加到存储子系统的几个使用者代理。例如，从数百个web服务器收集的日志发送到十几个写到HDFS集群的代理。

A fan-in flow using Avro RPC to consolidate events in one place

在Flume中，可以通过配置许多具有avro sink的第一层代理来实现这一点，所有这些代理都指向单个代理的avro源(同样，您可以在这样的场景中使用thrift sources/sink /clients)。第二层代理上的这个源将接收到的事件合并到一个单一的通道中，该通道由接收器使用到最终目的地。

多路复用

Flume支持将事件流多路复用到一个或多个目的地。这是通过定义一个流多路复用器来实现的，它可以复制或选择性地将事件路由到一个或多个通道。

A fan-out flow using a (multiplexing) channel selector

上面的示例显示了一个来自代理“foo”的源，它将流分成三个不同的通道。这个扇出可以复制或多路复用。在复制流的情况下，每个事件被发送到所有三个通道。对于多路复用的情况，当事件的属性与预先配置的值匹配时，事件被传递到可用通道的子集。例如，如果一个名为“txnType”的事件属性被设置为“customer”，那么它应该转到channel1和channel3，如果它是“vendor”，那么它应该转到channel2，否则就是channel3。映射可以在代理的配置文件中设置。

配置

如前一节所述，Flume代理配置是从一个类似于具有分层属性设置的Java属性文件格式的文件中读取的。

定义流

要在单个代理中定义流，需要通过通道链接源和汇。您需要列出给定代理的源、接收器和通道，然后将源和接收器指向一个通道。源实例可以指定多个通道，但是sink实例只能指定一个通道。格式如下:

# list the sources, sinks and channels for the agent
<Agent>.sources = <Source>
<Agent>.sinks = <Sink>
<Agent>.channels = <Channel1> <Channel2>

# set channel for source
<Agent>.sources.<Source>.channels = <Channel1> <Channel2> ...

# set channel for sink
<Agent>.sinks.<Sink>.channel = <Channel1>

例如，名为agent_foo的代理正在从外部avro客户端读取数据并通过内存通道将其发送到HDFS。配置文件weblog.config如下所示：

# list the sources, sinks and channels for the agent
agent_foo.sources = avro-appserver-src-1
agent_foo.sinks = hdfs-sink-1
agent_foo.channels = mem-channel-1

# set channel for source
agent_foo.sources.avro-appserver-src-1.channels = mem-channel-1

# set channel for sink
agent_foo.sinks.hdfs-sink-1.channel = mem-channel-1

这将使事件从avro-AppSrv-source流向hdfs-Cluster1-sink，通过内存通道mem-channel-1。当使用weblog.config作为其配置文件启动代理程序时，它将实例化该流程。

配置单个组件

定义流后，您需要设置每个源，接收器和通道的属性。这是以相同的分层命名空间方式完成的，您可以在其中设置组件类型以及特定于每个组件的属性的其他值：

# properties for sources
<Agent>.sources.<Source>.<someProperty> = <someValue>

# properties for channels
<Agent>.channel.<Channel>.<someProperty> = <someValue>

# properties for sinks
<Agent>.sources.<Sink>.<someProperty> = <someValue>

需要为每个组件设置属性“type”，以便Flume了解它需要成为什么样的对象。每个源、接收器和通道类型都有自己的一组属性，这些属性是按照预期功能运行所必需的。所有这些都需要根据需要进行设置。在前面的示例中，我们有一个从avro- appsrv源到hdfs-Cluster1-sink的流，通过内存通道mem-channel-1。下面是一个例子，展示了这些组件的配置:

agent_foo.sources = avro-AppSrv-source
agent_foo.sinks = hdfs-Cluster1-sink
agent_foo.channels = mem-channel-1

# set channel for sources, sinks

# properties of avro-AppSrv-source
agent_foo.sources.avro-AppSrv-source.type = avro
agent_foo.sources.avro-AppSrv-source.bind = localhost
agent_foo.sources.avro-AppSrv-source.port = 10000

# properties of mem-channel-1
agent_foo.channels.mem-channel-1.type = memory
agent_foo.channels.mem-channel-1.capacity = 1000
agent_foo.channels.mem-channel-1.transactionCapacity = 100

# properties of hdfs-Cluster1-sink
agent_foo.sinks.hdfs-Cluster1-sink.type = hdfs
agent_foo.sinks.hdfs-Cluster1-sink.hdfs.path = hdfs://namenode/flume/webdata

#...

在Agent中添加多个流

单个Flume代理可以包含几个独立的流。可以在配置中列出多个源、接收器和通道。这些组件可以链接成多个流:

# list the sources, sinks and channels for the agent
<Agent>.sources = <Source1> <Source2>
<Agent>.sinks = <Sink1> <Sink2>
<Agent>.channels = <Channel1> <Channel2>

然后，您可以将源和汇链接到它们相应的通道(用于源)，以便设置两个不同的流。例如，如果您需要在一个代理中设置两个流，一个从外部avro客户机到外部HDFS，另一个从尾部输出到avro sink，那么这里有一个配置:

# list the sources, sinks and channels in the agent
agent_foo.sources = avro-AppSrv-source1 exec-tail-source2
agent_foo.sinks = hdfs-Cluster1-sink1 avro-forward-sink2
agent_foo.channels = mem-channel-1 file-channel-2

# flow #1 configuration
agent_foo.sources.avro-AppSrv-source1.channels = mem-channel-1
agent_foo.sinks.hdfs-Cluster1-sink1.channel = mem-channel-1

# flow #2 configuration
agent_foo.sources.exec-tail-source2.channels = file-channel-2
agent_foo.sinks.avro-forward-sink2.channel = file-channel-2

配置多Agent流

要设置一个多层流，需要有一个avro/thrift sink指向下一个hop的avro/thrift source。这将导致第一个Flume代理将事件转发给下一个Flume代理。例如，如果您使用avro客户机定期向本地Flume代理发送文件(每个事件一个文件)，那么这个本地代理可以将其转发给另一个挂载用于存储的代理。

Weblog Agent配置:

# list sources, sinks and channels in the agent
agent_foo.sources = avro-AppSrv-source
agent_foo.sinks = avro-forward-sink
agent_foo.channels = file-channel

# define the flow
agent_foo.sources.avro-AppSrv-source.channels = file-channel
agent_foo.sinks.avro-forward-sink.channel = file-channel

# avro sink properties
agent_foo.sinks.avro-forward-sink.type = avro
agent_foo.sinks.avro-forward-sink.hostname = 10.1.1.100
agent_foo.sinks.avro-forward-sink.port = 10000

# configure other pieces
#...

HDFS Agent配置:

# list sources, sinks and channels in the agent
agent_foo.sources = avro-collection-source
agent_foo.sinks = hdfs-sink
agent_foo.channels = mem-channel

# define the flow
agent_foo.sources.avro-collection-source.channels = mem-channel
agent_foo.sinks.hdfs-sink.channel = mem-channel

# avro source properties
agent_foo.sources.avro-collection-source.type = avro
agent_foo.sources.avro-collection-source.bind = 10.1.1.100
agent_foo.sources.avro-collection-source.port = 10000

# configure other pieces
#...

在这里，我们将avro-forward-sink从weblog代理链接到hdfs代理的avro-collection-source。这将导致来自外部appserver源的事件最终存储在HDFS中。

扇出流

如前一节所述，Flume支持将流从一个源展开到多个通道。扇出、复制和多路复使用两种模式。在复制流中，事件被发送到所有配置的通道。在多路复用的情况下，事件只发送到符合条件的通道的子集。要展开流，需要指定源的通道列表和展开流的策略。这是通过添加一个可以复制或多路复用的通道“选择器”来完成的。然后进一步指定选择规则，如果它是一个多路复用器。如果你做：

# List the sources, sinks and channels for the agent
<Agent>.sources = <Source1>
<Agent>.sinks = <Sink1> <Sink2>
<Agent>.channels = <Channel1> <Channel2>

# set list of channels for source (separated by space)
<Agent>.sources.<Source1>.channels = <Channel1> <Channel2>

# set channel for sinks
<Agent>.sinks.<Sink1>.channel = <Channel1>
<Agent>.sinks.<Sink2>.channel = <Channel2>

<Agent>.sources.<Source1>.selector.type = replicating

多路复用选择具有进一步的属性集来分叉流。这需要指定事件属性到通道集的映射。选择器检查事件标头中的每个配置属性。如果匹配指定的值，则将该事件发送到映射到该值的所有通道。如果没有匹配，则将事件发送到配置为默认的通道集:

# Mapping for multiplexing selector
<Agent>.sources.<Source1>.selector.type = multiplexing
<Agent>.sources.<Source1>.selector.header = <someHeader>
<Agent>.sources.<Source1>.selector.mapping.<Value1> = <Channel1>
<Agent>.sources.<Source1>.selector.mapping.<Value2> = <Channel1> <Channel2>
<Agent>.sources.<Source1>.selector.mapping.<Value3> = <Channel2>
#...

<Agent>.sources.<Source1>.selector.default = <Channel2>

映射允许为每个值重叠通道

下面的示例有一个流，它多路复用到两条路径。名为agent_foo的代理有一个avro源和两个连接到两个接收器的通道:

# list the sources, sinks and channels in the agent
agent_foo.sources = avro-AppSrv-source1
agent_foo.sinks = hdfs-Cluster1-sink1 avro-forward-sink2
agent_foo.channels = mem-channel-1 file-channel-2

# set channels for source
agent_foo.sources.avro-AppSrv-source1.channels = mem-channel-1 file-channel-2

# set channel for sinks
agent_foo.sinks.hdfs-Cluster1-sink1.channel = mem-channel-1
agent_foo.sinks.avro-forward-sink2.channel = file-channel-2

# channel selector configuration
agent_foo.sources.avro-AppSrv-source1.selector.type = multiplexing
agent_foo.sources.avro-AppSrv-source1.selector.header = State
agent_foo.sources.avro-AppSrv-source1.selector.mapping.CA = mem-channel-1
agent_foo.sources.avro-AppSrv-source1.selector.mapping.AZ = file-channel-2
agent_foo.sources.avro-AppSrv-source1.selector.mapping.NY = mem-channel-1 file-channel-2
agent_foo.sources.avro-AppSrv-source1.selector.default = mem-channel-1

选择器检查一个名为“State”的标头。如果值为“CA”，则发送到mem-channel-1，如果值为“AZ”，则发送到file-channel-2，或者如果值为“NY”，则两者都发送到。如果“State”头文件没有设置或者不匹配这三个文件中的任何一个，那么它会转到指定为“default”的mem-channel-1。

选择器还支持可选通道。要为标头指定可选通道，配置参数“‘optional”的使用方式如下:

# channel selector configuration
agent_foo.sources.avro-AppSrv-source1.selector.type = multiplexing
agent_foo.sources.avro-AppSrv-source1.selector.header = State
agent_foo.sources.avro-AppSrv-source1.selector.mapping.CA = mem-channel-1
agent_foo.sources.avro-AppSrv-source1.selector.mapping.AZ = file-channel-2
agent_foo.sources.avro-AppSrv-source1.selector.mapping.NY = mem-channel-1 file-channel-2
agent_foo.sources.avro-AppSrv-source1.selector.optional.CA = mem-channel-1 file-channel-2
agent_foo.sources.avro-AppSrv-source1.selector.mapping.AZ = file-channel-2
agent_foo.sources.avro-AppSrv-source1.selector.default = mem-channel-1

选择器将首先尝试写入所需的通道，如果其中一个通道未能使用事件，则事务将失败。在所有通道上重新尝试事务。一旦所有必需的通道都使用了事件，那么选择器将尝试写入可选通道。任何可选通道使用事件的失败都将被忽略，不会重试。

如果可选通道和特定标头所需通道之间有重叠，则认为通道是必需的，而通道中的故障将导致重试所需的全部通道。例如，在上面的示例中，尽管标头“CA”memi -channel-1被标记为required和optional，但它被认为是必需的通道，如果不写入该通道，将在为选择器配置的所有通道上重试该事件。

注意，如果标头没有任何必需的通道，则事件将被写入默认通道，并将尝试被写入该标头的可选通道。如果没有指定需要的通道，指定可选通道仍然会导致事件被写入默认通道。如果没有指定通道为默认通道，也没有必要通道，选择器将尝试将事件写入可选通道。在这种情况下，任何失败都会被忽略。