一次flume exec source采集日志到kafka因为单条日志数据非常大同步失败的踩坑带来的思考

最新推荐文章于 2023-10-15 14:29:27 发布

HuaZi_Myth

最新推荐文章于 2023-10-15 14:29:27 发布

阅读量4.4k

点赞数

本文链接：https://blog.csdn.net/HuaZi_Myth/article/details/102962159

版权

本次遇到的问题描述，日志采集同步时，当单条日志（日志文件中一行日志）超过2M大小，数据无法采集同步到 kafka，分析后，共踩到如下几个坑。 1、flume采集时，通过shell+EXEC（tail -F xxx.log 的方式） source来获取日志时，当单条日志过大超过1M时，source端无法从日志中获取到Event。 2、日志超过1M后，flume的kafka sink 作为生产者发送给日志给kafka失败，kafka无法收到消息。以下针对踩的这两个坑做分析，flume 我使用的是1.9.0版本。 kafka使用的是2.11-2.0.0版本

问题一、flume采集时，通过shell+EXEC（tail -F xxx.log 的方式） source来获取日志时，当单条日志过大超过1M时，source端无法从日志中获取到Event。flume的配置如下：

 ......
 agent.sources = seqGenSrc
 ......
 # For each one of the sources, the type is defined
agent.sources.seqGenSrc.type = exec
#agent.sources.seqGenSrc.command = tail -F /opt/logs/test.log|grep businessCollection|awk -F '- {' '{print "{"$2}'
agent.sources.seqGenSrc.command = tail -F /opt/logs/test.log|grep businessCollection
agent.sources.seqGenSrc.shell = /bin/bash -c
agent.sources.seqGenSrc.batchSize = 1
agent.sources.seqGenSrc.batchTimeout = 90000
......

　　原因：采用shell+EXEC方式的时候，flume的源码中使用的是如下的方式来获取日志

    private Process process = null;
	//使用这种方式来执行命令。
process = Runtime.getRuntime().exec(commandArgs);
//读取日志
 reader = new BufferedReader(  new InputStreamReader(process.getInputStream(), charset));

在一行日志超过1M后，这个代码就假死了，一直宕住，导致无法获取到数据。针对这个问题处理方式：方式一:修改源码的实现方式。（1.9.0的源码对应的是源码中的flume-ng-core 项目中的org.apache.flume.source.ExecSource. java 这个类）

//process的采用如下方式获和执行命令，就改一行代码。增加.redirectErrorStream(true)后，输入流就都可以获取到，哪怕超过1M
process = new ProcessBuilder(commandArgs).redirectErrorStream(true).start();

修改完成后，重新打包编译，然后将生成的jar包替换原来老的jar包。

方式二：放弃EXECSource，使用TAILDIR Source。 使用这个source时，对应的配置如下：

 ......
 agent.sources = seqGenSrc
 ......
 # For each one of the sources, the type is defined
agent.sources.seqGenSrc.type = TAILDIR
agent.sources.seqGenSrc.positionFile = ./taildir_position.json
agent.sources.seqGenSrc.filegroups = seqGenSrc
agent.sources.seqGenSrc.filegroups.seqGenSrc = /opt/logs/test.log
agent.sources.seqGenSrc.fileHeader = false
agent.sources.seqGenSrc.batchSize = 1
......

　　建议采用TAILDIR Source 比较好，这个可以对多个日志进行监控和采集，而且日志采集时会记录日志采集位置到positionFile 中，这样日志采集不会重复。EXEC SOURCE在重启采集时数据会重复采集，还需要其他的方式去避免重复采集

问题二、日志超过1M后，flume的kafka sink 作为生产者发送给日志给kafka失败，kafka无法收到消息
原因：kafka 在默认情况下，只能接收1M大小以内的消息，在没有做自定义设置时。所以单条消息大于1M后是无法处理的。
处理方式如下：

1）、修改kafka 服务端server.properties文件，做如下设置(修改大小限制)

# The send buffer (SO_SNDBUF) used by the socket server
socket.send.buffer.bytes=502400

# The receive buffer (SO_RCVBUF) used by the socket server
socket.receive.buffer.bytes=502400

# The maximum size of a request that the socket server will accept (protection against OOM)
socket.request.max.bytes=104857600
message.max.bytes=5242880
replica.fetch.max.bytes=6291456

2）、修改producer.properties，做如下设置(修改大小限制)

# the maximum size of a request in bytes
max.request.size= 9242880

3）、java代码中在初始化kafka 生产者时，也需要指定max.request.size= 9242880

        Properties properties = new Properties();
		...
		      properties.put("max.request.size", 5242880);
			  ...
			KafkaProducer<Object,Object>  kafkaProducer = new KafkaProducer<Object,Object>(properties);

　　4)、消费者在消费kafka数据时，也需要注意设置消费消息的大小限制

            Properties properties = new Properties();
			...
            properties.put(ConsumerConfig.FETCH_MAX_BYTES_CONFIG, 6291456);		
				...
				 KafkaConsumer<String, String> consumer = new KafkaConsumer<>(properties);

对于flume不了的同学，可以看flume 1.9中文版用户指南：https://www.h3399.cn/201906/700076.html　　

flume1.9 用户指南 (中文版)

概述

Apache Flume 是一个分布式, 可靠且可用的系统, 用于有效地从许多不同的 source 收集, 聚合和移动大量日志数据到集中式数据存储.

Apache Flume 的使用不仅限于日志数据聚合. 由于数据 source 是可定制的, 因此 Flume 可用于传输大量 event 数据, 包括但不限于网络流量数据, 社交媒体生成的数据, 电子邮件消息以及几乎任何可能的数据 source.

Apache Flume 是 Apache Software Foundation 的顶级项目.

系统要求

Java 运行时环境 - Java 1.8 或更高版本

内存 - 为 source,channel 或 sink 配置的内存

磁盘空间 - channel 或 sink 配置的磁盘空间

目录权限 - agent 使用的目录的读 / 写权限

架构

数据流模型

Flume event 被定义为具有字节有效负载和可选字符串属性集的数据流单元. Flume agent 是一个 ( JVM) 进程, 它承载 event 从外部 source 流向下一个目标 (跃点) 的组件.

www.wityx.com

Flume source 消耗由外部 source(如 Web 服务器)传递给它的 event . 外部 source 以目标 Flume source 识别的格式向 Flume 发送 event . 例如, Avro Flume source 可用于从 Avro 客户端或从 Avrosink 发送 event 的流中的其他 Flume agent 接收 Avroevent . 可以使用 Thrift Flume Source 定义类似的流程, 以接收来自 Thrift Sink 或 Flume Thrift Rpc 客户端或 Thrift 客户端的 event , 这些客户端使用 Flume thrift 协议生成的任何语言编写. 当 Flume source 接收 event 时, 它将其存储到一个或多个 channels . 该 channel 是一个被动存储器, 可以保持 event 直到它被 Flume sink 消耗. 文件 channel 就是一个例子 - 它由本地文件系统支持. sink 从 channel 中移除 event 并将其放入外部存储库 (如 HDFS(通过 Flume HDFS sink)) 或将其转发到流中下一个 Flume agent (下一跳)的 Flume source. 给定 agent 中的 source 和 sink 与 channel 中暂存的 event 异步运行.

复杂的流程

Flume 允许用户构建多跳流, 其中 event 在到达最终目的地之前经过多个 agent . 它还允许 fan-in 和 fan-out, 上下文路由和故障跳跃的备份路由(故障转移).

可靠性

event 在每个 agent 的 channel 中进行. 然后将 event 传递到流中的下一个 agent 或终端存储库(如 HDFS). 只有将 event 存储在下一个 agent 的 channel 或终端存储库中后, 才会从 channel 中删除这些 event . 这就是 Flume 中的单跳消息传递语义如何提供流的端到端可靠性.

Flume 使用事务方法来保证 event 的可靠传递. source 和 sink 分别在事务中封装由 channel 提供的事务中放置或提供的 event 的存储 / 检索. 这可确保 event 集在流中从一个点到另一个点可靠地传递. 在多跳流的情况下, 来自前一跳的 sink 和来自下一跳的 source 都运行其事务以确保数据安全地存储在下一跳的 channel 中.

可恢复性

event 在 channel 中进行, 该 channel 管理从故障中恢复. Flume 支持由本地文件系统支持的持久文件 channel. 还有一个内存 channel, 它只是将 event 存储在内存中的队列中, 这更快, 但是当 agent 进程死亡时仍然留在内存 channel 中的任何 event 都无法恢复.

设置

设置 agent

Flume agent 配置存储在本地配置文件中. 这是一个遵循 Java 属性文件格式的文本文件. 可以在同一配置文件中指定一个或多个 agent 的配置. 配置文件包括 agent 中每个 source,sink 和 channel 的属性以及它们如何连接在一起以形成数据流.

配置单个组件

流中的每个组件 (source,sink 或 channel) 都具有特定于类型和实例化的名称, 类型和属性集. 例如, Avrosource 需要主机名 (或 IP 地址) 和端口号来接收数据. 内存 channel 可以具有最大队列大小 ("容量"),HDFS sink 需要知道文件系统 URI, 创建文件的路径, 文件轮换频率("hdfs.rollInterval") 等. 组件的所有此类属性需要在托管 Flume agent 的属性文件中设置.

将各个部分连接在一起

agent 需要知道要加载哪些组件以及它们如何连接以构成流程. 这是通过列出 agent 中每个 source,sink 和 channel 的名称, 然后为每个 sink 和 source 指定连接 channel 来完成的. 例如, agent 通过名为 file-channel 的文件 channel 将 event 从名为 avroWeb 的 Avrosource 流向 HDFS sink hdfs-cluster1. 配置文件将包含这些组件的名称和文件 channel, 作为 avroWebsource 和 hdfs-cluster1 sink 的共享 channel.

启动 agent

使用名为 flume-ng 的 shell 脚本启动 agent 程序, 该脚本位于 Flume 发行版的 bin 目录中. 您需要在命令行上指定 agent 名称, config 目录和配置文件:

$ bin/flume-ng agent -n $agent_name -c conf -f conf/flume-conf.properties.template

现在, agent 将开始运行在给定属性文件中配置的 source 和 sink.

一个简单的例子

在这里, 我们给出一个示例配置文件, 描述单节点 Flume 部署. 此配置允许用户生成 event , 然后将其记录到控制台.

# example.conf: A single-node Flume configuration
# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1
# Describe/configure the source
a1.sources.r1.type = netcat
a1.sources.r1.bind = localhost
a1.sources.r1.port = 44444
# Describe the sink
a1.sinks.k1.type = logger
# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

此配置定义名为 a1 的单个 agent .a1 有一个监听端口 44444 上的数据的 source, 一个缓冲内存中 event 数据的 channel, 以及一个将 event 数据记录到控制台的 sink. 配置文件命名各种组件, 然后描述其类型和配置参数. 给定的配置文件可能会定义几个命名的 agent 当一个给定的 Flume 进程启动时, 会传递一个标志, 告诉它要显示哪个命名 agent.

鉴于此配置文件, 我们可以按如下方式启动 Flume:

$ bin/flume-ng agent --conf conf --conf-file example.conf --name a1 -Dflume.root.logger=INFO,console

请注意, 在完整部署中, 我们通常会包含一个选项: --conf=<conf-dir> . 所述 <conf-dir> 目录将包括一个 shell 脚本 f lume-env.sh 和潜在的一个 log4j 的属性文件. 在这个例子中, 我们传递一个 Java 选项来强制 Flume 登录到控制台, 我们没有自定义环境脚本.

从一个单独的终端, 我们可以 telnet 端口 44444 并向 Flume 发送一个 event :

$ telnet localhost 44444
Trying 127.0.0.1...
Connected to localhost.localdomain (127.0.0.1).
Escape character is '^]'.
Hello world! <ENTER>
OK

原始的 Flume 终端将在日志消息中输出 event .

12/06/19 15:32:19 INFO source.NetcatSource: Source starting
12/06/19 15:32:19 INFO source.NetcatSource: Created serverSocket:sun.nio.ch.ServerSocketChannelImpl[/127.0.0.1:44444]
12/06/19 15:32:34 INFO sink.LoggerSink: Event: {
headers:{
} body: 48 65 6C 6C 6F 20 77 6F 72 6C 64 21 0D Hello world!.
}

恭喜 - 您已成功配置并部署了 Flume agent ! 后续部分更详细地介绍了 agent 配置.

在配置文件中使用环境变量

Flume 能够替换配置中的环境变量. 例如:

a1.sources = r1
a1.sources.r1.type = netcat
a1.sources.r1.bind = 0.0.0.0
a1.sources.r1.port = ${
NC_PORT
}
a1.sources.r1.channels = c1

注意: 它目前仅适用于 values , 不适用于 keys . (IE. only on the "right side" of the = mark of the config lines.)

通过设置 propertiesImplementation = org.apache.flume.node.EnvVarResolverProperties, 可以通过 agent 程序调用上的 Java 系统属性启用此功能.

例如:

$ NC_PORT=44444 bin/flume-ng agent -conf conf -conf-file example.conf -name a1 -Dflume.root.logger=INFO,console -DpropertiesImplementation=org.apache.flume.node.EnvVarResolverProperties

请注意, 上面只是一个示例, 可以通过其他方式配置环境变量, 包括在 conf/flume-env.sh.

记录原始数据

在许多生产环境中记录流经摄取 pipeline 的原始数据流不是所希望的行为, 因为这可能导致泄漏敏感数据或安全相关配置 (例如密钥) 泄漏到 Flume 日志文件. 默认情况下, Flume 不会记录此类信息. 另一方面, 如果数据管道被破坏, Flume 将尝试提供调试 DEBUG 的线索.

调试 event 管道问题的一种方法是设置连接到 Logger Sink 的附加内存 channel, 它将所有 event 数据输出到 Flume 日志. 但是, 在某些情况下, 这种方法是不够的.

为了能够记录 event 和配置相关的数据, 除了 log4j 属性外, 还必须设置一些 Java 系统属性.

要启用与配置相关的日志记录, 请设置 Java 系统属性 - Dorg.apache.flume.log.printconfig=true . 这可以在命令行上传递, 也可以在 flume-env.sh 中的 JAVA_OPTS 变量中设置.

要启用数据记录, 请按照上述相同方式设置 Java 系统属性 -Dorg.apache.flume.log.rawdata=true . 对于大多数组件, 还必须将 log4j 日志记录级别设置为 DEBUG 或 TRACE, 以使特定于 event 的日志记录显示在 Flume 日志中.

下面是启用配置日志记录和原始数据日志记录的示例, 同时还将 Log4j 日志级别设置为 DEBUG 以用于控制台输出:

$ bin/flume-ng agent --conf conf --conf-file example.conf --name a1 -Dflume.root.logger=DEBUG,console -Dorg.apache.flume.log.printconfig=true -Dorgwdata=true

基于 Zookeeper 的配置

Flume 通过 Zookeeper 支持 agent 配置. 这是一个实验性功能. 配置文件需要在可配置前缀下的 Zookeeper 中上传. 配置文件存储在 Zookeeper 节点数据中. 以下是 agent 商 a1 和 a2 的 Zookeeper 节点树的外观

- /flume
|- /a1 [Agent config file]
|- /a2 [Agent config file]

上载配置文件后, 使用以下选项启动 agent

$ bin/flume-ng agent -conf conf -z zkhost:2181,zkhost1:2181 -p /flume -name a1 -Dflume.root.logger=INFO,console

Argument Name	Default	Description
z	–	Zookeeper 连接字符串。以逗号分隔的主机名列表：port
p	/flume	Zookeeper 中的基本路径，用于存储 agent 配置

Flume 拥有完全基于插件的架构. 虽然 Flume 附带了许多开箱即用的 source,channels,sink,serializers 等, 但许多实现都与 Flume 分开运行. 安装第三方插件

虽然通过将自己的 jar 包添加到 flume-env.sh 文件中的 FLUME_CLASSPATH 变量中, 始终可以包含自定义 Flume 组件, 但 Flume 现在支持一个名为 plugins.d 的特殊目录, 该目录会自动获取以特定格式打包的插件. 这样可以更轻松地管理插件打包问题, 以及更简单的调试和几类问题的故障排除, 尤其是库依赖性冲突.

该 plugins.d 目录位于 $FLUME_HOME/plugins.d . 在启动时, flume-ng 启动脚本在 plugins.d 目录中查找符合以下格式的插件, 并在启动 java 时将它们包含在正确的路径中.

插件的目录布局

plugins.d 中的每个插件 (子目录) 最多可以有三个子目录:

lib - the plugin's jar(s)
libext - the plugin's dependency jar(s)
native - any required native libraries, such as .so files

plugins.d 目录中的两个插件示例:

plugins.d/
plugins.d/custom-source-1/
plugins.d/custom-source-1/lib/my-source.jar
plugins.d/custom-source-1/libext/spring-core-2.5.6.jar
plugins.d/custom-source-2/
plugins.d/custom-source-2/lib/custom.jar
plugins.d/custom-source-2/native/gettext.so

数据摄取

Flume 支持许多从外部来 source 摄取数据的机制.

RPC

Flume 发行版中包含的 Avro 客户端可以使用 avro RPC 机制将给定文件发送到 Flume Avrosource:

$ bin/flume-ng avro-client -H localhost -p 41414 -F /usr/logs/log.10

上面的命令会将 /usr/logs/log.10 的内容发送到监听该端口的 Flume source.

执行命令

有一个 exec source 执行给定的命令并消耗输出. 输出的单 "行" 即. 文本后跟回车符 ('\ r') 或换行符 ('\ n') 或两者一起.

网络流

Flume 支持以下机制从常用日志流类型中读取数据, 例如:

Avro
Thrift
Syslog
Netcat

设置多 agent 流程

www.wityx.com

为了跨多个 agent 或跳数据流, 先前 agent 的 sink 和当前跳的 source 需要是 avro 类型, sink 指向 source 的主机名 (或 IP 地址) 和端口.

合并

日志收集中非常常见的情况是大量日志生成客户端将数据发送到连接到存储子系统的少数消费者 agent . 例如, 从数百个 Web 服务器收集的日志发送给写入 HDFS 集群的十几个 agent .

www.wityx.com

这可以通过使用 avrosink 配置多个第一层 agent 在 Flume 中实现, 所有这些 agent 都指向单个 agent 的 avrosource(同样, 您可以在这种情况下使用 thriftsource/sink / 客户端). 第二层 agent 上的此 source 将接收的 event 合并到单个信道中, 该信道由信宿器消耗到其最终目的地.

多路复用流程

Flume 支持将 event 流多路复用到一个或多个目的地. 这是通过定义可以复制或选择性地将 event 路由到一个或多个信道的流复用器来实现的.

www.wityx.com

上面的例子显示了来自 agent "foo" 的 source 代码将流程扩展到三个不同的 channel. 扇出可以复制或多路复用. 在复制流的情况下, 每个 event 被发送到所有三个 channel. 对于多路复用情况, 当 event 的属性与预配置的值匹配时, event 将被传递到可用 channel 的子集. 例如, 如果一个名为 "txnType" 的 event 属性设置为 "customer", 那么它应该转到 channel1 和 channel3, 如果它是 "vendor", 那么它应该转到 channel2, 否则转到 channel3. 可以在 agent 的配置文件中设置映射.

配置

如前面部分所述, Flume agent 程序配置是从类似于具有分层属性设置的 Java 属性文件格式的文件中读取的.

定义流程

要在单个 agent 中定义流, 您需要通过 channel 链接 source 和 sink. 您需要列出给定 agent 的 source,sink 和 channel, 然后将 source 和 sink 指向 channel.source 实例可以指定多个 channel, 但 sink 实例只能指定一个 channel. 格式如下:

# list the sources, sinks and channels for the agent
<Agent>.sources = <Source>
<Agent>.sinks = <Sink>
<Agent>.channels = <Channel1> <Channel2>
# set channel for source
<Agent>.sources.<Source>.channels = <Channel1> <Channel2> ...
# set channel for sink
<Agent>.sinks.<Sink>.channel = <Channel1>

例如, 名为 agent_foo 的 agent 正在从外部 avro 客户端读取数据并通过内存 channel 将其发送到 HDFS.

配置文件 weblog.config 可能如下所示:

# list the sources, sinks and channels for the agent
agent_foo.sources = avro-appserver-src-1
agent_foo.sinks = hdfs-sink-1
agent_foo.channels = mem-channel-1
# set channel for source
agent_foo.sources.avro-appserver-src-1.channels = mem-channel-1
# set channel for sink
agent_foo.sinks.hdfs-sink-1.channel = mem-channel-1

这将使 event 从 avro-AppSrv-source 流向 hdfs-Cluster1-sink, 通过内存 channelmem-channel-1.

当使用 weblog.config 作为其配置文件启动 agent 程序时, 它将实例化该流程.

配置单个组件

定义流后, 您需要设置每个 source,sink 和 channel 的属性. 这是以相同的分层命名空间方式完成的, 您可以在其中设置组件类型以及特定于每个组件的属性的其他值:

# properties for sources
<Agent>.sources.<Source>.<someProperty> = <someValue>
# properties for channels
<Agent>.channel.<Channel>.<someProperty> = <someValue>
# properties for sinks
<Agent>.sources.<Sink>.<someProperty> = <someValue>

需要为 Flume 的每个组件设置属性 "type", 以了解它需要什么类型的对象. 每个 source,sink 和 channel 类型都有自己的一组属性, 使其能够按预期运行. 所有这些都需要根据需要进行设置. 在前面的示例中, 我们有一个从 avro-AppSrv-source 到 hdfs-Cluster1-sink 的流程通过内存 channelmem-channel-1. 这是一个显示每个组件配置的示例:

agent_foo.sources = avro-AppSrv-source
agent_foo.sinks = hdfs-Cluster1-sink
agent_foo.channels = mem-channel-1
# set channel for sources, sinks
# properties of avro-AppSrv-source
agent_foo.sources.avro-AppSrv-source.type = avro
agent_foo.sources.avro-AppSrv-source.bind = localhost
agent_foo.sources.avro-AppSrv-source.port = 10000
# properties of mem-channel-1
agent_foo.channels.mem-channel-1.type = memory
agent_foo.channels.mem-channel-1.capacity = 1000
agent_foo.channels.mem-channel-1.transactionCapacity = 100
# properties of hdfs-Cluster1-sink
agent_foo.sinks.hdfs-Cluster1-sink.type = hdfs
agent_foo.sinks.hdfs-Cluster1-sink.hdfs.path = hdfs://namenode/flume/webdata
#...

在 agent 中添加多个流

单个 Flume agent 可以包含多个独立流. 您可以在配置中列出多个 source,sink 和 channel. 可以链接这些组件以形成多个流:

# list the sources, sinks and channels for the agent
<Agent>.sources = <Source1> <Source2>
<Agent>.sinks = <Sink1> <Sink2>
<Agent>.channels = <Channel1> <Channel2>

然后, 您可以将 source 和 sink 链接到 channel(用于 sink)的相应 channel(用于 source), 以设置两个不同的流. 例如, 如果您需要在 agent 中设置两个流, 一个从外部 avro 客户端到外部 HDFS, 另一个从尾部输出到 avrosink, 那么这是一个配置来执行此操作:

# list the sources, sinks and channels in the agent
agent_foo.sources = avro-AppSrv-source1 exec-tail-source2
agent_foo.sinks = hdfs-Cluster1-sink1 avro-forward-sink2
agent_foo.channels = mem-channel-1 file-channel-2
# flow #1 configuration
agent_foo.sources.avro-AppSrv-source1.channels = mem-channel-1
agent_foo.sinks.hdfs-Cluster1-sink1.channel = mem-channel-1
# flow #2 configuration
agent_foo.sources.exec-tail-source2.channels = file-channel-2
agent_foo.sinks.avro-forward-sink2.channel = file-channel-2

配置多 agent 流程

要设置多层流, 您需要有 sink 指向下一跳的 avro/thrift source. 这将导致第一个 Flume agent 将 event 转发到下一个 Flume agent . 例如, 如果您使用 avro 客户端定期向本地 Flume agent 发送文件(每个 event 1 个文件), 则此本地 agent 可以将其转发到另一个已安装存储的 agent

Weblog agent 配置:

# list sources, sinks and channels in the agent
agent_foo.sources = avro-AppSrv-source
agent_foo.sinks = avro-forward-sink
agent_foo.channels = file-channel
# define the flow
agent_foo.sources.avro-AppSrv-source.channels = file-channel
agent_foo.sinks.avro-forward-sink.channel = file-channel
# avro sink properties
agent_foo.sinks.avro-forward-sink.type = avro
agent_foo.sinks.avro-forward-sink.hostname = 10.1.1.100
agent_foo.sinks.avro-forward-sink.port = 10000
# configure other pieces
#...

HDFS agent 配置:

# list sources, sinks and channels in the agent
agent_foo.sources = avro-collection-source
agent_foo.sinks = hdfs-sink
agent_foo.channels = mem-channel
# define the flow
agent_foo.sources.avro-collection-source.channels = mem-channel
agent_foo.sinks.hdfs-sink.channel = mem-channel
# avro source properties
agent_foo.sources.avro-collection-source.type = avro
agent_foo.sources.avro-collection-source.bind = 10.1.1.100
agent_foo.sources.avro-collection-source.port = 10000
# configure other pieces
#...

在这里, 我们将 weblog agent 的 avro-forward-sink 链接到 hdfs agent 的 avro-collection-source. 这将导致来自外部应用程序服务器 source 的 event 最终存储在 HDFS 中.

扇出流量

如前一节所述, Flume 支持扇出从一个 source 到多个 channel 的流量. 扇出有两种模式 : 复制和多路复用. 在复制流程中, event 将发送到所有已配置的 channel. 在多路复用的情况下, event 仅被发送到合格 channels 的子集. 为了散开流量, 需要指定 source 的 channel 列表以及扇出它的策略. 这是通过添加可以复制或多路复用的 channel"选择器" 来完成的. 如果它是多路复用器, 则进一步指定选择规则. 如果您没有指定选择器, 那么默认情况下它会复制:

# List the sources, sinks and channels for the agent
<Agent>.sources = <Source1>
<Agent>.sinks = <Sink1> <Sink2>
<Agent>.channels = <Channel1> <Channel2>
# set list of channels for source (separated by space)
<Agent>.sources.<Source1>.channels = <Channel1> <Channel2>
# set channel for sinks
<Agent>.sinks.<Sink1>.channel = <Channel1>
<Agent>.sinks.<Sink2>.channel = <Channel2>
<Agent>.sources.<Source1>.selector.type = replicating

多路复用选择具有另一组属性以分流流. 这需要指定 event 属性到 channel 集的映射. 选择器检查 event 头中的每个已配置属性. 如果它与指定的值匹配, 则该 event 将发送到映射到该值的所有 channel. 如果没有匹配项, 则将 event 发送到配置为默认值的 channel 集:

# Mapping for multiplexing selector
<Agent>.sources.<Source1>.selector.type = multiplexing
<Agent>.sources.<Source1>.selector.header = <someHeader>
<Agent>.sources.<Source1>.selector.mapping.<Value1> = <Channel1>
<Agent>.sources.<Source1>.selector.mapping.<Value2> = <Channel1> <Channel2>
<Agent>.sources.<Source1>.selector.mapping.<Value3> = <Channel2>
#...
<Agent>.sources.<Source1>.selector.default = <Channel2>

映射允许为每个值重叠 channel.

以下示例具有多路复用到两个路径的单个流. 名为 agent_foo 的 agent 具有单个 avrosource 和两个链接到两个 sink 的 channel:

# list the sources, sinks and channels in the agent
agent_foo.sources = avro-AppSrv-source1
agent_foo.sinks = hdfs-Cluster1-sink1 avro-forward-sink2
agent_foo.channels = mem-channel-1 file-channel-2
# set channels for source
agent_foo.sources.avro-AppSrv-source1.channels = mem-channel-1 file-channel-2
# set channel for sinks
agent_foo.sinks.hdfs-Cluster1-sink1.channel = mem-channel-1
agent_foo.sinks.avro-forward-sink2.channel = file-channel-2
# channel selector configuration
agent_foo.sources.avro-AppSrv-source1.selector.type = multiplexing
agent_foo.sources.avro-AppSrv-source1.selector.header = State
agent_foo.sources.avro-AppSrv-source1.selector.mapping.CA = mem-channel-1
agent_foo.sources.avro-AppSrv-source1.selector.mapping.AZ = file-channel-2
agent_foo.sources.avro-AppSrv-source1.selector.mapping.NY = mem-channel-1 file-channel-2
agent_foo.sources.avro-AppSrv-source1.selector.default = mem-channel-1

选择器检查名为 "State" 的标头. 如果该值为 "CA", 则将其发送到 mem-channel-1, 如果其为 "AZ", 则将其发送到文件 channel-2, 或者如果其为 "NY" 则为两者. 如果 "状态" 标题未设置或与三者中的任何一个都不匹配, 则它将转到 mem-channel-1, 其被指定为 "default".

选择器还支持可选 channel. 要为标头指定可选 channel, 可通过以下方式使用 config 参数 "optional":

# channel selector configuration
agent_foo.sources.avro-AppSrv-source1.selector.type = multiplexing
agent_foo.sources.avro-AppSrv-source1.selector.header = State
agent_foo.sources.avro-AppSrv-source1.selector.mapping.CA = mem-channel-1
agent_foo.sources.avro-AppSrv-source1.selector.mapping.AZ = file-channel-2
agent_foo.sources.avro-AppSrv-source1.selector.mapping.NY = mem-channel-1 file-channel-2
agent_foo.sources.avro-AppSrv-source1.selector.optional.CA = mem-channel-1 file-channel-2
agent_foo.sources.avro-AppSrv-source1.selector.mapping.AZ = file-channel-2
agent_foo.sources.avro-AppSrv-source1.selector.default = mem-channel-1

选择器将首先尝试写入所需的 channel, 如果其中一个 channel 无法使用 event , 则会使事务失败. 在所有渠道上重新尝试交易. 一旦所有必需的 channel 消耗了 event , 则选择器将尝试写入可选 channel. 任何可选 channel 使用该 event 的失败都会被忽略而不会重试.

如果可选信道与特定报头的所需信道之间存在重叠, 则认为该信道是必需的, 并且信道中的故障将导致重试所有必需信道集. 例如, 在上面的示例中, 对于标题 "CA",mem-channel-1 被认为是必需的 channel, 即使它被标记为必需和可选, 并且写入此 channel 的失败将导致该 event 在为选择器配置的所有 channel 上重试.

请注意, 如果标头没有任何所需的 channel, 则该 event 将被写入默认 channel, 并将尝试写入该标头的可选 channel. 如果未指定所需的 channel, 则指定可选 channel 仍会将 event 写入默认 channel. 如果没有将 channel 指定为默认 channel 且没有必需 channel, 则选择器将尝试将 event 写入可选 channel. 在这种情况下, 任何失败都会被忽略.

支持

多个 Flume 组件支持 SSL / TLS 协议, 以便安全地与其他系统通信.

Component	SSL server or client
Avro Source	server
Avro Sink	client
Thrift Source	server
Thrift Sink	client
Kafka Source	client
Kafka Channel	client
Kafka Sink	client
HTTP Source	server
JMS Source	client
Syslog TCP Source	server
Multiport Syslog TCP Source	server

SSL 兼容组件具有若干配置参数来设置 SSL, 例如启用 SSL 标志, 密钥库 / 信任库参数 (位置, 密码, 类型) 和其他 SSL 参数(例如禁用的协议)

始终在 agent 配置文件的组件级别指定为组件启用 SSL. 因此, 某些组件可能配置为使用 SSL, 而其他组件则不配置(即使具有相同的组件类型)

密钥库 / 信任库设置可以在组件级别或全局指定.

在组件级别设置的情况下, 通过组件特定参数在 agent 配置文件中配置密钥库 / 信任库. 此方法的优点是组件可以使用不同的密钥库(如果需要). 缺点是必须为 agent 配置文件中的每个组件复制密钥库参数. 组件级别设置是可选的, 但如果已定义, 则其优先级高于全局参数.

使用全局设置, 只需定义一次密钥库 / 信任库参数, 并对所有组件使用相同的设置, 这意味着更少和更集中的配置.

可以通过系统属性或通过环境变量来配置全局设置.

系统属性	环境变量	描述
javax.net.ssl.keyStore	FLUME_SSL_KEYSTORE_PATH	密钥库位置
javax.net.ssl.keyStorePassword	FLUME_SSL_KEYSTORE_PASSWORD	密钥库密码
javax.net.ssl.keyStoreType	FLUME_SSL_KEYSTORE_TYPE	密钥库类型（默认为 JKS）
javax.net.ssl.trustStore	FLUME_SSL_TRUSTSTORE_PATH	信任库位置
javax.net.ssl.trustStorePassword	FLUME_SSL_TRUSTSTORE_PASSWORD	信任库密码
javax.net.ssl.trustStoreType	FLUME_SSL_TRUSTSTORE_TYPE	信任库类型（默认为 JKS）
flume.ssl.include.protocols	FLUME_SSL_INCLUDE_PROTOCOLS	计算启用的协议时要包括的协议。逗号（，）分隔列表。如果提供，排除的协议将从此列表中排除。
flume.ssl.exclude.protocols	FLUME_SSL_EXCLUDE_PROTOCOLS	计算启用的协议时要排除的协议。逗号（，）分隔列表。
flume.ssl.include.cipherSuites	FLUME_SSL_INCLUDE_CIPHERSUITES	在计算启用的密码套件时包含的密码套件。逗号（，）分隔列表。如果提供，排除的密码套件将被排除在此列表之外。
flume.ssl.exclude.cipherSuites	FLUME_SSL_EXCLUDE_CIPHERSUITES	在计算启用的密码套件时要排除的密码套件。逗号（，）分隔列表。

可以在命令行上传递 SSL 系统属性, 也可以在 conf / flume-env.sh 中设置 JAVA_OPTS 环境变量(尽管使用命令行是不可取的, 因为包含密码的命令将保存到命令历史记录中.)

export JAVA_OPTS="$JAVA_OPTS -Djavax.net.ssl.keyStore=/path/to/keystore.jks"
export JAVA_OPTS="$JAVA_OPTS -Djavax.net.ssl.keyStorePassword=password"

Flume 使用 JSSE(Java 安全套接字扩展)中定义的系统属性, 因此这是设置 SSL 的标准方法. 另一方面, 在系统属性中指定密码意味着可以在进程列表中看到密码. 对于不可接受的情况, 也可以在环境变量中定义参数. 在这种情况下, Flume 在内部从相应的环境变量初始化 JSSE 系统属性.

SSL 环境变量可以在启动 Flume 之前在 shell 环境中设置, 也可以在 conf / flume-env.sh 中设置(尽管使用命令行是不可取的, 因为包含密码的命令将保存到命令历史记录中.)

export FLUME_SSL_KEYSTORE_PATH=/path/to/keystore.jks
export FLUME_SSL_KEYSTORE_PASSWORD=password

** 请注意:**

必须在组件级别启用 SSL. 仅指定全局 SSL 参数不会产生任何影响.

如果在多个级别指定全局 SSL 参数, 则优先级如下(从高到低):

agent 配置中的组件参数

系统属性

环境变量

如果为组件启用了 SSL, 但未以上述任何方式指定 SSL 参数, 则

在密钥库的情况下: 配置错误

在 truststores 的情况下: 将使用默认信任库(Oracle JDK 中的 jssecacerts / cacerts)

在所有情况下, 可信任密码都是可选的. 如果未指定, 则在 JDK 打开信任库时, 不会对信任库执行完整性检查.

source 和接收批量大小和 channel 事务容量

source 和 sink 可以具有批量大小参数, 该参数确定它们在一个批次中处理的最大 event 数. 这发生在具有称为事务容量的上限的 channel 事务中. 批量大小必须小于渠道的交易容量. 有一个明确的检查, 以防止不兼容的设置. 只要读取配置, 就会进行此检查.

Flume Source
Avro Source

监听 Avro 端口并从外部 Avro 客户端流接收 event . 当与另一个(上一跳)Flume agent 上的内置 Avro Sink 配对时, 它可以创建分层集合拓扑. 必需属性以粗体显示

属性名称	默认	描述
channels	-
type	-	组件类型名称，需要是 avro
bind	-	要侦听的主机名或 IP 地址
port	-	要绑定的端口号
threads	-	生成的最大工作线程数
selector.type
selector.*
interceptors	-	以空格分隔的拦截器列表
interceptors.*
compression-type	none	这可以是 “none” 或“deflate”。压缩类型必须与匹配 AvroSource 的压缩类型匹配
SSL	false	将其设置为 true 以启用 SSL 加密。如果启用了 SSL，则还必须通过组件级参数（请参阅下文）或全局 SSL 参数（请参阅 SSL / TLS 支持部分）指定 “密钥库” 和“密钥库密码” 。
keysore	-	这是 Java 密钥库文件的路径。如果未在此处指定，则将使用全局密钥库（如果已定义，则配置错误）。
keystore-password	-	Java 密钥库的密码。如果未在此处指定，则将使用全局密钥库密码（如果已定义，则配置错误）。
keystore-type	JKS	Java 密钥库的类型。这可以是 “JKS” 或“PKCS12”。如果未在此处指定，则将使用全局密钥库类型（如果已定义，则默认为 JKS）。
exclude-protocols	SSLv3	要排除的以空格分隔的 SSL / TLS 协议列表。除指定的协议外，将始终排除 SSLv3。
include-protocols	-	要包含的以空格分隔的 SSL / TLS 协议列表。启用的协议将是包含的协议，没有排除的协议。如果包含协议为空，则它包括每个支持的协议。
exclude-cipher-suites	-	要排除的以空格分隔的密码套件列表。
include-cipher-suites	-	以空格分隔的密码套件列表。启用的密码套件将是包含的密码套件，不包括排除的密码套件。如果 included-cipher-suites 为空，则包含每个支持的密码套件。
ipFilter	false	将此设置为 true 以启用 ipFiltering for netty
ipFilterRules	-	使用此配置定义 N netty ipFilter 模式规则。

agent 名为 a1 的示例:

a1.sources = r1
a1.channels = c1
a1.sources.r1.type = avro
a1.sources.r1.channels = c1
a1.sources.r1.bind = 0.0.0.0
a1.sources.r1.port = 4141

ipFilterRules 的示例

ipFilterRules 定义由逗号分隔的 N 个 netty ipFilters 模式规则必须采用此格式.

<'allow' or deny>:<'ip' or 'name' for computer name>:<pattern> or allow/deny:ip/name:pattern
example: ipFilterRules=allow:ip:127.*,allow:name:localhost,deny:ip:*

请注意, 匹配的第一个规则将适用, 如下例所示, 来自 localhost 上的客户端

这将允许 localhost 上的客户端拒绝来自任何其他 ip 的客户端 "allow:name:localhost,deny:ip: 这将拒绝 localhost 上的客户端允许来自任何其他 ip 的客户端"deny:name:localhost,allow:ip:

Thrift Source

侦听 Thrift 端口并从外部 Thrift 客户端流接收 event . 当与另一个(上一跳)Flume agent 上的内置 ThriftSink 配对时, 它可以创建分层集合拓扑. 可以通过启用 kerberos 身份验证将 Thriftsource 配置为以安全模式启动. agent-principal 和 agent-keytab 是 Thriftsource 用于向 kerberos KDC 进行身份验证的属性. 必需属性以粗体显示

属性名称	默认	描述
channels	-
type	-	组件类型名称，需要节俭
bind	-	要侦听的主机名或 IP 地址
port	-	要绑定的端口号
threads	-	生成的最大工作线程数
selector.type
selector.*
interceptors	-	空格分隔的拦截器列表
interceptors.*
SSL	false	将其设置为 true 以启用 SSL 加密。如果启用了 SSL，则还必须通过组件级参数（请参阅下文）或全局 SSL 参数（请参阅 SSL / TLS 支持部分）指定 “密钥库” 和“密钥库密码”。
keystore	-	这是 Java 密钥库文件的路径。如果未在此处指定，则将使用全局密钥库（如果已定义，则配置错误）。
keystore-password	-	Java 密钥库的密码。如果未在此处指定，则将使用全局密钥库密码（如果已定义，则配置错误）。
keystore-type	JKS	Java 密钥库的类型。这可以是 “JKS” 或“PKCS12”。如果未在此处指定，则将使用全局密钥库类型（如果已定义，则默认为 JKS）。
exclude-protocols		要排除的以空格分隔的 SSL / TLS 协议列表。除指定的协议外，将始终排除 SSLv3。
include-protocols	-	要包含的以空格分隔的 SSL / TLS 协议列表。启用的协议将是包含的协议，没有排除的协议。如果包含协议为空，则它包括每个支持的协议。
exclude-cipher-suites	-	要排除的以空格分隔的密码套件列表。
include-cipher-suites	-	以空格分隔的密码套件列表。启用的密码套件将是包含的密码套件，不包括排除的密码套件。
kerberos		设置为 true 以启用 kerberos 身份验证。在 kerberos 模式下，成功进行身份验证需要 agent-principal 和 agent-keytab。安全模式下的 Thriftsource 仅接受来自已启用 kerberos 且已成功通过 kerberos KDC 验证的 Thrift 客户端的连接。
agent-principal	-	Thrift Source 使用的 kerberos 主体对 kerberos KDC 进行身份验证。
agent-keytab	-	Thrift Source 与 agent 主体结合使用的 keytab 位置，用于对 kerberos KDC 进行身份验证。

agent 名为 a1 的示例:

a1.sources = r1
a1.channels = c1
a1.sources.r1.type = thrift
a1.sources.r1.channels = c1
a1.sources.r1.bind = 0.0.0.0
a1.sources.r1.port = 4141
Exec Source

Exec source 在启动时运行给定的 Unix 命令, 并期望该进程在标准输出上连续生成数据 (stderr 被简单地丢弃, 除非属性 logStdErr 设置为 true). 如果进程因任何原因退出, 则 source 也会退出并且不会生成其他数据. 这意味着 cat [named pipe] 或 tail -F [file] 等配置将产生所需的结果, 而日期可能不会 - 前两个命令产生数据流, 而后者产生单个 event 并退出. 必需属性以粗体显示

属性名称	默认	描述
channels	-
type	-	组件类型名称，需要是 exec
command	-	要执行的命令
shell	-	用于运行命令的 shell 调用。例如 /bin/sh -c 。仅适用于依

最低0.47元/天解锁文章

HuaZi_Myth

关注

0
点赞
踩
3

收藏

觉得还不错? 一键收藏
0
评论
一次flume exec source采集日志到kafka因为单条日志数据非常大同步失败的踩坑带来的思考

本次遇到的问题描述，日志采集同步时，当单条日志（日志文件中一行日志）超过2M大小，数据无法采集同步到kafka，分析后，共踩到如下几个坑。 1、flume采集时，通过shell+EXEC（tail -F xxx.log 的方式） source来获取日志时，当单条日志过大超过1M时，source端无法从日志中获取到Event。 2、日志超过1M后，flume的kafka sink 作为生产者发送给...
复制链接

扫一扫