flume配置文件的一些语义解释

最新推荐文章于 2024-07-29 09:10:44 发布

Miriam_Taylor

最新推荐文章于 2024-07-29 09:10:44 发布

阅读量650

点赞数 19

文章标签： flume 大数据

本文链接：https://blog.csdn.net/m0_64359381/article/details/140481915

版权

在之前我的博客windows下，利用flume将csv数据文件上传至kafka-CSDN博客中，我提到了自定义flume配置文件，以达到读取csv数据上传至kafka的目的。

但尚未讲解该配置文件语句和语义，接下来我将详细展开：

原文件example.conf如下：

# Define the source, sink, and channel
a1.sources = src
a1.sinks = k1
a1.channels = c1

# Describe the source
a1.sources.src.type = spooldir
a1.sources.src.spoolDir = E:\\apache-flume-1.11.0-bin
a1.sources.src.fileHeader = false
a1.sources.src.includePattern = work12.csv
a1.sources.src.deserializer = LINE
a1.sources.src.deserializer.maxLineLength = 10000

# Define the interceptor with the correct type
a1.sources.src.interceptors = head_filter
a1.sources.src.interceptors.head_filter.type = regex_filter
a1.sources.src.interceptors.head_filter.regex = ^bus_id*
a1.sources.src.interceptors.head_filter.excludeEvents = true

# Describe the sink
a1.sinks.k1.type = org.apache.flume.sink.kafka.KafkaSink
a1.sinks.k1.kafka.bootstrap.servers = localhost:9092
a1.sinks.k1.kafka.topic = mytopic
a1.sinks.k1.kafka.flumeBatchSize = 500
a1.sinks.k1.kafka.producer.acks = 1
a1.sinks.k1.kafka.producer.linger.ms = 1
a1.sinks.k1.kafka.producer.batch.size = 1048576
a1.sinks.k1.kafka.consumer.session.timeout.ms = 30000
a1.sinks.k1.kafka.consumer.heartbeat.interval.ms = 10000

# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 10000
a1.channels.c1.transactionCapacity = 5000

# Bind the source and sink to the channel
a1.sources.src.channels = c1
a1.sinks.k1.channel = c1

这段 Flume 配置文件定义了一个简单的数据流，从一个源（spooldir）读取数据，经过一个通道（memory channel），并将数据发送到一个 Kafka 主题中。

其中
###定义源、通道和接收器
a1.sources = src
a1.sinks = k1
a1.channels = c1
代码块定义了一个名为 `a1` 的代理（agent）：包含一个源（src）、一个接收器（k1）和一个通道（c1）。
### 描述源
a1.sources.src.type = spooldir
a1.sources.src.spoolDir = E:\\apache-flume-1.11.0-bin
a1.sources.src.fileHeader = false
a1.sources.src.includePattern = work12.csv
a1.sources.src.deserializer = LINE
a1.sources.src.deserializer.maxLineLength = 10000
这些代码块描述了 `src` 源的配置：
`type`：指定源类型为 `spooldir`，表示从目录中读取文件。
`spoolDir`：指定要监控的目录（这里是 `E:\apache-flume-1.11.0-bin`）。
`fileHeader`：指定是否在事件头中包含文件名（这里设置为 `false`）。
`includePattern`：指定要处理的文件的名称模式（这里是 `work12.csv`）。
`deserializer`：指定如何反序列化文件中的行（这里使用 `LINE` 反序列化器）。
`deserializer.maxLineLength`：指定单行的最大长度（这里是 10000 字节）。
### 定义拦截器
a1.sources.src.interceptors = head_filter
a1.sources.src.interceptors.head_filter.type = regex_filter
a1.sources.src.interceptors.head_filter.regex = ^bus_id*
a1.sources.src.interceptors.head_filter.excludeEvents = true
代码块定义了一个名为 `head_filter` 的拦截器：
`type`：指定拦截器类型为 `regex_filter`，表示使用正则表达式过滤事件。
`regex`：指定正则表达式（这里是 `^bus_id*`），用来匹配事件内容。
`excludeEvents`：指定是否排除匹配的事件（这里设置为 `true`，表示排除匹配的事件）。
### 描述接收器
a1.sinks.k1.type = org.apache.flume.sink.kafka.KafkaSink
a1.sinks.k1.kafka.bootstrap.servers = localhost:9092
a1.sinks.k1.kafka.topic = mytopic
a1.sinks.k1.kafka.flumeBatchSize = 500
a1.sinks.k1.kafka.producer.acks = 1
a1.sinks.k1.kafka.producer.linger.ms = 1
a1.sinks.k1.kafka.producer.batch.size = 1048576
a1.sinks.k1.kafka.consumer.session.timeout.ms = 30000
a1.sinks.k1.kafka.consumer.heartbeat.interval.ms = 10000
这些行描述了 `k1` 接收器的配置：
`type`：接收器类型为 `KafkaSink`，表示将事件发送到 Kafka。
`kafka.bootstrap.servers`：指Kafka 集群的地址（这里是 `localhost:9092`）。
`kafka.topic`：目标 Kafka 主题的名称（这里是 `mytopic`）。
`kafka.flumeBatchSize`：每批次发送的事件数量（这里是 500）。
`kafka.producer.acks`：生产者确认的级别（这里是 1，表示领导者写入成功即确认）。
`kafka.producer.linger.ms`：生产者等待新消息发送前的延迟时间（这里是 1 毫秒）。
`kafka.producer.batch.size`：指定批次大小（这里是 1048576 字节）。
`kafka.consumer.session.timeout.ms`：指定消费者会话超时时间（ 30000 毫秒）。
`kafka.consumer.heartbeat.interval.ms`：指定消费者心跳间隔时间（10000 毫秒）。
### 定义通道
a1.channels.c1.type = memory
a1.channels.c1.capacity = 10000
a1.channels.c1.transactionCapacity = 5000
这些行定义了 `c1` 通道的配置：
`type`：指定通道类型为 `memory`，表示使用内存来缓冲事件。
`capacity`：指定通道的容量（这里是 10000 个事件）。
`transactionCapacity`：指定每次事务处理的最大事件数量（这里是 5000 个事件）。
### 绑定源和接收器到通道
a1.sources.src.channels = c1
a1.sinks.k1.channel = c1
这些行将 `src` 源和 `k1` 接收器绑定到 `c1` 通道，确保数据流从源经过通道到达接收器。

Flume 代理中，监控指定目录中的文件（`E:\apache-flume-1.11.0-bin`），读取匹配模式的文件（`work12.csv`），通过拦截器过滤不需要的事件，然后将剩余的事件发送到指定的 Kafka 主题（`mytopic`）。在代理生效的时候，通常会使用内存通道来缓冲数据，确保数据流的高效处理。

Miriam_Taylor

关注

19
点赞
踩
15

收藏

觉得还不错? 一键收藏
0
评论
flume配置文件的一些语义解释

Flume 代理中，监控指定目录中的文件（`E:\apache-flume-1.11.0-bin`），读取匹配模式的文件（`work12.csv`），通过拦截器过滤不需要的事件，然后将剩余的事件发送到指定的 Kafka 主题（`mytopic`）。代码块定义了一个名为 `a1` 的代理（agent）：包含一个源（src）、一个接收器（k1）和一个通道（c1）。这些行将 `src` 源和 `k1` 接收器绑定到 `c1` 通道，确保数据流从源经过通道到达接收器。
复制链接

扫一扫