Flume的详细使用

最新推荐文章于 2024-04-12 17:07:16 发布

code@fzk

最新推荐文章于 2024-04-12 17:07:16 发布

阅读量1.4k

点赞数 1

分类专栏：大数据文章标签：大数据 flume

本文链接：https://blog.csdn.net/qq_44002865/article/details/119447408

版权

大数据专栏收录该内容

25 篇文章 0 订阅

订阅专栏

Flume

文章目录

Flume
1、简介
2、快速入门
- 安装
3、配置及简单使用
4、Flume进阶

1、简介

Flume 是 Cloudera 提供的一个高可用的，高可靠的，分布式的海量日志采集、聚合和传输的系统。Flume 基于流式架构，灵活简单
Agent
- Agent 是一个 JVM 进程，它以事件的形式将数据从源头送至目的。
- Agent 主要有 3 个部分组成，Source、Channel、Sink。
  - - 负责接收数据到 Flume Agent 的组件。Source 组件可以处理各种类型、各种格式的日志数据
  - Sink
    - Sink 不断地轮询 Channel 中的事件且批量地移除它们，并将这些事件批量写入到存储或索引系统、或者被发送到另一个 Flume Agent
  - Channel
    - 位于 Source 和 Sink 之间的缓冲区。因此，Channel 允许 Source 和 Sink 运作在不同的速率上。Channel 是线程安全的，可以同时处理几个 Source 的写入操作和几个Sink 的读取操作
    - Flume 自带 Channel：Memory Channel 和 File Channel 以及 Kafka Channel
      - Memory Channel 是内存中的队列。Memory Channel 在不需要关心数据丢失的情景下适用。如果需要关心数据丢失，那么 Memory Channel 就不应该使用，因为程序死亡、机器宕机或者重启都会导致数据丢失
      - File Channel 将所有事件写到磁盘。因此在程序关闭或机器宕机的情况下不会丢失数据

2、快速入门

安装

第一步：将 apache-flume-1.7.0-bin.tar.gz 上传到 linux 上，并解压
第二步：将 flume/conf 下的 flume-env.sh.template 文件修改为 flume-env.sh，并配置jdk安装路径
```
export JAVA_HOME=/opt/software/jdk/jdk1.8.0_281  #按照自己jdk的安装路径
```
第三步：安装 netcat 工具
- yum install -y nc

3、配置及简单使用

配置

官方文档地址：http://flume.apache.org/releases/content/1.7.0/FlumeUserGuide.html

source配置（常用）

netcat
- http://flume.apache.org/releases/content/1.7.0/FlumeUserGuide.html#netcat-source
exec
- http://flume.apache.org/releases/content/1.7.0/FlumeUserGuide.html#exec-source
spooling directory
- http://flume.apache.org/releases/content/1.7.0/FlumeUserGuide.html#spooling-directory-source
taildir
- http://flume.apache.org/releases/content/1.7.0/FlumeUserGuide.html#taildir-source
avro
- http://flume.apache.org/releases/content/1.7.0/FlumeUserGuide.html#avro-source
kafka
- http://flume.apache.org/releases/content/1.7.0/FlumeUserGuide.html#kafka-source

sink配置（常用）

logger
- http://flume.apache.org/releases/content/1.7.0/FlumeUserGuide.html#logger-sink
hdfs
- http://flume.apache.org/releases/content/1.7.0/FlumeUserGuide.html#hdfs-sink
avro
- http://flume.apache.org/releases/content/1.7.0/FlumeUserGuide.html#avro-sink
file roll
- http://flume.apache.org/releases/content/1.7.0/FlumeUserGuide.html#file-roll-sink
HBase
- http://flume.apache.org/releases/content/1.7.0/FlumeUserGuide.html#hbasesinks
hive
- http://flume.apache.org/releases/content/1.7.0/FlumeUserGuide.html#hive-sink
kafka
- http://flume.apache.org/releases/content/1.7.0/FlumeUserGuide.html#kafka-sink

channel配置（常用）

Memory channel
- http://flume.apache.org/releases/content/1.7.0/FlumeUserGuide.html#memory-channel
File channel
- http://flume.apache.org/releases/content/1.7.0/FlumeUserGuide.html#file-channel
JDBC channel
- http://flume.apache.org/releases/content/1.7.0/FlumeUserGuide.html#jdbc-channel
Kafka channel
- http://flume.apache.org/releases/content/1.7.0/FlumeUserGuide.html#kafka-channel

使用

监控端口数据到控制台（netcat-memeory-logger）

source：netcat ------------- channel：memory ------------- sink：logger

第一步：在 job 文件夹下创建 Flume Agent 配置文件 netcat-memory-logger.conf（该文件名随便取），并配置

# Agent 三个组件(source、sink、channel)的名称
a1.sources = r1
a1.sinks = k1
a1.channels = c1

# 配置source
a1.sources.r1.type = netcat
a1.sources.r1.bind = hadoop151
a1.sources.r1.port = 44444

# 配置sink
a1.sinks.k1.type = logger

# 配置channel
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# 将source和sink绑定到channel中
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

第二步：开启flume监听端口

bin/flume-ng agent -n agent名称 -c conf目录 -f 自定义的conf文件路径及文件名

#例子：-Dflume.root.logger=INFO,console：将日志输出到控制台
bin/flume-ng agent -n a1 -c conf/ -f job/netcat-memory-logger.conf -Dflume.root.logger=INFO,console

第三步：使用 netcat 工具向本机发送内容

nc source的host source的端口号

#例子
nc hadoop151 44444

实时监控单个追加文件到控制台（exec-memeory-logger）

source：exec ------------- channel：memory ------------- sink：logger

第一步：在 job 文件夹下创建 Flume Agent 配置文件 exec-memory-logger.conf（该文件名随便取），并配置

# Agent 三个组件(source、sink、channel)的名称
a1.sources = r1
a1.sinks = k1
a1.channels = c1

# 配置source
a1.sources.r1.type = exec
a1.sources.r1.command = tail -f /opt/software/hive/apache-hive-3.1.2-bin/logs/hive.log

# 配置sink
a1.sinks.k1.type = logger

# 配置channel
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# 将source和sink绑定到channel中
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

第二步：启动hive

第三步：运行flume

bin/flume-ng agent -n agent名称 -c conf目录 -f 自定义的conf文件路径及文件名

#例子：-Dflume.root.logger=INFO,console：将日志输出到控制台
bin/flume-ng agent -n a1 -c conf/ -f job/exec-memory-logger.conf -Dflume.root.logger=INFO,console

第四步：写个查询语句，hive.log文件会追加，查看flume的变化

实时监控单个追加文件到HDFS（exec-memeory-hdfs）

source：exec ------------- channel：memory ------------- sink：hdfs

第一步：Flume 要想将数据输出到 HDFS，须持有 Hadoop 相关 jar 包，将以下jar包上传到flume的lib路径下

commons-configuration-1.6.jar
hadoop-auth-2.7.2.jar
hadoop-common-2.7.2.jar
hadoop-hdfs-2.7.2.jar
commons-io-2.4.jar
htrace-core-3.1.0-incubating.jar

第二步：在 job 文件夹下创建 Flume Agent 配置文件 exec-memory-hdfs.conf（该文件名随便取），并配置

# Agent 三个组件(source、sink、channel)的名称
a1.sources = r1
a1.sinks = k1
a1.channels = c1

# 配置source
a1.sources.r1.type = exec
a1.sources.r1.command = tail -f /opt/software/hive/apache-hive-3.1.2-bin/logs/hive.log

# 配置sink
a1.sinks.k1.type = hdfs
a1.sinks.k1.hdfs.path = hdfs://hadoop151:8020/flume/%Y%m%d/%H
#上传文件的前缀
a1.sinks.k1.hdfs.filePrefix = logs- 
#是否按照时间滚动文件夹
a1.sinks.k1.hdfs.round = true
#多少时间单位创建一个新的文件夹
a1.sinks.k1.hdfs.roundValue = 1
#重新定义时间单位
a1.sinks.k1.hdfs.roundUnit = hour
#是否使用本地时间戳
a1.sinks.k1.hdfs.useLocalTimeStamp = true
#积攒多少个 Event 才 flush 到 HDFS 一次
a1.sinks.k1.hdfs.batchSize = 1000
#设置文件类型，可支持压缩
a1.sinks.k1.hdfs.fileType = DataStream
#多久生成一个新的文件
a1.sinks.k1.hdfs.rollInterval = 30
#设置每个文件的滚动大小
a1.sinks.k1.hdfs.rollSize = 134217700
#文件的滚动与 Event 数量无关
a1.sinks.k1.hdfs.rollCount = 0

# 配置channel
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# 将source和sink绑定到channel中
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

第三步：启动hive

第四步：运行flume

bin/flume-ng agent -n agent名称 -c conf目录 -f 自定义的conf文件路径及文件名

#例子：-Dflume.root.logger=INFO,console：将日志输出到控制台
bin/flume-ng agent -n a1 -c conf/ -f job/exec-memory-hdfs.conf

第五步：写个查询语句，hive.log文件会追加，HDFS文件变化

实时监控目录下多个新文件（spooldir-memory-hdfs）

source：Spooling Directory ------------- channel：memory ------------- sink：hdfs

第一步：Flume 要想将数据输出到 HDFS，须持有 Hadoop 相关 jar 包，将以下jar包上传到flume的lib路径下

commons-configuration-1.6.jar
hadoop-auth-2.7.2.jar
hadoop-common-2.7.2.jar
hadoop-hdfs-2.7.2.jar
commons-io-2.4.jar
htrace-core-3.1.0-incubating.jar

第二步：在 job 文件夹下创建 Flume Agent 配置文件 spooldir-memory-hdfs.conf（该文件名随便取），并配置

# Agent 三个组件(source、sink、channel)的名称
a1.sources = r1
a1.sinks = k1
a1.channels = c1

# 配置source
a1.sources.r1.type = spooldir
a1.sources.r1.spoolDir = /opt/software/flume/data/update
a1.sources.r1.fileSuffix = .COMPLETED
a1.sources.r1.fileHeader = true
#忽略所有以.tmp 结尾的文件，不上传
a1.sources.r1.ignorePattern = ([^ ]*\.tmp)

# 配置sink
a1.sinks.k1.type = hdfs
a1.sinks.k1.hdfs.path = hdfs://hadoop151:8020/flume/spooldir/%Y%m%d/%H
#上传文件的前缀
a1.sinks.k1.hdfs.filePrefix = logs- 
#是否按照时间滚动文件夹
a1.sinks.k1.hdfs.round = true
#多少时间单位创建一个新的文件夹
a1.sinks.k1.hdfs.roundValue = 1
#重新定义时间单位
a1.sinks.k1.hdfs.roundUnit = hour
#是否使用本地时间戳
a1.sinks.k1.hdfs.useLocalTimeStamp = true
#积攒多少个 Event 才 flush 到 HDFS 一次
a1.sinks.k1.hdfs.batchSize = 1000
#设置文件类型，可支持压缩
a1.sinks.k1.hdfs.fileType = DataStream
#多久生成一个新的文件
a1.sinks.k1.hdfs.rollInterval = 30
#设置每个文件的滚动大小
a1.sinks.k1.hdfs.rollSize = 134217700
#文件的滚动与 Event 数量无关
a1.sinks.k1.hdfs.rollCount = 0

# 配置channel
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# 将source和sink绑定到channel中
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

第三步：运行flume

bin/flume-ng agent -n agent名称 -c conf目录 -f 自定义的conf文件路径及文件名

#例子
bin/flume-ng agent -n a1 -c conf/ -f job/spooldir-memory-hdfs.conf

实时监控目录下的多个追加文件（taildir-memory-hdfs）

source：taildir ------------- channel：memory ------------- sink：hdfs

第一步：Flume 要想将数据输出到 HDFS，须持有 Hadoop 相关 jar 包，将以下jar包上传到flume的lib路径下

commons-configuration-1.6.jar
hadoop-auth-2.7.2.jar
hadoop-common-2.7.2.jar
hadoop-hdfs-2.7.2.jar
commons-io-2.4.jar
htrace-core-3.1.0-incubating.jar

第二步：在 job 文件夹下创建 Flume Agent 配置文件 taildir-memory-hdfs.conf（该文件名随便取），并配置

# Agent 三个组件(source、sink、channel)的名称
a1.sources = r1
a1.sinks = k1
a1.channels = c1

# 配置source
a1.sources.r1.type = TAILDIR
a1.sources.r1.positionFile = /opt/software/flume/data/taildir/tail_dir.json
a1.sources.r1.filegroups = f1 f2
a1.sources.r1.filegroups.f1 = /opt/software/flume/data/taildir/file1.txt
a1.sources.r1.filegroups.f2 = /opt/software/flume/data/taildir/file2.txt

# 配置sink
a1.sinks.k1.type = hdfs
a1.sinks.k1.hdfs.path = hdfs://hadoop151:8020/flume/taildir/%Y%m%d/%H
#上传文件的前缀
a1.sinks.k1.hdfs.filePrefix = logs- 
#是否按照时间滚动文件夹
a1.sinks.k1.hdfs.round = true
#多少时间单位创建一个新的文件夹
a1.sinks.k1.hdfs.roundValue = 1
#重新定义时间单位
a1.sinks.k1.hdfs.roundUnit = hour
#是否使用本地时间戳
a1.sinks.k1.hdfs.useLocalTimeStamp = true
#积攒多少个 Event 才 flush 到 HDFS 一次
a1.sinks.k1.hdfs.batchSize = 1000
#设置文件类型，可支持压缩
a1.sinks.k1.hdfs.fileType = DataStream
#多久生成一个新的文件
a1.sinks.k1.hdfs.rollInterval = 30
#设置每个文件的滚动大小
a1.sinks.k1.hdfs.rollSize = 134217700
#文件的滚动与 Event 数量无关
a1.sinks.k1.hdfs.rollCount = 0

# 配置channel
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# 将source和sink绑定到channel中
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

第三步：运行flume

bin/flume-ng agent -n agent名称 -c conf目录 -f 自定义的conf文件路径及文件名

#例子
bin/flume-ng agent -n a1 -c conf/ -f job/taildir-memory-hdfs.conf

Flume对接Kafka

source：netcat ------------- channel：memory ------------- sink：kafka

第一步：编写Flume配置文件（flume-kafka.conf）

#name
a1.sources = r1
a1.sinks = k1
a1.channels = c1

#source配置
a1.sources.r1.type = netcat
a1.sources.r1.bind = hadoop151
a1.sources.r1.port = 44444

#sink配置
a1.sinks.k1.type = org.apache.flume.sink.kafka.KafkaSink
a1.sinks.k1.kafka.topic = fzk
a1.sinks.k1.kafka.bootstrap.servers = hadoop151:9092,hadoop152:9092,hadoop153:9092
a1.sinks.k1.kafka.flumeBatchSize = 20
a1.sinks.k1.kafka.producer.acks = 1
a1.sinks.k1.kafka.producer.linger.ms = 1

#channel配置
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

#source-channel-sink之间的联系
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

第二步：开启kafka消费者

bin/kafka-console-consumer.sh --bootstrap-server hadoop151:9092,hadoop152:9092,hadoop153:9092 \
--topic fzk

第三步：启动Flume

bin/flume-ng agent -n agent名称 -c conf/ -f flume-kafka.conf文件路径及文件名

#执行
bin/flume-ng agent -n a1 -c conf/ -f job/flume-kafka.conf

4、Flume进阶

复制和多路复用（Flume Channel Selectors）

Replicating Channel Selector：复制选择器（默认）
- http://flume.apache.org/releases/content/1.7.0/FlumeUserGuide.html#replicating-channel-selector-default
Multiplexing Channel Selector：多路复用选择器
- http://flume.apache.org/releases/content/1.7.0/FlumeUserGuide.html#multiplexing-channel-selector
Custom Channel Selector：自定义选择器
- http://flume.apache.org/releases/content/1.7.0/FlumeUserGuide.html#custom-channel-selector

复制（Replicating Channel Selector）

需求：使用 Flume-1 监控文件变动，Flume-1 将变动内容传递给 Flume-2，Flume-2 负责存储到 HDFS。同时 Flume-1 将变动内容传递给 Flume-3，Flume-3 负责输出到 Local FileSystem

在这里插入图片描述

第一步：编写flume-1配置文件（flume-1.conf）

# Agent 三个组件(source、sink、channel)的名称
a1.sources = r1
a1.sinks = k1 k2
a1.channels = c1 c2

#source配置
a1.sources.r1.type = TAILDIR
a1.sources.r1.positionFile = /opt/software/flume/data/taildir/taildir001.json
a1.sources.r1.filegroups = f1
a1.sources.r1.filegroups.f1 = /opt/software/hive/apache-hive-3.1.2-bin/logs/hive.log

#sink配置
#sink_1配置
a1.sinks.k1.type = avro
a1.sinks.k1.hostname = hadoop151
a1.sinks.k1.port = 4141
#sink_2配置
a1.sinks.k2.type = avro
a1.sinks.k2.hostname = hadoop151
a1.sinks.k2.port = 4142

#channel配置
#channel_1配置
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
#channel_2配置
a1.channels.c2.type = memory
a1.channels.c2.capacity = 1000
a1.channels.c2.transactionCapacity = 100

#ChannelSelect配置（复制，默认）
a1.sources.r1.selector.type = replicating

#source-channel-sink之间的联系
a1.sources.r1.channels = c1 c2
a1.sinks.k1.channel = c1
a1.sinks.k2.channel = c2

第二步：编写flume-2配置文件（flume-2.conf）

# Agent 三个组件(source、sink、channel)的名称
a2.sources = r1
a2.sinks = k1
a2.channels = c1

#source配置
a2.sources.r1.type = avro
a2.sources.r1.bind = hadoop151
a2.sources.r1.port = 4141

#sink配置
a2.sinks.k1.type = hdfs
a2.sinks.k1.hdfs.path = hdfs://hadoop151:8020/flume2/%Y%m%d/%H
#上传文件的前缀
a2.sinks.k1.hdfs.filePrefix = flume2-
#是否按照时间滚动文件夹
a2.sinks.k1.hdfs.round = true
#多少时间单位创建一个新的文件夹
a2.sinks.k1.hdfs.roundValue = 1
#重新定义时间单位
a2.sinks.k1.hdfs.roundUnit = hour
#是否使用本地时间戳
a2.sinks.k1.hdfs.useLocalTimeStamp = true
#积攒多少个 Event 才 flush 到 HDFS 一次
a2.sinks.k1.hdfs.batchSize = 100
#设置文件类型，可支持压缩
a2.sinks.k1.hdfs.fileType = DataStream
#多久生成一个新的文件
a2.sinks.k1.hdfs.rollInterval = 30
#设置每个文件的滚动大小大概是 128M
a2.sinks.k1.hdfs.rollSize = 134217700
#文件的滚动与 Event 数量无关
a2.sinks.k1.hdfs.rollCount = 0

#channel配置
a2.channels.c1.type = memory
a2.channels.c1.capacity = 1000
a2.channels.c1.transactionCapacity = 100

#source-channel-sink之间的联系
a2.sources.r1.channels = c1
a2.sinks.k1.channel = c1

第三步：编写flume-3配置文件（flume-3.conf）

# Agent 三个组件(source、sink、channel)的名称
a3.sources = r1
a3.sinks = k1
a3.channels = c1

#source配置
a3.sources.r1.type = avro
a3.sources.r1.bind = hadoop151
a3.sources.r1.port = 4142

#sink配置
a3.sinks.k1.type = file_roll
a3.sinks.k1.sink.directory = /opt/software/flume/data/flumeData

#channel配置
a3.channels.c1.type = memory
a3.channels.c1.capacity = 1000
a3.channels.c1.transactionCapacity = 100

#source-channel-sink之间的联系
a3.sources.r1.channels = c1
a3.sinks.k1.channel = c1

第四步：先启动flume-2、flume-3，再启动flume-1

bin/flume-ng agent -n agent名称 -c conf目录 -f 自定义的conf文件路径及文件名

#执行
bin/flume-ng agent -n a2 -c conf/ -f job/group1/flume-2.conf
bin/flume-ng agent -n a3 -c conf/ -f job/group1/flume-3.conf
bin/flume-ng agent -n a1 -c conf/ -f job/group1/flume-1.conf

多路复用（Multiplexing Channel Selector）

见下文自定义Interceptor

故障转移和负载均衡（Flume Sink Processors）

Default Sink Processor：默认
- http://flume.apache.org/releases/content/1.7.0/FlumeUserGuide.html#default-sink-processor
Failover Sink Processor：故障转移
- http://flume.apache.org/releases/content/1.7.0/FlumeUserGuide.html#failover-sink-processor
Load balancing Sink Processor：负载均衡
- http://flume.apache.org/releases/content/1.7.0/FlumeUserGuide.html#load-balancing-sink-processor

故障转移（Failover Sink Processor）

需求：使用 Flume-1 监控一个端口，其 sink 组中的 sink 分别对接 Flume-2 和 Flume-3，采用FailoverSinkProcessor，实现故障转移的功能

在这里插入图片描述

第一步：编写flume-1配置文件（flume-1.conf）

# Agent 三个组件(source、sink、channel)的名称
a1.sources = r1
a1.sinks = k1 k2
a1.channels = c1

#source配置
a1.sources.r1.type = netcat
a1.sources.r1.bind = hadoop151
a1.sources.r1.port = 44444

#sink配置
#sink_1配置
a1.sinks.k1.type = avro
a1.sinks.k1.hostname = hadoop151
a1.sinks.k1.port = 4141
#sink_2配置
a1.sinks.k2.type = avro
a1.sinks.k2.hostname = hadoop151
a1.sinks.k2.port = 4142

#channel配置
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

#SinkProcessors（故障转移）
a1.sinkgroups = g1
a1.sinkgroups.g1.sinks = k1 k2
a1.sinkgroups.g1.processor.type = failover
a1.sinkgroups.g1.processor.priority.k1 = 5
a1.sinkgroups.g1.processor.priority.k2 = 10
a1.sinkgroups.g1.processor.maxpenalty = 10000

#source-channel-sink之间的联系
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
a1.sinks.k2.channel = c1

第二步：编写flume-2配置文件（flume-2.conf）

# Agent 三个组件(source、sink、channel)的名称
a2.sources = r1
a2.sinks = k1
a2.channels = c1

#source配置
a2.sources.r1.type = avro
a2.sources.r1.bind = hadoop151
a2.sources.r1.port = 4141

#sink配置
a2.sinks.k1.type = logger

#channel配置
a2.channels.c1.type = memory
a2.channels.c1.capacity = 1000
a2.channels.c1.transactionCapacity = 100

#source-channel-sink之间的联系
a2.sources.r1.channels = c1
a2.sinks.k1.channel = c1

第三步：编写flume-3配置文件（flume-3.conf）

# Agent 三个组件(source、sink、channel)的名称
a3.sources = r1
a3.sinks = k1
a3.channels = c1

#source配置
a3.sources.r1.type = avro
a3.sources.r1.bind = hadoop151
a3.sources.r1.port = 4142

#sink配置
a3.sinks.k1.type = logger

#channel配置
a3.channels.c1.type = memory
a3.channels.c1.capacity = 1000
a3.channels.c1.transactionCapacity = 100

#source-channel-sink之间的联系
a3.sources.r1.channels = c1
a3.sinks.k1.channel = c1

第四步：先启动flume-2、flume-3，再启动flume-1

bin/flume-ng agent -n agent名称 -c conf目录 -f 自定义的conf文件路径及文件名

#执行
bin/flume-ng agent -n a2 -c conf/ -f job/group2/flume-2.conf -Dflume.root.logger=INFO,console
bin/flume-ng agent -n a3 -c conf/ -f job/group2/flume-3.conf -Dflume.root.logger=INFO,console
bin/flume-ng agent -n a1 -c conf/ -f job/group2/flume-1.conf

负载均衡（Load balancing Sink Processor）

需求：使用 Flume-1 监控一个端口，其 sink 组中的 sink 分别对接 Flume-2 和 Flume-3，采用LoadBalancingSinkProcessor，实现负载均衡（随机将日志打印到 Flume-2 和 Flume-3 中的一个）

配置跟上面故障转移一样，不一样的是 flume-1配置文件（flume-1.conf）

#name
a1.sources = r1
a1.sinks = k1 k2
a1.channels = c1

#source配置
a1.sources.r1.type = netcat
a1.sources.r1.bind = hadoop151
a1.sources.r1.port = 44444

#sink配置
#sink_1配置
a1.sinks.k1.type = avro
a1.sinks.k1.hostname = hadoop151
a1.sinks.k1.port = 4141
#sink_2配置
a1.sinks.k2.type = avro
a1.sinks.k2.hostname = hadoop151
a1.sinks.k2.port = 4142

#channel配置
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

#SinkProcessors（负载均衡）
a1.sinkgroups = g1
a1.sinkgroups.g1.sinks = k1 k2
a1.sinkgroups.g1.processor.type = load_balance
a1.sinkgroups.g1.processor.backoff = true
a1.sinkgroups.g1.processor.selector = random

#source-channel-sink之间的联系
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
a1.sinks.k2.channel = c1

聚合

需求

hadoop151 上的 Flume-1 监控文件/opt/software/flume/data/group.log
hadoop152 上的 Flume-2 监控某一个端口的数据流
Flume-1 与 Flume-2 将数据发送给 hadoop153 上的 Flume-3，Flume-3 将最终数据打印到控制台

第一步：编写hadoop151中的flume-1配置文件（flume-1.conf）

#name
a1.sources = r1
a1.sinks = k1
a1.channels = c1

#source配置
a1.sources.r1.type = TAILDIR
a1.sources.r1.positionFile = /opt/software/flume/taildir_position1.json
a1.sources.r1.filegroups = f1
a1.sources.r1.filegroups.f1 = /opt/software/flume/data/group.txt

#sink配置
a1.sinks.k1.type = avro
a1.sinks.k1.hostname = hadoop153
a1.sinks.k1.port = 4141

#channel配置
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

#source-channel-sink之间的联系
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

第二步：编写hadoop152中的flume-2配置文件（flume-2.conf）

#name
a2.sources = r1
a2.sinks = k1
a2.channels = c1

#source配置
a2.sources.r1.type = netcat
a2.sources.r1.bind = hadoop152
a2.sources.r1.port = 44444

#sink配置
a2.sinks.k1.type = avro
a2.sinks.k1.hostname = hadoop153
a2.sinks.k1.port = 4141

#channel配置
a2.channels.c1.type = memory
a2.channels.c1.capacity = 1000
a2.channels.c1.transactionCapacity = 100

#source-channel-sink之间的联系
a2.sources.r1.channels = c1
a2.sinks.k1.channel = c1

第三步：编写hadoop153中的flume-3配置文件（flume-3.conf）

#name
a3.sources = r1
a3.sinks = k1
a3.channels = c1

#source配置
a3.sources.r1.type = avro
a3.sources.r1.bind = hadoop153
a3.sources.r1.port = 4141

#sink配置
a3.sinks.k1.type = logger

#channel配置
a3.channels.c1.type = memory
a3.channels.c1.capacity = 1000
a3.channels.c1.transactionCapacity = 100

#source-channel-sink之间的联系
a3.sources.r1.channels = c1
a3.sinks.k1.channel = c1

第四步：先启动flume-3，再启动flume-1、flume-2

bin/flume-ng agent -n agent名称 -c conf目录 -f 自定义的conf文件路径及文件名

#执行
bin/flume-ng agent -n a3 -c conf/ -f job/group3/flume-3.conf -Dflume.root.logger=INFO,console
bin/flume-ng agent -n a1 -c conf/ -f job/group3/flume-1.conf
bin/flume-ng agent -n a2 -c conf/ -f job/group3/flume-2.conf

自定义Interceptor（实现多路复用功能）

需求

使用 Flume 采集服务器本地日志，需要按照日志类型的不同，将不同种类的日志发往不同的分析系统

需求分析

在实际的开发中，一台服务器产生的日志类型可能有很多种，不同类型的日志可能需要发送到不同的分析系统。此时会用到 Flume 拓扑结构中的 Multiplexing 结构，Multiplexing的原理是，根据 event 中 Header 的某个 key 的值，将不同的 event 发送到不同的 Channel中，所以我们需要自定义一个 Interceptor，为不同类型的 event 的 Header 中的 key 赋予不同的值
在该案例中，我们以端口数据模拟日志，以数字（单个）和字母（单个）模拟不同类型的日志，我们需要自定义 interceptor 区分数字和字母，将其分别发往不同的分析系统（Channel）

第一步：创建Maven工程，并导入依赖（pom.xml）

<dependencies>
    <dependency>
        <groupId>org.apache.flume</groupId>
        <artifactId>flume-ng-core</artifactId>
        <version>1.7.0</version>
    </dependency>
</dependencies>

第二步：编写自定义Interceptor类（TypeInterceptor）

import org.apache.flume.Context;
import org.apache.flume.Event;
import org.apache.flume.interceptor.Interceptor;
import java.util.ArrayList;
import java.util.List;
import java.util.Map;

public class TypeInterceptor implements Interceptor {

    private List<Event> eventList;

    //初始化
    public void initialize() {
        eventList = new ArrayList<Event>();
    }

    //单个事件拦截
    public Event intercept(Event event) {
        //获取header信息
        Map<String, String> header = event.getHeaders();
        //获取body学习
        byte[] body = event.getBody();
        //若为字母则发送到letter，否则发送到number
        if (body[0] >= 'A' && body[0] <= 'z') {
            header.put("type", "letter");
        } else {
            header.put("type", "number");
        }
        return event;
    }

    //批量事件拦截
    public List<Event> intercept(List<Event> list) {
        eventList.clear();
        //批量处理
        for (Event event : list) {
            eventList.add(intercept(event));
        }
        return eventList;
    }

    //关闭
    public void close() {

    }

    //Builder类，返回自定义Interceptor
    public static class Builder implements Interceptor.Builder{

        public Interceptor build() {
            //返回自定义的拦截类
            return new TypeInterceptor();
        }

        public void configure(Context context) {

        }
    }
}

第三步：将编写好的类打包并上传到服务器的flume目录的lib目录下

第四步：编写flume-1配置文件（flume-1.conf）

#name
a1.sources = r1
a1.sinks = k1 k2
a1.channels = c1 c2

#source配置
a1.sources.r1.type = netcat
a1.sources.r1.bind = hadoop151
a1.sources.r1.port = 44444

#sink配置
#sink_1配置
a1.sinks.k1.type = avro
a1.sinks.k1.hostname = hadoop151
a1.sinks.k1.port = 4141
#sink_2配置
a1.sinks.k2.type = avro
a1.sinks.k2.hostname = hadoop151
a1.sinks.k2.port = 4142

#channel配置
#channel_1配置
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
#channel_2配置
a1.channels.c2.type = memory
a1.channels.c2.capacity = 1000
a1.channels.c2.transactionCapacity = 100

#Interceptor配置
a1.sources.r1.interceptors = i1
#自定义类的全限定类名
a1.sources.r1.interceptors.i1.type = com.itfzk.flume.interceptor.TypeInterceptor$Builder

#ChannelSelect配置（多路复用）
a1.sources.r1.selector.type = multiplexing
#header信息的key（自定义java程序中header的key）
a1.sources.r1.selector.header = type
#根据自定义java程序中header的value值决定发送到哪（为"letter"将数据发送到c1，为"number"将数据发送到c1）
a1.sources.r1.selector.mapping.letter = c1
a1.sources.r1.selector.mapping.number = c2

#source-channel-sink之间的联系
a1.sources.r1.channels = c1 c2
a1.sinks.k1.channel = c1
a1.sinks.k2.channel = c2

第五步：编写flume-2配置文件（flume-2.conf）

#name
a2.sources = r1
a2.sinks = k1
a2.channels = c1

#source配置
a2.sources.r1.type = avro
a2.sources.r1.bind = hadoop151
a2.sources.r1.port = 4141

#sink配置
a2.sinks.k1.type = logger

#channel配置
a2.channels.c1.type = memory
a2.channels.c1.capacity = 1000
a2.channels.c1.transactionCapacity = 100

#source-channel-sink之间的联系
a2.sources.r1.channels = c1
a2.sinks.k1.channel = c1

第六步：编写flume-3配置文件（flume-3.conf）

#name
a3.sources = r1
a3.sinks = k1
a3.channels = c1

#source配置
a3.sources.r1.type = avro
a3.sources.r1.bind = hadoop151
a3.sources.r1.port = 4142

#sink配置
a3.sinks.k1.type = logger

#channel配置
a3.channels.c1.type = memory
a3.channels.c1.capacity = 1000
a3.channels.c1.transactionCapacity = 100

#source-channel-sink之间的联系
a3.sources.r1.channels = c1
a3.sinks.k1.channel = c1

第七步：先启动flume-2、flume-3，再启动flume-1

bin/flume-ng agent -n agent名称 -c conf目录 -f 自定义的conf文件路径及文件名

#执行
bin/flume-ng agent -n a2 -c conf/ -f job/group3/flume-2.conf -Dflume.root.logger=INFO,console
bin/flume-ng agent -n a3 -c conf/ -f job/group3/flume-3.conf -Dflume.root.logger=INFO,console
bin/flume-ng agent -n a1 -c conf/ -f job/group3/flume-1.conf

自定义Source

需求

使用 flume 接收数据，并给每条数据添加前缀和后缀，输出到控制台
前缀和后缀可从 flume 配置文件中配置，后缀设定默认值

自定义Source类（打包上传服务器）

import org.apache.flume.Context;
import org.apache.flume.EventDeliveryException;
import org.apache.flume.PollableSource;
import org.apache.flume.conf.Configurable;
import org.apache.flume.event.SimpleEvent;
import org.apache.flume.source.AbstractSource;

public class MySource extends AbstractSource implements Configurable, PollableSource {

    private String prefix;
    private String subfix;

    //初始化 context（读取flume配置文件内容）
    public void configure(Context context) {
        //获取flume配置文件内容（a1.sources.r1.prefix 和 a1.sources.r1.subfix）
        prefix = context.getString("prefix");
        subfix = context.getString("subfix", "fzk");
    }

    //获取数据封装成 event 并写入 channel，这个方法将被循环调用
    public Status process() throws EventDeliveryException {
        Status status = null;
        try {
            for (int i = 0; i < 5; i++) {
                //创建事件(Event)并设置body的值
                SimpleEvent event = new SimpleEvent();
                event.setBody((prefix + "--" + i + "--" + subfix).getBytes());
                //将事件写入到Channel
                getChannelProcessor().processEvent(event);

                Thread.sleep(1000);
            }
            //就绪
            status = Status.READY;
        } catch (Exception e) {
            e.printStackTrace();
            //退避
            status = Status.BACKOFF;
        }

        return status;
    }

    public long getBackOffSleepIncrement() {
        return 0;
    }

    public long getMaxBackOffSleepInterval() {
        return 0;
    }
}

flume配置文件（mysource.conf）

#name
a1.sources = r1
a1.sinks = k1
a1.channels = c1

#source配置
a1.sources.r1.type = com.itfzk.flume.source.MySource
a1.sources.r1.prefix = flume
a1.sources.r1.subfix = end

#sink配置
a1.sinks.k1.type = logger

#channel配置
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

#source-channel-sink之间的联系
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

自定义Sink

介绍

Sink 不断地轮询 Channel 中的事件且批量地移除它们，并将这些事件批量写入到存储或索引系统、或者被发送到另一个 Flume Agent
Sink 是完全事务性的。在从 Channel 批量删除数据之前，每个 Sink 用 Channel 启动一个事务。批量事件一旦成功写出到存储系统或下一个 Flume Agent，Sink 就利用 Channel 提交事务。事务一旦被提交，该 Channel 从自己的内部缓冲区删除事件

需求：使用 flume 接收数据，并在 Sink 端给数据添加前、后缀，输出到控制台。前后缀可在 flume 任务配置文件中配置

自定义Sink类（打包上传服务器）

import org.apache.flume.*;
import org.apache.flume.conf.Configurable;
import org.apache.flume.sink.AbstractSink;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

public class MySink extends AbstractSink implements Configurable {

    private String prefix;
    private String subfix;
    private Logger logger;

    //初始化 context（读取配置文件内容）
    public void configure(Context context) {
        //获取flume配置文件内容（a1.sink.k1.prefix 和 a1.sink.k1.subfix）
        prefix = context.getString("prefix");
        subfix = context.getString("subfix", "fzk");
        logger = LoggerFactory.getLogger(MySink.class);
    }

    //从 Channel 读取获取数据（event），这个方法将被循环调用
    public Status process() throws EventDeliveryException {
        //状态
        Status status = null;
        //获取Channel
        Channel channel = getChannel();
        //从Channel中获取事务
        Transaction transaction = channel.getTransaction();
        //开启事务
        transaction.begin();
        //从Channel中获取数据
        Event event = channel.take();
        try {
            //处理事件
            if(event != null) {
                String body = new String(prefix + event.getBody() + subfix);
                logger.info(body);
            }
            //提交事务
            transaction.commit();
            //状态设为成功
            status = Status.READY;
        } catch (Exception e) {
            e.printStackTrace();
            //将状态设为退避
            status = Status.BACKOFF;
            //回滚事务
            transaction.rollback();
        } finally {
            //关闭事务
            if(transaction != null){
                transaction.close();
            }
        }

        return status;
    }
}

flume配置文件（mysink.conf）

#name
a1.sources = r1
a1.sinks = k1
a1.channels = c1

#source配置
a1.sources.r1.type = netcat
a1.sources.r1.bind = hadoop151
a1.sources.r1.port = 44444

#sink配置
a1.sinks.k1.type = com.itfzk.flume.sink.MySink
a1.sinks.k1.prefix = flume
a1.sinks.k1.subfix = end

#channel配置
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

#source-channel-sink之间的联系
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

code@fzk

关注

1
点赞
踩
9

收藏

觉得还不错? 一键收藏
0
评论
Flume的详细使用

Flume文章目录Flume1、简介2、快速入门安装3、配置及简单使用配置source配置（常用）sink配置（常用）channel配置（常用）使用监控端口数据到控制台（netcat-memeory-logger）实时监控单个追加文件到控制台（exec-memeory-logger）实时监控单个追加文件到HDFS（exec-memeory-hdfs）实时监控目录下多个新文件（spooldir-memory-hdfs）实时监控目录下的多个追加文件（taildir-memory-hdfs）Flume对接Kaf
复制链接

扫一扫