Flume
1、简介
- Flume 是 Cloudera 提供的一个高可用的,高可靠的,分布式的海量日志采集、聚合和传输的系统。Flume 基于流式架构,灵活简单
- Agent
- Agent 是一个 JVM 进程,它以事件的形式将数据从源头送至目的。
- Agent 主要有 3 个部分组成,Source、Channel、Sink。
-
- 负责接收数据到 Flume Agent 的组件。Source 组件可以处理各种类型、各种格式的日志数据
- Sink
- Sink 不断地轮询 Channel 中的事件且批量地移除它们,并将这些事件批量写入到存储或索引系统、或者被发送到另一个 Flume Agent
- Channel
- 位于 Source 和 Sink 之间的缓冲区。因此,Channel 允许 Source 和 Sink 运作在不同的速率上。Channel 是线程安全的,可以同时处理几个 Source 的写入操作和几个Sink 的读取操作
- Flume 自带 Channel:Memory Channel 和 File Channel 以及 Kafka Channel
- Memory Channel 是内存中的队列。Memory Channel 在不需要关心数据丢失的情景下适用。如果需要关心数据丢失,那么 Memory Channel 就不应该使用,因为程序死亡、机器宕机或者重启都会导致数据丢失
- File Channel 将所有事件写到磁盘。因此在程序关闭或机器宕机的情况下不会丢失数据
-
2、快速入门
安装
-
第一步:将 apache-flume-1.7.0-bin.tar.gz 上传到 linux 上,并解压
-
第二步:将 flume/conf 下的 flume-env.sh.template 文件修改为 flume-env.sh,并配置jdk安装路径
export JAVA_HOME=/opt/software/jdk/jdk1.8.0_281 #按照自己jdk的安装路径
-
第三步:安装 netcat 工具
yum install -y nc
3、配置及简单使用
配置
官方文档地址:http://flume.apache.org/releases/content/1.7.0/FlumeUserGuide.html
source配置(常用)
- netcat
- http://flume.apache.org/releases/content/1.7.0/FlumeUserGuide.html#netcat-source
- exec
- http://flume.apache.org/releases/content/1.7.0/FlumeUserGuide.html#exec-source
- spooling directory
- http://flume.apache.org/releases/content/1.7.0/FlumeUserGuide.html#spooling-directory-source
- taildir
- http://flume.apache.org/releases/content/1.7.0/FlumeUserGuide.html#taildir-source
- avro
- http://flume.apache.org/releases/content/1.7.0/FlumeUserGuide.html#avro-source
- kafka
- http://flume.apache.org/releases/content/1.7.0/FlumeUserGuide.html#kafka-source
sink配置(常用)
- logger
- http://flume.apache.org/releases/content/1.7.0/FlumeUserGuide.html#logger-sink
- hdfs
- http://flume.apache.org/releases/content/1.7.0/FlumeUserGuide.html#hdfs-sink
- avro
- http://flume.apache.org/releases/content/1.7.0/FlumeUserGuide.html#avro-sink
- file roll
- http://flume.apache.org/releases/content/1.7.0/FlumeUserGuide.html#file-roll-sink
- HBase
- http://flume.apache.org/releases/content/1.7.0/FlumeUserGuide.html#hbasesinks
- hive
- http://flume.apache.org/releases/content/1.7.0/FlumeUserGuide.html#hive-sink
- kafka
- http://flume.apache.org/releases/content/1.7.0/FlumeUserGuide.html#kafka-sink
channel配置(常用)
- Memory channel
- http://flume.apache.org/releases/content/1.7.0/FlumeUserGuide.html#memory-channel
- File channel
- http://flume.apache.org/releases/content/1.7.0/FlumeUserGuide.html#file-channel
- JDBC channel
- http://flume.apache.org/releases/content/1.7.0/FlumeUserGuide.html#jdbc-channel
- Kafka channel
- http://flume.apache.org/releases/content/1.7.0/FlumeUserGuide.html#kafka-channel
使用
监控端口数据到控制台(netcat-memeory-logger)
source:netcat ------------- channel:memory ------------- sink:logger
-
第一步:在 job 文件夹下创建 Flume Agent 配置文件 netcat-memory-logger.conf(该文件名随便取),并配置
# Agent 三个组件(source、sink、channel)的名称 a1.sources = r1 a1.sinks = k1 a1.channels = c1 # 配置source a1.sources.r1.type = netcat a1.sources.r1.bind = hadoop151 a1.sources.r1.port = 44444 # 配置sink a1.sinks.k1.type = logger # 配置channel a1.channels.c1.type = memory a1.channels.c1.capacity = 1000 a1.channels.c1.transactionCapacity = 100 # 将source和sink绑定到channel中 a1.sources.r1.channels = c1 a1.sinks.k1.channel = c1
-
第二步:开启flume监听端口
bin/flume-ng agent -n agent名称 -c conf目录 -f 自定义的conf文件路径及文件名 #例子:-Dflume.root.logger=INFO,console:将日志输出到控制台 bin/flume-ng agent -n a1 -c conf/ -f job/netcat-memory-logger.conf -Dflume.root.logger=INFO,console
-
第三步:使用 netcat 工具向本机发送内容
nc source的host source的端口号 #例子 nc hadoop151 44444
实时监控单个追加文件到控制台(exec-memeory-logger)
source:exec ------------- channel:memory ------------- sink:logger
-
第一步:在 job 文件夹下创建 Flume Agent 配置文件 exec-memory-logger.conf(该文件名随便取),并配置
# Agent 三个组件(source、sink、channel)的名称 a1.sources = r1 a1.sinks = k1 a1.channels = c1 # 配置source a1.sources.r1.type = exec a1.sources.r1.command = tail -f /opt/software/hive/apache-hive-3.1.2-bin/logs/hive.log # 配置sink a1.sinks.k1.type = logger # 配置channel a1.channels.c1.type = memory a1.channels.c1.capacity = 1000 a1.channels.c1.transactionCapacity = 100 # 将source和sink绑定到channel中 a1.sources.r1.channels = c1 a1.sinks.k1.channel = c1
-
第二步:启动hive
-
第三步:运行flume
bin/flume-ng agent -n agent名称 -c conf目录 -f 自定义的conf文件路径及文件名 #例子:-Dflume.root.logger=INFO,console:将日志输出到控制台 bin/flume-ng agent -n a1 -c conf/ -f job/exec-memory-logger.conf -Dflume.root.logger=INFO,console
-
第四步:写个查询语句,hive.log文件会追加,查看flume的变化
实时监控单个追加文件到HDFS(exec-memeory-hdfs)
source:exec ------------- channel:memory ------------- sink:hdfs
-
第一步:Flume 要想将数据输出到 HDFS,须持有 Hadoop 相关 jar 包,将以下jar包上传到flume的lib路径下
commons-configuration-1.6.jar hadoop-auth-2.7.2.jar hadoop-common-2.7.2.jar hadoop-hdfs-2.7.2.jar commons-io-2.4.jar htrace-core-3.1.0-incubating.jar
-
第二步:在 job 文件夹下创建 Flume Agent 配置文件 exec-memory-hdfs.conf(该文件名随便取),并配置
# Agent 三个组件(source、sink、channel)的名称 a1.sources = r1 a1.sinks = k1 a1.channels = c1 # 配置source a1.sources.r1.type = exec a1.sources.r1.command = tail -f /opt/software/hive/apache-hive-3.1.2-bin/logs/hive.log # 配置sink a1.sinks.k1.type = hdfs a1.sinks.k1.hdfs.path = hdfs://hadoop151:8020/flume/%Y%m%d/%H #上传文件的前缀 a1.sinks.k1.hdfs.filePrefix = logs- #是否按照时间滚动文件夹 a1.sinks.k1.hdfs.round = true #多少时间单位创建一个新的文件夹 a1.sinks.k1.hdfs.roundValue = 1 #重新定义时间单位 a1.sinks.k1.hdfs.roundUnit = hour #是否使用本地时间戳 a1.sinks.k1.hdfs.useLocalTimeStamp = true #积攒多少个 Event 才 flush 到 HDFS 一次 a1.sinks.k1.hdfs.batchSize = 1000 #设置文件类型,可支持压缩 a1.sinks.k1.hdfs.fileType = DataStream #多久生成一个新的文件 a1.sinks.k1.hdfs.rollInterval = 30 #设置每个文件的滚动大小 a1.sinks.k1.hdfs.rollSize = 134217700 #文件的滚动与 Event 数量无关 a1.sinks.k1.hdfs.rollCount = 0 # 配置channel a1.channels.c1.type = memory a1.channels.c1.capacity = 1000 a1.channels.c1.transactionCapacity = 100 # 将source和sink绑定到channel中 a1.sources.r1.channels = c1 a1.sinks.k1.channel = c1
-
第三步:启动hive
-
第四步:运行flume
bin/flume-ng agent -n agent名称 -c conf目录 -f 自定义的conf文件路径及文件名 #例子:-Dflume.root.logger=INFO,console:将日志输出到控制台 bin/flume-ng agent -n a1 -c conf/ -f job/exec-memory-hdfs.conf
-
第五步:写个查询语句,hive.log文件会追加,HDFS文件变化
实时监控目录下多个新文件(spooldir-memory-hdfs)
source:Spooling Directory ------------- channel:memory ------------- sink:hdfs
-
第一步:Flume 要想将数据输出到 HDFS,须持有 Hadoop 相关 jar 包,将以下jar包上传到flume的lib路径下
commons-configuration-1.6.jar hadoop-auth-2.7.2.jar hadoop-common-2.7.2.jar hadoop-hdfs-2.7.2.jar commons-io-2.4.jar htrace-core-3.1.0-incubating.jar
-
第二步:在 job 文件夹下创建 Flume Agent 配置文件 spooldir-memory-hdfs.conf(该文件名随便取),并配置
# Agent 三个组件(source、sink、channel)的名称 a1.sources = r1 a1.sinks = k1 a1.channels = c1 # 配置source a1.sources.r1.type = spooldir a1.sources.r1.spoolDir = /opt/software/flume/data/update a1.sources.r1.fileSuffix = .COMPLETED a1.sources.r1.fileHeader = true #忽略所有以.tmp 结尾的文件,不上传 a1.sources.r1.ignorePattern = ([^ ]*\.tmp) # 配置sink a1.sinks.k1.type = hdfs a1.sinks.k1.hdfs.path = hdfs://hadoop151:8020/flume/spooldir/%Y%m%d/%H #上传文件的前缀 a1.sinks.k1.hdfs.filePrefix = logs- #是否按照时间滚动文件夹 a1.sinks.k1.hdfs.round = true #多少时间单位创建一个新的文件夹 a1.sinks.k1.hdfs.roundValue = 1 #重新定义时间单位 a1.sinks.k1.hdfs.roundUnit = hour #是否使用本地时间戳 a1.sinks.k1.hdfs.useLocalTimeStamp = true #积攒多少个 Event 才 flush 到 HDFS 一次 a1.sinks.k1.hdfs.batchSize = 1000 #设置文件类型,可支持压缩 a1.sinks.k1.hdfs.fileType = DataStream #多久生成一个新的文件 a1.sinks.k1.hdfs.rollInterval = 30 #设置每个文件的滚动大小 a1.sinks.k1.hdfs.rollSize = 134217700 #文件的滚动与 Event 数量无关 a1.sinks.k1.hdfs.rollCount = 0 # 配置channel a1.channels.c1.type = memory a1.channels.c1.capacity = 1000 a1.channels.c1.transactionCapacity = 100 # 将source和sink绑定到channel中 a1.sources.r1.channels = c1 a1.sinks.k1.channel = c1
-
第三步:运行flume
bin/flume-ng agent -n agent名称 -c conf目录 -f 自定义的conf文件路径及文件名 #例子 bin/flume-ng agent -n a1 -c conf/ -f job/spooldir-memory-hdfs.conf
实时监控目录下的多个追加文件(taildir-memory-hdfs)
source:taildir ------------- channel:memory ------------- sink:hdfs
-
第一步:Flume 要想将数据输出到 HDFS,须持有 Hadoop 相关 jar 包,将以下jar包上传到flume的lib路径下
commons-configuration-1.6.jar hadoop-auth-2.7.2.jar hadoop-common-2.7.2.jar hadoop-hdfs-2.7.2.jar commons-io-2.4.jar htrace-core-3.1.0-incubating.jar
-
第二步:在 job 文件夹下创建 Flume Agent 配置文件 taildir-memory-hdfs.conf(该文件名随便取),并配置
# Agent 三个组件(source、sink、channel)的名称 a1.sources = r1 a1.sinks = k1 a1.channels = c1 # 配置source a1.sources.r1.type = TAILDIR a1.sources.r1.positionFile = /opt/software/flume/data/taildir/tail_dir.json a1.sources.r1.filegroups = f1 f2 a1.sources.r1.filegroups.f1 = /opt/software/flume/data/taildir/file1.txt a1.sources.r1.filegroups.f2 = /opt/software/flume/data/taildir/file2.txt # 配置sink a1.sinks.k1.type = hdfs a1.sinks.k1.hdfs.path = hdfs://hadoop151:8020/flume/taildir/%Y%m%d/%H #上传文件的前缀 a1.sinks.k1.hdfs.filePrefix = logs- #是否按照时间滚动文件夹 a1.sinks.k1.hdfs.round = true #多少时间单位创建一个新的文件夹 a1.sinks.k1.hdfs.roundValue = 1 #重新定义时间单位 a1.sinks.k1.hdfs.roundUnit = hour #是否使用本地时间戳 a1.sinks.k1.hdfs.useLocalTimeStamp = true #积攒多少个 Event 才 flush 到 HDFS 一次 a1.sinks.k1.hdfs.batchSize = 1000 #设置文件类型,可支持压缩 a1.sinks.k1.hdfs.fileType = DataStream #多久生成一个新的文件 a1.sinks.k1.hdfs.rollInterval = 30 #设置每个文件的滚动大小 a1.sinks.k1.hdfs.rollSize = 134217700 #文件的滚动与 Event 数量无关 a1.sinks.k1.hdfs.rollCount = 0 # 配置channel a1.channels.c1.type = memory a1.channels.c1.capacity = 1000 a1.channels.c1.transactionCapacity = 100 # 将source和sink绑定到channel中 a1.sources.r1.channels = c1 a1.sinks.k1.channel = c1
-
第三步:运行flume
bin/flume-ng agent -n agent名称 -c conf目录 -f 自定义的conf文件路径及文件名 #例子 bin/flume-ng agent -n a1 -c conf/ -f job/taildir-memory-hdfs.conf
Flume对接Kafka
source:netcat ------------- channel:memory ------------- sink:kafka
-
第一步:编写Flume配置文件(flume-kafka.conf)
#name a1.sources = r1 a1.sinks = k1 a1.channels = c1 #source配置 a1.sources.r1.type = netcat a1.sources.r1.bind = hadoop151 a1.sources.r1.port = 44444 #sink配置 a1.sinks.k1.type = org.apache.flume.sink.kafka.KafkaSink a1.sinks.k1.kafka.topic = fzk a1.sinks.k1.kafka.bootstrap.servers = hadoop151:9092,hadoop152:9092,hadoop153:9092 a1.sinks.k1.kafka.flumeBatchSize = 20 a1.sinks.k1.kafka.producer.acks = 1 a1.sinks.k1.kafka.producer.linger.ms = 1 #channel配置 a1.channels.c1.type = memory a1.channels.c1.capacity = 1000 a1.channels.c1.transactionCapacity = 100 #source-channel-sink之间的联系 a1.sources.r1.channels = c1 a1.sinks.k1.channel = c1
-
第二步:开启kafka消费者
bin/kafka-console-consumer.sh --bootstrap-server hadoop151:9092,hadoop152:9092,hadoop153:9092 \ --topic fzk
-
第三步:启动Flume
bin/flume-ng agent -n agent名称 -c conf/ -f flume-kafka.conf文件路径及文件名 #执行 bin/flume-ng agent -n a1 -c conf/ -f job/flume-kafka.conf
4、Flume进阶
复制和多路复用(Flume Channel Selectors)
- Replicating Channel Selector:复制选择器(默认)
- http://flume.apache.org/releases/content/1.7.0/FlumeUserGuide.html#replicating-channel-selector-default
- Multiplexing Channel Selector:多路复用选择器
- http://flume.apache.org/releases/content/1.7.0/FlumeUserGuide.html#multiplexing-channel-selector
- Custom Channel Selector:自定义选择器
- http://flume.apache.org/releases/content/1.7.0/FlumeUserGuide.html#custom-channel-selector
复制(Replicating Channel Selector)
需求:使用 Flume-1 监控文件变动,Flume-1 将变动内容传递给 Flume-2,Flume-2 负责存储到 HDFS。同时 Flume-1 将变动内容传递给 Flume-3,Flume-3 负责输出到 Local FileSystem
-
第一步:编写flume-1配置文件(flume-1.conf)
# Agent 三个组件(source、sink、channel)的名称 a1.sources = r1 a1.sinks = k1 k2 a1.channels = c1 c2 #source配置 a1.sources.r1.type = TAILDIR a1.sources.r1.positionFile = /opt/software/flume/data/taildir/taildir001.json a1.sources.r1.filegroups = f1 a1.sources.r1.filegroups.f1 = /opt/software/hive/apache-hive-3.1.2-bin/logs/hive.log #sink配置 #sink_1配置 a1.sinks.k1.type = avro a1.sinks.k1.hostname = hadoop151 a1.sinks.k1.port = 4141 #sink_2配置 a1.sinks.k2.type = avro a1.sinks.k2.hostname = hadoop151 a1.sinks.k2.port = 4142 #channel配置 #channel_1配置 a1.channels.c1.type = memory a1.channels.c1.capacity = 1000 a1.channels.c1.transactionCapacity = 100 #channel_2配置 a1.channels.c2.type = memory a1.channels.c2.capacity = 1000 a1.channels.c2.transactionCapacity = 100 #ChannelSelect配置(复制,默认) a1.sources.r1.selector.type = replicating #source-channel-sink之间的联系 a1.sources.r1.channels = c1 c2 a1.sinks.k1.channel = c1 a1.sinks.k2.channel = c2
-
第二步:编写flume-2配置文件(flume-2.conf)
# Agent 三个组件(source、sink、channel)的名称 a2.sources = r1 a2.sinks = k1 a2.channels = c1 #source配置 a2.sources.r1.type = avro a2.sources.r1.bind = hadoop151 a2.sources.r1.port = 4141 #sink配置 a2.sinks.k1.type = hdfs a2.sinks.k1.hdfs.path = hdfs://hadoop151:8020/flume2/%Y%m%d/%H #上传文件的前缀 a2.sinks.k1.hdfs.filePrefix = flume2- #是否按照时间滚动文件夹 a2.sinks.k1.hdfs.round = true #多少时间单位创建一个新的文件夹 a2.sinks.k1.hdfs.roundValue = 1 #重新定义时间单位 a2.sinks.k1.hdfs.roundUnit = hour #是否使用本地时间戳 a2.sinks.k1.hdfs.useLocalTimeStamp = true #积攒多少个 Event 才 flush 到 HDFS 一次 a2.sinks.k1.hdfs.batchSize = 100 #设置文件类型,可支持压缩 a2.sinks.k1.hdfs.fileType = DataStream #多久生成一个新的文件 a2.sinks.k1.hdfs.rollInterval = 30 #设置每个文件的滚动大小大概是 128M a2.sinks.k1.hdfs.rollSize = 134217700 #文件的滚动与 Event 数量无关 a2.sinks.k1.hdfs.rollCount = 0 #channel配置 a2.channels.c1.type = memory a2.channels.c1.capacity = 1000 a2.channels.c1.transactionCapacity = 100 #source-channel-sink之间的联系 a2.sources.r1.channels = c1 a2.sinks.k1.channel = c1
-
第三步:编写flume-3配置文件(flume-3.conf)
# Agent 三个组件(source、sink、channel)的名称 a3.sources = r1 a3.sinks = k1 a3.channels = c1 #source配置 a3.sources.r1.type = avro a3.sources.r1.bind = hadoop151 a3.sources.r1.port = 4142 #sink配置 a3.sinks.k1.type = file_roll a3.sinks.k1.sink.directory = /opt/software/flume/data/flumeData #channel配置 a3.channels.c1.type = memory a3.channels.c1.capacity = 1000 a3.channels.c1.transactionCapacity = 100 #source-channel-sink之间的联系 a3.sources.r1.channels = c1 a3.sinks.k1.channel = c1
-
第四步:先启动flume-2、flume-3,再启动flume-1
bin/flume-ng agent -n agent名称 -c conf目录 -f 自定义的conf文件路径及文件名 #执行 bin/flume-ng agent -n a2 -c conf/ -f job/group1/flume-2.conf bin/flume-ng agent -n a3 -c conf/ -f job/group1/flume-3.conf bin/flume-ng agent -n a1 -c conf/ -f job/group1/flume-1.conf
多路复用(Multiplexing Channel Selector)
见下文 自定义Interceptor
故障转移和负载均衡(Flume Sink Processors)
- Default Sink Processor:默认
- http://flume.apache.org/releases/content/1.7.0/FlumeUserGuide.html#default-sink-processor
- Failover Sink Processor:故障转移
- http://flume.apache.org/releases/content/1.7.0/FlumeUserGuide.html#failover-sink-processor
- Load balancing Sink Processor:负载均衡
- http://flume.apache.org/releases/content/1.7.0/FlumeUserGuide.html#load-balancing-sink-processor
故障转移(Failover Sink Processor)
需求:使用 Flume-1 监控一个端口,其 sink 组中的 sink 分别对接 Flume-2 和 Flume-3,采用FailoverSinkProcessor,实现故障转移的功能
-
第一步:编写flume-1配置文件(flume-1.conf)
# Agent 三个组件(source、sink、channel)的名称 a1.sources = r1 a1.sinks = k1 k2 a1.channels = c1 #source配置 a1.sources.r1.type = netcat a1.sources.r1.bind = hadoop151 a1.sources.r1.port = 44444 #sink配置 #sink_1配置 a1.sinks.k1.type = avro a1.sinks.k1.hostname = hadoop151 a1.sinks.k1.port = 4141 #sink_2配置 a1.sinks.k2.type = avro a1.sinks.k2.hostname = hadoop151 a1.sinks.k2.port = 4142 #channel配置 a1.channels.c1.type = memory a1.channels.c1.capacity = 1000 a1.channels.c1.transactionCapacity = 100 #SinkProcessors(故障转移) a1.sinkgroups = g1 a1.sinkgroups.g1.sinks = k1 k2 a1.sinkgroups.g1.processor.type = failover a1.sinkgroups.g1.processor.priority.k1 = 5 a1.sinkgroups.g1.processor.priority.k2 = 10 a1.sinkgroups.g1.processor.maxpenalty = 10000 #source-channel-sink之间的联系 a1.sources.r1.channels = c1 a1.sinks.k1.channel = c1 a1.sinks.k2.channel = c1
-
第二步:编写flume-2配置文件(flume-2.conf)
# Agent 三个组件(source、sink、channel)的名称 a2.sources = r1 a2.sinks = k1 a2.channels = c1 #source配置 a2.sources.r1.type = avro a2.sources.r1.bind = hadoop151 a2.sources.r1.port = 4141 #sink配置 a2.sinks.k1.type = logger #channel配置 a2.channels.c1.type = memory a2.channels.c1.capacity = 1000 a2.channels.c1.transactionCapacity = 100 #source-channel-sink之间的联系 a2.sources.r1.channels = c1 a2.sinks.k1.channel = c1
-
第三步:编写flume-3配置文件(flume-3.conf)
# Agent 三个组件(source、sink、channel)的名称 a3.sources = r1 a3.sinks = k1 a3.channels = c1 #source配置 a3.sources.r1.type = avro a3.sources.r1.bind = hadoop151 a3.sources.r1.port = 4142 #sink配置 a3.sinks.k1.type = logger #channel配置 a3.channels.c1.type = memory a3.channels.c1.capacity = 1000 a3.channels.c1.transactionCapacity = 100 #source-channel-sink之间的联系 a3.sources.r1.channels = c1 a3.sinks.k1.channel = c1
-
第四步:先启动flume-2、flume-3,再启动flume-1
bin/flume-ng agent -n agent名称 -c conf目录 -f 自定义的conf文件路径及文件名 #执行 bin/flume-ng agent -n a2 -c conf/ -f job/group2/flume-2.conf -Dflume.root.logger=INFO,console bin/flume-ng agent -n a3 -c conf/ -f job/group2/flume-3.conf -Dflume.root.logger=INFO,console bin/flume-ng agent -n a1 -c conf/ -f job/group2/flume-1.conf
负载均衡(Load balancing Sink Processor)
需求:使用 Flume-1 监控一个端口,其 sink 组中的 sink 分别对接 Flume-2 和 Flume-3,采用LoadBalancingSinkProcessor,实现负载均衡(随机将日志打印到 Flume-2 和 Flume-3 中的一个)
-
配置跟上面故障转移一样,不一样的是 flume-1配置文件(flume-1.conf)
#name a1.sources = r1 a1.sinks = k1 k2 a1.channels = c1 #source配置 a1.sources.r1.type = netcat a1.sources.r1.bind = hadoop151 a1.sources.r1.port = 44444 #sink配置 #sink_1配置 a1.sinks.k1.type = avro a1.sinks.k1.hostname = hadoop151 a1.sinks.k1.port = 4141 #sink_2配置 a1.sinks.k2.type = avro a1.sinks.k2.hostname = hadoop151 a1.sinks.k2.port = 4142 #channel配置 a1.channels.c1.type = memory a1.channels.c1.capacity = 1000 a1.channels.c1.transactionCapacity = 100 #SinkProcessors(负载均衡) a1.sinkgroups = g1 a1.sinkgroups.g1.sinks = k1 k2 a1.sinkgroups.g1.processor.type = load_balance a1.sinkgroups.g1.processor.backoff = true a1.sinkgroups.g1.processor.selector = random #source-channel-sink之间的联系 a1.sources.r1.channels = c1 a1.sinks.k1.channel = c1 a1.sinks.k2.channel = c1
聚合
需求
-
hadoop151 上的 Flume-1 监控文件/opt/software/flume/data/group.log
-
hadoop152 上的 Flume-2 监控某一个端口的数据流
-
Flume-1 与 Flume-2 将数据发送给 hadoop153 上的 Flume-3,Flume-3 将最终数据打印到控制台
-
第一步:编写hadoop151中的flume-1配置文件(flume-1.conf)
#name a1.sources = r1 a1.sinks = k1 a1.channels = c1 #source配置 a1.sources.r1.type = TAILDIR a1.sources.r1.positionFile = /opt/software/flume/taildir_position1.json a1.sources.r1.filegroups = f1 a1.sources.r1.filegroups.f1 = /opt/software/flume/data/group.txt #sink配置 a1.sinks.k1.type = avro a1.sinks.k1.hostname = hadoop153 a1.sinks.k1.port = 4141 #channel配置 a1.channels.c1.type = memory a1.channels.c1.capacity = 1000 a1.channels.c1.transactionCapacity = 100 #source-channel-sink之间的联系 a1.sources.r1.channels = c1 a1.sinks.k1.channel = c1
-
第二步:编写hadoop152中的flume-2配置文件(flume-2.conf)
#name a2.sources = r1 a2.sinks = k1 a2.channels = c1 #source配置 a2.sources.r1.type = netcat a2.sources.r1.bind = hadoop152 a2.sources.r1.port = 44444 #sink配置 a2.sinks.k1.type = avro a2.sinks.k1.hostname = hadoop153 a2.sinks.k1.port = 4141 #channel配置 a2.channels.c1.type = memory a2.channels.c1.capacity = 1000 a2.channels.c1.transactionCapacity = 100 #source-channel-sink之间的联系 a2.sources.r1.channels = c1 a2.sinks.k1.channel = c1
-
第三步:编写hadoop153中的flume-3配置文件(flume-3.conf)
#name a3.sources = r1 a3.sinks = k1 a3.channels = c1 #source配置 a3.sources.r1.type = avro a3.sources.r1.bind = hadoop153 a3.sources.r1.port = 4141 #sink配置 a3.sinks.k1.type = logger #channel配置 a3.channels.c1.type = memory a3.channels.c1.capacity = 1000 a3.channels.c1.transactionCapacity = 100 #source-channel-sink之间的联系 a3.sources.r1.channels = c1 a3.sinks.k1.channel = c1
-
第四步:先启动flume-3,再启动flume-1、flume-2
bin/flume-ng agent -n agent名称 -c conf目录 -f 自定义的conf文件路径及文件名 #执行 bin/flume-ng agent -n a3 -c conf/ -f job/group3/flume-3.conf -Dflume.root.logger=INFO,console bin/flume-ng agent -n a1 -c conf/ -f job/group3/flume-1.conf bin/flume-ng agent -n a2 -c conf/ -f job/group3/flume-2.conf
自定义Interceptor(实现多路复用功能)
需求
- 使用 Flume 采集服务器本地日志,需要按照日志类型的不同,将不同种类的日志发往不同的分析系统
需求分析
-
在实际的开发中,一台服务器产生的日志类型可能有很多种,不同类型的日志可能需要发送到不同的分析系统。此时会用到 Flume 拓扑结构中的 Multiplexing 结构,Multiplexing的原理是,根据 event 中 Header 的某个 key 的值,将不同的 event 发送到不同的 Channel中,所以我们需要自定义一个 Interceptor,为不同类型的 event 的 Header 中的 key 赋予不同的值
-
在该案例中,我们以端口数据模拟日志,以数字(单个)和字母(单个)模拟不同类型的日志,我们需要自定义 interceptor 区分数字和字母,将其分别发往不同的分析系统(Channel)
-
第一步:创建Maven工程,并导入依赖(pom.xml)
<dependencies> <dependency> <groupId>org.apache.flume</groupId> <artifactId>flume-ng-core</artifactId> <version>1.7.0</version> </dependency> </dependencies>
-
第二步:编写自定义Interceptor类(TypeInterceptor)
import org.apache.flume.Context; import org.apache.flume.Event; import org.apache.flume.interceptor.Interceptor; import java.util.ArrayList; import java.util.List; import java.util.Map; public class TypeInterceptor implements Interceptor { private List<Event> eventList; //初始化 public void initialize() { eventList = new ArrayList<Event>(); } //单个事件拦截 public Event intercept(Event event) { //获取header信息 Map<String, String> header = event.getHeaders(); //获取body学习 byte[] body = event.getBody(); //若为字母则发送到letter,否则发送到number if (body[0] >= 'A' && body[0] <= 'z') { header.put("type", "letter"); } else { header.put("type", "number"); } return event; } //批量事件拦截 public List<Event> intercept(List<Event> list) { eventList.clear(); //批量处理 for (Event event : list) { eventList.add(intercept(event)); } return eventList; } //关闭 public void close() { } //Builder类,返回自定义Interceptor public static class Builder implements Interceptor.Builder{ public Interceptor build() { //返回自定义的拦截类 return new TypeInterceptor(); } public void configure(Context context) { } } }
-
第三步:将编写好的类打包并上传到服务器的flume目录的lib目录下
-
第四步:编写flume-1配置文件(flume-1.conf)
#name a1.sources = r1 a1.sinks = k1 k2 a1.channels = c1 c2 #source配置 a1.sources.r1.type = netcat a1.sources.r1.bind = hadoop151 a1.sources.r1.port = 44444 #sink配置 #sink_1配置 a1.sinks.k1.type = avro a1.sinks.k1.hostname = hadoop151 a1.sinks.k1.port = 4141 #sink_2配置 a1.sinks.k2.type = avro a1.sinks.k2.hostname = hadoop151 a1.sinks.k2.port = 4142 #channel配置 #channel_1配置 a1.channels.c1.type = memory a1.channels.c1.capacity = 1000 a1.channels.c1.transactionCapacity = 100 #channel_2配置 a1.channels.c2.type = memory a1.channels.c2.capacity = 1000 a1.channels.c2.transactionCapacity = 100 #Interceptor配置 a1.sources.r1.interceptors = i1 #自定义类的全限定类名 a1.sources.r1.interceptors.i1.type = com.itfzk.flume.interceptor.TypeInterceptor$Builder #ChannelSelect配置(多路复用) a1.sources.r1.selector.type = multiplexing #header信息的key(自定义java程序中header的key) a1.sources.r1.selector.header = type #根据自定义java程序中header的value值决定发送到哪(为"letter"将数据发送到c1,为"number"将数据发送到c1) a1.sources.r1.selector.mapping.letter = c1 a1.sources.r1.selector.mapping.number = c2 #source-channel-sink之间的联系 a1.sources.r1.channels = c1 c2 a1.sinks.k1.channel = c1 a1.sinks.k2.channel = c2
-
第五步:编写flume-2配置文件(flume-2.conf)
#name a2.sources = r1 a2.sinks = k1 a2.channels = c1 #source配置 a2.sources.r1.type = avro a2.sources.r1.bind = hadoop151 a2.sources.r1.port = 4141 #sink配置 a2.sinks.k1.type = logger #channel配置 a2.channels.c1.type = memory a2.channels.c1.capacity = 1000 a2.channels.c1.transactionCapacity = 100 #source-channel-sink之间的联系 a2.sources.r1.channels = c1 a2.sinks.k1.channel = c1
-
第六步:编写flume-3配置文件(flume-3.conf)
#name a3.sources = r1 a3.sinks = k1 a3.channels = c1 #source配置 a3.sources.r1.type = avro a3.sources.r1.bind = hadoop151 a3.sources.r1.port = 4142 #sink配置 a3.sinks.k1.type = logger #channel配置 a3.channels.c1.type = memory a3.channels.c1.capacity = 1000 a3.channels.c1.transactionCapacity = 100 #source-channel-sink之间的联系 a3.sources.r1.channels = c1 a3.sinks.k1.channel = c1
-
第七步:先启动flume-2、flume-3,再启动flume-1
bin/flume-ng agent -n agent名称 -c conf目录 -f 自定义的conf文件路径及文件名 #执行 bin/flume-ng agent -n a2 -c conf/ -f job/group3/flume-2.conf -Dflume.root.logger=INFO,console bin/flume-ng agent -n a3 -c conf/ -f job/group3/flume-3.conf -Dflume.root.logger=INFO,console bin/flume-ng agent -n a1 -c conf/ -f job/group3/flume-1.conf
自定义Source
需求
-
使用 flume 接收数据,并给每条数据添加前缀和后缀,输出到控制台
-
前缀和后缀可从 flume 配置文件中配置,后缀设定默认值
-
自定义Source类(打包上传服务器)
import org.apache.flume.Context; import org.apache.flume.EventDeliveryException; import org.apache.flume.PollableSource; import org.apache.flume.conf.Configurable; import org.apache.flume.event.SimpleEvent; import org.apache.flume.source.AbstractSource; public class MySource extends AbstractSource implements Configurable, PollableSource { private String prefix; private String subfix; //初始化 context(读取flume配置文件内容) public void configure(Context context) { //获取flume配置文件内容(a1.sources.r1.prefix 和 a1.sources.r1.subfix) prefix = context.getString("prefix"); subfix = context.getString("subfix", "fzk"); } //获取数据封装成 event 并写入 channel,这个方法将被循环调用 public Status process() throws EventDeliveryException { Status status = null; try { for (int i = 0; i < 5; i++) { //创建事件(Event)并设置body的值 SimpleEvent event = new SimpleEvent(); event.setBody((prefix + "--" + i + "--" + subfix).getBytes()); //将事件写入到Channel getChannelProcessor().processEvent(event); Thread.sleep(1000); } //就绪 status = Status.READY; } catch (Exception e) { e.printStackTrace(); //退避 status = Status.BACKOFF; } return status; } public long getBackOffSleepIncrement() { return 0; } public long getMaxBackOffSleepInterval() { return 0; } }
-
flume配置文件(mysource.conf)
#name a1.sources = r1 a1.sinks = k1 a1.channels = c1 #source配置 a1.sources.r1.type = com.itfzk.flume.source.MySource a1.sources.r1.prefix = flume a1.sources.r1.subfix = end #sink配置 a1.sinks.k1.type = logger #channel配置 a1.channels.c1.type = memory a1.channels.c1.capacity = 1000 a1.channels.c1.transactionCapacity = 100 #source-channel-sink之间的联系 a1.sources.r1.channels = c1 a1.sinks.k1.channel = c1
自定义Sink
介绍
- Sink 不断地轮询 Channel 中的事件且批量地移除它们,并将这些事件批量写入到存储或索引系统、或者被发送到另一个 Flume Agent
- Sink 是完全事务性的。在从 Channel 批量删除数据之前,每个 Sink 用 Channel 启动一个事务。批量事件一旦成功写出到存储系统或下一个 Flume Agent,Sink 就利用 Channel 提 交事务。事务一旦被提交,该 Channel 从自己的内部缓冲区删除事件
需求:使用 flume 接收数据,并在 Sink 端给数据添加前、后缀,输出到控制台。前后缀可在 flume 任务配置文件中配置
-
自定义Sink类(打包上传服务器)
import org.apache.flume.*; import org.apache.flume.conf.Configurable; import org.apache.flume.sink.AbstractSink; import org.slf4j.Logger; import org.slf4j.LoggerFactory; public class MySink extends AbstractSink implements Configurable { private String prefix; private String subfix; private Logger logger; //初始化 context(读取配置文件内容) public void configure(Context context) { //获取flume配置文件内容(a1.sink.k1.prefix 和 a1.sink.k1.subfix) prefix = context.getString("prefix"); subfix = context.getString("subfix", "fzk"); logger = LoggerFactory.getLogger(MySink.class); } //从 Channel 读取获取数据(event),这个方法将被循环调用 public Status process() throws EventDeliveryException { //状态 Status status = null; //获取Channel Channel channel = getChannel(); //从Channel中获取事务 Transaction transaction = channel.getTransaction(); //开启事务 transaction.begin(); //从Channel中获取数据 Event event = channel.take(); try { //处理事件 if(event != null) { String body = new String(prefix + event.getBody() + subfix); logger.info(body); } //提交事务 transaction.commit(); //状态设为成功 status = Status.READY; } catch (Exception e) { e.printStackTrace(); //将状态设为退避 status = Status.BACKOFF; //回滚事务 transaction.rollback(); } finally { //关闭事务 if(transaction != null){ transaction.close(); } } return status; } }
-
flume配置文件(mysink.conf)
#name a1.sources = r1 a1.sinks = k1 a1.channels = c1 #source配置 a1.sources.r1.type = netcat a1.sources.r1.bind = hadoop151 a1.sources.r1.port = 44444 #sink配置 a1.sinks.k1.type = com.itfzk.flume.sink.MySink a1.sinks.k1.prefix = flume a1.sinks.k1.subfix = end #channel配置 a1.channels.c1.type = memory a1.channels.c1.capacity = 1000 a1.channels.c1.transactionCapacity = 100 #source-channel-sink之间的联系 a1.sources.r1.channels = c1 a1.sinks.k1.channel = c1