Flume_fluem-CSDN博客

本文详细介绍了Flume，一个高可用的分布式日志处理框架，涵盖了其定义、优点、架构、拓扑应用、Agent内部原理及实战案例，包括监控端口、文件到HDFS、多出口和自定义组件。

概述

定义

Flume是Cloudera提供的一个高可用、高可靠的分布式海量日志采集、聚合、传输系统，基于流式框架，灵活简单

流式框架：处理数据的单位很小
MapReduce是先获取整个文件再处理，是非流式的
Spark也不是流失构架

在这里插入图片描述

Flume的优点

可以和任意存储进程集合
输入的速率大于心如目的的存储速率时，Flume会进行缓冲，减小HDFS压力
Flume中事务基于channel，使用了两个事务模型sender、receiver，确保消息可以被可靠发送。
Flume两个事务模型，确保了数据的可靠性，但是不保证没有重复数据

Flume组成架构（重点）

在这里插入图片描述

spooling directory：目录，只要目录有变化，就将数据输入
exec：执行一条命令（tail -f 文件）这时文件有变化，进行数据输入
syslog：追踪系统日志
avro：两个flume通过其串联
netcat：网络端口传输

logger：日志

Agent

Agent是一个JVM进程，他以事件的形式将数据从源头送至目的。
主要由三个部分组成：Source、Channel、Sink

Source：

负责接收数据到Flume Agent的组件,Source组件可以处理各种类型、各种格式的日志数据，包括avro、thrift、exec、jms、spooling directory、netcat、squence generator、syslog、http、legacy

Channel

是位于Source和Sink之间的缓冲区，因此，Channl允许Source和Sink运作在不同的速率上。Channel是线程安全的，可以同时处理几个Source写入操作和几个Sink的读取操作
Flume自带两种Channel：Memory Channel、File Channel
Memory Channel是内存中的队列，Memory Channel在不需要关心数据丢失的情景下适用，程序死亡、机器宕机或重启都会导致数据丢失。
File Channel将所有事件写到磁盘。因此程序关闭或者机器宕机都不会丢失数据。

Sink

Sink不断地轮询Channel中的事件并且批量的移除它们，并将这些事件批量写入到存储或者索引系统、或者是两一个Flume Agent
Sink是完全事务性的。再从Channel批量删除数据之前，每个Sink用Channel启动一个事务。批量事件一旦成功写出到存储系统或者下一个Flume Agent，Sink就利用Channel提交事务。事务一旦被提交，Channel冲自己内部缓冲区删除事件。

Event

Flume传输数据的基本单元，以事件的形式将数据从源头送至目的地。Event由可选的header和载有数据的一个byte array构成。Header是容纳了key-value字符串对的HashMap
在这里插入图片描述

Flume拓扑序列

在这里插入图片描述
优点是加大缓冲，HDFS更稳定了，缺点是其中一个宕机，全都宕机

Flume支持将时间流向一个或者多个目的地。这种模式将数据源复制到多个channel中，每个channel都有相同的数据，sink可以选择传送不同的目的地
在这里插入图片描述
Flume支持将多个Sink逻辑上分到一个sink组，flume将数据发送到不同的sink，主要解决负载均衡和故障转移的问题。

最常见的一种模式，日常web分布在上百个服务器，以至于成千上万个，产生的日志处理起来也非常麻烦。用flume的这种组合方式能很好的解决这一问题。每个服务器部署一个flume采集日志，传送到一个集中收集日志的flume，再由此flume上传到hdfs、hive、hbase等进行日志分析

Flume Agent内部原理（重点）

在这里插入图片描述
拦截器不能写过于复杂的逻辑计算，简单的处理，例如说掐头去尾（占用性能过高）

channel三条，一条消息要么去1sink，要么去2sink，没有群发的概念
发送方式有两种，一种是一个sink一条，一种是看哪个sink处于空闲状态，就发给哪个（负载均衡）

安装

http://flume.apache.org/
将压缩包缓存到sortware
tar解压
改名

[yyx@hadoop01 module]$ mv apache-flume-1.7.0-bin/ flume

将flume/conf下的flume-env.sh.template文件修改为flume-env.sh，并配置flume-env.sh文件

[yyx@hadoop01 conf]$ cp flume-env.sh.template flume-env.sh
[yyx@hadoop01 conf]$ vim flume-env.sh
# export JAVA_HOME=/usr/lib/jvm/java-6-sun
export JAVA_HOME=/opt/module/jdk1.8.0_144

案例

监控端口数据官方案例

需求：启动flume任务，监控本机44444端口，服务端
然后通过netcat工具向本机44444端口发送信息，客户端
最后Flume将坚挺的数据实时显示在控制台
安装necat

sudo yum install -y nc

查看44444端口是否被占用

[yyx@hadoop01 module]$  sudo netstat -tunlp | grep 44444

功能描述：netstat命令是一个监控TCP/IP网络的非常有用的工具，它可以显示路由表、实际的网络连接以及每一个网络接口设备的状态信息。
基本语法：netstat [选项]
选项参数：
-t或–tcp：显示TCP传输协议的连线状况；
-u或–udp：显示UDP传输协议的连线状况；
-n或–numeric：直接使用ip地址，而不通过域名服务器；
-l或–listening：显示监控中的服务器的Socket；
-p或–programs：显示正在使用Socket的程序识别码（PID）和程序名称；

创建Flume Agent配置文件flume-netcat-logger.conf
在flume目录下创建工程文件夹，并创建以及写入配置文件

[yyx@hadoop01 flume]$ mkdir listen_44444
[yyx@hadoop01 listen_44444]$  touch flume-netcat-logger.conf
[yyx@hadoop01 listen_44444]$ vim flume-netcat-logger.conf

# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1

# Describe/configure the source
a1.sources.r1.type = netcat
a1.sources.r1.bind = localhost
a1.sources.r1.port = 44444

# Describe the sink
a1.sinks.k1.type = logger

# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

在这里插入图片描述
启动flume

[yyx@hadoop01 flume]$  bin/flume-ng agent --conf conf/ --name a1 --conf-file listen_44444/flume-netcat-logger.conf -Dflume.root.logger=INFO,console

参数说明：
–conf conf/ ：表示配置文件存储在conf/目录
–name a1 ：表示给agent起名为a1
–conf-file job/flume-netcat.conf ：flume本次启动读取的配置文件是在job文件夹下的flume-telnet.conf文件。
-Dflume.root.logger==INFO,console ：-D表示flume运行时动态修改flume.root.logger参数属性值，并将控制台日志打印级别设置为INFO级别。日志级别包括:log、info、warn、error。

在另一个连接中

[yyx@hadoop01 ~]$ nc localhost 44444
nihao
OK

2021-03-08 18:19:51,112 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - org.apache.flume.sink.LoggerSink.process(LoggerSink.java:95)] Event: { headers:{} body: 6E 69 68 61 6F                                  nihao }

实时读取本地文件到HDFS

案例需求：实时监控Hive日志，并上传到HDFS中、
首先，将相关jar包导入到/opt/module/flume/lib
在job中创建并写入

[yyx@hadoop01 job]$  vim flume-file-hdfs.conf


# Name the components on this agent
a2.sources = r2
a2.sinks = k2
a2.channels = c2

# Describe/configure the source
a2.sources.r2.type = exec
a2.sources.r2.command = tail -F /tmp/yyx/hive.log
a2.sources.r2.shell = /bin/bash -c

# Describe the sink
a2.sinks.k2.type = hdfs
a2.sinks.k2.hdfs.path = hdfs://hadoop01:9000/flume/%Y%m%d/%H
#上传文件的前缀
a2.sinks.k2.hdfs.filePrefix = logs-
#是否按照时间滚动文件夹
a2.sinks.k2.hdfs.round = true
#多少时间单位创建一个新的文件夹
a2.sinks.k2.hdfs.roundValue = 1
#重新定义时间单位
a2.sinks.k2.hdfs.roundUnit = hour
#是否使用本地时间戳
a2.sinks.k2.hdfs.useLocalTimeStamp = true
#积攒多少个Event才flush到HDFS一次
a2.sinks.k2.hdfs.batchSize = 1000
#设置文件类型，可支持压缩
a2.sinks.k2.hdfs.fileType = DataStream
#多久生成一个新的文件
a2.sinks.k2.hdfs.rollInterval = 60
#设置每个文件的滚动大小
a2.sinks.k2.hdfs.rollSize = 134217700
#文件的滚动与Event数量无关
a2.sinks.k2.hdfs.rollCount = 0

# Use a channel which buffers events in memory
a2.channels.c2.type = memory
a2.channels.c2.capacity = 1000
a2.channels.c2.transactionCapacity = 100

# Bind the source and sink to the channel
a2.sources.r2.channels = c2
a2.sinks.k2.channel = c2

执行监控配置

[yyx@hadoop01 flume]$  bin/flume-ng agent --conf conf/ --name a2 --conf-file job/flume-file-hdfs.conf

在这里插入图片描述

实时监听目录文件到HDFS

在job下创建配置文件

[yyx@hadoop01 job]$ vim flume-dir-hdfs.conf

a3.sources = r3
a3.sinks = k3
a3.channels = c3

# Describe/configure the source
a3.sources.r3.type = spooldir
a3.sources.r3.spoolDir = /opt/module/flume/upload  //监听的文件夹
a3.sources.r3.fileSuffix = .COMPLETED
a3.sources.r3.fileHeader = true
#忽略所有以.tmp结尾的文件，不上传
a3.sources.r3.ignorePattern = ([^ ]*\.tmp)

# Describe the sink
a3.sinks.k3.type = hdfs
a3.sinks.k3.hdfs.path = hdfs://hadoop01:9000/flume/upload/%Y%m%d/%H
#上传文件的前缀
a3.sinks.k3.hdfs.filePrefix = upload-
#是否按照时间滚动文件夹
a3.sinks.k3.hdfs.round = true
#多少时间单位创建一个新的文件夹
a3.sinks.k3.hdfs.roundValue = 1
#重新定义时间单位
a3.sinks.k3.hdfs.roundUnit = hour
#是否使用本地时间戳
a3.sinks.k3.hdfs.useLocalTimeStamp = true
#积攒多少个Event才flush到HDFS一次
a3.sinks.k3.hdfs.batchSize = 100
#设置文件类型，可支持压缩
a3.sinks.k3.hdfs.fileType = DataStream
#多久生成一个新的文件
a3.sinks.k3.hdfs.rollInterval = 60
#设置每个文件的滚动大小大概是128M
a3.sinks.k3.hdfs.rollSize = 134217700
#文件的滚动与Event数量无关
a3.sinks.k3.hdfs.rollCount = 0

# Use a channel which buffers events in memory
a3.channels.c3.type = memory
a3.channels.c3.capacity = 1000
a3.channels.c3.transactionCapacity = 100

# Bind the source and sink to the channel
a3.sources.r3.channels = c3
a3.sinks.k3.channel = c3

启动flume：

[yyx@hadoop01 flume]$  bin/flume-ng agent --conf conf/ --name a3 --conf-file job/flume-dir-hdfs.conf

在upload文件夹创建文件

[yyx@hadoop01 upload]$ vim test1.txt
[yyx@hadoop01 upload]$ ll
总用量 4
-rw-rw-r--. 1 yyx yyx 29 3月   9 09:07 test1.txt.COMPLETED

在这里插入图片描述
对之前创建的文件进行修改（相当于创建一个新文件）

[yyx@hadoop01 upload]$ vim test1.txt

dwadwadadsaafwad

在这里插入图片描述

但是文件夹中不会修改，因为重名

[yyx@hadoop01 upload]$ ll
总用量 8
-rw-rw-r--. 1 yyx yyx 17 3月   9 09:11 test1.txt
-rw-rw-r--. 1 yyx yyx 29 3月   9 09:07 test1.txt.COMPLETED

对已完成的文件进行修改

[yyx@hadoop01 upload]$ vim test1.txt.COMPLETED 

我
I
wawdjaoidjadjiafnioaoc
adadsadwadwa

没有改变
在这里插入图片描述

在使用Spooling Directory Source时
不要在监控目录中创建并持续修改文件
上传完成文件会以.COMPLETED结尾
被监控的文件夹会以500毫秒扫描一次文件变动

单数据源多出口案例

在这里插入图片描述
需求：使用Flume1监听文件变动，Flume1将变动内容传递给Flume2，Flume2负责存储到HDFS，同时Flume1将变动内容传递给Flume3，Flume3负责输出到本地。

准备：
在/opt/module/flume/job创建group1文件夹

[yyx@hadoop01 job]$ mkdir group1
[yyx@hadoop01 job]$ cd group1/
[yyx@hadoop01 group1]$ mkdir flume3

创建flume-file-flume.conf

vim flume-file-flume.conf
# Name the components on this agent
a1.sources = r1
a1.sinks = k1 k2
a1.channels = c1 c2
# 将数据流复制给所有channel
a1.sources.r1.selector.type = replicating

# Describe/configure the source
a1.sources.r1.type = exec
a1.sources.r1.command = tail -F /tmp/yyx/hive.log
a1.sources.r1.shell = /bin/bash -c

# Describe the sink
# sink端的avro是一个数据发送者 这段是重点
a1.sinks.k1.type = avro
a1.sinks.k1.hostname = hadoop01 
a1.sinks.k1.port = 4141

a1.sinks.k2.type = avro
a1.sinks.k2.hostname = hadoop01
a1.sinks.k2.port = 4142

# Describe the channel
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

a1.channels.c2.type = memory
a1.channels.c2.capacity = 1000
a1.channels.c2.transactionCapacity = 100

# Bind the source and sink to the channel
a1.sources.r1.channels = c1 c2
a1.sinks.k1.channel = c1
a1.sinks.k2.channel = c2

再创建Flume输出到Source，输出的是HDFS的是Sink

[yyx@hadoop01 group1]$ vim flume-flume-hdfs.conf
# Name the components on this agent
a2.sources = r1
a2.sinks = k1
a2.channels = c1

# Describe/configure the source
# source端的avro是一个数据接收服务
a2.sources.r1.type = avro
a2.sources.r1.bind = hadoop01
a2.sources.r1.port = 4141

# Describe the sink
a2.sinks.k1.type = hdfs
a2.sinks.k1.hdfs.path = hdfs://hadoop01:9000/group1/%Y%m%d/%H 
#上传文件的前缀
a2.sinks.k1.hdfs.filePrefix = flume2-
#是否按照时间滚动文件夹
a2.sinks.k1.hdfs.round = true
#多少时间单位创建一个新的文件夹
a2.sinks.k1.hdfs.roundValue = 1
#重新定义时间单位
a2.sinks.k1.hdfs.roundUnit = hour
#是否使用本地时间戳
a2.sinks.k1.hdfs.useLocalTimeStamp = true
#积攒多少个Event才flush到HDFS一次
a2.sinks.k1.hdfs.batchSize = 100
#设置文件类型，可支持压缩
a2.sinks.k1.hdfs.fileType = DataStream
#多久生成一个新的文件
a2.sinks.k1.hdfs.rollInterval = 600
#设置每个文件的滚动大小大概是128M
a2.sinks.k1.hdfs.rollSize = 134217700
#文件的滚动与Event数量无关
a2.sinks.k1.hdfs.rollCount = 0

# Describe the channel
a2.channels.c1.type = memory
a2.channels.c1.capacity = 1000
a2.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel
a2.sources.r1.channels = c1
a2.sinks.k1.channel = c1

创建flume-flume-dir.conf
配置上级Flume输出的Source，输出到本地目录的Sink

vim flume-flume-dir.conf
# Name the components on this agent
a3.sources = r1
a3.sinks = k1
a3.channels = c2

# Describe/configure the source
a3.sources.r1.type = avro
a3.sources.r1.bind = hadoop01
a3.sources.r1.port = 4142

# Describe the sink
a3.sinks.k1.type = file_roll
a3.sinks.k1.sink.directory = /opt/module/datas/flume3 
# Describe the channel
a3.channels.c2.type = memory
a3.channels.c2.capacity = 1000
a3.channels.c2.transactionCapacity = 100

# Bind the source and sink to the channel
a3.sources.r1.channels = c2
a3.sinks.k1.channel = c2

注意，输出的本地目录必须是已存在的，否则不会创建目录

[yyx@hadoop01 ~]$ cd /opt/module/datas/
[yyx@hadoop01 datas]$ mkdir flume3

执行配置文件：flume-flume-dir，flume-flume-hdfs，flume-file-flume
先执行后面的，在执行前面的（否则前面的flume的数据没有输出位置）

[yyx@hadoop01 flume]$ bin/flume-ng agent --conf conf/ --name a3 --conf-file job/group1/flume-flume-dir.conf
[yyx@hadoop01 flume]$ bin/flume-ng agent --conf conf/ --name a2 --conf-file job/group1/flume-flume-hdfs.conf
[yyx@hadoop01 flume]$ bin/flume-ng agent --conf conf/ --name a1 --conf-file job/group1/flume-file-flume.conf
启动hive
[yyx@hadoop01 hive]$ bin/hive
会产生报错日志

在这里插入图片描述

[yyx@hadoop01 flume3]$ cat 1615259524810-1
	... 22 more
2021-03-09 10:58:09,784 INFO  [main]: hive.metastore (HiveMetaStoreClient.java:open(376)) - Trying to connect to metastore with URI thrift://hadoop01:9083
2021-03-09 10:58:09,784 WARN  [main]: hive.metastore (HiveMetaStoreClient.java:open(428)) - Failed to connect to the MetaStore Server...
2021-03-09 10:58:09,784 INFO  [main]: hive.metastore (HiveMetaStoreClient.java:open(459)) - Waiting 1 seconds before next connection attempt.
2021-03-09 10:58:10,785 INFO  [main]: hive.metastore (HiveMetaStoreClient.java:open(376)) - Trying to connect to metastore with URI thrift://hadoop01:9083
2021-03-09 10:58:10,786 WARN  [main]: hive.metastore (HiveMetaStoreClient.java:open(428)) - Failed to connect to the MetaStore Server...
2021-03-09 10:58:10,787 INFO  [main]: hive.metastore (HiveMetaStoreClient.java:open(459)) - Waiting 1 seconds before next connection attempt.
2021-03-09 10:58:11,788 INFO  [main]: hive.metastore (HiveMetaStoreClient.java:open(376)) - Trying to connect to metastore with URI thrift://hadoop01:9083
2021-03-09 10:58:11,788 WARN  [main]: hive.metastore (HiveMetaStoreClient.java:open(428)) - Failed to connect to the MetaStore Server...
2021-03-09 10:58:11,788 INFO  [main]: hive.metastore (HiveMetaStoreClient.java:open(459)) - Waiting 1 seconds before next connection attempt.

在这里插入图片描述

单数据源多出口案例（sink组）

单Source、Channel多Sink
在这里插入图片描述
需求：使用Flume1监控文件变动，Flume1将变动传递给Flume2，Flume2将其存储到控制台。同时Flume1将内容传递给Flume3，Flume3也负责存储到控制台

[yyx@hadoop01 hadoop-2.7.2]$ cd /opt/module/flume/
[yyx@hadoop01 flume]$ cd job/
[yyx@hadoop01 job]$ ll
总用量 8
-rw-rw-r--. 1 yyx yyx 1389 3月   9 09:04 flume-dir-hdfs.conf
-rw-rw-r--. 1 yyx yyx 1274 3月   8 19:19 flume-file-hdfs.conf
drwxrwxr-x. 2 yyx yyx   92 3月   9 11:17 group1
drwxrwxr-x. 2 yyx yyx   38 3月   8 18:16 listen_44444
[yyx@hadoop01 job]$ mkdir group2
[yyx@hadoop01 job]$ cd group2
[yyx@hadoop01 group2]$ vim flume-netcat-flume.conf

# Name the components on this agent
a1.sources = r1
a1.channels = c1
a1.sinkgroups = g1
a1.sinks = k1 k2

# Describe/configure the source
a1.sources.r1.type = netcat
a1.sources.r1.bind = localhost
a1.sources.r1.port = 44444

a1.sinkgroups.g1.processor.type = load_balance
a1.sinkgroups.g1.processor.backoff = true
a1.sinkgroups.g1.processor.selector = round_robin
a1.sinkgroups.g1.processor.selector.maxTimeOut=10000

# Describe the sink
a1.sinks.k1.type = avro
a1.sinks.k1.hostname = hadoop01
a1.sinks.k1.port = 4141

a1.sinks.k2.type = avro
a1.sinks.k2.hostname = hadoop01
a1.sinks.k2.port = 4142

# Describe the channel
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinkgroups.g1.sinks = k1 k2
a1.sinks.k1.channel = c1
a1.sinks.k2.channel = c1

load_balance表示负载均衡
round_robin表示采用轮询方法
再创建一个Flume输出Source，输出到本地控制台

[yyx@hadoop01 group2]$ vim flume-flume-console1.conf

# Name the components on this agent
a2.sources = r1
a2.sinks = k1
a2.channels = c1

# Describe/configure the source
a2.sources.r1.type = avro
a2.sources.r1.bind = hadoop01
a2.sources.r1.port = 4141

# Describe the sink
a2.sinks.k1.type = logger

# Describe the channel
a2.channels.c1.type = memory
a2.channels.c1.capacity = 1000
a2.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel
a2.sources.r1.channels = c1
a2.sinks.k1.channel = c1

Logger：用于INFO级别的日志
再创建一个Flume输出的Source，输出到本地控制台

[yyx@hadoop01 group2]$ vim flume-flume-console2.conf

# Name the components on this agent
a3.sources = r1
a3.sinks = k1
a3.channels = c2

# Describe/configure the source
a3.sources.r1.type = avro
a3.sources.r1.bind = hadoop01
a3.sources.r1.port = 4142

# Describe the sink
a3.sinks.k1.type = logger

# Describe the channel
a3.channels.c2.type = memory
a3.channels.c2.capacity = 1000
a3.channels.c2.transactionCapacity = 100

# Bind the source and sink to the channel
a3.sources.r1.channels = c2
a3.sinks.k1.channel = c2

分别启动flume（注意顺序）
在这里插入图片描述

多数据源汇总案例

在这里插入图片描述
需求：
在Hadoop02上的Flume1监控文件/tmp/yyx/hive.log
Hadoop01上的Flume2监控某一个端口的数据流
Flume1和Flume3将数据发给Hadoop上的Flume3，Flume3将数据打印到控制台

首先，分发Flume

xsync flume

在hadoop01、hadoop02和hadoop03的/opt/module/flume/job目录下创建一个group3文件夹。
创建flume1-logger-flume.conf
用于监控group.log，配置Sink输出数据到下一级Flume
在Hadoop02上创建

vim flume1-logger-flume.conf
# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1

# Describe/configure the source
a1.sources.r1.type = exec
a1.sources.r1.command = tail -F /opt/module/group.log
a1.sources.r1.shell = /bin/bash -c

# Describe the sink
a1.sinks.k1.type = avro
a1.sinks.k1.hostname = hadoop03
a1.sinks.k1.port = 4141

# Describe the channel
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

创建flume2-netcat-flume.conf
配置Source监控端口44444数据流，配置SInk到下一级Flume
在Hadoop01上创建配置文件

vim flume2-netcat-flume.conf
# Name the components on this agent
a2.sources = r1
a2.sinks = k1
a2.channels = c1

# Describe/configure the source
a2.sources.r1.type = netcat
a2.sources.r1.bind = hadoop01
a2.sources.r1.port = 44444

# Describe the sink
a2.sinks.k1.type = avro
a2.sinks.k1.hostname = hadoop03
a2.sinks.k1.port = 4141

# Use a channel which buffers events in memory
a2.channels.c1.type = memory
a2.channels.c1.capacity = 1000
a2.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel
a2.sources.r1.channels = c1
a2.sinks.k1.channel = c1

配置flume3-flume-logger.conf
在Hadoop03上

vim flume3-flume-logger.conf
# Name the components on this agent
a3.sources = r1
a3.sinks = k1
a3.channels = c1

# Describe/configure the source
a3.sources.r1.type = avro
a3.sources.r1.bind = hadoop03
a3.sources.r1.port = 4141

# Describe the sink
# Describe the sink
a3.sinks.k1.type = logger

# Describe the channel
a3.channels.c1.type = memory
a3.channels.c1.capacity = 1000
a3.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel
a3.sources.r1.channels = c1
a3.sinks.k1.channel = c1

分别启动（注意顺序）：

[yyx@hadoop03 flume]$  bin/flume-ng agent --conf conf/ --name a3 --conf-file job/group3/flume3-flume-logger.conf -Dflume.root.logger=INFO,console
[yyx@hadoop02 flume]$ bin/flume-ng agent --conf conf/ --name a1 --conf-file job/group3/flume1-logger-ume.conf
[yyx@hadoop01 flume]$  bin/flume-ng agent --conf conf/ --name a2 --conf-file job/group3/flume2-netcatlume.conf

在hadoop上对文件进行追加
在这里插入图片描述

在Hadoop01上向端口发送数据

在这里插入图片描述

自定义Souce

创建一个新的Maven工程并写入pom文件

<dependencies>
        <dependency>
            <groupId>org.apache.flume</groupId>
            <artifactId>flume-ng-core</artifactId>
            <version>1.7.0</version>
        </dependency>

    </dependencies>

创建类并继承类以及实现接口

import org.apache.flume.Context;
import org.apache.flume.PollableSource;
import org.apache.flume.conf.Configurable;
import org.apache.flume.event.SimpleEvent;
import org.apache.flume.source.AbstractSource;

import java.util.HashMap;
import java.util.Map;

public class MySource extends AbstractSource implements Configurable, PollableSource {
    // 定义一下source将来要读取的字段
    private String field;
    // 读取间隔时间
    private long delay;

    public Status process() { // 接收数据，封装为一个个Event，并写入Channel，这个方法将被循环调用。
        try { // try一下
            // 创建一个头文件
            Map<String,String> header = new HashMap();
            // 创建一个事件
            SimpleEvent event = new SimpleEvent();
            // 封装一个事件
            for (int i = 0; i < 5; i++) {
                // 将头文件写入事件
                event.setHeaders(header);
                // 生成写入事件的数据
                event.setBody((field + i).getBytes());
                // 将事件写入Channel
                getChannelProcessor().processEvent(event); // 这里获取Channel处理器的方法是通过父类返回的
                // 设置间隔
                Thread.sleep(delay);
            }
        }catch (Exception e){
            return Status.BACKOFF; // 失败回滚
        }
        return Status.READY; // 成功准备
    }

    public long getBackOffSleepIncrement() {return 0;}

    public long getMaxBackOffSleepInterval() {return 0;}

    public void configure(Context context) { // 用来读取配置文件的配置信息
        // 首先，初始化配置信息 注意，key要小写，这里为了注意一下，写了大写，那么配置文件也要大写
        field = context.getString("Field","YYX"); // 如果没有配置默认YYX
        delay = context.getLong("Delay"); // 如果怕忘了配置，也可以设置一个默认的
    	//delay = context.getLong("Delay",20000);
    }
}

压缩为jar包并导入flume下的lib
编写配置文件

[yyx@hadoop01 job]$ vim flume-mysource.conf

#Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1

# Describe/configure the source
a1.sources.r1.type = org.yyx.MySource
# 间隔时间为1秒 要小写
a1.sources.r1.Delay = 1000
# 输入数据
a1.sources.r1.Field = piu

# Describe the sink
a1.sinks.k1.type = logger

# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
~

启动：

[yyx@hadoop01 flume]$ bin/flume-ng agent -c conf/ -f job/flume-mysource.conf -n a1 -Dflume.root.logger=INFO,console

成功：

2021-03-16 09:17:15,511 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - org.apache.flume.sink.LoggerSink.process(LoggerSink.java:95)] Event: { headers:{} body: 70 69 75 30                                     piu0 }
2021-03-16 09:17:16,507 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - org.apache.flume.sink.LoggerSink.process(LoggerSink.java:95)] Event: { headers:{} body: 70 69 75 31                                     piu1 }
2021-03-16 09:17:17,510 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - org.apache.flume.sink.LoggerSink.process(LoggerSink.java:95)] Event: { headers:{} body: 70 69 75 32                                     piu2 }
2021-03-16 09:17:18,511 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - org.apache.flume.sink.LoggerSink.process(LoggerSink.java:95)] Event: { headers:{} body: 70 69 75 33                                     piu3 }
2021-03-16 09:17:19,513 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - org.apache.flume.sink.LoggerSink.process(LoggerSink.java:95)] Event: { headers:{} body: 70 69 75 34                                     piu4 }
2021-03-16 09:17:20,516 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - org.apache.flume.sink.LoggerSink.process(LoggerSink.java:95)] Event: { headers:{} body: 70 69 75 30                                     piu0 }
2021-03-16 09:17:21,519 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - org.apache.flume.sink.LoggerSink.process(LoggerSink.java:95)] Event: { headers:{} body: 70 69 75 31                                     piu1 }
2021-03-16 09:17:22,521 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - org.apache.flume.sink.LoggerSink.process(LoggerSink.java:95)] Event: { headers:{} body: 70 69 75 32                                     piu2 }
2021-03-16 09:17:23,524 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - org.apache.flume.sink.LoggerSink.process(LoggerSink.java:95)] Event: { headers:{} body: 70 69 75 33                                     piu3 }
2021-03-16 09:17:24,524 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - org.apache.flume.sink.LoggerSink.process(LoggerSink.java:95)] Event: { headers:{} body: 70 69 75 34                                     piu4 }
2021-03-16 09:17:25,527 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - org.apache.flume.sink.LoggerSink.process(LoggerSink.java:95)] Event: { headers:{} body: 70 69 75 30                                     piu0 }

自定义Sink

import org.apache.flume.Context;
import org.apache.flume.PollableSource;
import org.apache.flume.conf.Configurable;
import org.apache.flume.event.SimpleEvent;
import org.apache.flume.source.AbstractSource;

import java.util.HashMap;
import java.util.Map;

public class MySource extends AbstractSource implements Configurable, PollableSource {
    // 定义一下source将来要读取的字段（从配置中读取）
    private String field;
    // 读取间隔时间
    private long delay;

    public Status process() { // 接收数据，封装为一个个Event，并写入Channel，这个方法将被循环调用。
        try { // try一下
            // 创建一个头文件
            Map<String,String> header = new HashMap();
            // 创建一个事件
            SimpleEvent event = new SimpleEvent();
            // 封装一个事件
            for (int i = 0; i < 5; i++) {
                // 将头文件写入事件
                event.setHeaders(header);
                // 生成写入事件的数据
                event.setBody((field + i).getBytes());
                // 将事件写入Channel
                getChannelProcessor().processEvent(event); // 这里获取Channel处理器的方法是通过父类返回的
                // 设置间隔
                Thread.sleep(delay);
            }
        }catch (Exception e){
            return Status.BACKOFF; // 失败回滚
        }
        return Status.READY; // 成功准备
    }

    public long getBackOffSleepIncrement() {return 0;} // 设置每次失败后和上次相比增加多少间隔事件

    public long getMaxBackOffSleepInterval() {return 0;} // 设置最大的间隔事件

    public void configure(Context context) { // 用来读取配置文件的配置信息
        // 首先，初始化配置信息
        field = context.getString("Field","YYX");
        delay = context.getLong("Delay");
    }
}

导入到flume/lib

[yyx@hadoop01 job]$ vim flume-mysink.conf

# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1

# Describe/configure the source
a1.sources.r1.type = netcat
a1.sources.r1.bind = localhost
a1.sources.r1.port = 44444

# Describe the sink
a1.sinks.k1.type = org.yyx.sink.MySink
#a1.sinks.k1.prefix = begin:
a1.sinks.k1.suffix = :end

# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

启动

[yyx@hadoop01 flume]$ bin/flume-ng agent -c conf/ -f job/flume-mysink.conf -n a1 -Dflume.root.logger=INFO,console

[yyx@hadoop01 ~]$ nc localhost 44444

2021-03-16 10:02:34,767 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - org.yyx.sink.MySink.process(MySink.java:47)] The Prefix:nihao:end
2021-03-16 10:02:42,682 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - org.yyx.sink.MySink.process(MySink.java:47)] The Prefix:wodanio:end
2021-03-16 10:02:44,233 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - org.yyx.sink.MySink.process(MySink.java:47)] The Prefix:dwhaodjhwaio:end
2021-03-16 10:02:45,199 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - org.yyx.sink.MySink.process(MySink.java:47)] The Prefix:waiodhawod:end