数据迁移工具之Flume

最新推荐文章于 2023-10-09 13:14:08 发布

绝域时空

最新推荐文章于 2023-10-09 13:14:08 发布

阅读量815

点赞数 1

分类专栏：大数据组件文章标签： flume hadoop big data 大数据

本文链接：https://blog.csdn.net/m0_43405302/article/details/123096796

版权

大数据组件专栏收录该内容

24 篇文章 11 订阅

订阅专栏

文章目录

一、Flume
二、Flume安装
- 1、启动命令
三、Flume的端口数据监听
四、实时读取本地文件到HDFS
五、Flume监控多个文件上传到HDFS
- 1、配置文件
- 2、启动Flume
六、配置信息详解
七、高级使用
八、Flume进阶
九、基础知识

一、Flume

Flume 是Cloudera 提供的一个高可用的，高可靠的，分布式的海量日志采集、聚合和传输的系统。Flume 基于流式架构，灵活简单。Flume最主要的作用就是，实时读取服务器本地磁盘的数据，将数据写入到HDFS。

1、Flume的架构

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-puWFE5f8-1645613325525)(.\Flume\Flume架构.jpg)]

1.Agent

Agent是一个JVM进程，它是以事件的形式将数据从源头送至目的。它是由：Source、Channel和Sink 三个部分组成。

2.Source

Source是负责接收数据到Flume Agent的组件。Source组件可以处理各种类型、各种格式的日志数据，包括avro、thrift、exec、jms、spooling directory、netcat、sequencegenerator、syslog、http、 legacy。

3. Sink

Sink 不断地轮询 Channel中的事件且批量地移除它们，并将这些事件批量写入到存储或索引系统、或者被发送到另一个Flume Agent。Sink 组件目的地包括hdfs、logger、avro、thrift、ipc、file、HBase、solr、自定义。

4.Channel

Channel是位于Source和Sink 之间的缓冲区。因此，Channel允许Source和Sink运作在不同的速率上。Channel是线程安全的，可以同时处理几个 Source 的写入操作和几个Sink的读取操作。
Flume自带两种Channel: Memory Channel和File Channel以及Kafka Channel。Memory Channel是内存中的队列。Memory Channel在不需要关心数据丢失的情景下适用。如果需要关心数据丢失，那么Memory Channel就不应该使用，因为程序死亡、机器宕机或者重启都会导致数据丢失。
File Channel将所有事件写到磁盘。因此在程序关闭或机器宕机的情况下不会丢失数据。

5. Event

传输单元，Flume 数据传输的基本单元，以Event的形式将数据从源头送至目的地。Event由Header和Body两部分组成，Header用来存放该event的一些属性，为K-V结构，Body用来存放该条数据，形式为字节数组。

2、flume内部数据传输的封装形式

数据在Flum内部中数据以Event的封装形式存在。

因此，Source组件在获取到原始数据后，需要封装成Event放入channel；

Sink组件从channel中取出Event后，需要根据配置要求，转成其他形式的数据输出。

Event封装对象主要有两部分组成： Headers和 Body

Header是一个集合 Map[String,String]，用于携带一些KV形式的元数据（标志、描述等）

Boby：就是一个字节数组；装载具体的数据内容

3、 Transaction：事务控制机制

Flume的事务机制（类似数据库的事务机制）：

Flume使用两个独立的事务分别负责从Soucrce到Channel，以及从Channel到Sink的event传递。比如spooling directory source 为文件的每一个event batch创建一个事务，一旦事务中所有的事件全部传递到Channel且提交成功，那么Soucrce就将event batch标记为完成。

同理，事务以类似的方式处理从Channel到Sink的传递过程，如果因为某种原因使得事件无法记录，那么事务将会回滚，且所有的事件都会保持到Channel中，等待重新传递。

4、拦截器

拦截器工作在source组件之后，source产生的event会被传入拦截器根据需要进行拦截处理

而且，拦截器可以组成拦截器链！

拦截器在flume中有一些内置的功能比较常用的拦截器

用户也可以根据自己的数据处理需求，自己开发自定义拦截器！

这也是flume的一个可以用来自定义扩展的接口！

二、Flume安装

Flume的下载地址为：http://archive.apache.org/dist/flume

1、启动命令

#Flume启动命令
bin/flume-ng agent -c ./conf ....

commands

命令	功能描述
help	显示本帮助信息
agent	启动一个agent进程
avro-client	启动一个用于测试avro source的客户端（能够发送avro序列化流）
version	显示当前flume的版本信息

全局通用选项

命令	功能描述
–conf,-c <conf>	指定flume的系统配置文件所在目录
–classpath,-C <cp>	添加额外的jar路径
–dryrun,-d	不去真实启动flume agent，而是打印当前命令
–plugins-path <dirs>	指定插件（jar）所在路径
-Dproperty=value	传入java环境参数
-Xproperty=value	传入所需的JVM配置参数

agent选项

命令	功能描述
–name,-n <name>	agent的别名（在用户采集方案配置文件中）
–conf-file,-f <file>	指定用户采集方案配置文件的路径
–zkConnString,-z <str>	指定zookeeper的连接地址
–zkBasePath,-p <path>	指定用户配置文件所在的zookeeper path，比如：/flume/config
–no-reload-conf	关闭配置文件动态加载
–help,-h	显示帮助文档

avro-client选项

命令	功能描述
–rpcProps,-P <file>	RPC client properties file with server connection params
–host,-H <host>	avro序列化数据所要发往的目标主机（avro source所在机器）
–port,-p <port>	avro序列化数据所要发往的目标主机的端口号
–dirname <dir>	需要被序列化发走的数据所在目录（提前准备好测试数据放在一个文件中）
–filename,-F <file>	需要被序列化发走的数据所在文件(default: std input)
–headerFile,-R <file>	存储header key-value的文件
–help,-h	帮助信息
-Dflume.monitoring.type=http -Dflume.monitoring.port=34545	开启内置监控功能

三、Flume的端口数据监听

端口数据监听的一般步骤：

通过netcat工具向本机的44444端口发送数据
Flume监控本机的44444端口。通过Flume的source端读取数据。
Flume将获取的数据通过Sink端写出到控制台

1、切换目录并创建配置文件

#切换到Flume的文件夹中
cd /opt/software/flume
#创建文件夹
mkdir job
#创建配置文件
cd job
touch flume-netcat-logger.conf

2、配置信息

#编辑配置文件
vim flume-netcat-logger.conf
#---------------------------------------------------------------------------------
# Name the components on this agent # a1:表示agent的名称
a1.sources = r1		#r1:表示a1的Source的名称
a1.sinks = k1		#k1:表示a1的Sink的名称
a1.channels = c1	#c1:表示a1的Channel的名称
# Describe/configure the source	
a1.sources.r1.type = netcat		#表示a1的输入源类型为netcat端口类型
a1.sources.r1.bind = localhost	#表示a1的监听的主机
a1.sources.r1.port = 44444		#表示a1的监听的端口号
# Describe the sink
a1.sinks.k1.type = logger		#表示a1的输出目的地是控制台logger类型
# Use a channel which buffers events in memory
a1.channels.c1.type = memory	#表示a1的channel类型是memory内存型
a1.channels.c1.capacity = 1000	#表示a1的channel总容量1000个event
a1.channels.c1.transactionCapacity = 100	#表示a1的channel传输时收集到了100条event以后再去提交事务
# Bind the source and sink to the channel
a1.sources.r1.channels = c1		#表示将r1和c1连接起来
a1.sinks.k1.channel = c1		#表示将k1和c1连接起来
#------------------------------------------------------------------------------

3、打开Flume监听窗口

#打开Flume监听窗口
flume-ng agent --conf conf --conf-file example.conf --name a1 -Dflume.root.logger=INFO,console
#打开Flume监听窗口
flume-ng agent -c conf/ -n a1 -f job/flume-netcat-logger.conf -Dflume.root.logger=INFO,console

参数说明

**–conf/-c：**表示配置文件存储在conf/目录
**–name/-n：**表示给agent 起名为a1
**–conf-file/-f：**flume 本次启动读取的配置文件是在job 文件夹下的flume-telnet.conf
文件。
-Dflume.root.logger=INFO,console ：-D 表示flume 运行时动态修改flume.root.logger
参数属性值，并将控制台日志打印级别设置为INFO 级别。日志级别包括:log、info、warn、
error。

4、使用netcat 工具向本机的44444 端口发送内容

#安装netcat 工具
yum -y install nc
#传输数据
nc localhost 44444

四、实时读取本地文件到HDFS

实时读取本地文件到HDFS的步骤：

创建符合条件的flume配置文件
执行配置文件，开启监控
开启Hive，生成日志
查看HDFS上数据

1、依赖jar包

#依赖jar包，需要将这些jar包拷贝到Flume的lib文件夹中
commons-configuration-1.6.jar
hadoop-auth-2.7.2.jar
hadoop-common-2.7.2.jar
hadoop-hdfs-2.7.2.jar
commons-io-2.4.jar
htrace-core-3.1.0-incubating.jar

2、创建配置文件

#创建指定配置文件
vim /opt/software/flume/job/flume-file-hdfs.conf
#-----------------------------------------------------------------
#Name the components on this agent
a2.sources = r2  	#定义source
a2.sinks = k2		#定义sink
a2.channels = c2	#定义channel
# Describe/configure the source
a2.sources.r2.type = exec	#定义source类型为exec可执行命令的
a2.sources.r2.command = tail -F /opt/software/hive/logs/hive.log
a2.sources.r2.shell = /bin/bash -c	#执行shell的绝对路径
# Describe the sink
a2.sinks.k2.type = hdfs
a2.sinks.k2.hdfs.path = hdfs://hadoop102:9000/flume/%Y%m%d/%H
a2.sinks.k2.hdfs.filePrefix = logs-	#上传文件的前缀
sinks.k2.hdfs.round = true			#是否按照时间滚动文件夹
a2.sinks.k2.hdfs.roundValue = 1		#多少时间单位创建一个新的文件夹
a2.sinks.k2.hdfs.roundUnit = hour	#重新定义时间单位
a2.sinks.k2.hdfs.useLocalTimeStamp = true	#是否使用本地时间戳
a2.sinks.k2.hdfs.batchSize = 1000	#积攒多少个Event才flush到HDFS一次
a2.sinks.k2.hdfs.fileType = DataStream	#设置文件类型，可支持压缩
a2.sinks.k2.hdfs.rollInterval = 60	#多久生成一个新文件
a2.sinks.k2.hdfs.rollSize = 134217700	#设置每个文件的滚动大小
a2.sinks.k2.hdfs.rollCount = 0		#文件的滚动与Event数量无关
# Use a channel which buffers events in memory
a2.channels.c2.type = memory	#表示a2的channel类型是memory内存型
a2.channels.c2.capacity = 1000	#表示a2的channel总容量1000个event
a2.channels.c2.transactionCapacity = 100	#定义channel的事件容量
# Bind the source and sink to the channel
a2.sources.r2.channels = c2	#定义source与哪个channel连接
a2.sinks.k2.channel = c2	#定义sink与哪个channel连接
#-----------------------------------------------------------------

3、运行Flume

#运行Flume
flume-ng agent --conf conf/ --name a2 --conf-file job/flume-file-hdfs.conf

五、Flume监控多个文件上传到HDFS

Flume监控多个文件上传到HDFS的步骤如下：

创建符合条件的flume配置文件
执行配置文件，开启监控
向upload目录中添加文件
查看HDFS上数据
查看/opt/software/flume/upload目录中上传的文件是否已经标记为.COMPLETED结尾；.tmp后缀结尾文件没有上传

1、配置文件

#创建指定配置文件
vim /opt/software/flume/job/flume-file-hdfs.conf
#---------------------------------------------------------------------------------------
a3.sources = r3		#定义source
a3.sinks = k3		#定义sink
a3.channels = c3	#定义channel
# Describe/configure the source
a3.sources.r3.type = spooldir	#定义source类型为目录
a3.sources.r3.spoolDir = /opt/software/flume/upload	#定义监控文件夹
a3.sources.r3.fileSuffix = .COMPLETED	#定义文件上传成功的后缀
a3.sources.r3.fileHeader = true			#是否有文件头
a3.sources.r3.ignorePattern = ([^ ]*\.tmp)	#忽略所有以.tmp结尾的文件，不上传
# Describe the sink
a3.sinks.k3.type = hdfs
a3.sinks.k3.hdfs.path =hdfs://hadoop102:9000/flume/upload/%Y%m%d/%H #文件上传到hdfs的路径
a3.sinks.k3.hdfs.filePrefix = upload-	#文件上传到hdfs的前缀
sinks.k3.hdfs.round = true			#是否按照时间滚动文件
a3.sinks.k3.hdfs.roundValue = 1		#多少时间单位创建一个新的文件夹
a3.sinks.k3.hdfs.roundUnit = hour	#重新定义一个时间单位
a3.sinks.k3.hdfs.useLocalTimeStamp = true	#是否使用本地时间戳
a3.sinks.k3.hdfs.batchSize = 100	#积攒多少个Event才flush到HDFS一次
a3.sinks.k3.hdfs.fileType = DataStream	#设置文件类型，可支持压缩
a3.sinks.k3.hdfs.rollInterval = 60	#多久产生新文件
a3.sinks.k3.hdfs.rollSize = 134217700	#多大生成新文件
a3.sinks.k3.hdfs.rollCount = 0	#多少event生成新文件
# Use a channel which buffers events in memory
a3.channels.c3.type = memory	#表示a3的channel类型是memory内存型
a3.channels.c3.capacity = 1000	#表示a3的channel总容量1000个event
a3.channels.c3.transactionCapacity = 100	#表示a3的channel传输时收集到了100条event以后再去提交事务
# Bind the source and sink to the channel
a3.sources.r3.channels = c3		#表示将r3和c3连接起来
a3.sinks.k3.channel = c3		#表示将k3和c3连接起来
#---------------------------------------------------------------------------------------

2、启动Flume

#启动Flume
flume-ng agent --conf conf/ --name a3 --conf-file job/flume-dir-hdfs.conf

六、配置信息详解

https://flume.apache.org/releases/content/1.9.0/FlumeUserGuide.html

1、基于Zookeeper的Flume启动

flume-ng agent –conf conf -z zkhost:2181,zkhost1:2181 -p /flume –name a1 -Dflume.root.logger=INFO,console

参数说明

参数	默认值	描述
z	–	Zookeeper连接的字段，多个配置之间用逗号隔开，主机名：端口号
p	/flume	Zookeeper 中用于存储代理配置的基本路径‎

2、source配置

1.Avro Source ★

Avro source 是通过监听一个网络端口来接受数据，而且接受的数据必须是使用avro序列化框架序列化后的数据；Avro是一种序列化框架，跨语言的；

a1.sources = r1
a1.channels = c1
a1.sources.r1.type = avro
a1.sources.r1.channels = c1
a1.sources.r1.bind = 0.0.0.0
a1.sources.r1.port = 4141

配置项说明

配置项	默认值	描述
channels	–
type	–	定义source文件类型, 应该设置为 `avro`
bind	–	主机名或者IP地址用于Flume监听
port	–	端口号用于监听主机
threads	–	创建的最大的工作线程数量

2.Thrift Source

配置项	默认值	描述
channels	–
type	–	定义source文件类型, 应该设置为 `Thrift`
bind	–	主机名或者IP地址用于Flume监听
port	–	端口号用于监听主机

例程

a1.sources = r1
a1.channels = c1
a1.sources.r1.type = thrift
a1.sources.r1.channels = c1
a1.sources.r1.bind = 0.0.0.0
a1.sources.r1.port = 414

3.Exec Source

配置项	默认值	描述
channels	–
type	–	定义source文件类型, 应该设置为 `exec`
command	–	需要执行的命令

例程

#例程1
a1.sources = r1
a1.channels = c1
a1.sources.r1.type = exec
a1.sources.r1.command = tail -F /var/log/secure
a1.sources.r1.channels = c1
#例程2
a1.sources.tailsource-1.type = exec
a1.sources.tailsource-1.shell = /bin/bash -c
a1.sources.tailsource-1.command = for i in /path/*.txt; do cat $i; done

4.JMS Source

配置项	默认值	描述
channels	–
type	–	定义source文件类型, 应该设置为 `jms`
initialContextFactory	–	初始化上下文工厂, 例如: org.apache.activemq.jndi.ActiveMQInitialContextFactory
connectionFactory	–	连接工厂应显示为的 JNDI 名称‎
providerURL	–	JMS 提供程序 URL‎
destinationName	–	‎目标名称
destinationType	–	目标类型（队列或主题）

例程

a1.sources = r1
a1.channels = c1
a1.sources.r1.type = jms
a1.sources.r1.channels = c1
a1.sources.r1.initialContextFactory = org.apache.activemq.jndi.ActiveMQInitialContextFactory
a1.sources.r1.connectionFactory = GenericConnectionFactory
a1.sources.r1.providerURL = tcp://mqserver:61616
a1.sources.r1.destinationName = BUSINESS_DATA
a1.sources.r1.destinationType = QUEUE

5.Spooling Directory Source ★

配置项	默认值	描述
channels	–
type	–	定义source文件类型, 应该设置为 `spooldir`.
spoolDir	–	定义从哪个文件夹里读取文件

例程

a1.channels = ch-1
a1.sources = src-1

a1.sources.src-1.type = spooldir
a1.sources.src-1.channels = ch-1
a1.sources.src-1.spoolDir = /var/log/apache/flumeSpool
a1.sources.src-1.fileHeader = true

6.Taildir Source ★

配置项	默认值	描述
channels	–
type	–	组件类型名称需要为 ‎`‎TAILDIR‎`‎。‎
filegroups	–	以空格分隔的文件组列表。每个文件组都指示要尾随的一组文件
filegroups.	–	‎文件组的绝对路径。正则表达式（而不是文件系统模式）只能用于文件名。

例程

a1.sources = r1
a1.channels = c1
a1.sources.r1.type = TAILDIR
a1.sources.r1.channels = c1
a1.sources.r1.positionFile = /var/log/flume/taildir_position.json
a1.sources.r1.filegroups = f1 f2
a1.sources.r1.filegroups.f1 = /var/log/test1/example.log
a1.sources.r1.headers.f1.headerKey1 = value1
a1.sources.r1.filegroups.f2 = /var/log/test2/.*log.*
a1.sources.r1.headers.f2.headerKey1 = value2
a1.sources.r1.headers.f2.headerKey2 = value2-2
a1.sources.r1.fileHeader = true
a1.sources.ri.maxBatchCount = 1000

7.Kafka Source ★

配置项	默认值	描述
channels	–
type	–	‎组件类型名称，需要为 ‎`‎org.apache.flume.source.kafka.KafkaSource‎`
kafka.bootstrap.servers	–	源使用的 Kafka 集群中的代理列表‎
kafka.consumer.group.id	flume	唯一标识的使用者组。在多个源或代理中设置相同的 ID 表示它们是同一使用者组的一部分‎
kafka.topics	–	逗号分隔的 kafka 使用者将从中读取消息的主题列表。‎
kafka.topics.regex	–	定义订阅源的主题集的正则表达式。此属性的优先级高于 ‎`‎kafka.topics‎`‎，并且会覆盖 ‎`‎kafka.topics‎`‎（如果存在）。

例程

tier1.sources.source1.type = org.apache.flume.source.kafka.KafkaSource
tier1.sources.source1.channels = channel1
tier1.sources.source1.batchSize = 5000
tier1.sources.source1.batchDurationMillis = 2000
tier1.sources.source1.kafka.bootstrap.servers = localhost:9092
tier1.sources.source1.kafka.topics = test1, test2
tier1.sources.source1.kafka.consumer.group.id = custom.g.id

8.NetCat TCP Source ★

配置项	默认值	描述
channels	–
type	–	‎组件类型名称，需要为 ‎`netcat`
bind	–	‎要绑定到的主机名或 IP 地址
port	–	要绑定到的端口

例程

a1.sources = r1
a1.channels = c1
a1.sources.r1.type = netcat
a1.sources.r1.bind = 0.0.0.0
a1.sources.r1.port = 6666
a1.sources.r1.channels = c1

9.NetCat UDP Source ★

配置项	默认值	描述
channels	–
type	–	组件类型名称，需要为 ‎`‎netcatudp‎`
bind	–	要绑定到的主机名或 IP 地址
port	–	要绑定到的端口

例程

a1.sources = r1
a1.channels = c1
a1.sources.r1.type = netcatudp
a1.sources.r1.bind = 0.0.0.0
a1.sources.r1.port = 6666
a1.sources.r1.channels = c1

10、HTTP Source

配置项	默认值	描述
type		组件类型名称，需要为 ‎`‎http‎`
port	–	源应绑定到的端口

例程

a1.sources = r1
a1.channels = c1
a1.sources.r1.type = http
a1.sources.r1.port = 5140
a1.sources.r1.channels = c1
a1.sources.r1.handler = org.example.rest.RestHandler
a1.sources.r1.handler.nickname = random props
a1.sources.r1.HttpConfiguration.sendServerVersion = false
a1.sources.r1.ServerConnector.idleTimeout = 300

3、Sink配置

1.HDFS Sink

配置项	默认值	描述
channel	–
type	–	‎组件类型名称，需要为‎ `hdfs`
hdfs.path	–	HDFS 文件路径 (例如： hdfs://namenode/flume/webdata/)

例程

a1.channels = c1
a1.sinks = k1
a1.sinks.k1.type = hdfs
a1.sinks.k1.channel = c1
a1.sinks.k1.hdfs.path = /flume/events/%y-%m-%d/%H%M/%S
a1.sinks.k1.hdfs.filePrefix = events-
a1.sinks.k1.hdfs.round = true
a1.sinks.k1.hdfs.roundValue = 10
a1.sinks.k1.hdfs.roundUnit = minute

2.Hive Sink

配置项	默认值	描述
channel	–
type	–	组件类型名称，需要为‎ `hive`
hive.metastore	–	Hive 元数据存储的URI (例如 thrift://a.b.com:9083 )
hive.database	–	Hive数据库名称
hive.table	–	Hive数据表名称

例程

a1.channels = c1
a1.channels.c1.type = memory
a1.sinks = k1
a1.sinks.k1.type = hive
a1.sinks.k1.channel = c1
a1.sinks.k1.hive.metastore = thrift://127.0.0.1:9083
a1.sinks.k1.hive.database = logsdb
a1.sinks.k1.hive.table = weblogs
a1.sinks.k1.hive.partition = asia,%{country},%y-%m-%d-%H-%M
a1.sinks.k1.useLocalTimeStamp = false
a1.sinks.k1.round = true
a1.sinks.k1.roundValue = 10
a1.sinks.k1.roundUnit = minute
a1.sinks.k1.serializer = DELIMITED
a1.sinks.k1.serializer.delimiter = "\t"
a1.sinks.k1.serializer.serdeSeparator = '\t'
a1.sinks.k1.serializer.fieldnames =id,,msg

3.Logger Sink

配置项	默认值	描述
channel	–
type	–	组件类型名称，需要为 `logger`

例程

a1.channels = c1
a1.sinks = k1
a1.sinks.k1.type = logger
a1.sinks.k1.channel = c1

4.Avro Sink

配置项	默认值	描述
channel	–
type	–	组件类型名称需要为 ‎`‎avro‎`‎
hostname	–	‎要绑定到的主机名或 IP 地址。
port	–	要侦听的端口号。

例程

a1.channels = c1
a1.sinks = k1
a1.sinks.k1.type = avro
a1.sinks.k1.channel = c1
a1.sinks.k1.hostname = 10.10.10.10
a1.sinks.k1.port = 4545

5.Thrift Sink

配置项	默认值	描述
channel	–
type	–	组件类型名称需要为`‎thrift`
hostname	–	‎要绑定到的主机名或 IP 地址。
port	–	要侦听的端口号。

例程

a1.channels = c1
a1.sinks = k1
a1.sinks.k1.type = thrift
a1.sinks.k1.channel = c1
a1.sinks.k1.hostname = 10.10.10.10
a1.sinks.k1.port = 4545

6.IRC Sink

配置项	默认值	描述
channel	–
type	–	‎组件类型名称，需要为 ‎`‎irc‎`
hostname	–	要连接到的主机名或 IP 地址‎
port	6667	要连接的远程主机的端口号‎
nick	–	Nick名称
user	–	用户名
password	–	User password
chan	–	channel

例程

a1.channels = c1
a1.sinks = k1
a1.sinks.k1.type = irc
a1.sinks.k1.channel = c1
a1.sinks.k1.hostname = irc.yourdomain.com
a1.sinks.k1.nick = flume
a1.sinks.k1.chan = #flume

7.HBase Sink

a1.channels = c1
a1.sinks = k1
a1.sinks.k1.type = hbase
a1.sinks.k1.table = foo_table
a1.sinks.k1.columnFamily = bar_cf
a1.sinks.k1.serializer = org.apache.flume.sink.hbase.RegexHbaseEventSerializer
a1.sinks.k1.channel = c1

配置说明

配置项	默认值	描述
channel	–
type	–	‎组件类型名称，需要为 `hbase`
table	–	Hbase 中要写入的表的名称。‎
columnFamily	–	‎Hbase 中要写入的列系列。

8.HBase2 Sink

a1.channels = c1
a1.sinks = k1
a1.sinks.k1.type = hbase2
a1.sinks.k1.table = foo_table
a1.sinks.k1.columnFamily = bar_cf
a1.sinks.k1.serializer = org.apache.flume.sink.hbase2.RegexHBase2EventSerializer
a1.sinks.k1.channel = c1

配置说明

配置项	默认值	描述
channel	–
type	–	组件类型名称，需要为 ‎`‎hbase2‎`
table	–	HBase 中要写入的表的名称。
columnFamily	–	HBase 中要写入的列系列。‎

9.ElasticSearchSink

a1.channels = c1
a1.sinks = k1
a1.sinks.k1.type = elasticsearch
a1.sinks.k1.hostNames = 127.0.0.1:9200,127.0.0.2:9300
a1.sinks.k1.indexName = foo_index
a1.sinks.k1.indexType = bar_type
a1.sinks.k1.clusterName = foobar_cluster
a1.sinks.k1.batchSize = 500
a1.sinks.k1.ttl = 5d
a1.sinks.k1.serializer = org.apache.flume.sink.elasticsearch.ElasticSearchDynamicSerializer
a1.sinks.k1.channel = c1

参数说明

配置项	默认值	描述
channel	–
type	–	组件类型名称，需要是 ‎`‎org.apache.flume.sink.elasticsearch.ElasticSearchSink‎`
hostNames	–	以逗号分隔的主机名：端口列表，如果端口不存在，则将使用默认端口"9300"‎

10.Kafka Sink

a1.sinks.k1.channel = c1
a1.sinks.k1.type = org.apache.flume.sink.kafka.KafkaSink
a1.sinks.k1.kafka.topic = mytopic
a1.sinks.k1.kafka.bootstrap.servers = localhost:9092
a1.sinks.k1.kafka.flumeBatchSize = 20
a1.sinks.k1.kafka.producer.acks = 1
a1.sinks.k1.kafka.producer.linger.ms = 1
a1.sinks.k1.kafka.producer.compression.type = snappy

参数说明

配置项	默认值	描述
type	–	必须设置为 ‎`‎org.apache.flume.sink.kafka.KafkaSink‎`
kafka.bootstrap.servers	–	‎Kafka-Sink 将连接到的代理列表，以获取主题分区列表这可以是代理的部分列表，但我们建议至少为 HA 提供两个代理。格式是逗号分隔的主机名列表：端口‎

11、HTTP Sink

a1.channels = c1
a1.sinks = k1
a1.sinks.k1.type = http
a1.sinks.k1.channel = c1
a1.sinks.k1.endpoint = http://localhost:8080/someuri
a1.sinks.k1.connectTimeout = 2000
a1.sinks.k1.requestTimeout = 2000
a1.sinks.k1.acceptHeader = application/json
a1.sinks.k1.contentTypeHeader = application/json
a1.sinks.k1.defaultBackoff = true
a1.sinks.k1.defaultRollback = true
a1.sinks.k1.defaultIncrementMetrics = false
a1.sinks.k1.backoff.4XX = false
a1.sinks.k1.rollback.4XX = false
a1.sinks.k1.incrementMetrics.4XX = true
a1.sinks.k1.backoff.200 = false
a1.sinks.k1.rollback.200 = false
a1.sinks.k1.incrementMetrics.200 = true

参数说明

配置项	默认值	描述
channel	–
type	–	组件类型名称需要为 ‎`‎http‎`‎。‎
endpoint	–	‎要 POST 到的完全限定 URL 终结点‎

12.Custom Sink

a1.channels = c1
a1.sinks = k1
a1.sinks.k1.type = org.example.MySink
a1.sinks.k1.channel = c1

参数说明

配置项	默认值	描述
channel	–
type	–	组件类型名称，必须是您的 FQCN‎

4、拦截器

1. timestamp 拦截器

向event中，写入一个kv到header里。k名称可配置；v就是当前的时间戳（毫秒）

a1.sources = s1
a1.sources.s1.channels = c1
a1.sources.s1.type = exec
a1.sources.s1.command = tail -F /root/weblog/access.log
a1.sources.s1.batchSize = 100
a1.sources.s1.interceptors = i1
a1.sources.s1.interceptors.i1.type = timestamp
a1.sources.s1.interceptors.i1.preserveExisting = false 
a1.channels = c1
a1.channels.c1.type = memory
a1.channels.c1.capacity = 200
a1.channels.c1.transactionCapacity = 100 
a1.sinks = k1
a1.sinks.k1.type = logger
a1.sinks.k1.channel = c1

参数说明

配置项	Default	on**
type	–	本拦截器的名称：timestamp
headerName	timestamp	要插入header的key名
preserveExisting	false	如果header中已存在同名key，是否要覆盖

2.static拦截器

让用户往event中添加一个自定义的header key-value，当然，这个key-value是在配置文件中配死的；

a1.sources = s1
a1.sources.s1.channels = c1
a1.sources.s1.type = exec
a1.sources.s1.command = tail -F /root/weblog/access.log
a1.sources.s1.batchSize = 100a1.sources.s1.interceptors = i1 i2 i3 a1.sources.s1.interceptors.i1.type = timestamp
a1.sources.s1.interceptors.i1.preserveExisting = false a1.sources.s1.interceptors.i2.type = host
a1.sources.s1.interceptors.i2.preserveExisting = false
a1.sources.s1.interceptors.i2.useIP = true
a1.sources.r1.interceptors.i3.type = static
a1.sources.r1.interceptors.i3.key = hero
a1.sources.r1.interceptors.i3.value = TAOGE
a1.channels = c1
a1.channels.c1.type = memory
a1.channels.c1.capacity = 200
a1.channels.c1.transactionCapacity = 100
a1.sinks = k1
a1.sinks.k1.type = logger
a1.sinks.k1.channel = c1

参数说明

配置项	Default	描述
type	–	本拦截器的名称：timestamp
headerName	timestamp	要插入header的key名
preserveExisting	false	如果header中已存在同名key，是否要覆盖

3.Host 拦截器

往event的header中插入主机名（ip）信息

a1.sources = s1
a1.sources.s1.channels = c1
a1.sources.s1.type = exec
a1.sources.s1.command = tail -F /root/weblog/access.log
a1.sources.s1.batchSize = 100
a1.sources.s1.interceptors = i1 i2
a1.sources.s1.interceptors.i1.type = timestamp
a1.sources.s1.interceptors.i1.preserveExisting = false
a1.sources.s1.interceptors.i2.type = host
a1.sources.s1.interceptors.i2.preserveExisting = false
a1.sources.s1.interceptors.i2.useIP = true
a1.channels = c1a1.channels.c1.type = memory
a1.channels.c1.capacity = 200
a1.channels.c1.transactionCapacity = 100
a1.sinks = k1a1.sinks.k1.type = logger
a1.sinks.k1.channel = c1

参数说明

配置项	Default	描述
type	–	本拦截器的别名： host
preserveExisting	false	是否覆盖已存在的hader key-value
useIP	true	插入ip还是主机名
hostHeader	host	要插入header的key名

4.UUID 拦截器

a1.sources = s1
a1.sources.s1.channels = c1
a1.sources.s1.type = exec
a1.sources.s1.command = tail -F /root/weblog/access.log
a1.sources.s1.batchSize = 100
a1.sources.s1.interceptors = i1 i2 i3 i4
a1.sources.s1.interceptors.i1.type = timestamp
a1.sources.s1.interceptors.i1.preserveExisting = false a1.sources.s1.interceptors.i2.type = host
a1.sources.s1.interceptors.i2.preserveExisting = false
a1.sources.s1.interceptors.i2.useIP = true
a1.sources.s1.interceptors.i3.type = static
a1.sources.s1.interceptors.i3.key = hero
a1.sources.s1.interceptors.i3.value = TAOGE
a1.sources.s1.interceptors.i4.type = org.apache.flume.sink.solr.morphline.UUIDInterceptor$Builder
a1.sources.s1.interceptors.i4.headName = duanzong
a1.sources.s1.interceptors.i4.prefix =  666_ 
a1.channels = c1
a1.channels.c1.type = memory
a1.channels.c1.capacity = 200
a1.channels.c1.transactionCapacity = 100
a1.sinks = k1
a1.sinks.k1.type = logger
a1.sinks.k1.channel = c1

参数说明

配置项	默认值	描述
type	–	全名：org.apache.flume.sink.solr.morphline.UUIDInterceptor$Builder
headerName	id	Key名称
preserveExisting	true	是否覆盖同名key
prefix	“”	Uuid前的前缀

七、高级使用

1、多级联动

#single01配置
vim flume_link_netcat_memory_avro.log
#----------------------------------------
#common
a.sources = s1
a.sinks = k1
a.channels = c1
#source
a.source.s1.type=netcat
a.source.s1.bind=single01
a.source.s1.port=9999
#channel
a.channels.c1.type = memory
a.channels.c1.capacity = 1024
a.channels.c1.transactionCapacity = 128
#sink
a.sinks.k1.tye=avro
a.sinks.k1.hostname=single02
a.sinks.k1.port=44444
#join
a.sources.s1.channels=c1
a.sinks.k1.channel=c1
#----------------------------------------
#single02配置
vim flume_link_avro_memory_log.log
#----------------------------------------
#common
a.sources = s1
a.sinks = k1
a.channels = c1
#source
a.source.s1.type=avro
a.source.s1.bind=single02
a.source.s1.port=44444
#channel
a.channels.c1.type = memory
a.channels.c1.capacity = 1024
a.channels.c1.transactionCapacity = 128
#sink
a.sink.k1.type=logger
#join
a.sources.s1.channels=c1
a.sinks.k1.channel=c1
#----------------------------------------

2、扇入

vim flume_fanin_avro_memory_log.log
#----------------------------------------
#common
a.sources = s1 s2
a.sinks = k1
a.channels = c1
#source
a.source.s1.type=avro
a.source.s1.bind=single01
a.source.s1.port=44444

#channel
a.channels.c1.type = memory
a.channels.c1.capacity = 1024
a.channels.c1.transactionCapacity = 128
#sink
a.sinks.k1.tye=avro
a.sinks.k1.hostname=single02
a.sinks.k1.port=44444
#join
a.sources.s1.channels=c1
a.sinks.k1.channel=c1
#----------------------------------------
vim flume_fanin02_avro_memory_log.log
#----------------------------------------
#common
a.sources = s1
a.sinks = k1
a.channels = c1
#source
a.source.s1.type=avro
a.source.s1.bind=single01
a.source.s1.port=44444

#channel
a.channels.c1.type = memory
a.channels.c1.capacity = 1024
a.channels.c1.transactionCapacity = 128
#sink
a.sinks.k1.tye=avro
a.sinks.k1.hostname=single01
a.sinks.k1.port=44444
#join
a.sources.s1.channels=c1
a.sinks.k1.channel=c1
#----------------------------------------
vim flume_fanin02_avro_memory_log.log
#----------------------------------------
#common
a.sources = s1
a.sinks = k1
a.channels = c1
#source
a.source.s1.type=avro
a.source.s1.bind=single01
a.source.s1.port=44444

#channel
a.channels.c1.type = memory
a.channels.c1.capacity = 1024
a.channels.c1.transactionCapacity = 128
#sink
a.sinks.k1.tye=avro
a.sinks.k1.hostname=single02
a.sinks.k1.port=44444
#join
a.sources.s1.channels=c1
a.sinks.k1.channel=c1
#----------------------------------------

3、扇出

vim flume_fanout_taildir_file_log_kafka_hbase.log
#common
a.sources=s1
a.channnels=c1 c2 c3
a.sinks=k1 k2 k3
#source
a.sources.s1.type=TAILDIR
a.sources.s1.filegroup=f1
a.sources.s1.filegroup.f1=/root/flume/spooldir/.*.csv
a.sources.s1.fileHeader=true
a.sources.s1.maxBatchCount=1000
#channel
a.channels.c1.type=memory
a.channels.c1.capacity=256
a.channels.c1.transactionCapacity=128

a.channels.c1.type=org.
a.channels.
a.channels.c1.type=file
#sink
a.sinks.s1.type=logger
a.sinks.s2.type=hbase
a.sinks.k2.table=kb16:flume_hbase_sink_20220209
a.sinks.k2.columnFamily=tags
a.sinks.k2.zookeeperQuorun=single01:2181
a.sinks.k2.batchsize=50
a.sinks.k3.type=org.apache.flune.sink.kafka.Kafkasink
a.sinks.k3.kafka.bootstrap.servers=single01:9092
a.sinks.k3.kafka.topic=flume_kafka_sink_fanout_20220209_01
a.sinks.k3.flumeBatchsize=50
#join
a.sources.channels=c1 c2 c3
a.sinks.k1.channel=c1
a.sinks.k2.channel=c2
a.sinks.k3.channel=c3

kafka-topics.sh --bootstrap-server single01:9092 --create kafka_sink_fanout_20220209_01 --partitions 1 --replication -factor 1

八、Flume进阶

1、Memory channel源码

可以看到一个Transaction主要有、put、take、commit、rollback这四个方法，我们在实现其子类时，主要也是实现着四个方法。

private class MemoryTransaction extends BasicTransactionSemantics {
    //和MemoryChannel一样，内部使用LinkedBlockingDeque来保存没有commit的Event
    private LinkedBlockingDeque<Event> takeList;
    private LinkedBlockingDeque<Event> putList;
    private final ChannelCounter channelCounter;
    //下面两个变量用来表示put的Event的大小、take的Event的大小
    private int putByteCounter = 0;
    private int takeByteCounter = 0;

    public MemoryTransaction(int transCapacity, ChannelCounter counter) {
      //用transCapacity来初始化put、take的队列
      putList = new LinkedBlockingDeque<Event>(transCapacity);
      takeList = new LinkedBlockingDeque<Event>(transCapacity);
      channelCounter = counter;
    }

    @Override
    protected void doPut(Event event) throws InterruptedException {
      //doPut操作，先判断putList中是否还有剩余空间，有则把Event插入到该队列中，同时更新putByteCounter
      //没有剩余空间的话，直接报ChannelException
      channelCounter.incrementEventPutAttemptCount();
      int eventByteSize = (int)Math.ceil(estimateEventSize(event)/byteCapacitySlotSize);

      if (!putList.offer(event)) {
        throw new ChannelException(
          "Put queue for MemoryTransaction of capacity " +
            putList.size() + " full, consider committing more frequently, " +
            "increasing capacity or increasing thread count");
      	}
   	   putByteCounter += eventByteSize;

    }



    @Override
    protected Event doTake() throws InterruptedException {
      //doTake操作，首先判断takeList中是否还有剩余空间
      channelCounter.incrementEventTakeAttemptCount();
      if(takeList.remainingCapacity() == 0) {
        throw new ChannelException("Take list for MemoryTransaction, capacity " +
            takeList.size() + " full, consider committing more frequently, " +
            "increasing capacity, or increasing thread count");
      }
      //然后判断，该MemoryChannel中的queue中是否还有空间，这里通过信号量来判断
      if(!queueStored.tryAcquire(keepAlive, TimeUnit.SECONDS)) {
        return null;
      }
      Event event;
      //从MemoryChannel中的queue中取出一个event
      synchronized(queueLock) {
        event = queue.poll();
      }
      Preconditions.checkNotNull(event, "Queue.poll returned NULL despite semaphore " +
          "signalling existence of entry");
      //放到takeList中，然后更新takeByteCounter变量
      takeList.put(event);

      int eventByteSize = (int)Math.ceil(estimateEventSize(event)/byteCapacitySlotSize);
      takeByteCounter += eventByteSize;
      return event;

    }

    @Override
    protected void doCommit() throws InterruptedException {
      //该对应一个事务的提交
      //首先判断putList与takeList的相对大小
      int remainingChange = takeList.size() - putList.size();
      //如果takeList小，说明向该MemoryChannel放的数据比取的数据要多，所以需要判断该MemoryChannel是否有空间来放
      if(remainingChange < 0) {
        // 1. 首先通过信号量来判断是否还有剩余空间
        if(!bytesRemaining.tryAcquire(putByteCounter, keepAlive,
          TimeUnit.SECONDS)) {
          throw new ChannelException("Cannot commit transaction. Byte capacity " +
            "allocated to store event body " + byteCapacity * byteCapacitySlotSize +
            "reached. Please increase heap space/byte capacity allocated to " +
            "the channel as the sinks may not be keeping up with the sources");
        }
        // 2. 然后判断，在给定的keepAlive时间内，能否获取到充足的queue空间
        if(!queueRemaining.tryAcquire(-remainingChange, keepAlive, TimeUnit.SECONDS)) {
          bytesRemaining.release(putByteCounter);
          throw new ChannelFullException("Space for commit to queue couldn't be acquired." +" Sinks are likely not keeping up with sources, or the buffer size is too tight");
        }
      }
      int puts = putList.size();
      int takes = takeList.size();
      //如果上面的两个判断都过了，那么把putList中的Event放到该MemoryChannel中的queue中。
      synchronized(queueLock) {
        if(puts > 0 ) {
          while(!putList.isEmpty()) {
            if(!queue.offer(putList.removeFirst())) {
              throw new RuntimeException("Queue add failed, this shouldn't be able to happen");
            }
          }
        }
        //清空本次事务中用到的putList与takeList，释放资源
        putList.clear();
        takeList.clear();
      }
      //更新控制queue大小的信号量bytesRemaining，因为把takeList清空了，所以直接把takeByteCounter加到bytesRemaining中。
      bytesRemaining.release(takeByteCounter);
      takeByteCounter = 0;
      putByteCounter = 0;
      //因为把putList中的Event放到了MemoryChannel中的queue，所以把puts加到queueStored中去。
      queueStored.release(puts);
      //如果takeList比putList大，说明该MemoryChannel中queue的数量应该是减少了，所以把(takeList-putList)的差值加到信号量queueRemaining
      if(remainingChange > 0) {
        queueRemaining.release(remainingChange);
      }
      if (puts > 0) {
        channelCounter.addToEventPutSuccessCount(puts);
      }
      if (takes > 0) {
        channelCounter.addToEventTakeSuccessCount(takes);
      }
      channelCounter.setChannelSize(queue.size());
    }

    @Override
    protected void doRollback() {
      //当一个事务失败时，会进行回滚，即调用本方法
      //首先把takeList中的Event放回到MemoryChannel中的queue中。
      int takes = takeList.size();
      synchronized(queueLock) {
        Preconditions.checkState(queue.remainingCapacity() >= takeList.size(), "Not enough space in memory channel " +"queue to rollback takes. This should never happen, please report");
        while(!takeList.isEmpty()) {
          queue.addFirst(takeList.removeLast());
        }
        //然后清空putList
        putList.clear();
      }
      //因为清空了putList，所以需要把putList所占用的空间大小添加到bytesRemaining中
      bytesRemaining.release(putByteCounter);
      putByteCounter = 0;
      takeByteCounter = 0;
      //因为把takeList中的Event回退到queue中去了，所以需要把takeList的大小添加到queueStored中
      queueStored.release(takes);
      channelCounter.setChannelSize(queue.size());
    }
  }

2、Flume自定义Source组件

1.需求场景

什么情况下需要自定义source：一般是某种数据源，用flume内置的source组件无法解析，比如XML文档

2.实现思路

找到自定义source所要实现或继承的父类/接口
重写方法（插入自己的需求逻辑）
将代码打成jar包，传入flume的lib目录
写配置文件调用自定义的source

3.依赖

<dependency>
	<groupId>org.apache.flume</groupId>
	<artifactId>flume-ng-core</artifactId>
	<version>1.7.0</version>
</dependency>

4.线程池实现版

import org.apache.commons.io.FileUtils;
import org.apache.commons.lang.StringUtils;
import org.apache.flume.Context;
import org.apache.flume.Event;
import org.apache.flume.EventDrivenSource;
import org.apache.flume.SystemClock;
import org.apache.flume.channel.ChannelProcessor;
import org.apache.flume.conf.Configurable;
import org.apache.flume.event.EventBuilder;
import org.apache.flume.source.AbstractSource;
import org.apache.flume.source.ExecSource;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory; 
import java.io.*;
import java.util.ArrayList;
import java.util.List;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;

/**  能够记录读取位置偏移量的自定义source
 *
 */
public class test extends AbstractSource implements EventDrivenSource, Configurable {
    private static final Logger logger = LoggerFactory.getLogger(test.class);
    private String positionfilepath;
    private String logfile;
    private int batchsize;
    private ExecutorService exec;
    /**框架调用本方法，开始采集数据
     * 自定义代码去读取数据，转为event
     *  用getChannelProcessor（）方法（定义在父类中）去获取框架的channel processor（channel处理器）
     *      调用这个channelprocessor将event提交给channel
     */
    @Override
    public synchronized void start() {
        super.start();
        // 用于向channel提交数据的一个处理器
        ChannelProcessor channelProcessor = getChannelProcessor();
        // 获取历史偏移量
        long offset = 0;
        try {
            File positionfile = new File(this.positionfilepath);
            String s = FileUtils.readFileToString(positionfile);
            offset = Long.parseLong(s);
        } catch (IOException e) {
            e.printStackTrace();
        }
        // 构造一个线程池
        exec = Executors.newSingleThreadExecutor();
        // 向线程池提交数据采集任务
        exec.execute(new HoldOffsetRunnable(offset, logfile, channelProcessor, batchsize, positionfilepath));
    }
    /**
     * 停止前要调用的方法
     * 可以在这里做一些资源关闭清理工作
     */
    @Override
    public synchronized void stop() {
        super.stop();
        try{
            exec.shutdown();
        }catch (Exception e){
            exec.shutdown();
        }
    }
    /**
     * 获取配置文件中的参数，来配置本source实例
     * 要哪些参数：偏移量记录文件所在路径
     * 要采集的文件所在路径
     * @param context
     */
    public void configure(Context context) {
        // 这是我们source用来记录偏移量的文件路径
        this.positionfilepath = context.getString("positionfile", "./");
        // 这是我们source要采集的日志文件的路径
        this.logfile = context.getString("logfile");
        // 这是用户配置的采集事务批次最大值
        this.batchsize = context.getInteger("batchsize", 100);
        // 如果日志文件路径没有指定，则抛异常
        if (StringUtils.isBlank(logfile))
            throw new RuntimeException("请配置需要采集的文件路径");
    }
    /**
     * 采集文件的具体工作线程任务类
     */
    private static class HoldOffsetRunnable implements Runnable {
        long offset;
        String logfilepath;
        String positionfilepath;
        ChannelProcessor channelProcessor;
        // channel提交器 （里面会调拦截器，会开启写入channel的事务）
        int batchsize;
        // 批次大小
        List<Event> events = new ArrayList<Event>();
        // 用来保存一批事件
        SystemClock systemClock = new SystemClock();
        public HoldOffsetRunnable(long offset, String logfilepath, ChannelProcessor channelProcessor, int batchsize, String positionfilepath) {
            this.offset = offset;
            this.logfilepath = logfilepath;
            this.channelProcessor = channelProcessor;
            this.batchsize = batchsize;
            this.positionfilepath = positionfilepath;
        }
        public void run() {
            try {
                // 先定位到指定的offset
                RandomAccessFile raf = new RandomAccessFile(logfilepath, "r");
                raf.seek(offset);
                // 循环读数据
                String line = null;
                // 记录上一批提交的时间
                long lastBatchTime = System.currentTimeMillis();
                while (true) {
                    line = raf.readLine();
                    if(line == null ){
                        Thread.sleep(2000);
                        continue;
                    }
                    // 将数据转成event
                    Event event = EventBuilder.withBody(line.getBytes());
                    // 装入list batch
                    synchronized (test.class) {
                        events.add(event);
                    }
                    // 判断批次大小是否满 或者 时间到了没有
                    if (events.size() >= batchsize || timeout(lastBatchTime)) {
                        // 满足，则提交
                        channelProcessor.processEventBatch(events);
                        // 记录提交时间
                        lastBatchTime = systemClock.currentTimeMillis();
                        // 记录偏移量
                        long offset = raf.getFilePointer();
                        FileUtils.writeStringToFile(new File(positionfilepath), offset + "");
                        // 清空本批event
                        events.clear();
                    }
                    // 不满足，继续读
                }
            } catch (FileNotFoundException e) {
                logger.error("要采集的文件不存在");
            } catch (IOException e) {
                logger.error("我也不知道怎么搞的，不好意思，我罢工了");
            } catch (InterruptedException e) {
                logger.error("线程休眠出问题了");
            }
        }
        // 判断是否批次间隔超时
        private boolean timeout(long lastBatchTime) {
            return systemClock.currentTimeMillis() - lastBatchTime > 2000;
        }
    }
}

5.单线程实现

import org.apache.commons.io.FileUtils;
import org.apache.flume.Context;
import org.apache.flume.Event;
import org.apache.flume.EventDrivenSource;
import org.apache.flume.channel.ChannelProcessor;
import org.apache.flume.conf.Configurable;
import org.apache.flume.event.EventBuilder;
import org.apache.flume.source.AbstractSource;

import java.io.File;
import java.io.RandomAccessFile;
import java.util.ArrayList;

public class MySource extends AbstractSource implements EventDrivenSource, Configurable {
    String data_file_path = null;
    String position_file_path = null;
    Integer batchSize = null;
    Long batchTime = null;
    RandomAccessFile rda = null;

    /**
     * 获取配置参数
     * 要读取的文件路径
     * 偏移量记录文件路径
     * batchsiz
     * @param context
     */
    public void configure(Context context) {
        data_file_path = context.getString("data_file_path");
        position_file_path = context.getString("position_file_path","/tmp/position");
        batchSize = context.getInteger("batchSize",100);
        batchTime = context.getLong("batchTime",3000L);
    }

    /**
     * source的核心逻辑方法： 开始工作
     * 这是框架调用source时的入口方法
     * 我们的需求：
     * 读指定文件
     * 次读一行，一行变一个Event，写入channel，并记录偏移量
     */
    @Override
    public synchronized void start() {
        super.start();
        try {
           long offset = 0;
            // 先读取已存在的偏移量
            File positionFile = new File(position_file_path);
            if(positionFile.exists()){
                String position = FileUtils.readFileToString(positionFile);
                offset = Long.parseLong(position);
            }
            // 定位到偏移量位置开始读文件
            rda = new RandomAccessFile(data_file_path, "r");
            rda.seek(offset);
            // 获取channel processor
            ChannelProcessor channelProcessor = getChannelProcessor();
            // 一次读一行 ，封装成event
            String line = null;
            ArrayList<Event> events = new ArrayList<Event>();
            // 上次提交批次的时间
            long preBatchTime = System.currentTimeMillis();
            while(true){
                // 尝试读取一行
                line=rda.readLine();
                // 如果此刻，文件中没有新增数据，则等待1秒，继续读
                if(line == null ) {
                    Thread.sleep(1000);
                    continue;
                };
                // 将event写入channel
                Event event = EventBuilder.withBody(line.getBytes());
                events.add(event);
                if(events.size() == batchSize || System.currentTimeMillis() - preBatchTime >= batchTime){
                    channelProcessor.processEventBatch(events);
                    // 清空list
                    events.clear();
                    // 更新上次提交批次时间
                    preBatchTime = System.currentTimeMillis();
                    // 获取当前所读到的便宜量
                    offset = rda.getFilePointer();
                    // 更新偏移量到记录文件中
                    FileUtils.writeStringToFile(positionFile,offset+"");
                }
            }
        }catch (Exception e){
            
        }
    }

    @Override
    public synchronized void stop() {
        try {
            if(rda != null) {
                rda.close();
            }
        }catch(Exception e){

        }
        super.stop();
    }
}

3、Flume自定义拦截器组件

1.拦截器开发

框架中，自定义扩展接口的套路：

1. 要实现或者继承框架中提供的接口或父类，实现、重写其中的方法

2. 写好的代码要打成jar包，并放入flume的lib目录

3. 要将自定义的类，写入相关agent配置文件

2.依赖

<dependency>
    <groupId>org.apache.flume</groupId>
    <artifactId>flume-ng-core</artifactId>
    <version>1.9.0</version>
    <scope>provided</scope>
</dependency>

3.代码实现

import org.apache.commons.codec.binary.Base64;
import org.apache.commons.codec.digest.DigestUtils;
import org.apache.flume.Context;
import org.apache.flume.Event;
import org.apache.flume.event.EventBuilder;
import org.apache.flume.interceptor.Interceptor;
import java.util.ArrayList;import java.util.List;
public class EncryptInterceptor  implements Interceptor {
    // 要加密的字段索引s
    String indices;
    // 索引之间的分隔符
    String idxSplitBy;
    // 数据体字段之间的分隔符
    String dataSplitBy;
    /**      
     *构造方法 
     *@param indices
     *@param idxSplitBy 
     *@param dataSplitBy 
     */ 
    public EncryptInterceptor(String indices, String idxSplitBy, String dataSplitBy) {
        // 0,3
        this.indices = indices;
        this.idxSplitBy = idxSplitBy;
        this.dataSplitBy = dataSplitBy;
    }
    // 这个方法会被框架调用一次，用来做一些初始化工作
    public void initialize() {

    }
    // 拦截方法--对一个event进行处理
    public Event intercept(Event event) {
        byte[] body = event.getBody();
        String dataStr = new String(body);
        // 数据的字段数组
        String[] dataFieldsArr = dataStr.split(dataSplitBy);
        // 需要加密的索引的数组
        String[] idxArr = indices.split(idxSplitBy);
        for (String s : idxArr) {
            int index = Integer.parseInt(s);
            // 取出要加密的字段的内容
            String field = dataFieldsArr[index];
            // MD5加密这个字段
            String encryptedField = DigestUtils.md5Hex(field);
            // BASE64编码
            byte[] bytes = Base64.decodeBase64(encryptedField);
            // 替换掉原来的未加密内容
            dataFieldsArr[index] = new String(bytes);
        }
        // 将加密过的字段重新拼接成一条数据，并使用原来的分隔符
        StringBuilder sb = new StringBuilder();
        for (String field : dataFieldsArr) {
            sb.append(field).append(dataSplitBy);
        }
        sb.deleteCharAt(sb.lastIndexOf(dataSplitBy));
        // 返回加密后的字段所封装的event对象
        return EventBuilder.withBody(sb.toString().getBytes());
    }
    // 拦截方法--对一批event进行处理
   public List<Event> intercept(List<Event> events) {
        ArrayList<Event> lst = new ArrayList<Event>();
        for (Event event : events) {
            Event eventEncrpt = intercept(event);
            lst.add(eventEncrpt);
        }
        return lst;
    }
    // agent退出前，会调一次该方法，进行需要的清理、关闭操作
    public void close() {

    }
    /**
     *拦截器的构造器
     */
    public static class EncryptInterceptorBuilder implements Interceptor.Builder{
        // 要加密的字段索引s
        String indices;
        // 索引之间的分隔符
        String idxSplitBy;
        // 数据体字段之间的分隔符
        String dataSplitBy;
        // 构造一个拦截器实例
        public Interceptor build() {
            return new EncryptInterceptor(indices,idxSplitBy,dataSplitBy);
        }
        // 获取配置文件中的拦截器参数
        public void configure(Context context) {
            // 要加密的字段索引s
            this.indices = context.getString(Constants.INDICES);
            // 索引之间的分隔符
            this.idxSplitBy = context.getString(Constants.IDX_SPLIT_BY);
            // 数据体字段之间的分隔符
            this.dataSplitBy = context.getString(Constants.DATA_SPLIT_BY);
        }
    }
    public static class Constants {
        public static final String INDICES = "indices";
        public static final String IDX_SPLIT_BY = "idxSplitBy";
        public static final String DATA_SPLIT_BY= "dataSplitBy";
    }
}

九、基础知识

1、 flume事务机制

1.Delivery 保证

认识 Flume 对事件投递的可靠性保证是非常重要的，它往往是我们是否使用 Flume 来解决问题的决定因素之一。

消息投递的可靠保证有三种：

At-least-once
At-most-once
Exactly-once

2.At-least-once

基本上所有工具的使用用户都希望工具框架能保证消息 Exactly-once ，这样就不必在设计实现上考虑消息的丢失或者重复的处理场景。但是事实上很少有工具和框架能做到这一点，真正能做到这一点所付出的成本往往很大，或者带来的额外影响反而让你觉得不值得。假设 Flume 真的做到了 Exactly-once ，那势必降低了稳定性和吞吐量，所以 Flume 选择的策略是 At-least-once 。

当然这里的 At-least-once 需要加上引号，并不是说用上 Flume 的随便哪个组件组成一个实例，运行过程中就能保存消息不会丢失。事实上 At-least-once 原则只是说的是 Source 、 Channel 和 Sink 三者之间上下投递消息的保证。而当你选择 MemoryChannel 时，实例如果异常挂了再重启，在 channel 中的未被 sink 所消费的残留数据也就丢失了，从而没办法保证整条链路的 At-least-once。

Flume 的 At-least-once 保证的实现基础是建立了自身的 Transaction 机制。Flume 的 Transaction 有4个生命周期函数，分别是 start、 commit、rollback 和 close。

当 Source 往 Channel 批量投递事件时首先调用 start 开启事务,批量

put 完事件后通过 commit 来提交事务，如果 commit 异常则 rollback ，然后 close 事务，最后 Source 将刚才提交的一批消息事件向源服务 ack（比如 kafka 提交新的 offset ）。Sink 消费 Channel 也是相同的模式，唯一的区别就是 Sink 需要在向目标源完成写入之后才对事务进行 commit。两个组件的相同做法都是只有向下游成功投递了消息才会向上游 ack，从而保证了数据能 At-least-once 向下投递。

2、flume agent内部机制

组件：

1.ChannelSelector

ChannelSelector 的作用就是选出 Event 将要被发往哪个 Channel。其共有两种类型，分别是 Replicating（复制）和 Multiplexing（多路复用）。 ReplicatingSelector 会将同一个 Event 发往所有的 Channel，Multiplexing 会根据相应的原则，将不同的 Event 发往不同的 Channel。

2.SinkProcessor

(1) SinkProcessor 共有三种类型，分别是 DefaultSinkProcessor 、LoadBalancingSinkProcessor 和 FailoverSinkProcessor。

(2) DefaultSinkProcessor 对应的是单个的 Sink，LoadBalancingSinkProcessor 和FailoverSinkProcessor 对应的是 Sink Group。

(3) LoadBalancingSinkProcessor 可以实现负载均衡的功能，FailoverSinkProcessor 可以实现故障转移的功能。

3、 ganglia及flume监控

开启内置监控功能

-Dflume.monitoring.type=http -Dflume.monitoring.port=34545

将监控数据发往ganglia进行展现

-Dflume.monitoring.type=ganglia -Dflume.monitoring.port=34890

4、 Flume调优

flume-ng agent包括source、channel、sink三个部分，这三部分都运行在JVM上，而JVM运行在linux操作系统之上。因此，对于flume的性能调优，就是对这三部分及影响因素调优。

1、source的配置

该项目中采用的是 taildir source，他的读取速度能够跟上命令行写入日志的速度，故并未做特殊的处理。

2、channel的配置

可选的channel配置一般有两种:

emory channel
file channel

建议在内存足够的情况下，优先选择memory channel。

尝试过相同配置下使用file channel和memory channel，file channel明显速度较慢，并且会生成log的文件，应该是用作缓存，当source已经接收但是还未写入sink时的event都会存在这个文件中。这样的好处是保证数据不会丢失，所以当对数据的丢失情况非常敏感且对实时性没有太大要求的时候，还是使用file memory吧。。

一开的memory channel配置用的是默认的，然后控制台报出了如下警告：

The channel is full or unexpected failure. The source will try again after 1000 ms

这个是因为当前被采集的文件过大，可以通过增大keep-alive的值解决。深层的原因是文件采集的速度和sink的速度没有匹配好。

所以memory channel有三个比较重要的参数需要配置：

#channel中最多缓存多少
a1.channels.c1.capacity = 5000
#channel一次最多吐给sink多少
a1.channels.c1.transactionCapacity = 2000
#event的活跃时间
a1.channels.c1.keep-alive = 10

3、sink的配置

可以通过压缩来节省空间和网络流量，但是会增加cpu的消耗。

batch：size越大性能越好，但是太大会影响时效性，一般batch size和源数据端的大小相同。

4、java内存的配置

export JAVA_OPTS="-Xms512m -Xmx2048m -Dcom.sun.management.jmxremote"

主要涉及Xms和Xmx两个参数，可以根据实际的服务器的内存大小进行设计。

5、OS内核参数的配置

如果单台服务器启动的flume agent过多的话，默认的内核参数设置偏小，需要调整。

绝域时空

关注

1
点赞
踩
4

收藏

觉得还不错? 一键收藏
打赏
1
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

专栏目录

数据迁移工具之Flume

文章目录

一、Flume

1、Flume的架构

1.Agent

2.Source

3. Sink

4.Channel

5. Event

2、flume内部数据传输的封装形式

3、 Transaction：事务控制机制

4、 拦截器

二、Flume安装

1、启动命令

三、Flume的端口数据监听

1、切换目录并创建配置文件

2、配置信息

3、打开Flume监听窗口

4、使用netcat 工具向本机的44444 端口发送内容

四、实时读取本地文件到HDFS

1、依赖jar包

2、创建配置文件

3、运行Flume

五、Flume监控多个文件上传到HDFS

1、配置文件

2、启动Flume

六、配置信息详解

1、基于Zookeeper的Flume启动

2、source配置

1.Avro Source ★

2.Thrift Source

3.Exec Source

4.JMS Source

5.Spooling Directory Source ★

6.Taildir Source ★

7.Kafka Source ★

8.NetCat TCP Source ★

9.NetCat UDP Source ★

10、HTTP Source

3、Sink配置

1.HDFS Sink

2.Hive Sink

3.Logger Sink

4.Avro Sink

5.Thrift Sink

6.IRC Sink

7.HBase Sink

8.HBase2 Sink

9.ElasticSearchSink

10.Kafka Sink

11、HTTP Sink

12.Custom Sink

4、拦截器

1. timestamp 拦截器

2.static拦截器

3.Host 拦截器

4.UUID 拦截器

七、高级使用

1、多级联动

2、扇入

3、扇出

八、Flume进阶

1、Memory channel源码

2、Flume自定义Source组件

1.需求场景

2.实现思路

3.依赖

4.线程池实现版

5.单线程实现

3、Flume自定义拦截器组件

1.拦截器开发

2.依赖

3.代码实现

九、基础知识

1、 flume事务机制

1.Delivery 保证

2.At-least-once

2、flume agent内部机制

1.ChannelSelector

2.SinkProcessor

4、拦截器