Flume系列（一）Flume简介与基本使用--Source--Channel--Sink 收集传输三部曲

最新推荐文章于 2022-07-05 11:09:26 发布

NICEDAYSS

最新推荐文章于 2022-07-05 11:09:26 发布

阅读量409

点赞数

分类专栏： Flume 文章标签： flume 大数据数据库 kafka spark

本文链接：https://blog.csdn.net/qq_35050438/article/details/107832086

版权

Flume 专栏收录该内容

2 篇文章 0 订阅

订阅专栏

写在前面： 我是「nicedays」，一枚喜爱做特效，听音乐，分享技术的大数据开发猿。这名字是来自world order乐队的一首HAVE A NICE DAY。如今，走到现在很多坎坷和不顺，如今终于明白nice day是需要自己赋予的。
白驹过隙，时光荏苒，珍惜当下~~
写博客一方面是对自己学习的一点点总结及记录，另一方面则是希望能够帮助更多对大数据感兴趣的朋友。如果你也对 大数据与机器学习感兴趣，可以关注我的动态 https://blog.csdn.net/qq_35050438，让我们一起挖掘数据与人工智能的价值~

文章目录

- Flume简介和基本使用：

Flume简介和基本使用：

一种可靠、可用的高效分布式数据收集服务。

Flume拥有基于数据流上的简单灵活架构，支持容错、故障转移与恢复

一：Flume架构：

Client：客户端，数据产生的地方，如Web服务器
Event：事件，指通过Agent传输的单个数据包，如日志数据通常对应一行数据
Agent：代理，一个独立的JVM进程
- Flume以一个或多个Agent部署运行
- Agent包含三个组件
  - Source
  - Channel
  - Sink

二：Flume工作流程：

在这里插入图片描述

Flume的本质分为三大块：输入源，管道，输出地

三：Source输入源：

http source：

用于接收HTTP的get和post请求

属性	缺省值	描述
type	-	http
port	-	监听端口
bind	0.0.0.0	绑定IP
handler	org.apache.flume.source.http.JSONHandler	数据处理程序类全名

avro source：

监听Avro端口，并从外部Avro客户端接收events

属性	缺省值	描述
type	-	avro
bind	-	绑定IP地址
port	-	端口
threads	-	最大工作线程数量

Spooling Directory Source：

This source lets you ingest data by placing files to be ingested into a “spooling” directory on disk. This source will watch the specified directory for new files, and will parse events out of new files as they appear. The event parsing logic is pluggable. After a given file has been fully read into the channel, completion by default is indicated by renaming the file or it can be deleted or the trackerDir is used to keep track of processed files.

Unlike the Exec source, this source is reliable and will not miss data, even if Flume is restarted or killed. In exchange for this reliability, only immutable, uniquely-named files must be dropped into the spooling directory. Flume tries to detect these problem conditions and will fail loudly if they are violated:

If a file is written to after being placed into the spooling directory, Flume will print an error to its log file and stop processing.
If a file name is reused at a later time, Flume will print an error to its log file and stop processing.

To avoid the above issues, it may be useful to add a unique identifier (such as a timestamp) to log file names when they are moved into the spooling directory.

Despite the reliability guarantees of this source, there are still cases in which events may be duplicated if certain downstream failures occur. This is consistent with the guarantees offered by other Flume components.

Property Name	Default	Description
channels	–
type	–	The component type name, needs to be `spooldir`.
spoolDir	–	The directory from which to read files from.
fileSuffix	.COMPLETED	Suffix to append to completely ingested files
deletePolicy	never	When to delete completed files: `never` or `immediate`
fileHeader	false	Whether to add a header storing the absolute path filename.
fileHeaderKey	file	Header key to use when appending absolute path filename to event header.
basenameHeader	false	Whether to add a header storing the basename of the file.
basenameHeaderKey	basename	Header Key to use when appending basename of file to event header.
includePattern	^.*$	Regular expression specifying which files to include. It can used together with`ignorePattern`. If a file matches both `ignorePattern` and `includePattern` regex, the file is ignored.
ignorePattern	^$	Regular expression specifying which files to ignore (skip). It can used together with`includePattern`. If a file matches both `ignorePattern` and `includePattern` regex, the file is ignored.
trackerDir	.flumespool	Directory to store metadata related to processing of files. If this path is not an absolute path, then it is interpreted as relative to the spoolDir.
trackingPolicy	rename	The tracking policy defines how file processing is tracked. It can be “rename” or “tracker_dir”. This parameter is only effective if the deletePolicy is “never”. “rename” - After processing files they get renamed according to the fileSuffix parameter. “tracker_dir” - Files are not renamed but a new empty file is created in the trackerDir. The new tracker file name is derived from the ingested one plus the fileSuffix.
consumeOrder	oldest	In which order files in the spooling directory will be consumed `oldest`, `youngest` and `random`. In case of `oldest` and `youngest`, the last modified time of the files will be used to compare the files. In case of a tie, the file with smallest lexicographical order will be consumed first. In case of `random` any file will be picked randomly. When using `oldest` and `youngest` the whole directory will be scanned to pick the oldest/youngest file, which might be slow if there are a large number of files, while using `random` may cause old files to be consumed very late if new files keep coming in the spooling directory.
pollDelay	500	Delay (in milliseconds) used when polling for new files.
recursiveDirectorySearch	false	Whether to monitor sub directories for new files to read.
maxBackoff	4000	The maximum time (in millis) to wait between consecutive attempts to write to the channel(s) if the channel is full. The source will start at a low backoff and increase it exponentially each time the channel throws a ChannelException, upto the value specified by this parameter.
batchSize	100	Granularity at which to batch transfer to the channel
inputCharset	UTF-8	Character set used by deserializers that treat the input file as text.
decodeErrorPolicy	`FAIL`	What to do when we see a non-decodable character in the input file. `FAIL`: Throw an exception and fail to parse the file. `REPLACE`: Replace the unparseable character with the “replacement character” char, typically Unicode U+FFFD. `IGNORE`: Drop the unparseable character sequence.
deserializer	`LINE`	Specify the deserializer used to parse the file into events. Defaults to parsing each line as an event. The class specified must implement `EventDeserializer.Builder`.
deserializer.*		Varies per event deserializer.
bufferMaxLines	–	(Obselete) This option is now ignored.
bufferMaxLineLength	5000	(Deprecated) Maximum length of a line in the commit buffer. Use deserializer.maxLineLength instead.
selector.type	replicating	replicating or multiplexing
selector.*		Depends on the selector.type value
interceptors	–	Space-separated list of interceptors
interceptors.*

a1.channels = ch-1
a1.sources = src-1

a1.sources.src-1.type = spooldir
a1.sources.src-1.channels = ch-1
a1.sources.src-1.spoolDir = /var/log/apache/flumeSpool
a1.sources.src-1.fileHeader = true

Netcat Source：

A netcat-like source that listens on a given port and turns each line of text into an event. Acts like nc -k -l [host] [port]. In other words, it opens a specified port and listens for data. The expectation is that the supplied data is newline separated text. Each line of text is turned into a Flume event and sent via the connected channel.

Required properties are in bold.

Property Name	Default	Description
channels	–
type	–	The component type name, needs to be `netcat`
bind	–	Host name or IP address to bind to
port	–	Port # to bind to
max-line-length	512	Max line length per event body (in bytes)
ack-every-event	true	Respond with an “OK” for every event received
selector.type	replicating	replicating or multiplexing
selector.*		Depends on the selector.type value
interceptors	–	Space-separated list of interceptors
interceptors.*

# agent为实例名
# agent实例的三部分组成源，管道，输出槽
agent.sources = s1    
agent.channels = c1  
agent.sinks = sk1 

#设置Source为netcat 端口为5678，使用的channel为c1  接收端
agent.sources.s1.type = netcat  
agent.sources.s1.bind = localhost  
agent.sources.s1.port = 5678  
# 源和和管道连
agent.sources.s1.channels = c1    

#设置Sink为logger模式，使用的channel为c1  发送端
agent.sinks.sk1.type = logger  
# 槽和管道连
agent.sinks.sk1.channel = c1  


#设置channel为capacity 存内存
agent.channels.c1.type = memory
agent.channels.c1.capacity = 1000 # 最多容纳1000条
agent.channels.c1.transactionCapacity = 100 # 一次接100条数据

Exec Source：

执行linux指令，并消费指令返回的结果，如“tail -f”

属性	缺省值	描述
type	-	exec
command	-	如“tail -f xxx.log”
shell	-	选择系统Shell程序，如“/bin/sh”
batchSize	20	发送给channel的最大行数

Kafka Source：

Kafka Source is an Apache Kafka consumer that reads messages from Kafka topics. If you have multiple Kafka sources running, you can configure them with the same Consumer Group so each will read a unique set of partitions for the topics. This currently supports Kafka server releases 0.10.1.0 or higher. Testing was done up to 2.0.1 that was the highest avilable version at the time of the release.

Property Name	Default	Description
channels	–
type	–	The component type name, needs to be `org.apache.flume.source.kafka.KafkaSource`
kafka.bootstrap.servers	–	List of brokers in the Kafka cluster used by the source
kafka.consumer.group.id	flume	Unique identified of consumer group. Setting the same id in multiple sources or agents indicates that they are part of the same consumer group
kafka.topics	–	Comma-separated list of topics the kafka consumer will read messages from.
kafka.topics.regex	–	Regex that defines set of topics the source is subscribed on. This property has higher priority than `kafka.topics` and overrides `kafka.topics` if exists.
batchSize	1000	Maximum number of messages written to Channel in one batch
batchDurationMillis	1000	Maximum time (in ms) before a batch will be written to Channel The batch will be written whenever the first of size and time will be reached.
backoffSleepIncrement	1000	Initial and incremental wait time that is triggered when a Kafka Topic appears to be empty. Wait period will reduce aggressive pinging of an empty Kafka Topic. One second is ideal for ingestion use cases but a lower value may be required for low latency operations with interceptors.
maxBackoffSleep	5000	Maximum wait time that is triggered when a Kafka Topic appears to be empty. Five seconds is ideal for ingestion use cases but a lower value may be required for low latency operations with interceptors.
useFlumeEventFormat	false	By default events are taken as bytes from the Kafka topic directly into the event body. Set to true to read events as the Flume Avro binary format. Used in conjunction with the same property on the KafkaSink or with the parseAsFlumeEvent property on the Kafka Channel this will preserve any Flume headers sent on the producing side.
setTopicHeader	true	When set to true, stores the topic of the retrieved message into a header, defined by the `topicHeader` property.
topicHeader	topic	Defines the name of the header in which to store the name of the topic the message was received from, if the `setTopicHeader` property is set to `true`. Care should be taken if combining with the Kafka Sink `topicHeader` property so as to avoid sending the message back to the same topic in a loop.
kafka.consumer.security.protocol	PLAINTEXT	Set to SASL_PLAINTEXT, SASL_SSL or SSL if writing to Kafka using some level of security. See below for additional info on secure setup.
more consumer security props		If using SASL_PLAINTEXT, SASL_SSL or SSL refer to Kafka security for additional properties that need to be set on consumer.
Other Kafka Consumer Properties	–	These properties are used to configure the Kafka Consumer. Any consumer property supported by Kafka can be used. The only requirement is to prepend the property name with the prefix `kafka.consumer`. For example: `kafka.consumer.auto.offset.reset`

tier1.sources.source1.type = org.apache.flume.source.kafka.KafkaSource
tier1.sources.source1.channels = channel1
tier1.sources.source1.batchSize = 5000
tier1.sources.source1.batchDurationMillis = 2000
tier1.sources.source1.kafka.bootstrap.servers = localhost:9092
tier1.sources.source1.kafka.topics = test1, test2
tier1.sources.source1.kafka.consumer.group.id = custom.g.id

四：Channel管道：

Memory Channel
- event保存在Java Heap中。如果允许数据小量丢失，推荐使用
File Channel
- event保存在本地文件中，可靠性高，但吞吐量低于Memory Channel
JDBC Channel
- event保存在关系数据中，一般不推荐使用
Kafka Channel

五：Sink输出地：

Avro sink：

作为avro客户端向avro服务端发送avro事件

属性	缺省值	描述
type	-	avro
hostname	-	服务端IP地址
post	-	端口
batch-size	100	批量发送事件数量

HDFS sink：

将事件写入Hadoop分布式文件系统（HDFS）

属性	缺省值	描述
type	-	hdfs
hdfs.path	-	hdfs目录
hfds.filePrefix	FlumeData	文件前缀
hdfs.fileSuffix	-	文件后缀

a2.channels = c2
a2.sources = s2
a2.sinks = k2


a2.sources.s2.type = spooldir
a2.sources.s2.spoolDir = /opt/datas
a2.sources.s2.channels = c2

a2.channels.c2.type = memory
a2.channels.c2.capacity = 10000
a2.channels.c2.transactionCapacity = 1000


a2.sinks.k2.type = hdfs
a2.sinks.k2.hdfs.path = hdfs://192.168.56.101:9000/flume/customs
a2.sinks.k2.hdfs.filePrefix = events-
a2.sinks.k2.rollCount = 5000
a2.sinks.k2.rollSize = 600000
a2.sinks.k2.batchSize = 500

a2.sinks.k2.channel = c2

Hive sink：

包含分隔文本或JSON数据流事件直接进入Hive表或分区
传入的事件数据字段映射到Hive表中相应的列

属性	缺省值	描述
type	-	hive
hive.metastore	-	Hive metastore URI
hive.database	-	Hive数据库名称
hive.table	-	Hive表
serializer	-	序列化器负责从事件中分析出字段并将它们映射为Hive表中的列。序列化器的选择取决于数据的格式。支持序列化器:DELIMITED和JSON

HBase sink：

属性	缺省值	描述
type	-	hbase
table	-	要写入的 Hbase 表名
columnFamily	-	要写入的 Hbase 列族
zookeeperQuorum	-	对应hbase.zookeeper.quorum
znodeParent	/hbase	zookeeper.znode.parent
serializer	org.apache.flume.sink.hbase.SimpleHbaseEventSerializer	一次事件插入一列
serializer.payloadColumn	-	列名col1