Flume系列(一)Flume简介与基本使用--Source--Channel--Sink 收集传输三部曲

写在前面: 我是「nicedays」,一枚喜爱做特效,听音乐,分享技术大数据开发猿。这名字是来自world order乐队的一首HAVE A NICE DAY。如今,走到现在很多坎坷和不顺,如今终于明白nice day是需要自己赋予的。
白驹过隙,时光荏苒,珍惜当下~~
写博客一方面是对自己学习的一点点总结及记录,另一方面则是希望能够帮助更多对大数据感兴趣的朋友。如果你也对 大数据与机器学习感兴趣,可以关注我的动态 https://blog.csdn.net/qq_35050438,让我们一起挖掘数据与人工智能的价值~

Flume简介和基本使用:

一种可靠、可用的高效分布式数据收集服务。

Flume拥有基于数据流上的简单灵活架构,支持容错、故障转移与恢复

一:Flume架构:

  • Client:客户端,数据产生的地方,如Web服务器

  • Event:事件,指通过Agent传输的单个数据包,如日志数据通常对应一行数据

  • Agent:代理,一个独立的JVM进程

    • Flume以一个或多个Agent部署运行
    • Agent包含三个组件
      • Source
      • Channel
      • Sink

二:Flume工作流程:

在这里插入图片描述

Flume的本质分为三大块:输入源,管道,输出地

三:Source输入源:

http source:
  • 用于接收HTTP的get和post请求
属性缺省值描述
type-http
port-监听端口
bind0.0.0.0绑定IP
handlerorg.apache.flume.source.http.JSONHandler数据处理程序类全名
avro source:
  • 监听Avro端口,并从外部Avro客户端接收events
属性缺省值描述
type-avro
bind-绑定IP地址
port-端口
threads-最大工作线程数量
Spooling Directory Source:

This source lets you ingest data by placing files to be ingested into a “spooling” directory on disk. This source will watch the specified directory for new files, and will parse events out of new files as they appear. The event parsing logic is pluggable. After a given file has been fully read into the channel, completion by default is indicated by renaming the file or it can be deleted or the trackerDir is used to keep track of processed files.

Unlike the Exec source, this source is reliable and will not miss data, even if Flume is restarted or killed. In exchange for this reliability, only immutable, uniquely-named files must be dropped into the spooling directory. Flume tries to detect these problem conditions and will fail loudly if they are violated:

  1. If a file is written to after being placed into the spooling directory, Flume will print an error to its log file and stop processing.
  2. If a file name is reused at a later time, Flume will print an error to its log file and stop processing.

To avoid the above issues, it may be useful to add a unique identifier (such as a timestamp) to log file names when they are moved into the spooling directory.

Despite the reliability guarantees of this source, there are still cases in which events may be duplicated if certain downstream failures occur. This is consistent with the guarantees offered by other Flume components.

Property NameDefaultDescription
channels
typeThe component type name, needs to be spooldir.
spoolDirThe directory from which to read files from.
fileSuffix.COMPLETEDSuffix to append to completely ingested files
deletePolicyneverWhen to delete completed files: never or immediate
fileHeaderfalseWhether to add a header storing the absolute path filename.
fileHeaderKeyfileHeader key to use when appending absolute path filename to event header.
basenameHeaderfalseWhether to add a header storing the basename of the file.
basenameHeaderKeybasenameHeader Key to use when appending basename of file to event header.
includePattern^.*$Regular expression specifying which files to include. It can used together withignorePattern. If a file matches both ignorePattern and includePattern regex, the file is ignored.
ignorePattern^$Regular expression specifying which files to ignore (skip). It can used together withincludePattern. If a file matches both ignorePattern and includePattern regex, the file is ignored.
trackerDir.flumespoolDirectory to store metadata related to processing of files. If this path is not an absolute path, then it is interpreted as relative to the spoolDir.
trackingPolicyrenameThe tracking policy defines how file processing is tracked. It can be “rename” or “tracker_dir”. This parameter is only effective if the deletePolicy is “never”. “rename” - After processing files they get renamed according to the fileSuffix parameter. “tracker_dir” - Files are not renamed but a new empty file is created in the trackerDir. The new tracker file name is derived from the ingested one plus the fileSuffix.
consumeOrderoldestIn which order files in the spooling directory will be consumed oldest, youngest and random. In case of oldest and youngest, the last modified time of the files will be used to compare the files. In case of a tie, the file with smallest lexicographical order will be consumed first. In case of random any file will be picked randomly. When using oldest and youngest the whole directory will be scanned to pick the oldest/youngest file, which might be slow if there are a large number of files, while using random may cause old files to be consumed very late if new files keep coming in the spooling directory.
pollDelay500Delay (in milliseconds) used when polling for new files.
recursiveDirectorySearchfalseWhether to monitor sub directories for new files to read.
maxBackoff4000The maximum time (in millis) to wait between consecutive attempts to write to the channel(s) if the channel is full. The source will start at a low backoff and increase it exponentially each time the channel throws a ChannelException, upto the value specified by this parameter.
batchSize100Granularity at which to batch transfer to the channel
inputCharsetUTF-8Character set used by deserializers that treat the input file as text.
decodeErrorPolicyFAILWhat to do when we see a non-decodable character in the input file. FAIL: Throw an exception and fail to parse the file. REPLACE: Replace the unparseable character with the “replacement character” char, typically Unicode U+FFFD. IGNORE: Drop the unparseable character sequence.
deserializerLINESpecify the deserializer used to parse the file into events. Defaults to parsing each line as an event. The class specified must implement EventDeserializer.Builder.
deserializer.*Varies per event deserializer.
bufferMaxLines(Obselete) This option is now ignored.
bufferMaxLineLength5000(Deprecated) Maximum length of a line in the commit buffer. Use deserializer.maxLineLength instead.
selector.typereplicatingreplicating or multiplexing
selector.*Depends on the selector.type value
interceptorsSpace-separated list of interceptors
interceptors.*
a1.channels = ch-1
a1.sources = src-1

a1.sources.src-1.type = spooldir
a1.sources.src-1.channels = ch-1
a1.sources.src-1.spoolDir = /var/log/apache/flumeSpool
a1.sources.src-1.fileHeader = true
Netcat Source:

A netcat-like source that listens on a given port and turns each line of text into an event. Acts like nc -k -l [host] [port]. In other words, it opens a specified port and listens for data. The expectation is that the supplied data is newline separated text. Each line of text is turned into a Flume event and sent via the connected channel.

Required properties are in bold.

Property NameDefaultDescription
channels
typeThe component type name, needs to be netcat
bindHost name or IP address to bind to
portPort # to bind to
max-line-length512Max line length per event body (in bytes)
ack-every-eventtrueRespond with an “OK” for every event received
selector.typereplicatingreplicating or multiplexing
selector.*Depends on the selector.type value
interceptorsSpace-separated list of interceptors
interceptors.*
# agent为实例名
# agent实例的三部分组成源,管道,输出槽
agent.sources = s1    
agent.channels = c1  
agent.sinks = sk1 

#设置Source为netcat 端口为5678,使用的channel为c1  接收端
agent.sources.s1.type = netcat  
agent.sources.s1.bind = localhost  
agent.sources.s1.port = 5678  
# 源和和管道连
agent.sources.s1.channels = c1    

#设置Sink为logger模式,使用的channel为c1  发送端
agent.sinks.sk1.type = logger  
# 槽和管道连
agent.sinks.sk1.channel = c1  


#设置channel为capacity 存内存
agent.channels.c1.type = memory
agent.channels.c1.capacity = 1000 # 最多容纳1000条
agent.channels.c1.transactionCapacity = 100 # 一次接100条数据

Exec Source:
  • 执行linux指令,并消费指令返回的结果,如“tail -f”
属性缺省值描述
type-exec
command-如“tail -f xxx.log”
shell-选择系统Shell程序,如“/bin/sh”
batchSize20发送给channel的最大行数
Kafka Source:

Kafka Source is an Apache Kafka consumer that reads messages from Kafka topics. If you have multiple Kafka sources running, you can configure them with the same Consumer Group so each will read a unique set of partitions for the topics. This currently supports Kafka server releases 0.10.1.0 or higher. Testing was done up to 2.0.1 that was the highest avilable version at the time of the release.

Property NameDefaultDescription
channels
typeThe component type name, needs to be org.apache.flume.source.kafka.KafkaSource
kafka.bootstrap.serversList of brokers in the Kafka cluster used by the source
kafka.consumer.group.idflumeUnique identified of consumer group. Setting the same id in multiple sources or agents indicates that they are part of the same consumer group
kafka.topicsComma-separated list of topics the kafka consumer will read messages from.
kafka.topics.regexRegex that defines set of topics the source is subscribed on. This property has higher priority than kafka.topics and overrides kafka.topics if exists.
batchSize1000Maximum number of messages written to Channel in one batch
batchDurationMillis1000Maximum time (in ms) before a batch will be written to Channel The batch will be written whenever the first of size and time will be reached.
backoffSleepIncrement1000Initial and incremental wait time that is triggered when a Kafka Topic appears to be empty. Wait period will reduce aggressive pinging of an empty Kafka Topic. One second is ideal for ingestion use cases but a lower value may be required for low latency operations with interceptors.
maxBackoffSleep5000Maximum wait time that is triggered when a Kafka Topic appears to be empty. Five seconds is ideal for ingestion use cases but a lower value may be required for low latency operations with interceptors.
useFlumeEventFormatfalseBy default events are taken as bytes from the Kafka topic directly into the event body. Set to true to read events as the Flume Avro binary format. Used in conjunction with the same property on the KafkaSink or with the parseAsFlumeEvent property on the Kafka Channel this will preserve any Flume headers sent on the producing side.
setTopicHeadertrueWhen set to true, stores the topic of the retrieved message into a header, defined by the topicHeader property.
topicHeadertopicDefines the name of the header in which to store the name of the topic the message was received from, if the setTopicHeader property is set to true. Care should be taken if combining with the Kafka Sink topicHeader property so as to avoid sending the message back to the same topic in a loop.
kafka.consumer.security.protocolPLAINTEXTSet to SASL_PLAINTEXT, SASL_SSL or SSL if writing to Kafka using some level of security. See below for additional info on secure setup.
more consumer security propsIf using SASL_PLAINTEXT, SASL_SSL or SSL refer to Kafka security for additional properties that need to be set on consumer.
Other Kafka Consumer PropertiesThese properties are used to configure the Kafka Consumer. Any consumer property supported by Kafka can be used. The only requirement is to prepend the property name with the prefix kafka.consumer. For example: kafka.consumer.auto.offset.reset
tier1.sources.source1.type = org.apache.flume.source.kafka.KafkaSource
tier1.sources.source1.channels = channel1
tier1.sources.source1.batchSize = 5000
tier1.sources.source1.batchDurationMillis = 2000
tier1.sources.source1.kafka.bootstrap.servers = localhost:9092
tier1.sources.source1.kafka.topics = test1, test2
tier1.sources.source1.kafka.consumer.group.id = custom.g.id

四:Channel管道:

  • Memory Channel

    • event保存在Java Heap中。如果允许数据小量丢失,推荐使用
  • File Channel

    • event保存在本地文件中,可靠性高,但吞吐量低于Memory Channel
  • JDBC Channel

    • event保存在关系数据中,一般不推荐使用
  • Kafka Channel

五:Sink输出地:

Avro sink:
  • 作为avro客户端向avro服务端发送avro事件
属性缺省值描述
type-avro
hostname-服务端IP地址
post-端口
batch-size100批量发送事件数量
HDFS sink:
  • 将事件写入Hadoop分布式文件系统(HDFS)
属性缺省值描述
type-hdfs
hdfs.path-hdfs目录
hfds.filePrefixFlumeData文件前缀
hdfs.fileSuffix-文件后缀
a2.channels = c2
a2.sources = s2
a2.sinks = k2


a2.sources.s2.type = spooldir
a2.sources.s2.spoolDir = /opt/datas
a2.sources.s2.channels = c2

a2.channels.c2.type = memory
a2.channels.c2.capacity = 10000
a2.channels.c2.transactionCapacity = 1000


a2.sinks.k2.type = hdfs
a2.sinks.k2.hdfs.path = hdfs://192.168.56.101:9000/flume/customs
a2.sinks.k2.hdfs.filePrefix = events-
a2.sinks.k2.rollCount = 5000
a2.sinks.k2.rollSize = 600000
a2.sinks.k2.batchSize = 500

a2.sinks.k2.channel = c2
Hive sink:
  • 包含分隔文本或JSON数据流事件直接进入Hive表或分区

  • 传入的事件数据字段映射到Hive表中相应的列

属性缺省值描述
type-hive
hive.metastore-Hive metastore URI
hive.database-Hive****数据库名称
hive.table-Hive****表
serializer-序列化器负责从事件中分析出字段并将它们映射为Hive表中的列。序列化器的选择取决于数据的格式。支持序列化器:DELIMITED和JSON
HBase sink:
属性缺省值描述
type-hbase
table-要写入的 Hbase 表名
columnFamily-要写入的 Hbase 列族
zookeeperQuorum-对应hbase.zookeeper.quorum
znodeParent/hbasezookeeper.znode.parent
serializerorg.apache.flume.sink.hbase.SimpleHbaseEventSerializer一次事件插入一列
serializer.payloadColumn-列名col1
Kafka sink:
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值