写在前面: 我是
「nicedays」
,一枚喜爱做特效,听音乐,分享技术的大数据开发猿
。这名字是来自world order乐队的一首HAVE A NICE DAY
。如今,走到现在很多坎坷和不顺,如今终于明白nice day是需要自己赋予的。
白驹过隙,时光荏苒,珍惜当下~~
写博客一方面是对自己学习的一点点总结及记录
,另一方面则是希望能够帮助更多对大数据感兴趣的朋友。如果你也对大数据与机器学习
感兴趣,可以关注我的动态https://blog.csdn.net/qq_35050438
,让我们一起挖掘数据与人工智能的价值~
文章目录
Flume简介和基本使用:
一种可靠、可用的高效分布式数据收集服务。
Flume拥有基于数据流上的简单灵活架构,支持容错、故障转移与恢复
一:Flume架构:
-
Client:客户端,数据产生的地方,如Web服务器
-
Event:事件,指通过Agent传输的单个数据包,如日志数据通常对应一行数据
-
Agent:代理,一个独立的JVM进程
- Flume以一个或多个Agent部署运行
- Agent包含三个组件
Source
Channel
Sink
二:Flume工作流程:
Flume的本质分为三大块:输入源,管道,输出地
三:Source输入源:
http source:
- 用于接收HTTP的get和post请求
属性 | 缺省值 | 描述 |
---|---|---|
type | - | http |
port | - | 监听端口 |
bind | 0.0.0.0 | 绑定IP |
handler | org.apache.flume.source.http.JSONHandler | 数据处理程序类全名 |
avro source:
- 监听Avro端口,并从外部Avro客户端接收events
属性 | 缺省值 | 描述 |
---|---|---|
type | - | avro |
bind | - | 绑定IP地址 |
port | - | 端口 |
threads | - | 最大工作线程数量 |
Spooling Directory Source:
This source lets you ingest data by placing files to be ingested into a “spooling” directory on disk. This source will watch the specified directory for new files, and will parse events out of new files as they appear. The event parsing logic is pluggable. After a given file has been fully read into the channel, completion by default is indicated by renaming the file or it can be deleted or the trackerDir is used to keep track of processed files.
Unlike the Exec source, this source is reliable and will not miss data, even if Flume is restarted or killed. In exchange for this reliability, only immutable, uniquely-named files must be dropped into the spooling directory. Flume tries to detect these problem conditions and will fail loudly if they are violated:
- If a file is written to after being placed into the spooling directory, Flume will print an error to its log file and stop processing.
- If a file name is reused at a later time, Flume will print an error to its log file and stop processing.
To avoid the above issues, it may be useful to add a unique identifier (such as a timestamp) to log file names when they are moved into the spooling directory.
Despite the reliability guarantees of this source, there are still cases in which events may be duplicated if certain downstream failures occur. This is consistent with the guarantees offered by other Flume components.
Property Name | Default | Description |
---|---|---|
channels | – | |
type | – | The component type name, needs to be spooldir . |
spoolDir | – | The directory from which to read files from. |
fileSuffix | .COMPLETED | Suffix to append to completely ingested files |
deletePolicy | never | When to delete completed files: never or immediate |
fileHeader | false | Whether to add a header storing the absolute path filename. |
fileHeaderKey | file | Header key to use when appending absolute path filename to event header. |
basenameHeader | false | Whether to add a header storing the basename of the file. |
basenameHeaderKey | basename | Header Key to use when appending basename of file to event header. |
includePattern | ^.*$ | Regular expression specifying which files to include. It can used together withignorePattern . If a file matches both ignorePattern and includePattern regex, the file is ignored. |
ignorePattern | ^$ | Regular expression specifying which files to ignore (skip). It can used together withincludePattern . If a file matches both ignorePattern and includePattern regex, the file is ignored. |
trackerDir | .flumespool | Directory to store metadata related to processing of files. If this path is not an absolute path, then it is interpreted as relative to the spoolDir. |
trackingPolicy | rename | The tracking policy defines how file processing is tracked. It can be “rename” or “tracker_dir”. This parameter is only effective if the deletePolicy is “never”. “rename” - After processing files they get renamed according to the fileSuffix parameter. “tracker_dir” - Files are not renamed but a new empty file is created in the trackerDir. The new tracker file name is derived from the ingested one plus the fileSuffix. |
consumeOrder | oldest | In which order files in the spooling directory will be consumed oldest , youngest and random . In case of oldest and youngest , the last modified time of the files will be used to compare the files. In case of a tie, the file with smallest lexicographical order will be consumed first. In case of random any file will be picked randomly. When using oldest and youngest the whole directory will be scanned to pick the oldest/youngest file, which might be slow if there are a large number of files, while using random may cause old files to be consumed very late if new files keep coming in the spooling directory. |
pollDelay | 500 | Delay (in milliseconds) used when polling for new files. |
recursiveDirectorySearch | false | Whether to monitor sub directories for new files to read. |
maxBackoff | 4000 | The maximum time (in millis) to wait between consecutive attempts to write to the channel(s) if the channel is full. The source will start at a low backoff and increase it exponentially each time the channel throws a ChannelException, upto the value specified by this parameter. |
batchSize | 100 | Granularity at which to batch transfer to the channel |
inputCharset | UTF-8 | Character set used by deserializers that treat the input file as text. |
decodeErrorPolicy | FAIL | What to do when we see a non-decodable character in the input file. FAIL : Throw an exception and fail to parse the file. REPLACE : Replace the unparseable character with the “replacement character” char, typically Unicode U+FFFD. IGNORE : Drop the unparseable character sequence. |
deserializer | LINE | Specify the deserializer used to parse the file into events. Defaults to parsing each line as an event. The class specified must implement EventDeserializer.Builder . |
deserializer.* | Varies per event deserializer. | |
bufferMaxLines | – | (Obselete) This option is now ignored. |
bufferMaxLineLength | 5000 | (Deprecated) Maximum length of a line in the commit buffer. Use deserializer.maxLineLength instead. |
selector.type | replicating | replicating or multiplexing |
selector.* | Depends on the selector.type value | |
interceptors | – | Space-separated list of interceptors |
interceptors.* |
a1.channels = ch-1
a1.sources = src-1
a1.sources.src-1.type = spooldir
a1.sources.src-1.channels = ch-1
a1.sources.src-1.spoolDir = /var/log/apache/flumeSpool
a1.sources.src-1.fileHeader = true
Netcat Source:
A netcat-like source that listens on a given port and turns each line of text into an event. Acts like
nc -k -l [host] [port]
. In other words, it opens a specified port and listens for data. The expectation is that the supplied data is newline separated text. Each line of text is turned into a Flume event and sent via the connected channel.
Required properties are in bold.
Property Name | Default | Description |
---|---|---|
channels | – | |
type | – | The component type name, needs to be netcat |
bind | – | Host name or IP address to bind to |
port | – | Port # to bind to |
max-line-length | 512 | Max line length per event body (in bytes) |
ack-every-event | true | Respond with an “OK” for every event received |
selector.type | replicating | replicating or multiplexing |
selector.* | Depends on the selector.type value | |
interceptors | – | Space-separated list of interceptors |
interceptors.* |
# agent为实例名
# agent实例的三部分组成源,管道,输出槽
agent.sources = s1
agent.channels = c1
agent.sinks = sk1
#设置Source为netcat 端口为5678,使用的channel为c1 接收端
agent.sources.s1.type = netcat
agent.sources.s1.bind = localhost
agent.sources.s1.port = 5678
# 源和和管道连
agent.sources.s1.channels = c1
#设置Sink为logger模式,使用的channel为c1 发送端
agent.sinks.sk1.type = logger
# 槽和管道连
agent.sinks.sk1.channel = c1
#设置channel为capacity 存内存
agent.channels.c1.type = memory
agent.channels.c1.capacity = 1000 # 最多容纳1000条
agent.channels.c1.transactionCapacity = 100 # 一次接100条数据
Exec Source:
- 执行linux指令,并消费指令返回的结果,如“tail -f”
属性 | 缺省值 | 描述 |
---|---|---|
type | - | exec |
command | - | 如“tail -f xxx.log” |
shell | - | 选择系统Shell程序,如“/bin/sh” |
batchSize | 20 | 发送给channel的最大行数 |
Kafka Source:
Kafka Source is an Apache Kafka consumer that reads messages from Kafka topics. If you have multiple Kafka sources running, you can configure them with the same Consumer Group so each will read a unique set of partitions for the topics. This currently supports Kafka server releases 0.10.1.0 or higher. Testing was done up to 2.0.1 that was the highest avilable version at the time of the release.
Property Name | Default | Description |
---|---|---|
channels | – | |
type | – | The component type name, needs to be org.apache.flume.source.kafka.KafkaSource |
kafka.bootstrap.servers | – | List of brokers in the Kafka cluster used by the source |
kafka.consumer.group.id | flume | Unique identified of consumer group. Setting the same id in multiple sources or agents indicates that they are part of the same consumer group |
kafka.topics | – | Comma-separated list of topics the kafka consumer will read messages from. |
kafka.topics.regex | – | Regex that defines set of topics the source is subscribed on. This property has higher priority than kafka.topics and overrides kafka.topics if exists. |
batchSize | 1000 | Maximum number of messages written to Channel in one batch |
batchDurationMillis | 1000 | Maximum time (in ms) before a batch will be written to Channel The batch will be written whenever the first of size and time will be reached. |
backoffSleepIncrement | 1000 | Initial and incremental wait time that is triggered when a Kafka Topic appears to be empty. Wait period will reduce aggressive pinging of an empty Kafka Topic. One second is ideal for ingestion use cases but a lower value may be required for low latency operations with interceptors. |
maxBackoffSleep | 5000 | Maximum wait time that is triggered when a Kafka Topic appears to be empty. Five seconds is ideal for ingestion use cases but a lower value may be required for low latency operations with interceptors. |
useFlumeEventFormat | false | By default events are taken as bytes from the Kafka topic directly into the event body. Set to true to read events as the Flume Avro binary format. Used in conjunction with the same property on the KafkaSink or with the parseAsFlumeEvent property on the Kafka Channel this will preserve any Flume headers sent on the producing side. |
setTopicHeader | true | When set to true, stores the topic of the retrieved message into a header, defined by the topicHeader property. |
topicHeader | topic | Defines the name of the header in which to store the name of the topic the message was received from, if the setTopicHeader property is set to true . Care should be taken if combining with the Kafka Sink topicHeader property so as to avoid sending the message back to the same topic in a loop. |
kafka.consumer.security.protocol | PLAINTEXT | Set to SASL_PLAINTEXT, SASL_SSL or SSL if writing to Kafka using some level of security. See below for additional info on secure setup. |
more consumer security props | If using SASL_PLAINTEXT, SASL_SSL or SSL refer to Kafka security for additional properties that need to be set on consumer. | |
Other Kafka Consumer Properties | – | These properties are used to configure the Kafka Consumer. Any consumer property supported by Kafka can be used. The only requirement is to prepend the property name with the prefix kafka.consumer . For example: kafka.consumer.auto.offset.reset |
tier1.sources.source1.type = org.apache.flume.source.kafka.KafkaSource
tier1.sources.source1.channels = channel1
tier1.sources.source1.batchSize = 5000
tier1.sources.source1.batchDurationMillis = 2000
tier1.sources.source1.kafka.bootstrap.servers = localhost:9092
tier1.sources.source1.kafka.topics = test1, test2
tier1.sources.source1.kafka.consumer.group.id = custom.g.id
四:Channel管道:
-
Memory Channel
- event保存在Java Heap中。如果允许数据小量丢失,推荐使用
-
File Channel
- event保存在本地文件中,可靠性高,但吞吐量低于Memory Channel
-
JDBC Channel
- event保存在关系数据中,一般不推荐使用
-
Kafka Channel
五:Sink输出地:
Avro sink:
- 作为avro客户端向avro服务端发送avro事件
属性 | 缺省值 | 描述 |
---|---|---|
type | - | avro |
hostname | - | 服务端IP地址 |
post | - | 端口 |
batch-size | 100 | 批量发送事件数量 |
HDFS sink:
- 将事件写入Hadoop分布式文件系统(HDFS)
属性 | 缺省值 | 描述 |
---|---|---|
type | - | hdfs |
hdfs.path | - | hdfs目录 |
hfds.filePrefix | FlumeData | 文件前缀 |
hdfs.fileSuffix | - | 文件后缀 |
a2.channels = c2
a2.sources = s2
a2.sinks = k2
a2.sources.s2.type = spooldir
a2.sources.s2.spoolDir = /opt/datas
a2.sources.s2.channels = c2
a2.channels.c2.type = memory
a2.channels.c2.capacity = 10000
a2.channels.c2.transactionCapacity = 1000
a2.sinks.k2.type = hdfs
a2.sinks.k2.hdfs.path = hdfs://192.168.56.101:9000/flume/customs
a2.sinks.k2.hdfs.filePrefix = events-
a2.sinks.k2.rollCount = 5000
a2.sinks.k2.rollSize = 600000
a2.sinks.k2.batchSize = 500
a2.sinks.k2.channel = c2
Hive sink:
-
包含分隔文本或JSON数据流事件直接进入Hive表或分区
-
传入的事件数据字段映射到Hive表中相应的列
属性 | 缺省值 | 描述 |
---|---|---|
type | - | hive |
hive.metastore | - | Hive metastore URI |
hive.database | - | Hive****数据库名称 |
hive.table | - | Hive****表 |
serializer | - | 序列化器负责从事件中分析出字段并将它们映射为Hive表中的列。序列化器的选择取决于数据的格式。支持序列化器:DELIMITED和JSON |
HBase sink:
属性 | 缺省值 | 描述 |
---|---|---|
type | - | hbase |
table | - | 要写入的 Hbase 表名 |
columnFamily | - | 要写入的 Hbase 列族 |
zookeeperQuorum | - | 对应hbase.zookeeper.quorum |
znodeParent | /hbase | zookeeper.znode.parent |
serializer | org.apache.flume.sink.hbase.SimpleHbaseEventSerializer | 一次事件插入一列 |
serializer.payloadColumn | - | 列名col1 |