Flume 介绍
架构模型
WebServer : 数据源
HDFS : 存储源
Agent
Agent : 是一个JVM进程,他以Event(事件)的形式将数据从源头至目的地
Agent 由3个组件组成
- Source : 数据源获取的数据
- Channel : 管道 缓存数据 一般情况下存储在内存中
- Sink : 负责发送数据
Event
传输单元,flume数据传输的基本单元,以Event的形式将数据从数据源送至存储源
Event由Header和Body两部分组成,Header用来存放该event的一些属性,为k,v结构。
Source
环境
- flume-1.6.0 (1.6版本以上使用jdk1.8)
- jdk 1.7
单节点搭建
-
上传并解压flume-1.6.0 到/opt 目录 , 可删除解压后的 docs 文档目录
-
修改配置conf/flume-env.sh 中jdk路径 可修改JAVA_OPTS(JVM内存大小)
-
添加环境变量,使用
flume-ng version
验证环境变量是否配置成功
单节点案例
A simple example¶
Here, we give an example configuration file, describing a single-node Flume deployment. This configuration lets a user generate events and subsequently logs them to the console.
# example.conf: A single-node Flume configuration # Name the components on this agent a1.sources = r1 a1.sinks = k1 a1.channels = c1 # Describe/configure the source a1.sources.r1.type = netcat a1.sources.r1.bind = node01 a1.sources.r1.port = 44444 # Describe the sink a1.sinks.k1.type = logger # Use a channel which buffers events in memory a1.channels.c1.type = memory a1.channels.c1.capacity = 1000 a1.channels.c1.transactionCapacity = 100 # Bind the source and sink to the channel a1.sources.r1.channels = c1 a1.sinks.k1.channel = c1
This configuration defines a single agent named a1. a1 has a source that listens for data on port 44444, a channel that buffers event data in memory, and a sink that logs event data to the console. The configuration file names the various components, then describes their types and configuration parameters. A given configuration file might define several named agents; when a given Flume process is launched a flag is passed telling it which named agent to manifest.
Given this configuration file, we can start Flume as follows:
$ bin/flume-ng agent --conf conf --conf-file example.conf --name a1 -Dflume.root.logger=INFO,console
-
在root目录下创建option(任意目录.任意文件名), 把官网案例中配置案例粘贴到刚刚创建的option文件中
-
启动flume
flume-ng agent --conf-file option --name a1 -Dflume.root.logger=INFO,console
-
启动成功后可通过其他节点使用telnet 访问 flume ,连接成功后可输入任意信息进行测试
yum install -y telnet telnet node01 44444
Avro流模式搭建
Setting multi-agent flow¶
In order to flow the data across multiple agents or hops, the sink of the previous agent and source of the current hop need to be avro type with the sink pointing to the hostname (or IP address) and port of the source.
-
flume 文件夹复制到其他节点(node02),并添加环境变量
-
编写Abent foo配置文件 参考Avro Sink
# example.conf: A single-node Flume configuration # Name the components on this agent a1.sources = r1 a1.sinks = k1 a1.channels = c1 # Describe/configure the source a1.sources.r1.type = netcat a1.sources.r1.bind = node01 a1.sources.r1.port = 44444 # Describe the sink a1.sinks.k1.type = avro a1.sinks.k1.hostname = node02 a1.sinks.k1.port = 10086 # Use a channel which buffers events in memory a1.channels.c1.type = memory a1.channels.c1.capacity = 1000 a1.channels.c1.transactionCapacity = 100 # Bind the source and sink to the channel a1.sources.r1.channels = c1 a1.sinks.k1.channel = c1
-
编写Abent bar配置文件 参考Avro Source
# example.conf: A single-node Flume configuration # Name the components on this agent a1.sources = r1 a1.sinks = k1 a1.channels = c1 # Describe/configure the source a1.sources.r1.type = avro a1.sources.r1.bind = node02 a1.sources.r1.port = 10086 # Describe the sink a1.sinks.k1.type = logger # Use a channel which buffers events in memory a1.channels.c1.type = memory a1.channels.c1.capacity = 1000 a1.channels.c1.transactionCapacity = 100 # Bind the source and sink to the channel a1.sources.r1.channels = c1 a1.sinks.k1.channel = c1
-
Abent bar 节点 启动flume
flume-ng agent --conf-file option --name a1 -Dflume.root.logger=INFO,console
-
Abent foo 节点 启动flume
flume-ng agent --conf-file option --name a1 -Dflume.root.logger=INFO,console
-
连接Abent bar flume 发送消息测试
telnet node01 44444
Avro流模式扩展
Consolidation
A very common scenario in log collection is a large number of log producing clients sending data to a few consumer agents that are attached to the storage subsystem. For example, logs collected from hundreds of web servers sent to a dozen of agents that write to HDFS cluster.
This can be achieved in Flume by configuring a number of first tier agents with an avro sink, all pointing to an avro source of single agent (Again you could use the thrift sources/sinks/clients in such a scenario). This source on the second tier agent consolidates the received events into a single channel which is consumed by a sink to its final destination.
多路流模式
离线业务+实时处理
Multiplexing the flow
Flume supports multiplexing the event flow to one or more destinations. This is achieved by defining a flow multiplexer that can replicate or selectively route an event to one or more channels.
The above example shows a source from agent “foo” fanning out the flow to three different channels. This fan out can be replicating or multiplexing. In case of replicating flow, each event is sent to all three channels. For the multiplexing case, an event is delivered to a subset of available channels when an event’s attribute matches a preconfigured value. For example, if an event attribute called “txnType” is set to “customer”, then it should go to channel1 and channel3, if it’s “vendor” then it should go to channel2, otherwise channel3. The mapping can be set in the agent’s configuration file.
Source常用源
Exec source
可以执行linux命令读取标准输出
# example.conf: A single-node Flume configuration
# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1
# Describe/configure the source
a1.sources.r1.type = exec
a1.sources.r1.command = tail -F /root/test.log
# Describe the sink
a1.sinks.k1.type = logger
# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
Spooling Directory Source
监控目录中的数据
# example.conf: A single-node Flume configuration
# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1
# Describe/configure the source
a1.sources.r1.type = spooldir
a1.sources.r1.spoolDir = /root/log
# 表示在flume读取数据之后,是否在封装出来的event中将文件名添加到event的header中。
a1.sources.r1.fileHeader = false
# Describe the sink
a1.sinks.k1.type = logger
# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
Kafka Source
Sink常用源
HDFS Sink
fulme支持按日期分割 在hdfs中创建日期目录
# example.conf: A single-node Flume configuration
# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1
# Describe/configure the source
a1.sources.r1.type = spooldir
a1.sources.r1.spoolDir = /root/log
# 表示在flume读取数据之后,是否在封装出来的event中将文件名添加到event的header中。
a1.sources.r1.fileHeader = false
# Describe the sink
a1.sinks.k1.type = hdfs
a1.sinks.k1.hdfs.path = /flume/events/%y-%m-%d/%H%M/%S
a1.sinks.k1.hdfs.filePrefix = events-
a1.sinks.k1.hdfs.round = true
a1.sinks.k1.hdfs.roundValue = 5
a1.sinks.k1.hdfs.roundUnit = second
a1.sinks.k1.hdfs.useLocalTimeStamp = true
# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
Name | Default | Description |
---|---|---|
channel | – | |
type | – | The component type name, needs to be hdfs |
hdfs.path | – | HDFS directory path (eg hdfs://namenode/flume/webdata/) |
hdfs.filePrefix | FlumeData | 上传成功后生成的 文件前缀 |
hdfs.fileSuffix | – | 长传成功后生成的 文件后缀 |
hdfs.inUsePrefix | – | 上传过程中生成的临时文件的前缀 |
hdfs.inUseSuffix | .tmp | 上传过程中生成的临时文件的后缀 |
hdfs.emptyInUseSuffix | false | 是否开启临时文件后缀 |
hdfs.rollInterval | 30 | 在连续的30秒内写入文件,超过30秒,关闭文件流创建新文件继续写入 参数为0 不生效 |
hdfs.rollSize | 1024 | 文件大小限制,超出限制关闭文件流创建新文件继续写入 参数为0 不生效 |
hdfs.rollCount | 10 | 记录数限制,超过10条,关闭文件流创建新文件继续写入 参数为0 不生效 |
hdfs.idleTimeout | 0 | 连续时间段没有数据写入,关闭文件 |
hdfs.batchSize | 100 | 一次传入文件的条数 |
hdfs.codeC | – | 压缩格式 gzip, bzip2, lzo, lzop, snappy |
hdfs.fileType | SequenceFile | 文件类型DataStream 不要指定codeC (1)DataStream will not compress output file and please don’t set codeC (2)CompressedStream requires set hdfs.codeC with an available codeC |
hdfs.maxOpenFiles | 5000 | 最大开启文件流的个数 (参考:1G内存可以开启10万个文件) |
hdfs.minBlockReplicas | – | 最小副本数 默认使用hdfs设置的副本数 |
hdfs.writeFormat | Writable | 数据输出格式 |
hdfs.callTimeout | 10000 | 操作超过10秒,报异常(服务器配置过低) 新版本flume中没有该配置 |
hdfs.threadsPoolSize | 10 | I/O线程数 |
hdfs.rollTimerPoolSize | 1 | Number of threads per HDFS sink for scheduling timed file rolling |
hdfs.kerberosPrincipal | – | Kerberos user principal for accessing secure HDFS |
hdfs.kerberosKeytab | – | Kerberos keytab for accessing secure HDFS |
hdfs.proxyUser | 代理用户 | |
hdfs.round | false | 开启文件夹数量控制 一般用于基于统计小时,分钟,秒的项目 |
hdfs.roundValue | 1 | 几秒中生成一个新文件夹 |
hdfs.roundUnit | second | 单元 设置具体的按小时或分钟或秒 |
hdfs.timeZone | Local Time | Name of the timezone that should be used for resolving the directory path, e.g. America/Los_Angeles. |
hdfs.useLocalTimeStamp | false | 使用本地时间戳,一般设置为true 若为false 影响hdfs.round等配置 |
hdfs.closeTries | 0 | Number of times the sink must try renaming a file, after initiating a close attempt. If set to 1, this sink will not re-try a failed rename (due to, for example, NameNode or DataNode failure), and may leave the file in an open state with a .tmp extension. If set to 0, the sink will try to rename the file until the file is eventually renamed (there is no limit on the number of times it would try). The file may still remain open if the close call fails but the data will be intact and in this case, the file will be closed only after a Flume restart. |
hdfs.retryInterval | 180 | Time in seconds between consecutive attempts to close a file. Each close call costs multiple RPC round-trips to the Namenode, so setting this too low can cause a lot of load on the name node. If set to 0 or less, the sink will not attempt to close the file if the first attempt fails, and may leave the file open or with a ”.tmp” extension. |
serializer | TEXT | Other possible options include avro_event or the fully-qualified class name of an implementation of the EventSerializer.Builder interface. |
serializer.* |