Flume基础介绍、搭建、入门案例、常用参数

最新推荐文章于 2022-02-25 19:03:09 发布

得过且过1223

最新推荐文章于 2022-02-25 19:03:09 发布

阅读量292

点赞数

分类专栏：大数据 # Flume 文章标签： flume

本文链接：https://blog.csdn.net/dgqg1223/article/details/104175551

版权

大数据同时被 2 个专栏收录

28 篇文章 1 订阅

订阅专栏

Flume

3 篇文章 0 订阅

订阅专栏

Flume 介绍

官方网址

架构模型

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-yWPIb2o8-1580823341408)(flume/DevGuide_image00.png)]

WebServer ：数据源
HDFS : 存储源

Agent

Agent : 是一个JVM进程，他以Event(事件)的形式将数据从源头至目的地
Agent 由3个组件组成

Source : 数据源获取的数据
Channel : 管道缓存数据一般情况下存储在内存中
Sink : 负责发送数据

Event

传输单元，flume数据传输的基本单元，以Event的形式将数据从数据源送至存储源
Event由Header和Body两部分组成，Header用来存放该event的一些属性，为k，v结构。

在这里插入图片描述

Source

环境

flume-1.6.0 (1.6版本以上使用jdk1.8)
jdk 1.7

单节点搭建

上传并解压flume-1.6.0 到/opt 目录 , 可删除解压后的 docs 文档目录
修改配置conf/flume-env.sh 中jdk路径 可修改JAVA_OPTS(JVM内存大小)
添加环境变量,使用flume-ng version 验证环境变量是否配置成功

单节点案例

A simple example¶

Here, we give an example configuration file, describing a single-node Flume deployment. This configuration lets a user generate events and subsequently logs them to the console.
# example.conf: A single-node Flume configuration

# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1

# Describe/configure the source
a1.sources.r1.type = netcat
a1.sources.r1.bind = node01
a1.sources.r1.port = 44444

# Describe the sink
a1.sinks.k1.type = logger

# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
This configuration defines a single agent named a1. a1 has a source that listens for data on port 44444, a channel that buffers event data in memory, and a sink that logs event data to the console. The configuration file names the various components, then describes their types and configuration parameters. A given configuration file might define several named agents; when a given Flume process is launched a flag is passed telling it which named agent to manifest.

Given this configuration file, we can start Flume as follows:
$ bin/flume-ng agent --conf conf --conf-file example.conf --name a1 -Dflume.root.logger=INFO,console

在root目录下创建option(任意目录.任意文件名), 把官网案例中配置案例粘贴到刚刚创建的option文件中

启动flume

flume-ng agent --conf-file option --name a1 -Dflume.root.logger=INFO,console

启动成功后可通过其他节点使用telnet 访问 flume ,连接成功后可输入任意信息进行测试
```
yum install -y telnet
telnet node01 44444
```

Avro流模式搭建

Setting multi-agent flow¶

In order to flow the data across multiple agents or hops, the sink of the previous agent and source of the current hop need to be avro type with the sink pointing to the hostname (or IP address) and port of the source.

flume 文件夹复制到其他节点(node02),并添加环境变量

编写Abent foo配置文件参考Avro Sink

# example.conf: A single-node Flume configuration

# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1

# Describe/configure the source
a1.sources.r1.type = netcat
a1.sources.r1.bind = node01
a1.sources.r1.port = 44444

# Describe the sink
a1.sinks.k1.type = avro
a1.sinks.k1.hostname = node02
a1.sinks.k1.port = 10086

# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

编写Abent bar配置文件参考Avro Source

# example.conf: A single-node Flume configuration

# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1

# Describe/configure the source
a1.sources.r1.type = avro
a1.sources.r1.bind = node02
a1.sources.r1.port = 10086

# Describe the sink
a1.sinks.k1.type = logger

# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

Abent bar 节点启动flume

flume-ng agent --conf-file option --name a1 -Dflume.root.logger=INFO,console

Abent foo 节点启动flume

flume-ng agent --conf-file option --name a1 -Dflume.root.logger=INFO,console

连接Abent bar flume 发送消息测试
```
telnet node01 44444
```

Avro流模式扩展

Consolidation

A very common scenario in log collection is a large number of log producing clients sending data to a few consumer agents that are attached to the storage subsystem. For example, logs collected from hundreds of web servers sent to a dozen of agents that write to HDFS cluster.

This can be achieved in Flume by configuring a number of first tier agents with an avro sink, all pointing to an avro source of single agent (Again you could use the thrift sources/sinks/clients in such a scenario). This source on the second tier agent consolidates the received events into a single channel which is consumed by a sink to its final destination.

多路流模式

离线业务+实时处理

Multiplexing the flow

Flume supports multiplexing the event flow to one or more destinations. This is achieved by defining a flow multiplexer that can replicate or selectively route an event to one or more channels.

The above example shows a source from agent “foo” fanning out the flow to three different channels. This fan out can be replicating or multiplexing. In case of replicating flow, each event is sent to all three channels. For the multiplexing case, an event is delivered to a subset of available channels when an event’s attribute matches a preconfigured value. For example, if an event attribute called “txnType” is set to “customer”, then it should go to channel1 and channel3, if it’s “vendor” then it should go to channel2, otherwise channel3. The mapping can be set in the agent’s configuration file.

Source常用源

Exec source

可以执行linux命令读取标准输出

# example.conf: A single-node Flume configuration

# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1

# Describe/configure the source
a1.sources.r1.type = exec
a1.sources.r1.command = tail -F /root/test.log

# Describe the sink
a1.sinks.k1.type = logger

# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

Spooling Directory Source

监控目录中的数据

# example.conf: A single-node Flume configuration

# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1

# Describe/configure the source
a1.sources.r1.type = spooldir
a1.sources.r1.spoolDir = /root/log
# 表示在flume读取数据之后，是否在封装出来的event中将文件名添加到event的header中。
a1.sources.r1.fileHeader = false

# Describe the sink
a1.sinks.k1.type = logger

# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

Kafka Source

Sink常用源

HDFS Sink

fulme支持按日期分割在hdfs中创建日期目录

# example.conf: A single-node Flume configuration

# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1

# Describe/configure the source
a1.sources.r1.type = spooldir
a1.sources.r1.spoolDir = /root/log
# 表示在flume读取数据之后，是否在封装出来的event中将文件名添加到event的header中。
a1.sources.r1.fileHeader = false

# Describe the sink
a1.sinks.k1.type = hdfs
a1.sinks.k1.hdfs.path = /flume/events/%y-%m-%d/%H%M/%S
a1.sinks.k1.hdfs.filePrefix = events-
a1.sinks.k1.hdfs.round = true
a1.sinks.k1.hdfs.roundValue = 5
a1.sinks.k1.hdfs.roundUnit = second
a1.sinks.k1.hdfs.useLocalTimeStamp = true

# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

Name	Default	Description
channel	–
type	–	The component type name, needs to be `hdfs`
hdfs.path	–	HDFS directory path (eg hdfs://namenode/flume/webdata/)
hdfs.filePrefix	FlumeData	上传成功后生成的文件前缀
hdfs.fileSuffix	–	长传成功后生成的文件后缀
hdfs.inUsePrefix	–	上传过程中生成的临时文件的前缀
hdfs.inUseSuffix	`.tmp`	上传过程中生成的临时文件的后缀
hdfs.emptyInUseSuffix	false	是否开启临时文件后缀
hdfs.rollInterval	30	在连续的30秒内写入文件，超过30秒，关闭文件流创建新文件继续写入参数为0 不生效
hdfs.rollSize	1024	文件大小限制，超出限制关闭文件流创建新文件继续写入参数为0 不生效
hdfs.rollCount	10	记录数限制，超过10条，关闭文件流创建新文件继续写入参数为0 不生效
hdfs.idleTimeout	0	连续时间段没有数据写入，关闭文件
hdfs.batchSize	100	一次传入文件的条数
hdfs.codeC	–	压缩格式 gzip, bzip2, lzo, lzop, snappy
hdfs.fileType	SequenceFile	文件类型DataStream 不要指定codeC (1)DataStream will not compress output file and please don’t set codeC (2)CompressedStream requires set hdfs.codeC with an available codeC
hdfs.maxOpenFiles	5000	最大开启文件流的个数（参考：1G内存可以开启10万个文件）
hdfs.minBlockReplicas	–	最小副本数默认使用hdfs设置的副本数
hdfs.writeFormat	Writable	数据输出格式
hdfs.callTimeout	10000	操作超过10秒，报异常（服务器配置过低）新版本flume中没有该配置
hdfs.threadsPoolSize	10	I/O线程数
hdfs.rollTimerPoolSize	1	Number of threads per HDFS sink for scheduling timed file rolling
hdfs.kerberosPrincipal	–	Kerberos user principal for accessing secure HDFS
hdfs.kerberosKeytab	–	Kerberos keytab for accessing secure HDFS
hdfs.proxyUser		代理用户
hdfs.round	false	开启文件夹数量控制一般用于基于统计小时,分钟,秒的项目
hdfs.roundValue	1	几秒中生成一个新文件夹
hdfs.roundUnit	second	单元设置具体的按小时或分钟或秒
hdfs.timeZone	Local Time	Name of the timezone that should be used for resolving the directory path, e.g. America/Los_Angeles.
hdfs.useLocalTimeStamp	false	使用本地时间戳,一般设置为true 若为false 影响hdfs.round等配置
hdfs.closeTries	0	Number of times the sink must try renaming a file, after initiating a close attempt. If set to 1, this sink will not re-try a failed rename (due to, for example, NameNode or DataNode failure), and may leave the file in an open state with a .tmp extension. If set to 0, the sink will try to rename the file until the file is eventually renamed (there is no limit on the number of times it would try). The file may still remain open if the close call fails but the data will be intact and in this case, the file will be closed only after a Flume restart.
hdfs.retryInterval	180	Time in seconds between consecutive attempts to close a file. Each close call costs multiple RPC round-trips to the Namenode, so setting this too low can cause a lot of load on the name node. If set to 0 or less, the sink will not attempt to close the file if the first attempt fails, and may leave the file open or with a ”.tmp” extension.
serializer	`TEXT`	Other possible options include `avro_event` or the fully-qualified class name of an implementation of the `EventSerializer.Builder` interface.
serializer.*

Hive Sink

HBaseSinks

Logger Sink

得过且过1223

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
Flume基础介绍、搭建、入门案例、常用参数

Flume 介绍官方网址架构模型基础架构模型[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-yWPIb2o8-1580823341408)(flume/DevGuide_image00.png)]WebServer ：数据源HDFS : 存储源Agent : Flume (代理) 包含3个组件Source : 数据源获取的数据Channel : ...
复制链接

扫一扫