Flume基础介绍、搭建、入门案例、常用参数

Flume 介绍

官方网址

架构模型

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-yWPIb2o8-1580823341408)(flume/DevGuide_image00.png)]

WebServer : 数据源
HDFS : 存储源

Agent

Agent : 是一个JVM进程,他以Event(事件)的形式将数据从源头至目的地
Agent 由3个组件组成

  • Source : 数据源获取的数据
  • Channel : 管道 缓存数据 一般情况下存储在内存中
  • Sink : 负责发送数据

Event

传输单元,flume数据传输的基本单元,以Event的形式将数据从数据源送至存储源
Event由HeaderBody两部分组成,Header用来存放该event的一些属性,为k,v结构。

在这里插入图片描述

Source

环境

  • flume-1.6.0 (1.6版本以上使用jdk1.8)
  • jdk 1.7

单节点搭建

  1. 上传并解压flume-1.6.0 到/opt 目录 , 可删除解压后的 docs 文档目录

  2. 修改配置conf/flume-env.sh 中jdk路径 可修改JAVA_OPTS(JVM内存大小)

  3. 添加环境变量,使用flume-ng version 验证环境变量是否配置成功

单节点案例

A simple example

Here, we give an example configuration file, describing a single-node Flume deployment. This configuration lets a user generate events and subsequently logs them to the console.

# example.conf: A single-node Flume configuration

# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1

# Describe/configure the source
a1.sources.r1.type = netcat
a1.sources.r1.bind = node01
a1.sources.r1.port = 44444

# Describe the sink
a1.sinks.k1.type = logger

# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

This configuration defines a single agent named a1. a1 has a source that listens for data on port 44444, a channel that buffers event data in memory, and a sink that logs event data to the console. The configuration file names the various components, then describes their types and configuration parameters. A given configuration file might define several named agents; when a given Flume process is launched a flag is passed telling it which named agent to manifest.

Given this configuration file, we can start Flume as follows:

$ bin/flume-ng agent --conf conf --conf-file example.conf --name a1 -Dflume.root.logger=INFO,console
  1. 在root目录下创建option(任意目录.任意文件名), 把官网案例中配置案例粘贴到刚刚创建的option文件中

  2. 启动flume

    flume-ng agent --conf-file option --name a1 -Dflume.root.logger=INFO,console
    
  3. 启动成功后可通过其他节点使用telnet 访问 flume ,连接成功后可输入任意信息进行测试

    yum install -y telnet
    telnet node01 44444
    

Avro流模式搭建

Setting multi-agent flow

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-KyQqschE-1580823341410)(flume/UserGuide_image03.png)]

In order to flow the data across multiple agents or hops, the sink of the previous agent and source of the current hop need to be avro type with the sink pointing to the hostname (or IP address) and port of the source.

  1. flume 文件夹复制到其他节点(node02),并添加环境变量

  2. 编写Abent foo配置文件 参考Avro Sink

    # example.conf: A single-node Flume configuration
    
    # Name the components on this agent
    a1.sources = r1
    a1.sinks = k1
    a1.channels = c1
    
    # Describe/configure the source
    a1.sources.r1.type = netcat
    a1.sources.r1.bind = node01
    a1.sources.r1.port = 44444
    
    # Describe the sink
    a1.sinks.k1.type = avro
    a1.sinks.k1.hostname = node02
    a1.sinks.k1.port = 10086
    
    # Use a channel which buffers events in memory
    a1.channels.c1.type = memory
    a1.channels.c1.capacity = 1000
    a1.channels.c1.transactionCapacity = 100
    
    # Bind the source and sink to the channel
    a1.sources.r1.channels = c1
    a1.sinks.k1.channel = c1
    
  3. 编写Abent bar配置文件 参考Avro Source

    # example.conf: A single-node Flume configuration
    
    # Name the components on this agent
    a1.sources = r1
    a1.sinks = k1
    a1.channels = c1
    
    # Describe/configure the source
    a1.sources.r1.type = avro
    a1.sources.r1.bind = node02
    a1.sources.r1.port = 10086
    
    # Describe the sink
    a1.sinks.k1.type = logger
    
    # Use a channel which buffers events in memory
    a1.channels.c1.type = memory
    a1.channels.c1.capacity = 1000
    a1.channels.c1.transactionCapacity = 100
    
    # Bind the source and sink to the channel
    a1.sources.r1.channels = c1
    a1.sinks.k1.channel = c1
    
  4. Abent bar 节点 启动flume

    flume-ng agent --conf-file option --name a1 -Dflume.root.logger=INFO,console
    
  5. Abent foo 节点 启动flume

    flume-ng agent --conf-file option --name a1 -Dflume.root.logger=INFO,console
    
  6. 连接Abent bar flume 发送消息测试

    telnet node01 44444
    
    

Avro流模式扩展

Consolidation

A very common scenario in log collection is a large number of log producing clients sending data to a few consumer agents that are attached to the storage subsystem. For example, logs collected from hundreds of web servers sent to a dozen of agents that write to HDFS cluster.

A fan-in flow using Avro RPC to consolidate events in one place

This can be achieved in Flume by configuring a number of first tier agents with an avro sink, all pointing to an avro source of single agent (Again you could use the thrift sources/sinks/clients in such a scenario). This source on the second tier agent consolidates the received events into a single channel which is consumed by a sink to its final destination.

多路流模式

离线业务+实时处理

Multiplexing the flow

Flume supports multiplexing the event flow to one or more destinations. This is achieved by defining a flow multiplexer that can replicate or selectively route an event to one or more channels.

A fan-out flow using a (multiplexing) channel selector

The above example shows a source from agent “foo” fanning out the flow to three different channels. This fan out can be replicating or multiplexing. In case of replicating flow, each event is sent to all three channels. For the multiplexing case, an event is delivered to a subset of available channels when an event’s attribute matches a preconfigured value. For example, if an event attribute called “txnType” is set to “customer”, then it should go to channel1 and channel3, if it’s “vendor” then it should go to channel2, otherwise channel3. The mapping can be set in the agent’s configuration file.

Source常用源

Exec source

可以执行linux命令读取标准输出

# example.conf: A single-node Flume configuration

# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1

# Describe/configure the source
a1.sources.r1.type = exec
a1.sources.r1.command = tail -F /root/test.log

# Describe the sink
a1.sinks.k1.type = logger

# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

Spooling Directory Source

监控目录中的数据

# example.conf: A single-node Flume configuration

# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1

# Describe/configure the source
a1.sources.r1.type = spooldir
a1.sources.r1.spoolDir = /root/log
# 表示在flume读取数据之后,是否在封装出来的event中将文件名添加到event的header中。
a1.sources.r1.fileHeader = false

# Describe the sink
a1.sinks.k1.type = logger

# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

Kafka Source

Sink常用源

HDFS Sink

fulme支持按日期分割 在hdfs中创建日期目录

# example.conf: A single-node Flume configuration

# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1

# Describe/configure the source
a1.sources.r1.type = spooldir
a1.sources.r1.spoolDir = /root/log
# 表示在flume读取数据之后,是否在封装出来的event中将文件名添加到event的header中。
a1.sources.r1.fileHeader = false

# Describe the sink
a1.sinks.k1.type = hdfs
a1.sinks.k1.hdfs.path = /flume/events/%y-%m-%d/%H%M/%S
a1.sinks.k1.hdfs.filePrefix = events-
a1.sinks.k1.hdfs.round = true
a1.sinks.k1.hdfs.roundValue = 5
a1.sinks.k1.hdfs.roundUnit = second
a1.sinks.k1.hdfs.useLocalTimeStamp = true

# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

NameDefaultDescription
channel
typeThe component type name, needs to be hdfs
hdfs.pathHDFS directory path (eg hdfs://namenode/flume/webdata/)
hdfs.filePrefixFlumeData上传成功后生成的 文件前缀
hdfs.fileSuffix长传成功后生成的 文件后缀
hdfs.inUsePrefix上传过程中生成的临时文件的前缀
hdfs.inUseSuffix.tmp上传过程中生成的临时文件的后缀
hdfs.emptyInUseSuffixfalse是否开启临时文件后缀
hdfs.rollInterval30在连续的30秒内写入文件,超过30秒,关闭文件流创建新文件继续写入 参数为0 不生效
hdfs.rollSize1024文件大小限制,超出限制关闭文件流创建新文件继续写入 参数为0 不生效
hdfs.rollCount10记录数限制,超过10条,关闭文件流创建新文件继续写入 参数为0 不生效
hdfs.idleTimeout0连续时间段没有数据写入,关闭文件
hdfs.batchSize100一次传入文件的条数
hdfs.codeC压缩格式 gzip, bzip2, lzo, lzop, snappy
hdfs.fileTypeSequenceFile文件类型DataStream 不要指定codeC
(1)DataStream will not compress output file and please don’t set codeC (2)CompressedStream requires set hdfs.codeC with an available codeC
hdfs.maxOpenFiles5000最大开启文件流的个数 (参考:1G内存可以开启10万个文件)
hdfs.minBlockReplicas最小副本数 默认使用hdfs设置的副本数
hdfs.writeFormatWritable数据输出格式
hdfs.callTimeout10000操作超过10秒,报异常(服务器配置过低) 新版本flume中没有该配置
hdfs.threadsPoolSize10I/O线程数
hdfs.rollTimerPoolSize1Number of threads per HDFS sink for scheduling timed file rolling
hdfs.kerberosPrincipalKerberos user principal for accessing secure HDFS
hdfs.kerberosKeytabKerberos keytab for accessing secure HDFS
hdfs.proxyUser代理用户
hdfs.roundfalse开启文件夹数量控制 一般用于基于统计小时,分钟,秒的项目
hdfs.roundValue1几秒中生成一个新文件夹
hdfs.roundUnitsecond单元 设置具体的按小时或分钟或秒
hdfs.timeZoneLocal TimeName of the timezone that should be used for resolving the directory path, e.g. America/Los_Angeles.
hdfs.useLocalTimeStampfalse使用本地时间戳,一般设置为true 若为false 影响hdfs.round等配置
hdfs.closeTries0Number of times the sink must try renaming a file, after initiating a close attempt. If set to 1, this sink will not re-try a failed rename (due to, for example, NameNode or DataNode failure), and may leave the file in an open state with a .tmp extension. If set to 0, the sink will try to rename the file until the file is eventually renamed (there is no limit on the number of times it would try). The file may still remain open if the close call fails but the data will be intact and in this case, the file will be closed only after a Flume restart.
hdfs.retryInterval180Time in seconds between consecutive attempts to close a file. Each close call costs multiple RPC round-trips to the Namenode, so setting this too low can cause a lot of load on the name node. If set to 0 or less, the sink will not attempt to close the file if the first attempt fails, and may leave the file open or with a ”.tmp” extension.
serializerTEXTOther possible options include avro_event or the fully-qualified class name of an implementation of the EventSerializer.Builder interface.
serializer.*

Hive Sink

HBaseSinks

Logger Sink

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值