flume安装及示例

关于Flume

flume的宗旨是向hadoop批量导入基于事件的海量数据。一个典型的例子是利用flume从一组web服务器中收集日志文件,然后将日志文件中的日志事件保存至HDFS,以便进行具体的分析。

flume基本构件source-channel-sink,使用flume需先运行flume agent。flume agent由持续运行的source、sink以及channel构成的jvm进程。flume的source产生event,并将其传递给channel,channel存储这些event直至转发给sink。

1531316-20181230234724910-714951991.png

安装

下载flume且配置环境变量
1531316-20181230234734863-1153711440.png

示例

  1. 监听网络端口
  2. 将消息打印在控制台
#example.conf: A single-node Flume configuration

#Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1

#Describe/configure the source
a1.sources.r1.type = netcat
a1.sources.r1.bind = localhost
a1.sources.r1.port = 44444

#Describe the sink
a1.sinks.k1.type = logger

#Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

#Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

启用flume agent

flume-ng agent --conf $FLUME_HOME/conf --conf-file example.conf --name a1 -Dflume.root.logger=INFO,console

向flume发送事件

telnet localhost 44444
Trying ::1...
telnet: connect to address ::1: Connection refused
Trying 127.0.0.1...
Connected to localhost.
Escape character is '^]'.
hello flume
OK

可以观察到flume在日志消息中输出事件

2018-12-30 22:04:42,039 (lifecycleSupervisor-1-0) [INFO - org.apache.flume.source.NetcatSource.start(NetcatSource.java:166)] Created serverSocket:sun.nio.ch.ServerSocketChannelImpl[/127.0.0.1:44444]
2018-12-30 22:04:57,046 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - org.apache.flume.sink.LoggerSink.process(LoggerSink.java:95)] Event: { headers:{} body: 68 65 6C 6C 6F 20 66 6C 75 6D 65 0D hello flume. }

flume也可以在配置文件中读取环境变量,可以通过设置agent propertiesImplementation = org.apache.flume.node.EnvVarResolverProperties来启用

a1.sources = r1
a1.sources.r1.type = netcat
a1.sources.r1.bind = 0.0.0.0
a1.sources.r1.port = ${NC_PORT}
a1.sources.r1.channels = c1

查看已配置的环境变量
1531316-20181230234753775-1841693131.png

启用flume agent

flume-ng agent --conf $FLUME_HOME/conf --conf-file example.conf --name a1 -Dflume.root.logger=INFO,console -DpropertiesImplementation=org.apache.flume.node.EnvVarResolverProperties

事物和可靠性

flume使用了两个分别独立的事物source-channel以及channel-sink。source写入数据到channel,数据在一个批次内的数据出现异常,则不写入到Channel。已接收到的部分数据直接抛弃,靠上游重发数据。事物以类似的方式应用于channel到sink的事件传递过程,如果失败,事物将会回滚,所有的事件依然存在于channel,等待重新投递。

示例中的channel类型为memory,具有较高的吞吐量,不具备持久化的能力。flume提供以下channel类型memory、jdbc、kafka、file等

at-least-once 保证source产生的事件到达sink至少一次,由于异常等原因,同一事件可能重复到达。重复的事件由后续的作业清除,比如mapreduce,hive

flume在尽可能的情况下以事物为单位来批量处理事件以便提高效率

拦截器

flume拦截器的作用在于数据在source传递至channel过程中提取event header中的属性或是对event的内容进行预处理

比如使用hdfs作为sink的输出源,flume的事件通常按时间来分区,可以通过设置hdfs.path实现分区

a1.sinks.k1.hdfs.path = /mine/flume/events/%y/%m/%d

一个flume事件将被写入哪个分区由header中的timestamp来决定,事件header中默认并没有timestamp,可以通过添加拦截器来实现

a1.sources.r1.interceptors = i1
a1.sources.r1.interceptors.i1.type = timestamp

一般而言,存在flume多层代理的情况下建议使用由运行hdfs sink的flume agent所产生的时间戳

a1.sinks.k1.hdfs.useLocalTimeStamp = true

为避免flume事件在hdfs上产生大量的小文件可以通过调大rollSize来实现,单位bytes

a1.sinks.k1.hdfs.rollSize = 100000000

flume提供的拦截器

  • Timestamp Interceptor
  • Host Interceptor
  • Static Interceptor
  • Remove Header Interceptor
  • UUID Interceptor
  • Morphline Interceptor
  • Search and Replace Interceptor
  • Regex Filtering Interceptor
  • Regex Extractor Interceptor
  • 自定义拦截器实现

扇出

扇出(fan out)指的是一个source向多个channel,亦向多个sink传递事件。以下配置将事件同时传递到一个hdfs sink和logger sink

#Name the components on this agent
a1.sources = r1
a1.sinks = k1 k2
a1.channels = c1 c2

#Describe/configure the source
a1.sources.r1.type = netcat
a1.sources.r1.bind = localhost
a1.sources.r1.port = 44444

#Describe the sink
a1.sinks.k1.type = hdfs
a1.sinks.k1.channel = c1
a1.sinks.k1.hdfs.path = /mine/flume/events/%y-%m-%d
a1.sinks.k1.hdfs.filePrefix = events-
a1.sinks.k1.hdfs.rollInterval = 0
a1.sinks.k1.hdfs.rollCount = 0
a1.sinks.k1.hdfs.rollSize = 100000000
a1.sinks.k1.hdfs.useLocalTimeStamp = true

a1.sinks.k2.type = logger

#Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

a1.channels.c2.type = memory
a1.channels.c2.capacity = 1000
a1.channels.c2.transactionCapacity = 100

#Bind the source and sink to the channel
a1.sources.r1.channels = c1 c2
a1.sinks.k1.channel = c1
a1.sinks.k2.channel = c2

完整流程如下:
1531316-20190101182706665-1750577875.png

正常的扇出流向所有chanenl复制事件,有时候需要将不同的事件发往不同的channel,可以在source上设置一个复用选择器,根据header中的特定值与channel匹配

a1.sources = r1
a1.channels = c1 c2 c3 c4
a1.sources.r1.selector.type = multiplexing
a1.sources.r1.selector.header = state
a1.sources.r1.selector.mapping.CZ = c1
a1.sources.r1.selector.mapping.US = c2 c3
a1.sources.r1.selector.default = c4

分层代理

使用flume的分层结构代理来实现flume事件的汇聚。第一层代理负责采集来自原始source的事件(如web日志),第二层代理汇总第一层代理的事件并且写入hdfs
1531316-20190101182648502-2111199516.png

以下示例 agent1通过监听网络端口将事件发送至kafka,agent2通过监听kafka将事件输出至控制台,完成flume两层代理配置
1531316-20190101182637042-132425673.png

agent1:

#Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1

#Describe/configure the source
a1.sources.r1.type = netcat
a1.sources.r1.bind = localhost
a1.sources.r1.port = 44444

#Describe the sink
a1.sinks.k1.type = org.apache.flume.sink.kafka.KafkaSink
a1.sinks.k1.kafka.topic = test1
a1.sinks.k1.kafka.bootstrap.servers = localhost:9092
a1.sinks.k1.kafka.flumeBatchSize = 20
a1.sinks.k1.kafka.producer.acks = 1
a1.sinks.k1.kafka.producer.linger.ms = 1
a1.sinks.k1.kafka.producer.compression.type = snappy

#Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

#Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

agent2:

#Name the components on this agent
a2.sources = r1
a2.sinks = k1
a2.channels = c1

#Describe/configure the source
a2.sources.r1.type = org.apache.flume.source.kafka.KafkaSource
a2.sources.r1.channels = channel1
a2.sources.r1.batchSize = 5000
a2.sources.r1.batchDurationMillis = 2000
a2.sources.r1.kafka.bootstrap.servers = localhost:9092
a2.sources.r1.kafka.topics = test1
a2.sources.r1.kafka.consumer.group.id = custom.g.id

#Describe the sink
a2.sinks.k1.type = logger

#Use a channel which buffers events in memory
a2.channels.c1.type = memory
a2.channels.c1.capacity = 1000
a2.channels.c1.transactionCapacity = 100

#Bind the source and sink to the channel
a2.sources.r1.channels = c1
a2.sinks.k1.channel = c1

sink组

flume允许将多个sink当做一个sink处理,以实现故障转移或负载均衡
1531316-20190101182606094-984942409.png

a1.sinkgroups = g1
a1.sinkgroups.g1.sinks = k1 k2
a1.sinkgroups.g1.processor.type = load_balance
a1.sinkgroups.g1.processor.backoff = true
a1.sinkgroups.g1.processor.selector = random

转载于:https://www.cnblogs.com/adia/p/10199040.html

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值