flume安装及示例

最新推荐文章于 2021-04-04 15:29:42 发布

dmakaay4656225717

最新推荐文章于 2021-04-04 15:29:42 发布

阅读量98

点赞数

文章标签：大数据 java

原文链接：http://www.cnblogs.com/adia/p/10199040.html

版权

关于Flume

flume的宗旨是向hadoop批量导入基于事件的海量数据。一个典型的例子是利用flume从一组web服务器中收集日志文件，然后将日志文件中的日志事件保存至HDFS，以便进行具体的分析。

flume基本构件source-channel-sink，使用flume需先运行flume agent。flume agent由持续运行的source、sink以及channel构成的jvm进程。flume的source产生event，并将其传递给channel，channel存储这些event直至转发给sink。

安装

下载flume且配置环境变量

示例

监听网络端口
将消息打印在控制台

#example.conf: A single-node Flume configuration

#Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1

#Describe/configure the source
a1.sources.r1.type = netcat
a1.sources.r1.bind = localhost
a1.sources.r1.port = 44444

#Describe the sink
a1.sinks.k1.type = logger

#Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

#Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

启用flume agent

flume-ng agent --conf $FLUME_HOME/conf --conf-file example.conf --name a1 -Dflume.root.logger=INFO,console

向flume发送事件

telnet localhost 44444
Trying ::1...
telnet: connect to address ::1: Connection refused
Trying 127.0.0.1...
Connected to localhost.
Escape character is '^]'.
hello flume
OK

可以观察到flume在日志消息中输出事件

2018-12-30 22:04:42,039 (lifecycleSupervisor-1-0) [INFO - org.apache.flume.source.NetcatSource.start(NetcatSource.java:166)] Created serverSocket:sun.nio.ch.ServerSocketChannelImpl[/127.0.0.1:44444]
2018-12-30 22:04:57,046 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - org.apache.flume.sink.LoggerSink.process(LoggerSink.java:95)] Event: { headers:{} body: 68 65 6C 6C 6F 20 66 6C 75 6D 65 0D hello flume. }

flume也可以在配置文件中读取环境变量,可以通过设置agent propertiesImplementation = org.apache.flume.node.EnvVarResolverProperties来启用

a1.sources = r1
a1.sources.r1.type = netcat
a1.sources.r1.bind = 0.0.0.0
a1.sources.r1.port = ${NC_PORT}
a1.sources.r1.channels = c1

查看已配置的环境变量

启用flume agent

flume-ng agent --conf $FLUME_HOME/conf --conf-file example.conf --name a1 -Dflume.root.logger=INFO,console -DpropertiesImplementation=org.apache.flume.node.EnvVarResolverProperties

事物和可靠性

flume使用了两个分别独立的事物source-channel以及channel-sink。source写入数据到channel,数据在一个批次内的数据出现异常，则不写入到Channel。已接收到的部分数据直接抛弃，靠上游重发数据。事物以类似的方式应用于channel到sink的事件传递过程，如果失败，事物将会回滚，所有的事件依然存在于channel，等待重新投递。

示例中的channel类型为memory，具有较高的吞吐量，不具备持久化的能力。flume提供以下channel类型memory、jdbc、kafka、file等

at-least-once 保证source产生的事件到达sink至少一次，由于异常等原因，同一事件可能重复到达。重复的事件由后续的作业清除，比如mapreduce，hive

flume在尽可能的情况下以事物为单位来批量处理事件以便提高效率

拦截器

flume拦截器的作用在于数据在source传递至channel过程中提取event header中的属性或是对event的内容进行预处理

比如使用hdfs作为sink的输出源，flume的事件通常按时间来分区，可以通过设置hdfs.path实现分区

a1.sinks.k1.hdfs.path = /mine/flume/events/%y/%m/%d

一个flume事件将被写入哪个分区由header中的timestamp来决定，事件header中默认并没有timestamp，可以通过添加拦截器来实现

a1.sources.r1.interceptors = i1
a1.sources.r1.interceptors.i1.type = timestamp

一般而言，存在flume多层代理的情况下建议使用由运行hdfs sink的flume agent所产生的时间戳

a1.sinks.k1.hdfs.useLocalTimeStamp = true

为避免flume事件在hdfs上产生大量的小文件可以通过调大rollSize来实现，单位bytes

a1.sinks.k1.hdfs.rollSize = 100000000

flume提供的拦截器

Timestamp Interceptor
Host Interceptor
Static Interceptor
Remove Header Interceptor
UUID Interceptor
Morphline Interceptor
Search and Replace Interceptor
Regex Filtering Interceptor
Regex Extractor Interceptor
自定义拦截器实现

扇出

扇出(fan out)指的是一个source向多个channel，亦向多个sink传递事件。以下配置将事件同时传递到一个hdfs sink和logger sink

#Name the components on this agent
a1.sources = r1
a1.sinks = k1 k2
a1.channels = c1 c2

#Describe/configure the source
a1.sources.r1.type = netcat
a1.sources.r1.bind = localhost
a1.sources.r1.port = 44444

#Describe the sink
a1.sinks.k1.type = hdfs
a1.sinks.k1.channel = c1
a1.sinks.k1.hdfs.path = /mine/flume/events/%y-%m-%d
a1.sinks.k1.hdfs.filePrefix = events-
a1.sinks.k1.hdfs.rollInterval = 0
a1.sinks.k1.hdfs.rollCount = 0
a1.sinks.k1.hdfs.rollSize = 100000000
a1.sinks.k1.hdfs.useLocalTimeStamp = true

a1.sinks.k2.type = logger

#Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

a1.channels.c2.type = memory
a1.channels.c2.capacity = 1000
a1.channels.c2.transactionCapacity = 100

#Bind the source and sink to the channel
a1.sources.r1.channels = c1 c2
a1.sinks.k1.channel = c1
a1.sinks.k2.channel = c2

完整流程如下：

正常的扇出流向所有chanenl复制事件，有时候需要将不同的事件发往不同的channel，可以在source上设置一个复用选择器，根据header中的特定值与channel匹配

a1.sources = r1
a1.channels = c1 c2 c3 c4
a1.sources.r1.selector.type = multiplexing
a1.sources.r1.selector.header = state
a1.sources.r1.selector.mapping.CZ = c1
a1.sources.r1.selector.mapping.US = c2 c3
a1.sources.r1.selector.default = c4

分层代理

使用flume的分层结构代理来实现flume事件的汇聚。第一层代理负责采集来自原始source的事件(如web日志)，第二层代理汇总第一层代理的事件并且写入hdfs

以下示例 agent1通过监听网络端口将事件发送至kafka，agent2通过监听kafka将事件输出至控制台，完成flume两层代理配置

agent1：

#Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1

#Describe/configure the source
a1.sources.r1.type = netcat
a1.sources.r1.bind = localhost
a1.sources.r1.port = 44444

#Describe the sink
a1.sinks.k1.type = org.apache.flume.sink.kafka.KafkaSink
a1.sinks.k1.kafka.topic = test1
a1.sinks.k1.kafka.bootstrap.servers = localhost:9092
a1.sinks.k1.kafka.flumeBatchSize = 20
a1.sinks.k1.kafka.producer.acks = 1
a1.sinks.k1.kafka.producer.linger.ms = 1
a1.sinks.k1.kafka.producer.compression.type = snappy

#Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

#Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

agent2：

#Name the components on this agent
a2.sources = r1
a2.sinks = k1
a2.channels = c1

#Describe/configure the source
a2.sources.r1.type = org.apache.flume.source.kafka.KafkaSource
a2.sources.r1.channels = channel1
a2.sources.r1.batchSize = 5000
a2.sources.r1.batchDurationMillis = 2000
a2.sources.r1.kafka.bootstrap.servers = localhost:9092
a2.sources.r1.kafka.topics = test1
a2.sources.r1.kafka.consumer.group.id = custom.g.id

#Describe the sink
a2.sinks.k1.type = logger

#Use a channel which buffers events in memory
a2.channels.c1.type = memory
a2.channels.c1.capacity = 1000
a2.channels.c1.transactionCapacity = 100

#Bind the source and sink to the channel
a2.sources.r1.channels = c1
a2.sinks.k1.channel = c1

sink组

flume允许将多个sink当做一个sink处理，以实现故障转移或负载均衡

a1.sinkgroups = g1
a1.sinkgroups.g1.sinks = k1 k2
a1.sinkgroups.g1.processor.type = load_balance
a1.sinkgroups.g1.processor.backoff = true
a1.sinkgroups.g1.processor.selector = random

转载于:https://www.cnblogs.com/adia/p/10199040.html

dmakaay4656225717

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
flume安装及示例

关于Flumeflume的宗旨是向hadoop批量导入基于事件的海量数据。一个典型的例子是利用flume从一组web服务器中收集日志文件，然后将日志文件中的日志事件保存至HDFS，以便进行具体的分析。flume基本构件source-channel-sink，使用flume需先运行flume agent。flume agent由持续运行的source、sink以及channel构成的j...
复制链接

扫一扫