Apache Flume

最新推荐文章于 2024-07-01 13:05:14 发布

李格非

最新推荐文章于 2024-07-01 13:05:14 发布

阅读量412

点赞数 1

分类专栏： Flume 文章标签： flume

本文链接：https://blog.csdn.net/leegh1992/article/details/64443053

版权

Flume 专栏收录该内容

2 篇文章 0 订阅

订阅专栏

Spark，Hadoop交流群，群QQ号：521066396，欢迎加入共同学习，一起进步~

一、Apache Flume简介

官方网址：http://flume.apache.org/
Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data. It has a simple and flexible architecture based on streaming data flows. It is robust and fault tolerant with tunable reliability mechanisms and many failover and recovery mechanisms. It uses a simple extensible data model that allows for online analytic application.
flume是一个分布式、高可靠、高可用的服务，能够有效的收集、聚合、移动大量的日志数据。
1、它有一个简单、灵活的基于流的数据流结构。
2、具有故障转移机制和负载均衡机制。
3、使用了一个简单的可扩展的数据模型（source、channel、sink）。
flume-ng处理数据有两种方式：avro-client、agent。
avro-client：一次性将数据传输到指定的avro服务的客户端。
agent：一个持续传输数据的服务。
Agent主要组件包含：Source 、Channel、Sink
数据在组件传输的单位是Event

Flume NG 节点组成图

这里写图片描述
Source +Channel +Sink = Agent
配置source
配置channel
配置sink
连接各个组件。
采集，存储，推送

二、部署Agent

1、搭建

（1）解压缩：apache-flume-1.6.0-bin.tar.gz
tar zvxf apache-flume-1.6.0-bin.tar.gz
$ cp conf/flume-env.sh.template conf/flume-env.sh
（2）在conf/flume-env.sh配置JAVA_HOME
创建配置文件example.conf 参考 conf/flume-conf.properties.template
注意：export JAVA_OPTS
（3）在conf目录下创建example.conf文件

# example.conf: A single-node Flume configuration
# Name the components on this agent
#a1表示agent名，下面三行是定义source,sink,channel组件
a1.sources = r1
a1.sinks = k1
a1.channels = c1

#下面分别配置source,sink,channel组件
# Describe/configure the source
a1.sources.r1.type = netcat
a1.sources.r1.bind = 0.0.0.0
a1.sources.r1.port = 44444

# Describe the sink
a1.sinks.k1.type = logger

# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

#下面是连接source,sink,channel组件
# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

启动agent

bin/flume-ng agent --conf conf/ --conf-file conf/example.conf --name a1     -Dflume.monitoring.type=http -Dflume.monitoring.port=34343 -Dflume.root.logger=INFO,console &

测试
netstat -an | grep 44444
Telnet后：ctrl + ] 回车到telnet界面然后quit【q】退出
可以在[hostname:xxxx]/metrics 上看到监控信息

三、Flume的Source

Source：source意为来源、源头。
主要作用：从外界采集各种类型的数据，将数据传递给Channel。
比如:监控某个文件只要增加数据就立即采集新增的数据、监控某个目录一旦有新文件产生就采集新文件的内容、监控某个端口等等。
常见采集的数据类型：
Exec Source、Avro Source、NetCat Source、Spooling Directory Source、Kafka Source等
详细查看：http://flume.apache.org/FlumeUserGuide.html#flume-sources
或者自带的文档查看（doc/index.html）。
Source具体作用：
AvroSource：监听一个avro服务端口，采集Avro数据序列化后的数据；
Thrift Source：监听一个Thrift服务端口，采集Thrift数据序列化后的数据；
Exec Source：基于Unix的command在标准输出上采集数据；
JMS Source：Java消息服务数据源，Java消息服务是一个与具体平台无关的API，这是支持jms规范的数据源采集；
Spooling Directory Source：通过文件夹里的新增的文件作为数据源的采集；
Kafka Source：从kafka服务中采集数据。zk,topic,consumer
NetCat Source：绑定的端口（tcp、udp），将流经端口的每一个文本行数据作为Event输入
HTTP Source：监听HTTP POST和 GET产生的数据的采集

四、Flume的Channel

Channel：一个数据的存储池，中间通道。
主要作用：接受source传出的数据，向sink指定的目的地传输。Channel中的数据直到进入到下一个channel中或者进入终端才会被删除。当sink写入失败后，可以自动重写，不会造成数据丢失，因此很可靠。
channel的类型很多比如:内存中、jdbc数据源中、文件形式存储等。
常见采集的数据类型：
Memory Channel、File Channel、JDBC Channel、Kafka Channel等
详细查看：http://flume.apache.org/FlumeUserGuide.html#flume-channels
Channel具体作用：
Memory Channel：使用内存作为数据的存储。
JDBC Channel：使用jdbc数据源来作为数据的存储。
Kafka Channel：使用kafka服务来作为数据的存储。
File Channel：使用文件来作为数据的存储，效率比较低。
Spillable Memory Channel：使用内存和文件作为数据的存储，即：先存在内存中，如果内存中数据达到阀值则flush到文件中。

五、Flume中的Sink

Sink：数据的最终的目的地。
主要作用：接受channel写入的数据以指定的形式表现出来（或存储或展示）。
sink的表现形式很多比如:打印到控制台、hdfs上、avro服务中、文件中等。
常见采集的数据类型：HDFS Sink、Hive Sink、Logger Sink、Avro Sink、Thrift Sink、File Roll Sink、HBaseSink、Kafka Sink等
详细查看：http://flume.apache.org/FlumeUserGuide.html#flume-sinks
HDFSSink需要有hdfs的配置文件和类库。一般采取多个sink汇聚到一台采集机器负责推送到hdfs。
Sink具体作用：
HDFS Sink：将数据传输到hdfs集群中。
Hive Sink：将数据传输到hive的表中。
Logger Sink：将数据作为日志处理（根据flume中的设置的日志的级别显示）。
Avro Sink：数据被转换成Avro Event，然后发送到指定的服务端口上。
Thrift Sink：数据被转换成Thrift Event，然后发送到指定的的服务端口上。
IRC Sink：数据向指定的IRC服务和端口中发送。
File Roll Sink：数据传输到本地文件中。
Null Sink：取消数据的传输，即不发送到任何目的地。
HBaseSink：将数据发往hbase数据库中。
MorphlineSolrSink：数据发送到Solr搜索服务器（集群）。
ElasticSearchSink：数据发送到Elastic Search搜索服务器（集群）。
Kafka Sink：将数据发送到kafka服务中。（注意依赖类库）

六、Flume的Event

event是Flume NG传输的数据的基本单位，也是事务的基本单位。
在文本文件，通常是一行记录就是一个event。
网络消息传输系统中，一条消息就是一个event。
一个需求：怎么实时监听一个文件的数据增加呢？
如果这个文件增加的量特别大呢？
答：加大conf文件中a1.channels.c1.capacity和transactionCapacity的容量
内存channel报错;java.lang.OutOfMemoryError 默认：-Xmx20m 修改flume-env.sh，打开export JAVA_OPTS注释
这里写图片描述

在浏览器输入：http://ip:34343/
{“SOURCE.r1”:{“OpenConnectionCount”:”0”,”Type”:”SOURCE”,”AppendBatchAcceptedCount”:”0”,”AppendBatchReceivedCount”:”0”,”EventAcceptedCount”:”15”,”StopTime”:”0”,”AppendReceivedCount”:”0”,”StartTime”:”1456742121305”,”EventReceivedCount”:”15”,”AppendAcceptedCount”:”0”},”CHANNEL.c1”:{“EventPutSuccessCount”:”15”,”ChannelFillPercentage”:”0.0”,”Type”:”CHANNEL”,”EventPutAttemptCount”:”15”,”ChannelSize”:”0”,”StopTime”:”0”,”StartTime”:”1456742121296”,”EventTakeSuccessCount”:”15”,”ChannelCapacity”:”1000000”,”EventTakeAttemptCount”:”21”}}

七、高级组件-Interceptors

Interceptors:source的拦截器

主要作用：对于一个source可以指定一个或者多个拦截器，按先后顺序依次对数据进行处理。

比如:在收集的数据的event的header中加入处理的时间戳、agent的主机或者IP、固定key-value等等。
常见Interceptors类型：Timestamp Interceptor、Host Interceptor、Static Interceptor、UUID Interceptor、Morphline Interceptor、Search and Replace Interceptor、Regex Filtering Interceptor、Regex Extractor Interceptor等
如何使用Interceptor？将source加入Interceptor测试看下header即可。
详细查看：http://flume.apache.org/FlumeUserGuide.html#flume-interceptors

a1.sources.r1.interceptors = i1 i2
a1.sources.r1.interceptors.i1.type=org.apache.flume.interceptor.HostInterceptor$Builder
a1.sources.r1.interceptors.i1.preserveExisting = false
a1.sources.r1.interceptors.i1.hostHeader = hostname
a1.sources.r1.interceptors.i2.type=org.apache.flume.interceptor.TimestampInterceptor$Builder

拦截hostname和时间戳。

八、高级组件-Channel Selectors

Channel Selectors:Channel选择器

主要作用：对于一个source发往多个channel的策略设置

Channel Selectors类型：
Replicating Channel Selector (default)、Multiplexing Channel Selector
其中：Replicating 会将source过来的events发往所有channel,而Multiplexing selector会根据event中某个header对应的value来将event发往不同的channel（header与value就是KV结构）。
详细查看：http://flume.apache.org/FlumeUserGuide.html#flume-channel-selectors
Multiplexing Channel Selector
Event里面的header类型：Map[String, String]
如果header中有一个key:state,我们在不同的数据源设置不同的state则通过设置这个key的值来分配不同的channel

a1.sources.r1.channels = c1 c2 c3
a1.sources.r1.selector.type = multiplexing
a1.sources.r1.selector.header = state
a1.sources.r1.selector.mapping.CZ = c1
a1.sources.r1.selector.mapping.US = c2 c3
a1.sources.r1.selector.default = c4
Switch state:
Case CZ :c1
Case US :c2 c3
Default c4

九、高级组件-Sink Processor

Sink Processor:Sink处理器

主要作用：针对Sink groups的处理策略设置

Sink Processor类型：
Default Sink Processor、Failover Sink Processor、Load balancing Sink Processor
其中：Default是默认的情况不用配置sinkgroup，Failover是故障转移机制、Load balancing是负载均衡机制，最后个需要定义sinkgroup。
详细查看：http://flume.apache.org/FlumeUserGuide.html#flume-sink-processors
应用：负载均衡、故障转移
Load balancing Sink Processor：processor.selector：round_robin,、random、自定义
轮询调度（Round Robin Scheduling）算法就是以轮询的方式依次将请求调度不同的服务器，即每次调度执行i = (i + 1) mod n，并选出第i台服务器。算法的优点是其简洁性，它无需记录当前所有连接的状态，所以它是一种无状态调度。
负载均衡

a1.sinkgroups=g1
a1.sinkgroups.g1.sinks=k1 k2
a1.sinkgroups.g1.processor.type=load_balance
a1.sinkgroups.g1.processor.backoff=true
a1.sinkgroups.g1.processor.selector=round_robin

故障转移

a1.sinkgroups=g1
a1.sinkgroups.g1.sinks=k1 k2
a1.sinkgroups.g1.processor.type=failover
a1.sinkgroups.g1.processor.priority.k1=10
a1.sinkgroups.g1.processor.priority.k2=5
a1.sinkgroups.g1.processor.maxpenalty=10000

十、负载均衡例子

# Name the components on this agent
a1.sources = r1
a1.sinks = k1 k2
a1.channels = c1

# Describe/configure the source
a1.sources.r1.type = exec
a1.sources.r1.channels=c1
a1.sources.r1.command=tail -F /opt/logs/access.log          
#define sinkgroups
a1.sinkgroups=g1
a1.sinkgroups.g1.sinks=k1 k2
a1.sinkgroups.g1.processor.type=load_balance
a1.sinkgroups.g1.processor.backoff=true
a1.sinkgroups.g1.processor.selector=round_robin
#define the sink 1
a1.sinks.k1.type=avro
a1.sinks.k1.hostname=192.168.1.11
a1.sinks.k1.port=41414
#define the sink 2
a1.sinks.k2.type=avro
a1.sinks.k2.hostname=192.168.1.12
a1.sinks.k2.port=41414
# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
a1.sinks.k2.channel=c1

十一、故障转移例子

# Name the components on this agent
a1.sources = r1
a1.sinks = k1 k2
a1.channels = c1

# Describe/configure the source
a1.sources.r1.type = exec
a1.sources.r1.channels=c1
a1.sources.r1.command=tail -F /opt/logs/access.log

#define sinkgroups
a1.sinkgroups=g1
a1.sinkgroups.g1.sinks=k1 k2
a1.sinkgroups.g1.processor.type=failover
a1.sinkgroups.g1.processor.priority.k1=10
a1.sinkgroups.g1.processor.priority.k2=5
a1.sinkgroups.g1.processor.maxpenalty=10000

#define the sink 1
a1.sinks.k1.type=avro
a1.sinks.k1.hostname=192.168.1.156
a1.sinks.k1.port=41414

#define the sink 2
a1.sinks.k2.type=avro
a1.sinks.k2.hostname=192.168.1.12
a1.sinks.k2.port=41414


# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
a1.sinks.k2.channel=c1
avro-sink.conf
# example.conf: A single-node Flume configuration
# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1
# Describe/configure the source
a1.sources.r1.type = exec
a1.sources.r1.command = tail -F /opt/logs/access.log
# Describe the sink
a1.sinks.k1.type = avro
a1.sinks.k1.hostname = 192.168.1.11
a1.sinks.k1.port = 41414
# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1