Flume学习笔记

ZeroTeam_麒麟

于 2016-04-08 19:38:41 发布

阅读量1.5k

点赞数

分类专栏： apache flume 文章标签：数据收集大数据 Flume

本文链接：https://blog.csdn.net/qq_26840065/article/details/51099141

版权

apache flume 专栏收录该内容

2 篇文章 0 订阅

订阅专栏

Flunme是什么
收集、聚合事件流数据的分布式框架
通常用于log数据?
采用ad-hoc方案，明显优点如下：??
可靠的、可伸缩、可管理、可定制、高性能?
声明式配置，可以动态更新配置?
提供上下文路由功能?
支持负载均衡和故障转移?
功能丰富?
完全的可扩展
数据收集的框架

flume的特点：
　　flume是一个分布式、可靠、和高可用的海量日志采集、聚合和传输的系统。支持在日志系统中定制各类数据发送方，用于收集数据;同时，Flume提供对数据进行简单处理，并写到各种数据接受方(比如文本、HDFS、Hbase等)的能力。
　　flume的数据流由事件(Event)贯穿始终。事件是Flume的基本数据单位，它携带日志数据(字节数组形式)并且携带有头信息，这些Event由Agent外部的Source生成，当Source捕获事件后会进行特定的格式化，然后Source会把事件推入(单个或多个)Channel中。你可以把Channel看作是一个缓冲区，它将保存事件直到Sink处理完该事件。Sink负责持久化日志或者把事件推向另一个Source。

      flume的可靠性
　　当节点出现故障时，日志能够被传送到其他节点上而不会丢失。Flume提供了三种级别的可靠性保障，从强到弱依次分别为：end-to-end（收到数据agent首先将event写到磁盘上，当数据传送成功后，再删除；如果数据发送失败，可以重新发送。），Store on failure（这也是scribe采用的策略，当数据接收方crash时，将数据写到本地，待恢复后，继续发送），Besteffort（数据发送到接收方后，不会进行确认）。

      flume的可恢复性：
　　还是靠Channel。推荐使用FileChannel，事件持久化在本地文件系统里(性能较差)。

　　flume的一些核心概念：
Agent       使用JVM 运行Flume。每台机器运行一个agent，但是可以在一个agent中包含多个sources和sinks。
Client       生产数据，运行在一个独立的线程。
Source       从Client收集数据，传递给Channel。
Sink       从Channel收集数据，运行在一个独立线程。
Channel       连接 sources 和 sinks ，这个有点像一个队列。
Events       可以是日志记录、 avro 对象等。

Agent
Flume Source
专用收集日志,可以收集各种类型的数据包含 avro netcat thrift exec jms syslog等,收集来的数据会临时的存放在Channel中.
flume Channel
临时数据存放地方,可以存放在memory,jdbc,file,自定义.当 sink发送成功后channel才会删除
flume sink
将日志发送到目的地组件,可以发送到 hdfs hbase logger,avro,file,solr.自定义等

一.安装与测试
解压缩：apache-flume-1.6.0-bin.tar.gz
tar zvxf apache-flume-1.6.0-bin.tar.gz
$ cp conf/flume-env.sh.template conf/flume-env.sh
在conf/flume-env.sh配置
JAVA_HOME
创建Agent的配置文件example.conf 参考 conf/flume-conf.properties.template
启动agent
bin/flume-ng agent --conf conf/  --conf-file conf/example.conf --name a1 -Dflume.monitoring.type=http -Dflume.monitoring.port=34343 -Dflume.root.logger=INFO,console &
netstat -a | grep 44444
Telnet后：ctrl + ] 回车到telnet界面然后quit【q】退出
可以在[hostname:xxxx]/metrics 上看到监控信息

Avro 客户端连接
$ bin/flume-ng avro-client --conf conf -H localhost -p 41414 -F  ~/.bashrc

vim example.conf
# example.conf: A single-node Flume configuration

# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1

# Describe/configure the source
a1.sources.r1.type = netcat
a1.sources.r1.bind = localhost
a1.sources.r1.port = 44444

# Describe the sink
a1.sinks.k1.type = logger

# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1


flume案例:
#定义 source 名称
a1.sources = r1
#定义 sinks 名称
a1.sinks = k1
#定义 channls 名称
a1.channels = c1

# Describe/configure the source
#定义 sources(接收器)的接受日志类型
a1.sources.r1.type = org.apache.flume.source.http.HTTPSource
#端口
a1.sources.r1.port = 8888
#sources接收的日志存储地址连接
a1.sources.r1.channels = c1

# Describe the sink
#以logger的方式打印在界面上  sinks(传输地址)
a1.sinks.k1.type = logger

# Use a channel which buffers events in memory
#channls(存储地址) 负责将 sources接收的日志存储
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel
#将 sources 和 sinks 都连接 channls
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

启动方法:flume-ng agent -c . -f /home/hadoop/flume-1.5.0-bin/conf/avro.conf -n a1 -Dflume.root.logger=INFO,console
该调用方法:curl -X POST -d '[{ "headers" :{"a" : "a1","b" : "b1"},"body" : "idoall.org_body"}]' http://s1(host地址):8888(端口)

返回内容
16/03/25 05:04:12 INFO sink.LoggerSink: Event: { headers:{b=b1, a=a1} body: 69 64 6F 61 6C 6C 2E 6F 72 67 5F 62 6F 64 79 idoall.org_body }

Flume拦截器
主要作用是用来修改event由头headers和消息体(body)两部分组成,interceptor用于修改headers内容