flume

最新推荐文章于 2024-05-20 12:44:11 发布

月夜楓

最新推荐文章于 2024-05-20 12:44:11 发布

阅读量614

点赞数

分类专栏： flume

本文链接：https://blog.csdn.net/cyxinda/article/details/78254176

版权

flume 专栏收录该内容

2 篇文章 0 订阅

订阅专栏

1：配置了两个flume实例以后，会报错误：

2016-07-11 12:40:27,845 (conf-file-poller-0) [INFO - org.mortbay.log.Slf4jLog.info(Slf4jLog.java:67)] jetty-6.1.26

2016-07-11 12:40:27,871 (conf-file-poller-0) [INFO - org.mortbay.log.Slf4jLog.info(Slf4jLog.java:67)] Started SocketConnector@0.0.0.0:41414

Server@5ea03ac7: java.net.BindException: Address already in use

java.net.BindException: Address already in use

at java.net.PlainSocketImpl.socketBind(Native Method)

at java.net.AbstractPlainSocketImpl.bind(AbstractPlainSocketImpl.java:376)

at java.net.ServerSocket.bind(ServerSocket.java:376)

at org.apache.flume.node.PollingPropertiesFileConfigurationProvider$FileWatcherRunnable.run(PollingPropertiesFileConfigurationProvider.java:140)

at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)

at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:304)

at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:178)

at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)

at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)

at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)

at java.lang.Thread.run(Thread.java:744)

at net.sf.json.util.JSONTokener.syntaxError(JSONTokener.java:505)

at net.sf.json.JSONObject._fromJSONTokener(JSONObject.java:1271)

at net.sf.json.JSONObject.fromObject(JSONObject.java:155)

at net.sf.json.util.JSONTokener.nextValue(JSONTokener.java:347)

at net.sf.json.JSONObject._fromJSONTokener(JSONObject.java:1180)

at net.sf.json.JSONObject.fromObject(JSONObject.java:155)

at net.sf.json.util.JSONTokener.nextValue(JSONTokener.java:347)

at net.sf.json.JSONObject._fromJSONTokener(JSONObject.java:1180)

at net.sf.json.JSONObject.fromObject(JSONObject.java:155)

at net.sf.json.util.JSONTokener.nextValue(JSONTokener.java:347)

at net.sf.json.JSONArray._fromJSONTokener(JSONArray.java:1132)

at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152)

at org.mortbay.jetty.Server.handle(Server.java:326)

at org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:542)

at org.mortbay.jetty.HttpConnection$RequestHandler.content(HttpConnection.java:945)

at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:756)

at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:218)

at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404)

at org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:228)

at org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:582)

解决方法：参照【 http://qianshangding.iteye.com/blog/2259389 】

添加的flume监控

Flume主要由以下几种监控方式：

JMX监控

JMX高爆可以在flume-env.sh文件修改JAVA_OPTS环境变量，如下：

Java代码

export JAVA_OPTS=”-Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote.port=5445 -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.ssl=false”

Ganglia监控

Flume也可以报告metrics到Ganglia 3或者是Ganglia 3.1的metanodes。要将metrics报告到Ganglia，必须在启动的时候就支持Flume Agent。这个Flume Agent使用flume.monitoring作为前缀，通过下面的参数启动。当然也可以在flume-env.sh中设置：

属性默认描述

type	–	de > 组件名：gangliade>
hosts	–	de >Ganglia服务器的hostname:port列表，有逗号分隔。de>
pollFrequency	60	多少秒向Ganglia发一次数据。
isGanglia3	false	刚的服务器是否是3，默认情况下是发Ganglia3.1的格式。

如果要支持Ganglia，可以通过如下命令启动。

     
     
      
      
       
       
        
        Java代码  
        
        
       
       
      
      
      
      $ bin/flume-ng agent --conf-file example.conf --name a1 -Dflume.monitoring.type=ganglia -Dflume.monitoring.hosts=com.example:1234,com.example2:5455

JSON监控

Flume可以通过JSON形式报告metrics，启用JSON形式，Flume需要配置一个端口。如下是采用JSON格式报告metrics的格式：

Java代码

{
"typeName1.componentName1" : {"metric1" : "metricValue1", "metric2" : "metricValue2"},
"typeName2.componentName2" : {"metric3" : "metricValue3", "metric4" : "metricValue4"}
}

例如：

Java代码

{
"CHANNEL.fileChannel":{"EventPutSuccessCount":"468085",
"Type":"CHANNEL",
"StopTime":"0",
"EventPutAttemptCount":"468086",
"ChannelSize":"233428",
"StartTime":"1344882233070",
"EventTakeSuccessCount":"458200",
"ChannelCapacity":"600000",
"EventTakeAttemptCount":"458288"},
"CHANNEL.memChannel":{"EventPutSuccessCount":"22948908",
"Type":"CHANNEL",
"StopTime":"0",
"EventPutAttemptCount":"22948908",
"ChannelSize":"5",
"StartTime":"1344882209413",
"EventTakeSuccessCount":"22948900",
"ChannelCapacity":"100",
"EventTakeAttemptCount":"22948908"}
}

属性名默认描述

type	–	组件的名称： de >httpde>
port	41414	启动服务的端口

可以用如下命令启动Flume:

       
       
        
        
         
         
          
          Java代码  
          
          
         
         
        
        
        
        $ bin/flume-ng agent --conf-file example.conf --name a1 -Dflume.monitoring.type=http -Dflume.monitoring.port=34545

自定义监控

自定义的监控需要实现org.apache.flume.instrumentation.MonitorService接口。例如有一个HTTP的监控类叫HttpReporting，我可以通过如下方式启动这个监控。

Java代码

$ bin/flume-ng agent --conf-file example.conf --name a1 -Dflume.monitoring.type=com.example.reporting.HTTPReporting -Dflume.monitoring.node=com.example:332

报告metrics我们也可以自定义组件，不过一定要继承 org.apache.flume.instrumentation.MonitoredCounterGroup 虚拟类。Flume已经实现的类，如下图：

根据上面的规范我么就可以开发自定义的监控组件了。

2：flume配置的几个总结：

参考：https://book.douban.com/people/cswuyg/annotation/26013531/

1、Overview and Architecture
   
   
Flume was created to meet this need and create a standard, simple, robust, flexible, and extensible tool for data ingestion into Hadoop.
   
   In June of 2011, Cloudera moved control of the Flume project to the Apache foundation.
   
   大家都需要flume这样一个实时传输数据的工具，Cloudera公司做了一个，然后在2011年6月将这个工具迁移到了Apache基金会，2012年出来了重构后的flume，版本号为flume1.X，这一系列也称为flume-ng。
   
   flume-ng跟之前的flume相比，明显的不同是去掉了master/masters、ZooKeeper，其传输框架、配置也发生了很大的变化。
   
   page9 hdfs上的文件如果不关闭，那么当它发生意外的时候，整个文件就是空文件。但小文件对hadoop不友好，所以也不能频繁的关闭文件。
   
   
In HDFS the file exists only as a directory entry, it shows as having zero length until the file is closed. This means if data is written to a file for an extended period without closing it, a network disconnect with the client will leave you with nothing but and empty file for all your efforts.
   
   Since the HDFS metadata is kept in memory on the NameNode, the more files you create, the more RAM you'll need to use, From a MapReduce prospective, tiny file lead to poor efficiency.
   
   
If you have lots of tiny files, the cost of starting the worker processes can be disproportionally high compared to the data it is processing.
   
   page10 
   
   
A source writes events to one or more channels.
   
   A channel is the holding area as events are passed from a source to a sink.
   
   
A sink receives events from one channel only.
   
   An agent can have many sources, channels, and sinks.
   
   数据传递的基本单位是event，event是header和body的组合，其中header可以是0或者多个字段的组合。
   
   Interceptors, channel selectors, and sink processes
   
   1、interceptor：可以观察、修改Flume的events；一个source可以有多个interceptor。
   
   2、channel selector：对来自source的数据分流到一个或多个channel中；flume自己提供了两类selector：（1）replicating channel selector，把event拷贝分给个channel，这类似副本；（2）multiplexing channel selector, 把event根据header信息分发给多个channel，这类似shard。
   
   3、sink processor ： 可以用于做备用sink，也可以用作多个sink对一个channel的负载均衡。
   
   
Tiered data collection（multiple flows and/or agents）
   
   可以把多个agent串起来，譬如在必要的时候，可以在数据源跟Hadoop集群之间增加一个中间层，用于缓存数据源到Hadoop集群的数据。
   
   2、Flume Quick Start
   
   Flume configuration file overview
   
   Flume agent的配置采用Java property format，一个配置文件中可以配置多个agent，所以启动时，需要指定agent名。
   
   最简单的例子：
   
   agent.sources=s1
   
   agent.channels=c1
   
   agent.sinks=k1
   
   agent.sources.s1.type=netcat
   
   agent.sources.s1.channels=c1
   
   agent.sources.s1.bind=0.0.0.0
   
   agent.sources.s1.port=12345
   
   agent.channels.c1.type=memory
   
   agent.sinks.k1.type=logger
   
   agent.sinks.k1.channel=c1
   
   agent名为agent，source名为s1，channel名为c1，sink名为k1
   
   一个sink只能对应一个channel。
   
   启动flume例子：./bin/flume-ng agent -n agent -c conf -f conf/hw.conf -Dflume.root.logger=INFO,console
   
   -Dflume.root.logger=INFO,console 表示 日志打到控制台，如果不设置默认打到 log/flume.log文件中，可以修改 cong/log4j.properties 文件改变日志配置。
   
   3、Channels
   
   Memory Channel： 管道数据不缓存在磁盘，机器故障、重启flume可能会导致数据丢失，受限于内存大小，管道能缓存的event不多；
   
   File Channel： 管道数据缓存在磁盘，保证数据不丢失，能缓存更多的event，缺点是性能下降；
   
   Memory Channel
   
   如果增加了管道的容量，记得也要增加java的堆空间：using the -Xmx and optionally the -Xms parameters。
   
   transactionCapacity属性的用途是设定一次传输多少events，包括从source写入到channel，从channel被sink读走。这个值如果设置得高，可以减少平均每个event的包装成本，但如果传输失败，重传成本也会相应变高。
   
    keep-alive 属性指定数据写入channel时的超时时间，当channel数据满时，会触发超时。
   
   File Channel
   
   适用于不允许数据丢失的使用场景。
   
   如果有多个File Channel，要把它们分配到多个磁盘上，避免IO瓶颈，如果有多块磁盘，可以在设置dataDirs 的时候用逗号分隔使用多个磁盘。
   
   避免使用NFS，太慢。
   
   默认capacity为100W条event。
   
   maxFileSize 属性设置每个log文件的大小，当一个文件满的时候，才会去检查是否需要删掉旧log文件，而检查时，如果发现旧文件的数据还没有被处理过，则不做删除，所以如果设置为2GB大小，那么可能峰值会达到4GB的磁盘空间。
   
   minimumRequired 属性表示file-channel所在磁盘至少需要剩下多少空间，如果剩下的空间小于这个值，则会抛出异常。
   
   Summary
   
   对比总结：
   
   
The memory channel offers speed at the cost of data loss in the event of failure.
   
   Alternatively, the file channel provides a more reliable transport, in that it can tolerate agent failures and restarts, at a performance cost.
   
   在做选择的时候有几个方面的考虑：1、如果选择了memory channel，数据丢失需要付出多少代价？如果选择了file-channel，需要为保证性能花费多少硬件升级资金？ 2、如果数据丢失了，是否容易找回？
   
   传输到hdfs中的数据可能有重复的，有两种做法，一种是定期启动MapReduce去清理重复数据，另一种是在使用数据的时候再做处理。
   
   4、Sinks and Sink Processors
   
   有很多开源的sink可以用，如果没有找到合适的，可以继承 org.apache.flume.sink.Abstractsink 自己写一个。
   
   hdfs sink，支持很多种方式的文件命名、文件路径设定。
   
   数据可以设置压缩存储：agent.sinks.k1.hdfs.codeC=gzip ， 但是如果读取的次数很多，这会影响性能。
   
   Event serializer
   
   把event转换为另一种格式
   
   An event serializer is the mechanism by which a Flume event is converted into another format for output. 
   
   Sink group
   
   可以设置多个sink用于故障备用，或者是负载均衡。指定sinkgroup的processor.type为failover则表示故障备用。
   
   如果是load balancing，balance的方式可以选择为:round_robin、random，所谓round_robin其实就是取模、轮流。
   
   压缩、数据写入格式...
   
   5、Sources and Channel Selectors
   
   有很多source插件可用，如果找不到，则可以继承 org.apache.flume.source. AbstractSource 自己写一个。
   
   The problem with using tail
   
   flume以前的版本曾经有TailSource，类似于tail -f的插件，但是后来移除了，因为它容易导致数据丢失，而且丢失得很隐蔽：
   
   譬如：
   
   （1）应用写入a.log文件，flume读取a.log文件；
   
   （2）a.log文件满，应用将其重命名为a.log.1，然后写入到新的a.log文件，这时候，flume还没处理完原来的文件，于是继续读取a.log.1
   
   （3）a.log文件又满了，应用将a.log.1重命名为a.log.2，a.log重命名为a.log.1，写入新的a.log文件，这时候，flume处理的文件为a.log.2文件，处理完后，它会认为最新的文件时a.log文件，于是去处理a.log文件，a.log.1文件丢失了。
   
   还有，tail的方式无法得知日志写入channel成功与否，如果channel已经满了，tail不会知道，数据就丢了。
   
   The exec source
   
   可以执行一个外部进程，但注意，flume重启的时候，不会关闭掉旧插件进程，需要自己关闭。
   
   The Spooling directory source
   
   监控目录，但是注意不能修改文件的名字，不能出现同名覆盖文件，不要出现只有一半内容的文件。传输完成之后，文件会被重命名为xx.COMPLETED，需要有定时清理脚本把这些文件清理掉。
   
   重启会导致出现重复event，因为那些被传输到一半的文件没有被设置为完成状态。
   
   Syslog
   
   可以用来接收syslog，支持TCP、UDP
   
   Channel Selectors
   
   有两种selector，一种是副本（Replicating），一种是分散（Multiplexing）。
   
   The replicating selector writes the same event to all channels in the source's channels list.
   
   如：
   
   agent.sources.s1.channels=c1 c2 c3
   
   agent.sources.s1.selector.type=replicating
   
   agent.sources.s1.selector.optional=c2 c3
   
   如果设置为可选，则表示可选的channel写入失败也无所谓，只要保证c1写入成功即可。
   
   If you wanted to send different events to different channels, you would use a multiplexing channel selector by setting selector.typeto multiplexing. 
   
   multiplexing 可以设置根据某个字段的值分流到不同的channel中
   
   如：
   
   agent.sources.s1.selector.type=multiplexing
   
   agent.sources.s1.selector.header=port
   
   agent.sources.s1.selector.default=c2
   
   agent.sources.s1.selector.mapping.11111=c1 c2
   
   agent.sources.s1.selector.mapping.44444=c2
   
   agent.sources.s1.selector.optional.44444=c3
   
   6、Interceptors, ETL, and Routing
   
   interceptor：在source之后，在sink之前，用来修改events。
   
   我用到的是static interceptor，用它来增加header，eg:
   
   agent.sources.s1.interceptors=pos env
   
   agent.sources.s1.interceptors.pos.type=static
   
   agent.sources.s1.interceptors.pos.key=pointOfSale
   
   agent.sources.s1.interceptors.pos.value=US
   
   agent.sources.s1.interceptors.env.type=static
   
   agent.sources.s1.interceptors.env.key=environment
   
   agent.sources.s1.interceptors.env.value=staging
   
   指定type为static，指定key、value为需要设置的值。
   
   可以使用regex_extractor去提取body中的信息，然后再用上serializer插件把信息加入到event的header中。
   
   自定义interceptor，实现接口：
   
   org.apache.flume.interceptor.Interceptor
   
   org.apache.flume.interceptor.Interceptor.Builder
   
   
Tiering data flows
   
   使用Avro作为传输协议，在多个agent之间传递event。
   
   传输文件，eg：
   
   ./flume-ng avro-client --filename foo.log --headerFile headers. properties --host collector.example.com --port 42424
   
   Log4J Appender
   
   The Load Balancing Log4J Appender
   
   让应用直接把日志打到channel中
   
   Routing
   
   interceptor给events打上标签，然后再用channel selector根据标签分流到不同的channel。
   
   7、Monitoring Flume
   
   我们需要监控数据进入source的速度，channel使用比例，数据被sink读取的速度。
   
   Ganglia
   
   The internal HTTP server
   
   可以启动一个内置的HTTPServer，用来简单观察flume运行情况，启动方法：在启动命令行上增加参数：
   
   -Dflume.monitoring.type=http -Dflume.monitoring.port=8879，表明在8879端口上查看监控信息，如：
http://xxxx:8879/metrics
可以查看到一些性能信息，如channel的使用率，在开发测试时，可以根据它来确认file-channel要开多大。
   
   也可以把上面的启动参数加到./conf/flume-env.sh文件的JAVA_OPTS变量上。
   
   
8、There Is No Spoon — The Realities Of Real-time Distributed Data Collection
   
   需要注意的一些问题：
   
   （1）传输时间跟日志时间，日志接受时间跟传输时间不一样的；
   
   （2）时区，在全球都有机房，要注意统一时区，或者使用UTC时间，设置：-Duser.timezone=UTC；
   
   （3）磁盘容量，大容量大价钱，需要考虑数据的价值；
   
   （4）多数据中心，在各个数据中心都布一套hadoop，不用把数据都传到一个数据中心，但这也会让总数计算变得复杂；
   
   （5）数据使用权限（法律问题）
   
   
end.

3总结2：

flume介绍

flume最新release版本是1.6.0
官方介绍：
Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data. It has a simple and flexible architecture based on streaming data flows. It is robust and fault tolerant with tunable reliability mechanisms and many failover and recovery mechanisms. It uses a simple extensible data model that allows for online analytic application.

flume基本组件架构

Flume-basic

flume agent: 由若干source,channel,sink组成，其中source接收来自外源（external source）发送来的数据，然后将数据写入一个或多个channel，channel被动的存储接收的数据直到数据被sink消费。一个agent内的source和sink是借助于channle异步运行的（The source and sink within the given agent run asynchronously with the events staged in the channel.）
Event: event在flume里被定义为一个数据流的基本单位，event里包含可设置各种属性的header。
Client SDK: 提供了rpc client接口，比较特殊的包括Failover client（一组agents组成一个faileover group），以及LoadBalancing RPC client（一组agents组成load-balancing group，负载均衡策略可以是随机，R-R，或者自己定义）

可靠性

数据（events）只有在被下一个agent中的channel存储或者发生到终端（terminal repository）后才会被从channel中删除。
（The Sink removes an Event from the Channel only after the Event is stored into the Channel of the next agent or stored in the terminal repository. This is how the single-hop message delivery semantics in Flume provide end-to-end reliability of the flow.）
flume利用channel提供的事务性来保证数据分发的可靠性，这样可以确保数据流中端到端的数据可靠性。
（The Sources and Sinks encapsulate the storage/retrieval of the Events in a Transaction provided by the Channel. ）

事务

flume-transaction
如上图所示。

可恢复性

数据被暂存在channel中，channel负责故障恢复。
channle支持持久化（本地文件）和非持久化（内存）两种模式。

flume支持级联模式

只需要配置上对应agent的source和sink即可。
flume-more-hop

多路技术

flume支持多路分发，支持一个源发布到多个端
flume-multiplexing
注意，一个source实例可以配置多个channel，但是一个sink只能配置指定到一个channel

分发数据流（Fan out flow）

前面说道，flume支持从一个source到多个channel的多路分发技术，具体实现有两种方法，一个是复制（replicating），一个是多路技术（multiplexing）。
replicating模式中，数据会被发送到配置指定的所有channels。
multiplexing模式中，数据可能只会被发送到符合规则的channels中，可能是一个，多个或所有的。

两种模式可以通过指定selector.type来选择，默认为replicating。
如果指定的是multiplexing模式，则需要进一步指定规则，主要是通过header内容判断，然后分发到不同channel。
如：

agent_foo.sources.avro-AppSrv-source1.selector.type = multiplexing
agent_foo.sources.avro-AppSrv-source1.selector.header = State
agent_foo.sources.avro-AppSrv-source1.selector.mapping.CA = mem-channel-1
agent_foo.sources.avro-AppSrv-source1.selector.mapping.AZ = file-channel-2
agent_foo.sources.avro-AppSrv-source1.selector.mapping.NY = mem-channel-1 file-channel-2
agent_foo.sources.avro-AppSrv-source1.selector.default = mem-channel-1

则selector会检查header中的state值，CA会发送到mem-channel-1，以此类推。

多种source

flume支持多种source，如Avro source, thrift source,exec source（命令行，如tail -f filelog），kafka source等等。
每个source实例都会有自己的生命周期，包括start(),stop(),process().

多种sink

flume支持多种sink，如hdfs sink,hive sink,logger sink, avro sink,thrift sink等等。
每个sink实例都会有自己的生命周期，包括start(),stop(),process().

多种channel

flume支持多种channel，如memory channel,file channel,kafka channel

flume拦截器（flume interceptors）

fulme具有修改或者丢弃接收到的数据（events）的能力。即对接收到的数据按照一定的配置规则处理。可以是增加字段，修改替换，过滤等。

flume轮询配置

flume agent会不断检查看配置文件是否修改更新，如果更新则会重新加载。

设计拓扑结构的一些考虑点

flume适合将文本log数据写到HDFS。但可传输的数据可以多种多样，对flume来讲，会把接收的数据都看作是二进制数据。此外，你的拓扑结构可以修改，但不适合经常性的修改，但不适合每天都要修改（because reconfiguration takes some thought and overhead.）。
数据流的可靠性。channel类型选择，持久化还是内存型的，以及当channel满时情况，因为有可能造成数据丢失。以及是否使用冗余拓扑。（Whether you use redundant topologies.）
拓扑结构设计。如果源比较多的话是否使用聚合功能。
估算数据量，吞吐能力。

一家之言

优点：

系统架构设计的非常清晰，组件之间耦合度非常低，可以根据需求自由组合；
同时提供了多种source和sink，扩展性很强。

不足：

配置不够灵活，尤其是需要动态更新一条数据流时；
虽然提供了channel的事务，但是整个系统的异常处理能力还是比较弱的，不适合对数据质量要求较高的场景；
用户接入的代价也比较高，以及运维是个挑战。

月夜楓

关注

0
点赞
踩
3

收藏

觉得还不错? 一键收藏
0
评论
flume

1：配置了两个flume实例以后，会报错误：2016-07-11 12:40:27,845 (conf-file-poller-0) [INFO - org.mortbay.log.Slf4jLog.info(Slf4jLog.java:67)] jetty-6.1.262016-07-11 12:40:27,871 (conf-file-poller-0) [INFO - org.mortbay
复制链接

扫一扫