Flume学习（三）Flume的配置方式

最新推荐文章于 2024-07-18 21:37:27 发布

匿名啊啊啊

最新推荐文章于 2024-07-18 21:37:27 发布

阅读量2.6k

点赞数

分类专栏： Flume

本文链接：https://blog.csdn.net/qq_41851454/article/details/80230364

版权

本文详细介绍了Apache Flume的配置方式，包括单一代理流配置，如监控目录、从web服务器读取数据到HDFS，以及TCP数据传输。接着讲解了单代理多流配置，实现多个独立流。进一步讨论了配置多代理流程，展示如何在不同节点间传递事件。最后，探讨了多路复用流的概念，包括复制和复用模式的使用案例。

摘要由CSDN通过智能技术生成

1、单一代理流配置

1.1　官网介绍

http://flume.apache.org/FlumeUserGuide.html#avro-source

通过一个通道将来源和接收器链接。需要列出源，接收器和通道，为给定的代理，然后指向源和接收器及通道。一个源的实例可以指定多个通道，但只能指定一个接收器实例。格式如下：

实例解析：一个代理名为agent_foo，外部通过avro客户端，并且发送数据通过内存通道给hdfs。在配置文件foo.config的可能看起来像这样：

案例说明：这将使事件流从avro-appserver-src-1到hdfs-sink-1通过内存通道mem-channel-1。当代理开始foo.config作为其配置文件，它会实例化流。

配置单个组件

定义流之后，需要设置每个源，接收器和通道的属性。可以分别设定组件的属性值。

“type”属性必须为每个组件设置，以了解它需要什么样的对象。每个源，接收器和通道类型有其自己的一套，它所需的性能，以实现预期的功能。所有这些，必须根据需要设置。在前面的例子中，从hdfs-sink-1中的流到HDFS，通过内存通道mem-channel-1的avro-appserver-src-1源。下面是一个例子，显示了这些组件的配置。

1.2、测试示例（一）

流配置

单一代理流配置

案例1:通过flume来监控一个目录，当目录中有新文件时，将文件内容输出到控制台。

#文件名:sample1.properties

#配置内容：

分别在linux系统里面建两个文件夹：一个文件夹用于存储配置文件（flumetest），一个文件夹用于存储需要读取的文件（flume）

#监控指定的目录，如果有新文件产生，那么将文件的内容显示到控制台  
#配置一个agent agent的名称可以自定义  
#指定agent的 sources，sinks，channels  
#分别指定 agent的 sources，sinks，channels 的名称 名称可以自定义  
a1.sources=s1  
a1.channels=c1  
a1.sinks=k1  
  
#配置 source 根据 agent的 sources 的名称来对 source 进行配置  
#source 的参数是根据 不同的数据源 配置不同---在文档查找即可  
#配置目录 source  flume这个文件夹用于存储需要读取的文件  
a1.sources.s1.type=spooldir  
a1.sources.s1.spoolDir=/home/hadoop/apps/apache-flume-1.8.0-bin/flume  
  
#配置 channel 根据 agent的 channels的名称来对 channels 进行配置  
#配置内存 channel  
a1.channels.c1.type=memory  
  
#配置 sink 根据 agent的sinks 的名称来对 sinks 进行配置  
#配置一个 logger sink  
a1.sinks.k1.type=logger  
  
#绑定 特别注意 source的channel 的绑定有 s,sink的 channel的绑定没有 s  
a1.sources.s1.channels=c1  
a1.sinks.k1.channel=c1

把 sample1.properties 配置文件上传到linux系统上的 flumetest 文件夹：

用这个命令来启动Flume：

bin/flume-ng agent --conf conf --conf-file /home/hadoop/apps/apache-flume-1.8.0-bin/flumetest/sample1.properties --name a1 -Dflume.root.logger=INFO,console

--conf 指定flume配置文件的位置  
--conf-file 指定日志收集的配置文件  
--name 指定agent的名称  
-Dflume.root.logger=INFO,console 让收集的信息打印到控制台

启动的部分日志内容：

18/05/05 20:28:16 INFO node.AbstractConfigurationProvider: Creating channels  
18/05/05 20:28:16 INFO channel.DefaultChannelFactory: Creating instance of channel c1 type memory  
18/05/05 20:28:16 INFO node.AbstractConfigurationProvider: Created channel c1  
18/05/05 20:28:16 INFO source.DefaultSourceFactory: Creating instance of source s1, type spooldir  
18/05/05 20:28:16 INFO sink.DefaultSinkFactory: Creating instance of sink: k1, type: logger  
18/05/05 20:28:16 INFO node.AbstractConfigurationProvider: Channel c1 connected to [s1, k1]  
18/05/05 20:28:16 INFO node.Application: Starting new configuration:{ sourceRunners:{s1=EventDrivenSourceRunner: { source:Spool Directory source s1: { spoolDir: /home/hadoop/apps/apache-flume-1.8.0-bin/flume } }} sinkRunners:{k1=SinkRunner: { policy:org.apache.flume.sink.DefaultSinkProcessor@101f0f3a counterGroup:{ name:null counters:{} } }} channels:{c1=org.apache.flume.channel.MemoryChannel{name: c1}} }  
18/05/05 20:28:16 INFO node.Application: Starting Channel c1  
18/05/05 20:28:16 INFO instrumentation.MonitoredCounterGroup: Monitored counter group for type: CHANNEL, name: c1: Successfully registered new MBean.  
18/05/05 20:28:16 INFO instrumentation.MonitoredCounterGroup: Component type: CHANNEL, name: c1 started  
18/05/05 20:28:16 INFO node.Application: Starting Sink k1  
18/05/05 20:28:16 INFO node.Application: Starting Source s1  
18/05/05 20:28:16 INFO source.SpoolDirectorySource: SpoolDirectorySource source starting with directory: /home/hadoop/apps/apache-flume-1.8.0-bin/flume  
18/05/05 20:28:17 INFO instrumentation.MonitoredCounterGroup: Monitored counter group for type: SOURCE, name: s1: Successfully registered new MBean.  
18/05/05 20:28:17 INFO instrumentation.MonitoredCounterGroup: Component type: SOURCE, name: s1 started

在liunx系统中新建一个文件 hello.txt

[hadoop@hadoop02 ~]$ vi hello.txt   
hello  
world

把这个文件复制到存储读取文件的目录下：（这个配置文件所设置的文件夹）

a1.sources.s1.spoolDir=/home/hadoop/apps/apache-flume-1.8.0-bin/flume

使用命令：

[hadoop@hadoop02 ~]$ cp hello.txt ~/apps/apache-flume-1.8.0-bin/flume

读取结果：

18/05/05 20:30:10 INFO avro.ReliableSpoolingFileEventReader: Last read took us just up to a file boundary. Rolling to the next file, if there is one.  
18/05/05 20:30:10 INFO avro.ReliableSpoolingFileEventReader: Preparing to move file /home/hadoop/apps/apache-flume-1.8.0-bin/flume/hello.txt to /home/hadoop/apps/apache-flume-1.8.0-bin/flume/hello.txt.COMPLETED  
18/05/05 20:30:14 INFO sink.LoggerSink: Event: { headers:{} body: 68 65 6C 6C 6F                                  hello }  
18/05/05 20:30:14 INFO sink.LoggerSink: Event: { headers:{} body: 77 6F 72 6C 64                                  world }

1.3、测试案例（二）

案例2：实时模拟从web服务器中读取数据到hdfs中

Exec Source介绍

模拟web界面的数据，需要一直启动着

新建一个空文件：

[hadoop@hadoop02 tomcat]$ touch catalina.out  
[hadoop@hadoop02 tomcat]$ ll  
total 0  
-rw-rw-r--. 1 hadoop hadoop 0 May  6 12:19 catalina.out

写一个脚本，依次往这个文件里面读入数据：

[hadoop@hadoop02 tomcat]$ while true; do echo `date` >> catalina.out; sleep 1; done

用这个命令进行查看：（数据在不断增加）

[hadoop@hadoop02 tomcat]$ tail -F catalina.out   
Sun May 6 12:24:57 CST 2018  
Sun May 6 12:24:58 CST 2018  
Sun May 6 12:24:59 CST 2018  
Sun May 6 12:25:00 CST 2018  
Sun May 6 12:25:01 CST 2018

#文件名:case_hdfs.properties

#配置内容：（是在同一个节点上进行操作）

读取的是tomcat/catalina.out 里面的数据（这个数据一直在不断的更新，每次读取的都是最后一次的数据）

#配置一个agent  agent的名称可以自定义  
#指定agent的sources,sinks,channels  
#分别指定 agent的 sources，sinks，channels 的名称 名称可以自定义  
a1.sources = s1  
a1.channels = c1  
a1.sinks = k1  
  
#配置source  根据agent的sources的名称来对source进行配置  
#source的参数是根据 不同的数据源 配置不同---在文档查找即可  
#配置source  
a1.sources.s1.type = exec  
a1.sources.s1.command = tail -F /home/hadoop/tomcat/catalina.out  
  
#配置channel 根据agent的channels的名称来对channels进行配置  
#配置channel  
a1.channels.c1.type = memory  
  
#配置sink 根据agent的sinks的名称来对sinks进行配置  
#配置一个hdfs sink  
a1.sinks.k1.type = hdfs  
a1.sinks.k1.hdfs.path = /flume/events/%y-%m-%d/%H%M  
#设置目录的回滚  
a1.sinks.k1.hdfs.round = true  
a1.sinks.k1.hdfs.roundValue = 1  
a1.sinks.k1.hdfs.roundUnit = minute  
a1.sinks.k1.hdfs.useLocalTimeStamp = true  
#设置前缀和后缀  
a1.sinks.k1.hdfs.filePrefix = taobao  
a1.sinks.k1.hdfs.fileSuffix = log  
#设置文件的回滚  
a1.sinks.k1.hdfs.rollInterval = 10  
a1.sinks.k1.hdfs.rollSize = 1024  
a1.sinks.k1.hdfs.rollCount = 10  
a1.sinks.k1.hdfs.fileType = DataStream  
  
#为source 指定它的channel  
a1.sources.s1.channels = c1  
  
#为sink 指定他的 channel  
a1.sinks.k1.channel = c1

运行命令：

bin/flume-ng agent --conf conf --conf-file /home/hadoop/apps/apache-flume-1.8.0-bin/flumetest/case_hdfs.properties --name a1 -Dflume.root.logger=INFO,console

--conf 指定flume配置文件的位置  
--conf-file 指定日志收集的配置文件  
--name 指定agent的名称  
-Dflume.root.logger=INFO,console 让收集的信息打印到控制台

运行的部分日志结果：

18/05/06 16:09:44 INFO conf.FlumeConfiguration: Processing:k1  
18/05/06 16:09:44 INFO conf.FlumeConfiguration: Processing:k1  
18/05/06 16:09:44 INFO conf.FlumeConfiguration: Processing:k1  
18/05/06 16:09:44 INFO conf.FlumeConfiguration: Processing:k1  
18/05/06 16:09:44 INFO conf.FlumeConfiguration: Post-validation flume configuration contains configuration for agents: [a1]  
18/05/06 16:09:44 INFO node.AbstractConfi