Apache Flume快速入门手册

介绍

apache fulme是一个分布式的高可用的框架,可以从不同的数据源大量的操作日志数据,能高效的收集,聚合,移动日志数据集中到存储中。

apahce fulme不仅仅是日志聚合功能,还能自定义数据源,用于传输大量的事件数据,网络流量数据,社交媒体数据,邮件数据以及其他数据
Apache Flume 目前有两种主版本: 0.9.x 和 1.x。其中 0.9.x 是历史版本,称之为 Flume OG(original generation),1.X版本为重构后版本,统称为 Flume NG(next generation)

系统环境要求

  • jdk1.7及以上
  • 足够的内存
  • 足够的磁盘空间
  • 允许读写的目录权限

数据流模型

一个 Flume event(事件,消息,操作)被定义为一个数据流单元。包含字节信息和一些可选的字符串属性集。一个flume agent是一个JVM进程。通过agent能从外部数据源加载数据到下一个目标(可能是数据源,也可能是其他的操作)

  • note:event解释:是flume数据传输的基本单元,一条记录对应一个event.他有headerbody组成,body中存在的是字节数组
    官方图

flume下载地址

下载地址:http://archive.cloudera.com/cdh5/cdh/5/flume-ng-1.6.0-cdh5.15.1.tar.gz
文档地址:http://archive.cloudera.com/cdh5/cdh/5/flume-ng-1.6.0-cdh5.15.1/FlumeUserGuide.html

flume核心组件(agent构成):

  • source:收集数据:从不同的数据源中采集数据,可查看能从哪里收集。官网http://archive.cloudera.com/cdh5/cdh/5/flume-ng-1.6.0-cdh5.15.1/FlumeUserGuide.html#flume-sources 常用的source有(avro/exec/spooling/tailDir/kafka/netcat/http/custom)

  • channel:鉴于source, sink之间,(处理数据,减少磁盘交互)官网:http://archive.cloudera.com/cdh5/cdh/5/flume-ng-1.6.0-cdh5.15.1/FlumeUserGuide.html#flume-channels常用channel的有(memory/file/kafka/custom)

  • sink:读取通道中的数据,执行下一步操作官网:http://archive.cloudera.com/cdh5/cdh/5/flume-ng-1.6.0-cdh5.15.1/FlumeUserGuide.html#flume-sinks常用sink的有(hdfs/logger/avro/kafka/custom)

数据流定义(agent定义)

  • 介绍:一个简单的数据流,需要定义source,channel,sink三大组件,通过channel连接sourcesink。一个agent需要有一个source列表sinkchannelsourcesink之间有一个通道。注意:一个source能指定多个channel,一个sink只能指定一个channel`
#每一个agent必须有一个名字。
#<Agent>=自定义agent名称
#<Source>=自定义source名称
#<Channel1>=channal1的名称
#<Channel2>=channal2的名称
#<Sink>=sink的名称

# 在agent中定义了一个source列表,channel,sink
<Agent>.sources = <Source>
<Agent>.sinks = <Sink>
<Agent>.channels = <Channel1> <Channel2>

# 设置source和channel之间的关系
<Agent>.sources.<Source>.channels = <Channel1> <Channel2> ...

# 设置sink和channel之间的关系
<Agent>.sinks.<Sink>.channel = <Channel1>
  • 举例
agent_foo.sources = avro-appserver-src-1
agent_foo.sinks = hdfs-sink-1
agent_foo.channels = mem-channel-1

agent_foo.sources.avro-appserver-src-1.channels = mem-channel-1
agent_foo.sinks.hdfs-sink-1.channel = mem-channel-1

实战

监听本机的44444端口的数据信息

  • 文档:http://archive.cloudera.com/cdh5/cdh/5/flume-ng-1.6.0-cdh5.15.1/FlumeUserGuide.html#a-simple-example

  • 在flume根目录下的conf目录下建立一个配置文件example-conf.properties

#定义agent的名称为a1,source为r1,sinks为k1,channels为c1
a1.sources = r1
a1.sinks = k1
a1.channels = c1


#通过端口监听,需要使用的source为netcat(http://archive.cloudera.com/cdh5/cdh/5/flume-ng-1.6.0-cdh5.15.1/FlumeUserGuide.html#netcat-tcp-source)
监听的机器为本机,监听端口为44444
a1.sources.r1.type = netcat
a1.sources.r1.bind = localhost
a1.sources.r1.port = 44444

#定义sink的类型为logger,也就是日志输出
a1.sinks.k1.type = logger

# 定义的channel类型为memory,直接把source的数据写入memory中
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# 配置source,sink和channel之间的关系
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
  • 启动脚本
    $ bin/flume-ng agent --conf conf --conf-file conf/example-conf.properties --name a1 -Dflume.root.logger=INFO,console
    脚本参数解析:
  • 1.–conf-file 配置文件的位置
  • 2.–name agent的名称
  • 测试
    $ telnet localhost 44444
    输入内容后会看到控制台有输出内容
# 控制台1 输入内容
hello
# 控制台2 输出内容
2019-09-11 21:46:39,812 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - org.apache.flume.sink.LoggerSink.process(LoggerSink.java:95)] Event: { headers:{} body: 68 65 6C 6C 6F 0D                               hello. }

监控新增的文件内容并输出到hdfs

  • 在flume根目录下的conf目录下建立一个配置文件example-conf1.properties
#定义agent的名称为tailDir-hdfs
tailDir-hdfs.sources = tailDir-source
tailDir-hdfs.sinks = tailDir-sink
tailDir-hdfs.channels = tailDir-channel


tailDir-hdfs.sources.tailDir-source.type = TAILDIR
tailDir-hdfs.sources.tailDir-source.filegroups =example2
tailDir-hdfs.sources.tailDir-source.filegroups.example2 =/home/hadoop/data/flume/example.log


#定义sink的类型为logger,也就是日志输出
tailDir-hdfs.sinks.tailDir-sink.type = hdfs
tailDir-hdfs.sinks.tailDir-sink.hdfs.path = /flume/events/%y-%m-%d
tailDir-hdfs.sinks.tailDir-sink.hdfs.filePrefix = events-
tailDir-hdfs.sinks.tailDir-sink.hdfs.round = true
tailDir-hdfs.sinks.tailDir-sink.hdfs.roundValue = 1
tailDir-hdfs.sinks.tailDir-sink.hdfs.roundUnit = second
#注释后会报错
# tailDir-hdfs.sinks.tailDir-sink.hdfs.useLocalTimeStamp = true


# 定义的channel类型为memory,直接把source的数据写入memory中
tailDir-hdfs.channels.tailDir-channel.type = memory
tailDir-hdfs.channels.tailDir-channel.capacity = 1000
tailDir-hdfs.channels.tailDir-channel.transactionCapacity = 100

# 配置source,sink和channel之间的关系
tailDir-hdfs.sources.tailDir-source.channels = tailDir-channel
tailDir-hdfs.sinks.tailDir-sink.channel = tailDir-channel
  • 启动脚本
    $ bin/flume-ng agent --conf conf --conf-file conf/example-conf1.properties --name tailDir-hdfs -Dflume.root.logger=INFO,console

  • 测试.输入内容后会看到控制台有输出内容

# 控制台1 输入内容
$ echo 111 >> /home/hadoop/data/flume/example.log
# 控制台2 输出内容
2019-09-12 02:09:50,169 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - org.apache.flume.sink.hdfs.HDFSSequenceFile.configure(HDFSSequenceFile.java:63)] writeFormat = Writable, UseRawLocalFileSystem = false
2019-09-12 02:09:50,195 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - org.apache.flume.sink.hdfs.BucketWriter.open(BucketWriter.java:251)] Creating /flume/event/19-09-12/0209/50/events-.1568268590170.tmp
2019-09-12 02:10:14,851 (hdfs-tailDir-sink-roll-timer-0) [INFO - org.apache.flume.sink.hdfs.BucketWriter.close(BucketWriter.java:393)] Closing /flume/event/19-09-12/0209/43/events-.1568268583163.tmp
2019-09-12 02:10:14,911 (hdfs-tailDir-sink-call-runner-7) [INFO - org.apache.flume.sink.hdfs.BucketWriter$8.call(BucketWriter.java:655)] Renaming /flume/event/19-09-12/0209/43/events-.1568268583163.tmp to /flume/event/19-09-12/0209/43/events-.1568268583163
2019-09-12 02:10:14,919 (hdfs-tailDir-sink-roll-timer-0) [INFO - org.apache.flume.sink.hdfs.HDFSEventSink$1.run(HDFSEventSink.java:382)] Writer callback called.
2019-09-12 02:10:20,250 (hdfs-tailDir-sink-roll-timer-0) [INFO - org.apache.flume.sink.hdfs.BucketWriter.close(BucketWriter.java:393)] Closing /flume/event/19-09-12/0209/50/events-.1568268590170.tmp
2019-09-12 02:10:20,273 (hdfs-tailDir-sink-call-runner-9) [INFO - org.apache.flume.sink.hdfs.BucketWriter$8.call(BucketWriter.java:655)] Renaming /flume/event/19-09-12/0209/50/events-.1568268590170.tmp to /flume/event/19-09-12/0209/50/events-.1568268590170
2019-09-12 02:10:20,282 (hdfs-tailDir-sink-roll-timer-0) [INFO - org.apache.flume.sink.hdfs.HDFSEventSink$1.run(HDFSEventSink.java:382)] Writer callback called.
  • 可能会出现的错误
2019-09-12 01:59:30,612 (SinkRunner-PollingRunner-DefaultSinkProcessor) [ERROR - org.apache.flume.sink.hdfs.HDFSEventSink.process(HDFSEventSink.java:447)] process failed
java.lang.NullPointerException: Expected timestamp in the Flume event headers, but it was null
	at com.google.common.base.Preconditions.checkNotNull(Preconditions.java:204)
	at org.apache.flume.formatter.output.BucketPath.replaceShorthand(BucketPath.java:251)
	at org.apache.flume.formatter.output.BucketPath.escapeString(BucketPath.java:460)
	at org.apache.flume.sink.hdfs.HDFSEventSink.process(HDFSEventSink.java:368)
	at org.apache.flume.sink.DefaultSinkProcessor.process(DefaultSinkProcessor.java:67)
	at org.apache.flume.SinkRunner$PollingRunner.run(SinkRunner.java:145)
	at java.lang.Thread.run(Thread.java:748)
2019-09-12 01:59:30,613 (SinkRunner-PollingRunner-DefaultSinkProcessor) [ERROR - org.apache.flume.SinkRunner$PollingRunner.run(SinkRunner.java:158)] Unable to deliver event. Exception follows.
org.apache.flume.EventDeliveryException: java.lang.NullPointerException: Expected timestamp in the Flume event headers, but it was null
	at org.apache.flume.sink.hdfs.HDFSEventSink.process(HDFSEventSink.java:451)
	at org.apache.flume.sink.DefaultSinkProcessor.process(DefaultSinkProcessor.java:67)
	at org.apache.flume.SinkRunner$PollingRunner.run(SinkRunner.java:145)
	at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.NullPointerException: Expected timestamp in the Flume event headers, but it was null
	at com.google.common.base.Preconditions.checkNotNull(Preconditions.java:204)
	at org.apache.flume.formatter.output.BucketPath.replaceShorthand(BucketPath.java:251)
	at org.apache.flume.formatter.output.BucketPath.escapeString(BucketPath.java:460)
	at org.apache.flume.sink.hdfs.HDFSEventSink.process(HDFSEventSink.java:368)
	... 3 more
  • 可在hdfs文件系统中查看生成后的内容
  • 本文参考地址:http://archive.cloudera.com/cdh5/cdh/5/flume-ng-1.6.0-cdh5.15.1/FlumeUserGuide.html
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值