1.flume 简介
Flume是一种分布式,可靠且可用的服务,用于有效地收集,聚合和移动大量日志数据。它具有基于流数据流的简单灵活的架构。它具有可靠的可靠性机制和许多故障转移和恢复机制,具有强大的容错性。它使用简单的可扩展数据模型,允许在线分析应用程序。
2.数据流模型
Flume事件被定义为具有字节有效负载和可选字符串属性集的数据流单元。 Flume代理是一个(JVM)进程,它承载事件从外部源流向下一个目标(跃点)的组件。
flume通过source收集应用服务器上日志数据并义event形式将数据发送给channel 组件,期间可将数据进行简单的清洗装换,可用到flume的拦截器,每一个source可将数据发送给多个channel,sink接收指定channel的数据,并将其发送或存储到相应的下游组件或文件系统,sink 可将数据发送到下一个flume的source 或者kafak,也可将数据直接存储到HDFS或者数据库等。
3.安装配置
3.1下载最新版本
地址:http://flume.apache.org/download.html
目前最新:apache-flume-1.9.0-bin.tar.gz
3.2解压
[hadoop@master ~]$ tar -zxvf apache-flume-1.9.0-bin.tar.gz
3.3环境配置
指定java 路径
根据模板生成flume的flume-env.sh环境文件和flume.conf组件配置文件(名字可以修改)
[hadoop@master ~]$ cd /home/hadoop/apache-flume-1.9.0-bin/conf
[hadoop@master conf]$ cp flume-env.sh.template flume-env.sh
[hadoop@master conf]$ cp flume-conf.properties.templat flume.conf
修改flume-env.sh 的JAVA_HOME:
[hadoop@master conf]$ vi flume-env.sh
.......
# Enviroment variables can be set here.
# export JAVA_HOME=/usr/lib/jvm/java-8-oracle
export JAVA_HOME=/usr/java/jdk1.8.0_131/
............
修改组件配置文件:
[hadoop@master conf]$ vi flume.conf
[hadoop@master conf]$ cat flume.conf|grep -v ^#
#为a1 agent配置source,channel,sink,分别取名为s1,c1,k1
a1.sources = s1
a1.channels = c1
a1.sinks = k1
#为a1 agent配置source 信息
#source来源类型为netcat,可以为Avro,seq,syslogtcp,http等等
a1.sources.s1.type = netcat
#指定s1输出到channel c1,可以为多个
a1.sources.s1.channels = c1
#绑定s1的ip和端口
a1.sources.s1.bind = 0.0.0.0
a1.sources.s1.port = 44444
#不需要header
a1.sources.s1.fileHeader = false
#sink s1配置
#配置agnet a1数据来源为channel c1
a1.sinks.k1.channel = c1
#sink k1输出到hdfs
a1.sinks.k1.type = hdfs
#指定hdfs输出路径,如果根据时间分区,必须为event header添加时间戳
a1.sinks.k1.hdfs.path =hdfs://master:9000/flume-collection/%Y-%m-%d
a1.sinks.k1.hdfs.maxOpenFiles = 5000
a1.sinks.k1.hdfs.batchSize= 100
#为每个event添加时间戳,采用本地时间
a1.sinks.k1.hdfs.useLocalTimeStamp=true
#指定输出给谁为text
a1.sinks.k1.hdfs.fileType = DataStream
a1.sinks.k1.hdfs.writeFormat =Text
#指定hdfs文件达到100M进行切换,也可根据时间进行滑动
a1.sinks.k1.hdfs.rollSize = 102400
a1.sinks.k1.hdfs.rollCount = 1000000
#根据时间间隔滑动,为0 取消根据时间间隔滑动,hadoop对大量小文件存储不太友好
a1.sinks.k1.hdfs.rollInterval = 30
#channle配置
a1.channels.c1.type = memory
a1.channels.c1.capacity = 100
配置列表:http://flume.apache.org/releases/content/1.9.0/FlumeUserGuide.html#component-summary
Component Interface | Type Alias | Implementation Class |
---|---|---|
org.apache.flume.Channel | memory | org.apache.flume.channel.MemoryChannel |
org.apache.flume.Channel | jdbc | org.apache.flume.channel.jdbc.JdbcChannel |
org.apache.flume.Channel | file | org.apache.flume.channel.file.FileChannel |
org.apache.flume.Channel | – | org.apache.flume.channel.PseudoTxnMemoryChannel |
org.apache.flume.Channel | – | org.example.MyChannel |
org.apache.flume.Source | avro | org.apache.flume.source.AvroSource |
org.apache.flume.Source | netcat | org.apache.flume.source.NetcatSource |
org.apache.flume.Source | seq | org.apache.flume.source.SequenceGeneratorSource |
org.apache.flume.Source | exec | org.apache.flume.source.ExecSource |
org.apache.flume.Source | syslogtcp | org.apache.flume.source.SyslogTcpSource |
org.apache.flume.Source | multiport_syslogtcp | org.apache.flume.source.MultiportSyslogTCPSource |
org.apache.flume.Source | syslogudp | org.apache.flume.source.SyslogUDPSource |
org.apache.flume.Source | spooldir | org.apache.flume.source.SpoolDirectorySource |
org.apache.flume.Source | http | org.apache.flume.source.http.HTTPSource |
org.apache.flume.Source | thrift | org.apache.flume.source.ThriftSource |
org.apache.flume.Source | jms | org.apache.flume.source.jms.JMSSource |
org.apache.flume.Source | – | org.apache.flume.source.avroLegacy.AvroLegacySource |
org.apache.flume.Source | – | org.apache.flume.source.thriftLegacy.ThriftLegacySource |
org.apache.flume.Source | – | org.example.MySource |
org.apache.flume.Sink | null | org.apache.flume.sink.NullSink |
org.apache.flume.Sink | logger | org.apache.flume.sink.LoggerSink |
org.apache.flume.Sink | avro | org.apache.flume.sink.AvroSink |
org.apache.flume.Sink | hdfs | org.apache.flume.sink.hdfs.HDFSEventSink |
org.apache.flume.Sink | hbase | org.apache.flume.sink.hbase.HBaseSink |
org.apache.flume.Sink | hbase2 | org.apache.flume.sink.hbase2.HBase2Sink |
org.apache.flume.Sink |