Flume入门

最新推荐文章于 2022-09-03 18:06:00 发布

会写程序员的代码

最新推荐文章于 2022-09-03 18:06:00 发布

阅读量365

点赞数

分类专栏： hadoop 文章标签：大数据 flume kafka

本文链接：https://blog.csdn.net/qq_41547580/article/details/103881834

版权

hadoop 专栏收录该内容

7 篇文章 0 订阅

订阅专栏

一、什么是Flume？

是一个分布式可靠的高可用的海量日志收集、聚合、移动的工具。
通俗来说flume就是一个日志采集工具。

二、flume的特性

1）flume可以高效率的将多个网站服务器中收集的日志信息存入HDFS/HBase中
（这里测试时是以集群中不同主机作为一个服务器，然后采集不同主机上的日志文件并存入HDFS）

2）移交数据速度快。
flume可以将从多个服务器中获取的数据迅速移交给hadoop中

3）除了收集日志信息，flume还可以收集规模较大的社交网络节点事件数据，比如facebook、twitter
电商网站如亚马逊等

4）支持各种接入资源数据的类型
（即source类型：avro source、exec source、spooling directory、kafka source、http source等）
和接出资源数据类型
（即sink类型：hdfs sink、avro sink、kafka sink等）

5）支持多路径流量，多管道接入流量，多管道接出流量，上下文路由等
6）可以被水平扩展

三、flume的结构

1）Source
默认的有Avro（监视端口）、Thrift、Exec（执行Linux命令）、JMS、Spooling Directory（监视目录）、TailDirSource（1.7版本以后新增类似tail功能，支持断点续传），第三方插件有kafka

2）拦截器
所有events增加头，类似json格式里的"header"{"key":"value"}
头部可以插入：时间戳、主机（主机名和IP）、静态（指定KV）、正则过滤（留下符合条件的）、自定义

3）选择器
source发送的event通过channel选择器来选择以哪种方式写入到channel中，flume提供三种选择器类型，复制，复用和自定义选择器

4）Channel
Memory、JDBC、Kafka、File、Custom

5）拦截器

6）Sink
HDFS、Hvie、Logger、Avro、File Roll Sink（本地文件存储）、Hbase、ElasticSearch、Kafka

四、什么是Channel？

概念：相当于flume内部的消息队列，是event中转临时缓冲区。

功能：存储source收集并且没有被sink读取的event，目的是为了平衡source收集和sink读取的速度

安全性：channel线程安全并且具有事务性，支持source写数据失败重写和sink读取失败重复读的操作

常见类型：
	1）Memory channel
		A）读写速度快，但是存储数据量小，flume进程挂掉、服务器停机或者重启都会导致数据丢失。资源充足，不关心数据丢失的场景可以用
		
	2）File channel
		A）会将event写入磁盘文件，与Memory channel相比存储容量大，无数据丢失风险。数据存储路径可以配置多个磁盘文件路径，通过磁盘并行写入提高性能。
		B）Flume 将 Event 顺序写入到 File Channel 文件的末尾。可以在配置文件中通过设置 maxFileSize 参数配置数据文件大小，当被写入的文件大小达到上限的时候，Flume 会重新创建新的文件存储写入 Event。当一个已经关闭的只读数据文件的 Event 被读取完成，并且 Sink 已经提交读取完成的事务，则 Flume 把存储该数据的文件删除。
	
	3）Kafka channel
		A）集合了Memory channel和File channel的优点。容量大，容错能力强。

		B）使用方式：
			日志收集层只需要配置source组件和kafka组件
			日志汇聚层只需要配置kafka channel和sink
		好处：减少了日志收集层启动的进程数，有效降低服务器内存、磁盘等资源的使用率


补充：
	问题一：File Channel会不会发生数据丢失，为什么？
	不会发生数据丢失情况，因为从source、channel、sink都是事务性，channel如果设置成memory的话有可能会丢失数据，如果是file channel不会发生数据丢失

五、什么是拦截器？

概念：source将event写入到channel之前可以使用拦截器对event进行各种形式的处理，source和channel之间可以有多个拦截器，不同拦截器使用不同的规则处理event，包括时间戳、主机（主机名和IP）、UUID、正则表达式等

六、什么是选择器？

概念：source发送的event通过channel选择器来选择以哪种方式写入到channel中，flume提供三种选择器类型，复制，复用和自定义选择器

1）复制选择器
在这里插入图片描述

一个source以复制的方式将一个event同时写入到多个channel中，不同的sink可以从不同的channel中获取相同的event
如：一份日志数据同时写kafka和HDFS，一个event同时写入两个channel，然后不同类型的sink读取后发送到不同的外部存储

2）复用选择器

需要和拦截器配合使用，根据event的头信息中的不同键值数据来判断event应该写入到哪个channel中

七、案例应用

flume命令详解（参考）

1)flume拦截器

拦截器案例一：Spooling Directory->Memory Channel ->HDFS Sink

##描述各组件名称
a1.sources = r1
a1.channels = c1
a1.sinks = k1

##描述source

a1.sources.r1.type = spooldir	#spooldir类型相比exec类型，更加保持了数据的完整性，而且趋近于实时
a1.sources.r1.spoolDir = /home/hadoop/datatest	#监控目录
a1.sources.r1.interceptors = i1 #定义的拦截器命名
a1.sources.r1.interceptors.i1.type = timestamp #拦截器类型

##描述channel
a1.channels.c1.type = memory	#channel类型
a1.channels.c1.capacity = 10000	#内存中存储event最大数
a1.channels.c1.transactionCapacity = 10000	#source 或者 sink 每个事务中存取 Event 的操作数量，不能比capacity大
a1.channels.c1.byteCapacityBufferPercentage = 20	#指定 Event header 所占空间大小与 channel 中所有 Event 的总大小之间的百分比
a1.channels.c1.byteCapacity = 800000	#Channel 中最大允许存储所有 Event 的总字节数（bytes），默认情况下会使用JVM可用内存的80%作为最大可用内存（就是JVM启动参数里面配置的-Xmx的值）

##描述sink
a1.sinks.k1.type = hdfs	#sink类型
a1.sinks.k1.hdfs.path = /flume/events/%y-%m-%d/%H%M/	#HDFS目录路径
a1.sinks.k1.hdfs.filePrefix = events-	#Flume在HDFS文件夹下创建新文件的固定前缀
a1.sinks.k1.hdfs.round = true	#是否应将时间戳向下舍入（如果为true，则影响除 %t 之外的所有基于时间的转义符
a1.sinks.k1.hdfs.roundValue = 1	#向下舍入（小于当前时间）的这个值的最高倍（单位取决于下面的 hdfs.roundUnit ）
a1.sinks.k1.hdfs.roundUnit = minute	#向下舍入的单位，可选值： second 、 minute 、 hour

#日志回滚策略：文件大小不会超过2G
a1.sinks.k1.hdfs.rollInterval = 30	#当前文件写入达到该值时间后触发滚动创建新文件 单位：秒
a1.sinks.k1.hdfs.rollSize = 2000	#当前文件写入达到该大小后触发滚动创建新文件 单位：字节
a1.sinks.k1.hdfs.rollCount = 50	#当前文件写入Event达到该数量后触发滚动创建新文件
a1.sinks.k1.hdfs.batchSize = 20	#向 HDFS 写入内容时每次批量操作的 Event 数量
a1.sinks.k1.hdfs.useLocalTimeStamp = true	#使用日期时间转义符时是否使用本地时间戳（而不是使用 Event header 中自带的时间戳）
a1.sinks.k1.hdfs.fileType = DataStream	#文件格式

##绑定source和channel，channel和sink
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

启动命令：

flume-ng agent -c $FLUME_HOME/conf -f /opt/soft/apache-flume-1.7.0-bin/conf/spool2_hdfs.conf -n a1 -Dflume.root.logger=INFO,console

拦截器案例二：综合型拦截器，包括时间戳，主机名，静态
Exec Source–>Memory Channel–>HDFS Sink

##描述各个组件
a1.sources=r1
a1.channels=c1
a1.sinks=s1

##描述source
a1.sources.r1.type=exec
a1.sources.r1.command= tail -F /home/hadoop/hive.log
a1.sources.r1.interceptors = i1 i2 i3	#拦截器数量
a1.sources.r1.interceptors.i1.type = timestamp
a1.sources.r1.interceptors.i1.preserveExisting=true
a1.sources.r1.interceptors.i2.type = host
a1.sources.r1.interceptors.i2.useIP=false
a1.sources.r1.interceptors.i2.hostHeader = hostname
a1.sources.r1.interceptors.i2.preserveExisting=true
a1.sources.r1.interceptors.i3.type = static
a1.sources.r1.interceptors.i3.key = city
a1.sources.r1.interceptors.i3.value = NEW_YORK

##描述channel
a1.channels.c1.type=memory
a1.channels.c1.capacity=1000
a1.channels.c1.transactionCapacity=100
a1.channels.c1.keep-alive=3
a1.channels.c1.byteCapacityBufferPercentage = 20
a1.channels.c1.byteCapacity = 800000

##描述sink
a1.sinks.s1.type = hdfs
a1.sinks.s1.hdfs.path = /flume/flumeevents/%y-%m-%d/%H%M/%{city}
a1.sinks.s1.hdfs.filePrefix = %{hostname}-
a1.sinks.s1.hdfs.fileSuffix=.log
a1.sinks.s1.hdfs.inUseSuffix=.tmp
a1.sinks.s1.hdfs.rollInterval=30
a1.sinks.s1.hdfs.rollSize=5000
a1.sinks.s1.hdfs.fileType=DataStream
a1.sinks.s1.hdfs.writeFormat=Text
a1.sinks.s1.hdfs.round = true
a1.sinks.s1.hdfs.roundValue = 1
a1.sinks.s1.hdfs.roundUnit = minute
a1.sinks.s1.hdfs.useLocalTimeStamp=false

a1.sources.r1.channels=c1
a1.sinks.s1.channel=c1

2)flume采集数据写到日志文件案例

案例一
Avro Source —>Memory Channel ---->Logger Sink
avro就是一种序列化形式，avro source监听一个端口只接收avro序列化后的数据，其他类型不接收

##描述各组件名称
a1.sources = r1
a1.channels = c1
a1.sinks = k1

##描述source
a1.sources.r1.type = avro
a1.sources.r1.bind = slave1
a1.sources.r1.port = 4141

##描述channel
a1.channels.c1.type = memory
a1.channels.c1.capacity = 10000  #channel中最大event数目
a1.channels.c1.transactionCapacity = 10000	#channel中允许事务的最大event数目
a1.channels.c1.byteCapacityBufferPercentage = 20
a1.channels.c1.byteCapacity = 800000

##描述sink
a1.sinks.k1.type = logger

##绑定source和channel，channel和sink
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

启动命令：

flume-ng agent -c $FLUME_HOME/conf -f /opt/soft/apache-flume-1.7.0-bin/conf/avro_logger.conf -n a1 -Dflume.root.logger=INFO,console

发送测试数据：

flume-ng avro-client -c $FLUME_HOME/conf -H slave1 -p 4141 -F /home/hadoop/datatest/

案例二
Exec Source —>Memory Channel ---->Logger Sink
补充：
（A）tail -f:监控的是文件inode号，一旦文件进行移动、重命名等操作就监控不到文件了
tail -F:监控的是文件名，只要名字不变就可以监控的到,无论文件是否存在
（B）exec source主要用于实时的数据采集（一般对日志文件进行采集，因为日志是实时滚动的）

##描述各组件名称
a1.sources = r1
a1.channels = c1
a1.sinks = k1

##描述source
a1.sources.r1.type = exec
a1.sources.r1.command = tail -F /tmp/hadoop/hive.log

##描述channel
a1.channels.c1.type = memory
a1.channels.c1.capacity = 10000
a1.channels.c1.transactionCapacity = 10000
a1.channels.c1.byteCapacityBufferPercentage = 20
a1.channels.c1.byteCapacity = 800000

##描述sink
a1.sinks.k1.type = logger

##绑定source和channel，channel和sink
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

启动命令：

flume-ng agent -c $FLUME_HOME/conf -f /opt/soft/apache-flume-1.7.0-bin/conf/exec_logger.conf -n a1 -Dflume.root.logger=INFO,console

启动日志文件对应的服务组件（如hive.log就需要启动hive）

3）flume监控目录将数据采集到的HDFS案例

案例一
Spooling Directory Source->Memory Channel ->HDFS Sink
补充：
spooling directory source
指定目录中的文件不允许有正在编辑的文件
目录中不能出现同名的文件

##描述各组件名称
a1.sources = r1
a1.channels = c1
a1.sinks = k1

##描述source
a1.sources.r1.type = spooldir
a1.sources.r1.spoolDir =/home/hadoop/datatest

##描述channel
a1.channels.c1.type = memory
a1.channels.c1.capacity = 10000
a1.channels.c1.transactionCapacity = 10000
a1.channels.c1.byteCapacityBufferPercentage = 20
a1.channels.c1.byteCapacity = 800000

##描述sink
a1.sinks.k1.type = hdfs
a1.sinks.k1.hdfs.path = /flume/events/%y-%m-%d/%H%M/
a1.sinks.k1.hdfs.filePrefix = events-
a1.sinks.k1.hdfs.round = true
a1.sinks.k1.hdfs.roundValue = 1
a1.sinks.k1.hdfs.roundUnit = minute

#日志回滚策略：文件大小不会超过2G
a1.sinks.k1.hdfs.rollInterval = 30 #间隔多久产生新文件
a1.sinks.k1.hdfs.rollSize = 2000	#文件达到多大再产生一个新文件，默认1024字节
a1.sinks.k1.hdfs.rollCount = 50	#event达到多大再产生一个新文件，默认10个
a1.sinks.k1.hdfs.batchSize = 20	#每次往hdfs里提交多少个event
a1.sinks.k1.hdfs.useLocalTimeStamp = true
a1.sinks.k1.hdfs.fileType = DataStream	#文件格式类型
#（maxBytesToLog：打印body的最长的字节数 默认为16）

##绑定source和channel，channel和sink
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

启动命令：

flume-ng agent -c $FLUME_HOME/conf -f /opt/soft/apache-flume-1.7.0-bin/conf/spool_hdfs.conf -n a1 -Dflume.root.logger=INFO,console

4）复制选择器案例

案例一：
Exec Source->Memory Channel1和Memory channe2 ->HDFS Sink和Logger Sink

a1.sources=r1
a1.channels=c1 c2
a1.sinks=s1 s2

a1.sources.r1.type=exec
a1.sources.r1.command= tail -F /tmp/hadoop/hive.log

##描述选择器（复制选择器，所有channel获取全部相同的数据）
a1.sources.r1.selector.type = replicating
a1.sources.r1.selector.optional = c2

a1.channels.c1.type=memory
a1.channels.c1.capacity=1000
a1.channels.c1.transactionCapacity=100
a1.channels.c1.keep-alive=3
a1.channels.c1.byteCapacityBufferPercentage = 20
a1.channels.c1.byteCapacity = 800000

a1.channels.c2.type=memory
a1.channels.c2.capacity=1000
a1.channels.c2.transactionCapacity=100
a1.channels.c2.keep-alive=3
a1.channels.c2.byteCapacityBufferPercentage = 20
a1.channels.c2.byteCapacity = 800000

a1.sinks.s1.type = logger

a1.sinks.s2.type = hdfs
a1.sinks.s2.hdfs.path = /flume/repevents/%y-%m-%d/%H%M/
a1.sinks.s2.hdfs.filePrefix = event-
a1.sinks.s2.hdfs.fileSuffix=.log
a1.sinks.s2.hdfs.inUseSuffix=.tmp
a1.sinks.s2.hdfs.rollInterval=50
a1.sinks.s2.hdfs.rollSize=1024
a1.sinks.s2.hdfs.fileType=DataStream
a1.sinks.s2.hdfs.writeFormat=Text
a1.sinks.s2.hdfs.round = true
a1.sinks.s2.hdfs.roundValue = 1
a1.sinks.s2.hdfs.roundUnit = minute
a1.sinks.s2.hdfs.useLocalTimeStamp=true

a1.sources.r1.channels=c1 c2
a1.sinks.s1.channel=c1
a1.sinks.s2.channel=c2

启动命令：

flume-ng agent -c $FLUME_HOME/conf -f /opt/soft/apache-flume-1.7.0-bin/conf/rep.conf -n a1 -Dflume.root.logger=INFO,console

5）复分选择器案例

案例一：
http source---->memory channel1和memory channe2 ---->hdfs sink和logger sink

a1.sources=r1
a1.channels=c1 c2
a1.sinks=s1 s2

a1.sources.r1.type=org.apache.flume.source.http.HTTPSource
a1.sources.r1.port=6666
a1.sources.r1.bind=slave1
a1.sources.r1.selector.type = multiplexing
a1.sources.r1.selector.header = status
a1.sources.r1.selector.mapping.CZ = c1
a1.sources.r1.selector.mapping.US = c2
a1.sources.r1.selector.default = c1

a1.channels.c1.type=memory
a1.channels.c1.capacity=1000
a1.channels.c1.transactionCapacity=100
a1.channels.c1.keep-alive=3
a1.channels.c1.byteCapacityBufferPercentage = 20
a1.channels.c1.byteCapacity = 800000

a1.channels.c2.type=memory
a1.channels.c2.capacity=1000
a1.channels.c2.transactionCapacity=100
a1.channels.c2.keep-alive=3
a1.channels.c2.byteCapacityBufferPercentage = 20
a1.channels.c2.byteCapacity = 800000

a1.sinks.s1.type = logger

a1.sinks.s2.type = hdfs
a1.sinks.s2.hdfs.path = /flume/repevents/%y-%m-%d/%H%M/
a1.sinks.s2.hdfs.filePrefix = event-
a1.sinks.s2.hdfs.fileSuffix=.log
a1.sinks.s2.hdfs.inUseSuffix=.tmp
a1.sinks.s2.hdfs.rollInterval=10
a1.sinks.s2.hdfs.rollSize=1024
a1.sinks.s2.hdfs.fileType=DataStream
a1.sinks.s2.hdfs.writeFormat=Text
a1.sinks.s2.hdfs.round = true
a1.sinks.s2.hdfs.roundValue = 1
a1.sinks.s2.hdfs.roundUnit = minute
a1.sinks.s2.hdfs.useLocalTimeStamp=true

a1.sources.r1.channels=c1 c2
a1.sinks.s1.channel=c1
a1.sinks.s2.channel=c2

启动命令：

flume-ng a1 -c $FLUME_HOME/conf -f /opt/soft/apache-flume-1.7.0-bin/conf/mul.conf -n a1 -Dflume.root.logger=INFO,console

测试数据：
curl -X POST -d '[{"headers":{"status":"2017-06-13"},"body":"this is default"}]' http://slave1:6666
curl -X POST -d '[{"headers":{"status":"CZ"},"body":"this is CZ"}]' http://slave1:6666
curl -X POST -d '[{"headers":{"status":"US"},"body":"this is US"}]' http://slave1:6666

6）级联案例

（官方举例）：在slave1服务器上面采集数据通过网络端口发送到slave2的服务器的a1上
slave2的a1负责接收来自slave1采集的数据然后保存到相应的位置

在slave1上配置

# Name the components on this a1
a1.sources = r1
a1.sinks = k1
a1.channels = c1

# Describe/configure the source
a1.sources.r1.type = exec
a1.sources.r1.command = tail -F /home/hadoop/datatest/flume_test

# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# Describe the sink
a1.sinks.k1.type = avro
a1.sinks.k1.hostname = slave2
a1.sinks.k1.port = 6666
a1.sinks.k1.batch-size = 2

# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

在slave2配置
# Name the components on this a1
a1.sources = r1
a1.sinks = k1
a1.channels = c1

# Describe/configure the source
a1.sources.r1.type = avro
a1.sources.r1.bind = slave2
a1.sources.r1.port = 6666

# Describe the sink
a1.sinks.k1.type = logger

# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

先启动slave2上的flume：

flume-ng a1 -c $FLUME_HOME/conf -f /opt/soft/flume-1.7.0/conf/avro_logger.conf -n a1 -Dflume.root.logger=INFO,console

再启动slave1上的flume

flume-ng a1 -c $FLUME_HOME/conf -f /opt/soft/apache-flume-1.7.0-bin/conf/tail_avro.conf -n a1 -Dflume.root.logger=INFO,console

7）flume对接kafka案例

案例一：
exec–>memory–>kafka

a1.sources = r1
a1.channels = c1
a1.sinks = k1

a1.sources.r1.type = exec
#tail -F 根据文件名进行追踪
a1.sources.r1.command = tail -F /home/hadoop/time2.txt

a1.sources.r1.channels = c1
a1.channels.c1.type=memory
a1.channels.c1.capacity=10000
a1.channels.c1.transactionCapacity=100

a1.sinks.k1.type = org.apache.flume.sink.kafka.KafkaSink
#指定kafka类型
a1.sinks.k1.topic = exec2kafka

#kafka集群地址
a1.sinks.k1.brokerList = node2:9092,node4:9092,node6:9092
a1.sinks.k1.requiredAcks = 1
a1.sinks.k1.batchSize = 20
a1.sinks.k1.channel = c1

启动flume:

flume-ng a1 -c $FLUME_HOME/conf -f /opt/flume-1.8.0/conf/exec_kafka.conf -n a1  -Dflume.root.logger=INFO,console

向监控文件添加数据（可写脚本生成数据）

#!/bin/bash
while true
do
echo echo $(date) >> /home/hadoop/time2.txt
sleep 2
done

启动kafka消费者查看数据

kafka-console-consumer.sh --from-beginning --topic exec2kafka --zookeeper node2:2181,node4:2181,node6:2181

案例二：
通过flume采集kafka数据存储到HDFS

a1.sources = kafkaSource
a1.channels = memoryChannel
a1.sinks = hdfsSink
#描述sources
a1.sources.kafkaSource.channels = memoryChannel
a1.sources.kafkaSource.type=org.apache.flume.source.kafka.KafkaSource
a1.sources.kafkaSource.zookeeperConnect=node2:2181,node4:2181,node6:2181
a1.sources.kafkaSource.topic=exec2kafka
a1.sources.kafkaSource.kafka.consumer.timeout.ms=100
#描述channels
a1.channels.memoryChannel.type=memory
a1.channels.memoryChannel.capacity=1000
a1.channels.memoryChannel.transactionCapacity=100
#描述sinks
a1.sinks.hdfsSink.type=hdfs
a1.sinks.hdfsSink.channel = memoryChannel
a1.sinks.hdfsSink.hdfs.path=hdfs://主机名或IP:9000/kafka/flume-data
a1.sinks.hdfsSink.hdfs.writeFormat=Text
a1.sinks.hdfsSink.hdfs.fileType=DataStream

注意：前提kafka服务要先启动
先启动flume，等待kafka传输消息

flume-ng agent -c $FLUME_HOME/conf -f /opt/flume-1.8.0/conf/kafka_hdfs.conf -n a1  -Dflume.root.logger=INFO,console

向kafka里面输入数据

kafka-console-producer.sh --broker-list node2:9092 --topic exec2kafka

会写程序员的代码

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

专栏目录