flume

一、大数据项目简介

1.1 整个学习周期的项目

1. 数据采集和监控系统
2. 准实时数据仓库建设和用户画像工程实战
3. 推荐系统
4. 实时数仓建设

1.2 数据采集和监控系统的简介

1.2.1 学习目标

旨在让学员理解实际生产环境中数据从何而来,如何针对不同的场景设计不同的数据采集方案,同时如何监控数据采集流程中各个指标的定义和监控,不要让数据在黑盒子中流转。
在学习项目中进一步掌握基础课中的组件在真实场景中的运用和变通,能够学以致用,针对场景设计架构

1.2.2 技术框架

1. nginx
2. flume
3. sqoop
4. azkaban
5. prometheus
6. grafana
7. dingding

二、Flume框架概要

2.1 flume的简介

1. 是一个分布式的、可靠的、高可用的日志数据采集框架
2. 是一个具有基于数据流的体系结构
3. 具有可调整的可靠性机制和容错机制
4. 是cloudera公司的开发的。
5. 在2011年10月22号进行的核心组件的重构,之前的版本称之为Flume OG,在这之后的版本称之为FLume NG版本

2.2 设计思想

在这里插入图片描述

2.3 Flume体系结构(重点)

1. Flume的最小运行单元是Agent,Agent里至少要有三大核心组件Source,channel,Sink.
2. Agent在运行时,就占一个JVM。
3. Flume的组件有:
	-- source组件:作用是与数据源进行交互,采集数据,封装成event, 将event传输给channel
	-- channel组件:作用是将source传输的event进行缓存,然后再传输给sink。
	-- sink组件:作用是接收channel传过来的event下沉到存储系统上或者是下一个Agent的source组件中
	-- event:是采集的数据,封装成的对象,Event的结构有键值对的header,还有正文的消息体body
	-- flow: event的传输抽象为flow
	-- interceptor: 作用于source或sink端,可以用于拦截event或者修改event的数据
	-- selector:  作用于source端,可以将不同的event分发到不同的channel里。
	-- client: 客户端,用于产生数据,运行在一个独立的线程中

2.4 Flume的数据流模型

2.4.1 分类说明

可以分为以下两大类型:
- 单一数据模型
- 多数据流模型

2.4.2 单Agent数据流模型

参考文档

2.4.3 多数据流模型

1. 多Agent串行
2. 多Agent汇聚型
3. 单Agent多路数据模型
4. sink组数据模型

2.4.4 总结

一个Agent中可以有多个Source,多个channel,多个Sink。

一个source的数据可以分到多个channel中
一个channel的event来源可以是多个Source。

一个channel的event数据可以流向多个sink.
一个sink的event数据只能来源于一个channel

2.5 采集方案模板

-- 先命名
agentName.sources = rname1 rname2...
agentName.channels = cname1 cname2....
agentName.sinks = sname1 sname2.....

--三大核心组件进行关联
agentName.sources.rname1.channels = cname1 cname2.....
agentName.sinks.sname1.channel = cname1 cname

--分别设置核心组件的类型和属性等
.........

2.6 常用的核心组件

1)常用的source组件

# Avro source:
	avro
# Kafka source:
	org.apache.flume.source.kafka,KafkaSource	
# HTTP Source:
	http	
# Exec source:
	exec
# Spooling directory source:
	spooldir
# Thrift source:
	thrift	
# Syslog TCP source:
	syslogtcp
# Syslog UDP Source:
	syslogudp
# JMS source:
	jms

2)常用的channel组件

# Memory Channel
	memory
# File Channel
	file	
# JDBC Channel
	jdbc
# Kafka Channel
	org.apache.flume.channel.kafka.KafkaChannel

3)常用的sink组件

# Logger Sink
	logger
# Avro Sink
	avro
# HDFS Sink
	hdfs
# HIVE Sink
	hive
# Kafka Sink
	org.apache.flume.sink.kafka.KafkaSink

三、Flume的安装

1)上传、解压、更名

[root@qianfeng01 ~]# tar -zxvf apache-flume-1.8.0-bin.tar.gz -C /usr/local/
[root@qianfeng01 ~]# cd /usr/local/
[root@qianfeng01 local]# mv apache-flume-1.8.0-bin/ flume

2)配置环境变量

[root@qianfeng01 local]# vim /etc/profile
........省略.......
#flume environment
export FLUME_HOME=/usr/local/flume
export PATH=$FLUME_HOME/bin:$PATH

[root@qianfeng01 local]# source /etc/profile

3)配置flume的环境脚本

[root@qianfeng01 conf]# cp flume-env.sh.template flume-env.sh
[root@qianfeng01 conf]# vim flume-env.sh
..........省略............
export JAVA_HOME=/usr/local/jdk
..........省略............

四、Flume案例演示

案例1)avro+memory+logger

1)采集方案的编辑

# 命名
a1.sources = r1
a1.channels = c1
a1.sinks = s1

# 关联
a1.sources.r1.channels = c1
a1.sinks.s1.channel = c1

#设置source的类型和属性
a1.sources.r1.type = avro
a1.sources.r1.bind = qianfeng01
a1.sources.r1.port = 10086

#设置channel的类型和属性
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

#设置sink的类型和属性
a1.sinks.s1.type = logger
a1.sinks.s1.maxBytesToLog = 32

2)启动Agent采集方案

[root@master flumeconf]# flume-ng agent -c ../conf -f ./avro-mem-logger.conf -n a1 -Dflume.root.logger=INFO,console

3)测试:使用avro-client向监听的主机和端口号发送数据

[root@qianfeng01 ~]# echo "hello world hello world" >> data/flume1.test
[root@qianfeng01 ~]# flume-ng avro-client -c /usr/local/flume/conf -F data/flume1.test -H qianfeng01 -p 10086

案例2)exec+memory+logger

1)采集方案的编辑

[root@qianfeng01 flumeconf]# vim exec-mem-logger.properties
# 命名
a1.sources = r1
a1.channels = c1
a1.sinks = s1

# 关联
a1.sources.r1.channels = c1
a1.sinks.s1.channel = c1

#设置source的类型和属性
a1.sources.r1.type = exec
a1.sources.r1.command = tail -f /root/data/flume1.test
a1.sources.r1.batchSize = 20
a1.sources.r1.batchTimeout = 3000

#设置channel的类型和属性
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

#设置sink的类型和属性
a1.sinks.s1.type = logger
a1.sinks.s1.maxBytesToLog = 32

2)启动Agent采集方案

[root@master flumeconf]# flume-ng agent -c ../conf -f ./exec-mem-logger.properties -n a1 -Dflume.root.logger=INFO,console

3)测试

[root@qianfeng01 ~]# echo "hello world hello world" >> data/flume1.test
[root@qianfeng01 ~]# ping www.baidu.com >> data/flume1.test 

案例3)exec+memory+hdfs

1)采集方案的编辑

[root@qianfeng01 flumeconf]# vim exec-mem-hdfs.properties
# 命名
a1.sources = r1
a1.channels = c1
a1.sinks = s1
# 关联
a1.sources.r1.channels = c1
a1.sinks.s1.channel = c1

#设置source的类型和属性
a1.sources.r1.type = exec
a1.sources.r1.command = tail -f /root/data/flume1.test
a1.sources.r1.batchSize = 20
a1.sources.r1.batchTimeout = 3000

#设置channel的类型和属性
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

#设置sink的类型和属性
a1.sinks.s1.type = hdfs
a1.sinks.s1.hdfs.path = hdfs://qianfeng01/flumedata/%Y%m%d/%H%M
a1.sinks.s1.hdfs.filePrefix = FlumeData
a1.sinks.s1.hdfs.fileSuffix = .wangcongming
# 下面三个属性是用于hdfs上的文件的滚动条件,满足其一即可滚动,如果是0,表示禁止此条件
a1.sinks.s1.hdfs.rollInterval = 120
a1.sinks.s1.hdfs.rollSize = 100
a1.sinks.s1.hdfs.rollCount = 10
# 下面三个属性是用于hdfs上的目录的滚动条件,如果想要使用,需要使用true, 禁止使用要改为false
a1.sinks.s1.hdfs.round = true  
a1.sinks.s1.hdfs.roundValue = 2
a1.sinks.s1.hdfs.roundUnit = minute
# 如果目录上的转移字符想要生效,那么下面这个属性必须是true
a1.sinks.s1.hdfs.useLocalTimeStamp = true
# 如果想要使用纯文本文件DataStream,那么写格式需要使用Text
a1.sinks.s1.hdfs.fileType = DataStream
a1.sinks.s1.hdfs.writeFormat = Text

2)启动Agent采集方案

[root@master flumeconf]# flume-ng agent -c ../conf -f ./exec-mem-hdfs.properties -n a1 -Dflume.root.logger=INFO,console

3)测试

[root@qianfeng01 ~]# echo "hello world hello world" >> data/flume1.test
[root@qianfeng01 ~]# ping www.baidu.com >> data/flume1.test 

案例4)spool+memory+logger

1)采集方案的编辑

[root@qianfeng01 flumeconf]# vim spool-mem-logger.properties
#命名,并关联
a1.sources = r1
a1.channels = c1
a1.sinks = s1
a1.sources.r1.channels = c1
a1.sinks.s1.channel = c1

#定义source的属性
a1.sources.r1.type=spooldir
#注意:监听的目录需要提前存在
a1.sources.r1.spoolDir = /root/data/spool
a1.sources.r1.fileSuffix = .COMPLETED
a1.sources.r1.deletePolicy=never
a1.sources.r1.fileHeader=false
a1.sources.r1.fileHeaderKey=file
a1.sources.r1.basenameHeader=false
a1.sources.r1.basenameHeaderKey=basename
a1.sources.r1.batchSize=100
a1.sources.r1.inputCharset=UTF-8
#定义channel的属性
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
#定义sink的属性
a1.sinks.s1.type = logger
a1.sinks.s1.maxBytesToLog = 32

2)启动Agent采集方案

[root@master flumeconf]# flume-ng agent -c ../conf -f ./spool-mem-logger.properties -n a1 -Dflume.root.logger=INFO,console

3)测试

[root@qianfeng01 spool]# echo "helloworld" >> f1.txt
[root@qianfeng01 spool]# echo "helloworld" >> f2.txt
[root@qianfeng01 spool]# echo "helloworld" >> f3.txt
[root@qianfeng01 spool]# echo "helloworld" >> f4.txt
[root@qianfeng01 spool]# echo "helloworld" >> f5.txt

注意:监听的目录下的新文件名不能使用旧文件名

案例5)spool+file+hdfs

1)采集方案的编辑

[root@qianfeng01 flumeconf]# vim spool-file-logger.properties
#命名,并关联
a1.sources = r1
a1.channels = c1
a1.sinks = s1
a1.sources.r1.channels = c1
a1.sinks.s1.channel = c1

#定义source的属性
a1.sources.r1.type=spooldir
#注意:监听的目录需要提前存在
a1.sources.r1.spoolDir = /root/data/spool
a1.sources.r1.fileSuffix = .COMPLETED
a1.sources.r1.deletePolicy=never
a1.sources.r1.fileHeader=false
a1.sources.r1.fileHeaderKey=file
# 是否要将文件名添加到event的header里,如果是true,表示添加,用basenameHeaderKey属性指定key的名字
a1.sources.r1.basenameHeader=true 
a1.sources.r1.basenameHeaderKey=filename
a1.sources.r1.batchSize=100
a1.sources.r1.inputCharset=UTF-8

#定义channel的属性
a1.channels.c1.type = file

#定义sink的属性
a1.sinks.s1.type = hdfs
a1.sinks.s1.hdfs.path = hdfs://qianfeng01/flumedata/%Y%m%d/%H%M
#获取 event中的header里的键值对数据,key是filename value是文件名
a1.sinks.s1.hdfs.filePrefix = %{filename}   
a1.sinks.s1.hdfs.fileSuffix = .wangcongming
# 下面三个属性是用于hdfs上的文件的滚动条件,满足其一即可滚动,如果是0,表示禁止此条件
a1.sinks.s1.hdfs.rollInterval = 30
a1.sinks.s1.hdfs.rollSize = 100
a1.sinks.s1.hdfs.rollCount = 10
# 下面三个属性是用于hdfs上的目录的滚动条件,如果想要使用,需要使用true, 禁止使用要改为false
a1.sinks.s1.hdfs.round = true  
a1.sinks.s1.hdfs.roundValue = 2
a1.sinks.s1.hdfs.roundUnit = minute
# 如果目录上的转移字符想要生效,那么下面这个属性必须是true
a1.sinks.s1.hdfs.useLocalTimeStamp = true
# 如果想要使用纯文本文件DataStream,那么写格式需要使用Text
a1.sinks.s1.hdfs.fileType = DataStream
a1.sinks.s1.hdfs.writeFormat = Text

2)启动Agent采集方案

[root@master flumeconf]# flume-ng agent -c ../conf -f ./spool-file-hdfs.properties -n a1 -Dflume.root.logger=INFO,console

3)测试

[root@qianfeng01 spool]# echo "helloworld" >> f6.txt
[root@qianfeng01 spool]# echo "helloworld" >> f7.txt
[root@qianfeng01 spool]# echo "helloworld" >> f8.txt
[root@qianfeng01 spool]# echo "helloworld" >> f9.txt
[root@qianfeng01 spool]# echo "helloworld" >> f10.txt

注意:监听的目录下的新文件名不能使用旧文件名

案例6)http+memory+logger

1)采集方案的编辑

[root@qianfeng01 flumeconf]# vim http-mem-logger.properties
#命名,并关联
a1.sources = r1
a1.channels = c1
a1.sinks = s1
a1.sources.r1.channels = c1
a1.sinks.s1.channel = c1

#定义source的属性
a1.sources.r1.type=http
a1.sources.r1.port=10010
a1.sources.r1.bind = qianfeng01
a1.sources.r1.handler = org.apache.flume.source.http.JSONHandler
a1.sources.r1.handler.nickname=michael


#定义channel的属性
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

#定义sink的属性
a1.sinks.s1.type = logger
a1.sinks.s1.maxBytesToLog = 32

2)启动Agent采集方案

[root@master flumeconf]# flume-ng agent -c ../conf -f ./http-mem-logger.properties -n a1 -Dflume.root.logger=INFO,console

3)测试:使用curl指令发送数据

curl -X POST -d '[{"headers":{"para1":"aaa","para2":"ccc"},"body":"this is my content"}]' http://qianfeng01:10010

案例7)syslogtcp+memory+logger

1)采集方案的编辑

[root@qianfeng01 flumeconf]# vim syslog-mem-logger.properties
#命名,并关联
a1.sources = r1
a1.channels = c1
a1.sinks = s1
a1.sources.r1.channels = c1
a1.sinks.s1.channel = c1

#定义source的属性
a1.sources.r1.type=syslogtcp
a1.sources.r1.host=qianfeng01
a1.sources.r1.port=10000

#定义channel的属性
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

#定义sink的属性
a1.sinks.s1.type = logger
a1.sinks.s1.maxBytesToLog = 32

2)启动Agent采集方案

[root@master flumeconf]# flume-ng agent -c ../conf -f ./syslog-mem-logger.properties -n a1 -Dflume.root.logger=INFO,console

3)测试:使用curl指令发送数据

[root@qianfeng01 queueset]# echo "hello world" | nc qianfeng01 10000

案例8)taildir+memory+hdfs

spooldir和taildir的相同点和不同

相同点: 
	-1. 都是可靠源
	-2. 监听的都是目录里的文件
	-3. 目录都要提前存在
不同点:
	-1. spooldir读完后,会修改文件的名称,添加后缀
	-2. spooldir里的文件的名字不能重复
	-3. spooldir采集的是新文件
	
	-4. taildir监听的文件不会更名,可以一直监听文件的尾部新数据
  1. 采集方案的编写
[root@qianfeng01 flumeconf]#  vim taildir-mem-hdfs.properties
#命名并关联
a1.sources=s1
a1.channels=c1
a1.sinks=k1
a1.sources.s1.channels=c1
a1.sinks.k1.channel=c1

#设置source的属性   
a1.sources.s1.type=TAILDIR
a1.sources.s1.filegroups = g1 g2
a1.sources.s1.filegroups.g1 = /root/data1/.*.txt
a1.sources.s1.filegroups.g2 = /root/data2/.*.csv
a1.sources.s1.positionFile = /root/taildir_position.json

#设置channel的属性
a1.channels.c1.type=memory
a1.channels.c1.capacity=1000
a1.channels.c1.transactionCapacity=100
a1.channels.c1.keep-alive=3
a1.channels.c1.byteCapacityBufferPercentage=20
a1.channels.c1.byteCapacity=800000

#设置sink的属性
a1.sinks.k1.type=hdfs
a1.sinks.k1.hdfs.path=hdfs://qianfeng01:8020/flumedata/taildir/%Y-%m-%d-%H-%M
a1.sinks.k1.hdfs.useLocalTimeStamp=true
a1.sinks.k1.hdfs.rollInterval=30
a1.sinks.k1.hdfs.rollSize=1024
a1.sinks.k1.hdfs.rollCount=0
a1.sinks.k1.hdfs.round=true
a1.sinks.k1.hdfs.roundValue=2
a1.sinks.k1.hdfs.roundUnit=minute
a1.sinks.k1.hdfs.fileType=DataStream
a1.sinks.k1.hdfs.writeFormat=Text

2)运行Agent

[root@qianfeng01 flumeconf]# flume-ng agent -c ../conf/ -f ./taildir-mem-hdfs.properties-n a1 -Dflume.root.logger=INFO,console

3)测试

[root@qianfeng01 ~]# mkdir data1
[root@qianfeng01 ~]# mkdir data2
[root@qianfeng01 ~]# echo "helloworld">>data1/a.txt
[root@qianfeng01 ~]# echo "helloworld">>data2/a.csv

五、拦截器和选择器的应用

5.1 拦截器的应用

5.1.1 常用的拦截器

# Timestamp Interceptor:
在拦截到的event的header里添加时间戳键值对,默认的key是timestamp。
# Host Interceptor:
在拦截到的event的header里添加主机名的键值对,默认的key是host,通过配置参数,value可能是ip地址,也可以是主机名
# Static Interceptor:
允许用户自定义event里的header的键值对
# Regex Filtering Interceptor

5.1.2 案例演示1

1)需求

在一个采集方案中,采用三个拦截器,分别是timestamp、host、static。将他们利用到hdfs的路径上

2)采集方案

[root@qianfeng01 ~]# vim /usr/local/flume/flumeconf/interceptors.properties
a1.sources = r1
a1.channels = c1
a1.sinks = s1
a1.sources.r1.channels = c1
a1.sinks.s1.channel = c1

#指定source的属性
a1.sources.r1.type = syslogtcp
a1.sources.r1.host = qianfeng01
a1.sources.r1.port = 10086
#指定一下拦截器
a1.sources.r1.interceptors = i1 i2 i3
a1.sources.r1.interceptors.i1.type = timestamp
a1.sources.r1.interceptors.i1.header = timestamp
a1.sources.r1.interceptors.i1.preserveExisting = true
a1.sources.r1.interceptors.i2.type = host
a1.sources.r1.interceptors.i2.useIP = false
a1.sources.r1.interceptors.i2.hostHeader = hostname
a1.sources.r1.interceptors.i2.preserveExisting = true
a1.sources.r1.interceptors.i3.type = static
a1.sources.r1.interceptors.i3.preserveExisting = true
a1.sources.r1.interceptors.i3.key = girlfriend
a1.sources.r1.interceptors.i3.value = canglaoshi


#指定channel的属性
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

#指定sink的属性
a1.sinks.s1.type = hdfs
a1.sinks.s1.hdfs.path = hdfs://qianfeng01/test/%{hostname}/%Y-%m-%d-%H-%M
# 如果想用时间戳拦截器设置的键值对消息头,那么下面的属性要设置false
a1.sinks.s1.hdfs.useLocalTimeStamp = false
a1.sinks.s1.hdfs.filePrefix = %{girlfriend}   
a1.sinks.s1.hdfs.fileSuffix = .wcm
a1.sinks.s1.hdfs.rollInterval = 30
a1.sinks.s1.hdfs.rollSize = 0
a1.sinks.s1.hdfs.rollCount = 0
# 下面三个属性是用于hdfs上的目录的滚动条件,如果想要使用,需要使用true, 禁止使用要改为false
a1.sinks.s1.hdfs.round = true  
a1.sinks.s1.hdfs.roundValue = 2
a1.sinks.s1.hdfs.roundUnit = minute
# 如果想要使用纯文本文件DataStream,那么写格式需要使用Text
a1.sinks.s1.hdfs.fileType = DataStream
a1.sinks.s1.hdfs.writeFormat = Text

3)启动

[root@qianfeng01 ~]# flume-ng agent -c /usr/local/flume/conf -f /usr/local/flume/flumeconf/interceptors.properties -n a1 -Dflume.root.logger=INFO,console

4)测试,查看路径

[root@qianfeng01 ~]# echo "hello" | nc qianfeng01 10086

5.1.3 案例演示2

1)需求

使用正则表达式过滤拦截器,判断正文如果是数字开头,我们就排除这个event

2)方案编写

[root@qianfeng01 ~]# vim /usr/local/flume/flumeconf/regex-interceptor.properties
#命名并关联
a1.sources = r1
a1.channels = c1
a1.sinks = s1
a1.sources.r1.channels = c1
a1.sinks.s1.channel = c1

#指定source的属性
a1.sources.r1.type = syslogtcp
a1.sources.r1.host = qianfeng01
a1.sources.r1.port = 10086
#指定一下正则表达式过滤拦截器
a1.sources.r1.interceptors = i1
a1.sources.r1.interceptors.i1.type = regex_filter
#正则表达式不要使用引号
a1.sources.r1.interceptors.i1.regex = ^[0-9].*
a1.sources.r1.interceptors.i1.excludeEvents=true

#指定channel的属性
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

#指定sink的属性
a1.sinks.s1.type = hdfs
a1.sinks.s1.hdfs.path = hdfs://qianfeng01/test/%Y-%m-%d-%H-%M
a1.sinks.s1.hdfs.useLocalTimeStamp = true
a1.sinks.s1.hdfs.filePrefix = Flume
a1.sinks.s1.hdfs.fileSuffix = .wcm
a1.sinks.s1.hdfs.rollInterval = 30
a1.sinks.s1.hdfs.rollSize = 0
a1.sinks.s1.hdfs.rollCount = 0
# 下面三个属性是用于hdfs上的目录的滚动条件,如果想要使用,需要使用true, 禁止使用要改为false
a1.sinks.s1.hdfs.round = true  
a1.sinks.s1.hdfs.roundValue = 2
a1.sinks.s1.hdfs.roundUnit = minute
# 如果想要使用纯文本文件DataStream,那么写格式需要使用Text
a1.sinks.s1.hdfs.fileType = DataStream
a1.sinks.s1.hdfs.writeFormat = Text

3)启动

[root@qianfeng01 ~]# flume-ng agent -c /usr/local/flume/conf -f /usr/local/flume/flumeconf/regex-interceptor.properties -n a1 -Dflume.root.logger=INFO,console

4)测试

[root@qianfeng01 ~]# echo "helloworld" | nc qianfeng01 10086
[root@qianfeng01 ~]# echo "1helloworld" | nc qianfeng01 10086
[root@qianfeng01 ~]# echo "hellokitty" | nc qianfeng01 10086
[root@qianfeng01 ~]# echo "9hellokitty" | nc qianfeng01 10086
[root@qianfeng01 ~]# echo "java" | nc qianfeng01 10086

5.1.4 自定义拦截器

1)需求

将event的正文 数字开头的数据存储到hdfs上的/flumedata/number/下
将event的正文 字母开头的数据存储到hdfs上的/fumedata/character/下
其他字符存储到hdfs上的/flumedata/others/下

2)分析

自带的拦截器,时间戳、主机两个拦截器无法完成逻辑操作,而静态拦截器虽然可以自定义拦截到的event的header内容,但是无法判断此event的正文的开头字符是什么。

因此我们可以自定义拦截器,对拦截到的event的正文进行逻辑分析处理,
如果正文是数字开头,我们可以在header中添加键值对"type":"number"
如果正文是字母开头,我们可以在header中添加键值对"type":"character"
如果正文是其他字符开头,我们可以在header中添加键值对"type":"others"

3)编写自定义拦截器

pom.xml

<dependencies>
    <dependency>
        <groupId>org.apache.flume</groupId>
        <artifactId>flume-ng-core</artifactId>
        <version>1.8.0</version>
    </dependency>
</dependencies>

自定义拦截器的逻辑:

package com.qf.flume.interceptor;

import org.apache.flume.Context;
import org.apache.flume.Event;
import org.apache.flume.interceptor.Interceptor;

import java.util.List;
import java.util.Map;

/**
 * 自定义拦截器实现以下功能:
 * 如果正文是数字开头,我们可以在header中添加键值对"type":"number"
 * 如果正文是字母开头,我们可以在header中添加键值对"type":"character"
 * 如果正文是其他字符开头,我们可以在header中添加键值对"type":"others"
 *
 * 第一步:要实现flume的拦截器接口
 */
public class MyInterceptor implements Interceptor {
    @Override
    public void initialize() {

    }

    /**
     * 重新 单个event的拦截方法
     * @param event  是刚刚被拦截到的event对象
     * @return
     */
    @Override
    public Event intercept(Event event) {
        //获取正文
        byte[] body = event.getBody();
        if(body[0]>=48 && body[0] <=57){
            //如果是数字,存储一个键值对type=number到header里
            event.getHeaders().put("type","number");
        }else if(body[0]>=65 && body[0] <=90 || (body[0]>=97 && body[0] <=122)){
            //如果是数字,存储一个键值对type=character到header里
            event.getHeaders().put("type","character");
        }else{
            event.getHeaders().put("type","others");
        }
        return event;
    }

    /**
     * 重新批量的拦截方法
     * @param events
     * @return
     */
    @Override
    public List<Event> intercept(List<Event> events) {
        for (Event event : events) {
            //调用单个event的拦截方法,进行解析
            intercept(event);
        }
        return events;
    }

    @Override
    public void close() {

    }
    //注意:必须是静态内部类
    public static class MyBuilder implements  Builder{
        /**
         * 此方法是框架自己要帮我们创建拦截器实例的方法
         * @return
         */
        @Override
        public Interceptor build() {
            return new MyInterceptor();
        }

        @Override
        public void configure(Context context) {

        }
    }
}

4)打包,上传到flume的lib目录下

5)编写方案

[root@qianfeng01 flumeconf]# vim my-interceptor.properties

#list name
a1.sources = r1
a1.channels = c1 c2 c3
a1.sinks = s1 s2 s3
a1.sources.r1.channels = c1 c2 c3
a1.sinks.s1.channel = c1
a1.sinks.s2.channel = c2
a1.sinks.s3.channel = c3


a1.sources.r1.type = syslogtcp
a1.sources.r1.host = qianfeng01
a1.sources.r1.port = 10086
a1.sources.r1.interceptors = i1
a1.sources.r1.interceptors.i1.type=com.qf.flume.interceptor.MyInterceptor$MyBuilder
a1.sources.r1.selector.type=multiplexing
a1.sources.r1.selector.header = type
a1.sources.r1.selector.mapping.number = c1
a1.sources.r1.selector.mapping.character = c2
a1.sources.r1.selector.default = c3

a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
a1.channels.c2.type = memory
a1.channels.c2.capacity = 1000
a1.channels.c2.transactionCapacity = 100
a1.channels.c3.type = memory
a1.channels.c3.capacity = 1000
a1.channels.c3.transactionCapacity = 100


#定义sink的属性
a1.sinks.s1.type = hdfs
a1.sinks.s1.hdfs.path = hdfs://qianfeng01/flumedata/number/%Y%m%d/%H%M
a1.sinks.s1.hdfs.filePrefix = FlumeData
a1.sinks.s1.hdfs.fileSuffix = .log
a1.sinks.s1.hdfs.rollInterval = 30
a1.sinks.s1.hdfs.rollSize = 0
a1.sinks.s1.hdfs.rollCount = 0
a1.sinks.s1.hdfs.round = true  
a1.sinks.s1.hdfs.roundValue = 2
a1.sinks.s1.hdfs.roundUnit = minute
a1.sinks.s1.hdfs.useLocalTimeStamp = true
a1.sinks.s1.hdfs.fileType = DataStream
a1.sinks.s1.hdfs.writeFormat = Text

a1.sinks.s2.type = hdfs
a1.sinks.s2.hdfs.path = hdfs://qianfeng01/flumedata/character/%Y%m%d/%H%M
a1.sinks.s2.hdfs.filePrefix = FlumeData
a1.sinks.s2.hdfs.fileSuffix = .log
a1.sinks.s2.hdfs.rollInterval = 30
a1.sinks.s2.hdfs.rollSize = 0
a1.sinks.s2.hdfs.rollCount = 0
a1.sinks.s2.hdfs.round = true  
a1.sinks.s2.hdfs.roundValue = 2
a1.sinks.s2.hdfs.roundUnit = minute
a1.sinks.s2.hdfs.useLocalTimeStamp = true
a1.sinks.s2.hdfs.fileType = DataStream
a1.sinks.s2.hdfs.writeFormat = Text

a1.sinks.s3.type = hdfs
a1.sinks.s3.hdfs.path = hdfs://qianfeng01/flumedata/others/%Y%m%d/%H%M
a1.sinks.s3.hdfs.filePrefix = FlumeData
a1.sinks.s3.hdfs.fileSuffix = .log
a1.sinks.s3.hdfs.rollInterval = 30
a1.sinks.s3.hdfs.rollSize = 0
a1.sinks.s3.hdfs.rollCount = 0
a1.sinks.s3.hdfs.round = true  
a1.sinks.s3.hdfs.roundValue = 2
a1.sinks.s3.hdfs.roundUnit = minute
a1.sinks.s3.hdfs.useLocalTimeStamp = true
a1.sinks.s3.hdfs.fileType = DataStream
a1.sinks.s3.hdfs.writeFormat = Text

6)启动

[root@qianfeng01 flumeconf]# flume-ng agent -c ../conf -f my-interceptor.properties -n a1 -Dflume.root.logger=INFO,console

7)测试

[root@qianfeng02 flumeconf]# echo "1sadfasdfasf" | nc qianfeng01 10086
[root@qianfeng02 flumeconf]# echo "9sadfasdfasf" | nc qianfeng01 10086
[root@qianfeng02 flumeconf]# echo "sadfasdfasf" | nc qianfeng01 10086
[root@qianfeng02 flumeconf]# echo "Sadfasdfasf" | nc qianfeng01 10086
[root@qianfeng02 flumeconf]# echo "-adfasdfasf" | nc qianfeng01 10086
[root@qianfeng02 flumeconf]# echo "@adfasdfasf" | nc qianfeng01 10086

5.2 选择器的应用

5.2.1 常用的选择器

# Replicating Channel Selector  复用选择器
将event复制到不同的channel里

# Multiplexing Channel Selector  多副路选择
根据header里指定的某一个key的值来决定将event分发到不同的channel里

5.2.2 案例演示:Replicating

1)方案编写

[root@qianfeng01 flumeconf]# vim first-replicating.properties
#命名并关联
a1.sources = r1
a1.channels = c1 c2 c3
a1.sinks = s1 s2 s3
a1.sources.r1.channels = c1 c2 c3
a1.sinks.s1.channel = c1
a1.sinks.s2.channel = c2
a1.sinks.s3.channel = c3

#指定source的属性
a1.sources.r1.type = syslogtcp
a1.sources.r1.host = qianfeng01
a1.sources.r1.port = 10086
#指定复用选择器
a1.sources.r1.selector.type = replicating
# 如果c3宕掉了,采集方案不会停止
a1.sources.r1.selector.optional = c3


#指定channel的属性
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
a1.channels.c2.type = memory
a1.channels.c2.capacity = 1000
a1.channels.c2.transactionCapacity = 100
a1.channels.c3.type = memory
a1.channels.c3.capacity = 1000
a1.channels.c3.transactionCapacity = 100

#指定sink的属性
a1.sinks.s1.type = hdfs
a1.sinks.s1.hdfs.path = hdfs://qianfeng01/test/s1/%Y-%m-%d-%H-%M
a1.sinks.s1.hdfs.useLocalTimeStamp = true
a1.sinks.s1.hdfs.filePrefix = Flume
a1.sinks.s1.hdfs.fileSuffix = .wcm
a1.sinks.s1.hdfs.rollInterval = 30
a1.sinks.s1.hdfs.rollSize = 0
a1.sinks.s1.hdfs.rollCount = 0
a1.sinks.s1.hdfs.round = true  
a1.sinks.s1.hdfs.roundValue = 2
a1.sinks.s1.hdfs.roundUnit = minute
a1.sinks.s1.hdfs.fileType = DataStream
a1.sinks.s1.hdfs.writeFormat = Text

a1.sinks.s2.type = hdfs
a1.sinks.s2.hdfs.path = hdfs://qianfeng01/test/s2/%Y-%m-%d-%H-%M
a1.sinks.s2.hdfs.useLocalTimeStamp = true
a1.sinks.s2.hdfs.filePrefix = Flume
a1.sinks.s2.hdfs.fileSuffix = .wcm
a1.sinks.s2.hdfs.rollInterval = 30
a1.sinks.s2.hdfs.rollSize = 0
a1.sinks.s2.hdfs.rollCount = 0
a1.sinks.s2.hdfs.round = true  
a1.sinks.s2.hdfs.roundValue = 2
a1.sinks.s2.hdfs.roundUnit = minute
a1.sinks.s2.hdfs.fileType = DataStream
a1.sinks.s2.hdfs.writeFormat = Text

#指定第三个sink,而第三个sink应该下沉到另一个agent中,所以在此选择avro sink
a1.sinks.s3.type = avro
a1.sinks.s3.hostname=qianfeng01
a1.sinks.s3.port = 10087

配置下游的采集方案:下游的agent的source必须是avro source

[root@qianfeng01 flumeconf]# vim second-avro-mem-logger.properties
#命令并关联
a1.sources = r1
a1.channels = c1
a1.sinks = s1
a1.sources.r1.channels = c1
a1.sinks.s1.channel = c1

a1.sources.r1.type = avro
a1.sources.r1.bind = qianfeng01
a1.sources.r1.port = 10087

a1.channels.c1.type = memory

a1.sinks.s1.type = logger

3)启动

注意:要先启动下游的agent,再启动上游的agent
在第二个窗口启动下游agent
[root@qianfeng01 flumeconf]# flume-ng agent -c ../conf -f second-avro-mem-logger.properties -n a1 -Dflume.root.logger=INFO,console

在第一个窗口启动上游agent
[root@qianfeng01 flumeconf]# flume-ng agent -c ../conf -f first-replicating.properties -n a1 -Dflume.root.logger=INFO,console

5.2.3 案例演示:Multiplexing

1)需求:

当event的header中的girlfriend的值如果是canglaoshi,就存储到一个叫canglaoshi的目录下
如果是bolaoshi,就存储到一个叫bolaoshi的目录下
否则存储到others目录下

2)方案编写

[root@qianfeng01 flumeconf]# vim multiplexing.properties
#命名并关联
a1.sources = r1
a1.channels = c1 c2 c3
a1.sinks = s1 s2 s3
a1.sources.r1.channels = c1 c2 c3
a1.sinks.s1.channel = c1
a1.sinks.s2.channel = c2
a1.sinks.s3.channel = c3

#指定source的属性
a1.sources.r1.type=http
a1.sources.r1.port=10086
a1.sources.r1.bind = qianfeng01
a1.sources.r1.handler = org.apache.flume.source.http.JSONHandler
#指定多副路选择器
a1.sources.r1.selector.type = multiplexing
a1.sources.r1.selector.header = girlfriend
a1.sources.r1.selector.mapping.canglaoshi = c1
a1.sources.r1.selector.mapping.bolaoshi = c2
a1.sources.r1.selector.default = c3


#指定channel的属性
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
a1.channels.c2.type = memory
a1.channels.c2.capacity = 1000
a1.channels.c2.transactionCapacity = 100
a1.channels.c3.type = memory
a1.channels.c3.capacity = 1000
a1.channels.c3.transactionCapacity = 100

#指定sink的属性
a1.sinks.s1.type = hdfs
a1.sinks.s1.hdfs.path = hdfs://qianfeng01/%{girlfriend}/%Y-%m-%d-%H-%M
a1.sinks.s1.hdfs.useLocalTimeStamp = true
a1.sinks.s1.hdfs.filePrefix = Flume
a1.sinks.s1.hdfs.fileSuffix = .wcm
a1.sinks.s1.hdfs.rollInterval = 30
a1.sinks.s1.hdfs.rollSize = 0
a1.sinks.s1.hdfs.rollCount = 0
a1.sinks.s1.hdfs.round = true  
a1.sinks.s1.hdfs.roundValue = 2
a1.sinks.s1.hdfs.roundUnit = minute
a1.sinks.s1.hdfs.fileType = DataStream
a1.sinks.s1.hdfs.writeFormat = Text

a1.sinks.s2.type = hdfs
a1.sinks.s2.hdfs.path = hdfs://qianfeng01/%{girlfriend}/%Y-%m-%d-%H-%M
a1.sinks.s2.hdfs.useLocalTimeStamp = true
a1.sinks.s2.hdfs.filePrefix = Flume
a1.sinks.s2.hdfs.fileSuffix = .wcm
a1.sinks.s2.hdfs.rollInterval = 30
a1.sinks.s2.hdfs.rollSize = 0
a1.sinks.s2.hdfs.rollCount = 0
a1.sinks.s2.hdfs.round = true  
a1.sinks.s2.hdfs.roundValue = 2
a1.sinks.s2.hdfs.roundUnit = minute
a1.sinks.s2.hdfs.fileType = DataStream
a1.sinks.s2.hdfs.writeFormat = Text

a1.sinks.s3.type = hdfs
a1.sinks.s3.hdfs.path = hdfs://qianfeng01/others/%Y-%m-%d-%H-%M
a1.sinks.s3.hdfs.useLocalTimeStamp = true
a1.sinks.s3.hdfs.filePrefix = Flume
a1.sinks.s3.hdfs.fileSuffix = .wcm
a1.sinks.s3.hdfs.rollInterval = 30
a1.sinks.s3.hdfs.rollSize = 0
a1.sinks.s3.hdfs.rollCount = 0
a1.sinks.s3.hdfs.round = true  
a1.sinks.s3.hdfs.roundValue = 2
a1.sinks.s3.hdfs.roundUnit = minute
a1.sinks.s3.hdfs.fileType = DataStream
a1.sinks.s3.hdfs.writeFormat = Text

3)启动

[root@qianfeng01 flumeconf]# flume-ng agent -c ../conf -f multiplexing.properties -n a1 -Dflume.root.logger=INFO,console

4)测试

[root@qianfeng01 ~]# curl -X POST -d "[{"headers":{"girlfriend":"canglaoshi"},"body":"helloworld"}]" http://qianfeng01:10086

[root@qianfeng01 ~]# curl -X POST -d "[{"header":{"girlfriend":"bolaoshi"},"body":"bolaoshi"}]" http://qianfeng01:10086

[root@qianfeng01 ~]# curl -X POST -d "[{"header":{"girlfriend":"others"},"body":"others"}]" http://qianfeng01:10086

六、Flume的自动容灾和负载均衡

6.1 说明

flume的自动容灾指的是当某一个channel或者sink宕掉后,由其他的sink来接收数据
flume的负载均衡指的是多个channel处理的event的数量尽可能的相同。

6.2 自动容灾的案例演示

1)上游方案的编写

[root@qianfeng01 ~]# vim first-processor.properties

#list names
a1.sources = r1
a1.channels = c1
a1.sinks = k1 k2
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
a1.sinks.k2.channel = c1


# source
a1.sources.r1.type = syslogtcp
a1.sources.r1.host = qianfeng01
a1.sources.r1.port = 10086

# channel
a1.channels.c1.type = memory

# sink
a1.sinks.k1.type = avro
a1.sinks.k1.hostname = qianfeng02
a1.sinks.k1.port = 10087

a1.sinks.k2.type = avro
a1.sinks.k2.hostname = qianfeng03
a1.sinks.k2.port = 10088

#设置sink组
a1.sinkgroups = g1
a1.sinkgroups.g1.sinks = k1 k2
a1.sinkgroups.g1.processor.type = failover
a1.sinkgroups.g1.processor.priority.k1 = 10
a1.sinkgroups.g1.processor.priority.k2 = 5
a1.sinkgroups.g1.processor.maxpenalty = 10000

2)下游的qianfeng02上的方案

[root@qianfeng02 flumeconf]# vim second-processor.properties

#list names
a1.sources = r1
a1.channels = c1
a1.sinks = k1
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1


# source
a1.sources.r1.type = avro
a1.sources.r1.bind = qianfeng02
a1.sources.r1.port = 10087

# channel
a1.channels.c1.type = memory

# sink
a1.sinks.k1.type = logger

3)下游的qianfeng03上的方案

[root@qianfeng03 flumeconf]# vim third-processor.properties

#list names
a1.sources = r1
a1.channels = c1
a1.sinks = k1
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1


# source
a1.sources.r1.type = avro
a1.sources.r1.bind = qianfeng03
a1.sources.r1.port = 10088

# channel
a1.channels.c1.type = memory

# sink
a1.sinks.k1.type = logger

4)启动

先启动下游的两个方案
[root@qianfeng02 flumeconf]# flume-ng agent -c ../conf -f second-processor.properties -n a1 -Dflume.root.logger=INFO,console
[root@qianfeng03 flumeconf]# flume-ng agent -c ../conf -f third-processor.properties -n a1 -Dflume.root.logger=INFO,console
在启动上游的一个方案
[root@qianfeng03 flumeconf]# flume-ng agent -c ../conf -f first-processor.properties -n a1 -Dflume.root.logger=INFO,console

5)测试

[root@qianfeng02 ~]# echo "helloworld" | nc qianfeng01 10086

由于k1的优先级是最高的,因此会看到qianfeng02上有数据
模拟自动容灾,使用ctrl+c 杀死qianfeng02上的方案,就会看到qianfeng03上有数据了。

6.3 负载均衡的案例演示

1)上游方案的编写

[root@qianfeng01 ~]# vim first-processor.properties

#list names
a1.sources = r1
a1.channels = c1
a1.sinks = k1 k2
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
a1.sinks.k2.channel = c1


# source
a1.sources.r1.type = syslogtcp
a1.sources.r1.host = qianfeng01
a1.sources.r1.port = 10086

# channel
a1.channels.c1.type = memory

# sink
a1.sinks.k1.type = avro
a1.sinks.k1.hostname = qianfeng02
a1.sinks.k1.port = 10087

a1.sinks.k2.type = avro
a1.sinks.k2.hostname = qianfeng03
a1.sinks.k2.port = 10088

#设置sink组
a1.sinkgroups = g1
a1.sinkgroups.g1.sinks = k1 k2
a1.sinkgroups.g1.processor.type = load_balance
a1.sinkgroups.g1.processor.backoff = true
# 轮询
a1.sinkgroups.g1.processor.selector = round_robin
a1.sinkgroups.g1.processor.selector.maxTimeOut= 30000

2)下游的qianfeng02上的方案

[root@qianfeng02 flumeconf]# vim second-processor.properties

#list names
a1.sources = r1
a1.channels = c1
a1.sinks = k1
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1


# source
a1.sources.r1.type = avro
a1.sources.r1.bind = qianfeng02
a1.sources.r1.port = 10087

# channel
a1.channels.c1.type = memory

# sink
a1.sinks.k1.type = logger

3)下游的qianfeng03上的方案

[root@qianfeng03 flumeconf]# vim third-processor.properties

#list names
a1.sources = r1
a1.channels = c1
a1.sinks = k1
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1


# source
a1.sources.r1.type = avro
a1.sources.r1.bind = qianfeng03
a1.sources.r1.port = 10088

# channel
a1.channels.c1.type = memory

# sink
a1.sinks.k1.type = logger

4)启动

先启动下游的两个方案
[root@qianfeng02 flumeconf]# flume-ng agent -c ../conf -f second-processor.properties -n a1 -Dflume.root.logger=INFO,console
[root@qianfeng03 flumeconf]# flume-ng agent -c ../conf -f third-processor.properties -n a1 -Dflume.root.logger=INFO,console
在启动上游的一个方案
[root@qianfeng03 flumeconf]# flume-ng agent -c ../conf -f first-processor.properties -n a1 -Dflume.root.logger=INFO,console

5)测试

[root@qianfeng02 ~]# echo "b1" | nc qianfeng01 10086
[root@qianfeng02 ~]# echo "b2" | nc qianfeng01 10086
[root@qianfeng02 ~]# echo "b3" | nc qianfeng01 10086
[root@qianfeng02 ~]# echo "b4" | nc qianfeng01 10086
[root@qianfeng02 ~]# echo "b5" | nc qianfeng01 10086
[root@qianfeng02 ~]# echo "b6" | nc qianfeng01 10086
[root@qianfeng02 ~]# echo "b7" | nc qianfeng01 10086

sinks = k1
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

source

a1.sources.r1.type = avro
a1.sources.r1.bind = qianfeng02
a1.sources.r1.port = 10087

channel

a1.channels.c1.type = memory

sink

a1.sinks.k1.type = logger


3)下游的qianfeng03上的方案

[root@qianfeng03 flumeconf]# vim third-processor.properties

#list names
a1.sources = r1
a1.channels = c1
a1.sinks = k1
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

source

a1.sources.r1.type = avro
a1.sources.r1.bind = qianfeng03
a1.sources.r1.port = 10088

channel

a1.channels.c1.type = memory

sink

a1.sinks.k1.type = logger


4)启动

```shell
先启动下游的两个方案
[root@qianfeng02 flumeconf]# flume-ng agent -c ../conf -f second-processor.properties -n a1 -Dflume.root.logger=INFO,console
[root@qianfeng03 flumeconf]# flume-ng agent -c ../conf -f third-processor.properties -n a1 -Dflume.root.logger=INFO,console
在启动上游的一个方案
[root@qianfeng03 flumeconf]# flume-ng agent -c ../conf -f first-processor.properties -n a1 -Dflume.root.logger=INFO,console

5)测试

[root@qianfeng02 ~]# echo "b1" | nc qianfeng01 10086
[root@qianfeng02 ~]# echo "b2" | nc qianfeng01 10086
[root@qianfeng02 ~]# echo "b3" | nc qianfeng01 10086
[root@qianfeng02 ~]# echo "b4" | nc qianfeng01 10086
[root@qianfeng02 ~]# echo "b5" | nc qianfeng01 10086
[root@qianfeng02 ~]# echo "b6" | nc qianfeng01 10086
[root@qianfeng02 ~]# echo "b7" | nc qianfeng01 10086

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值