Flume--拦截器

最新推荐文章于 2024-07-11 18:52:23 发布

Keep hunger

最新推荐文章于 2024-07-11 18:52:23 发布

阅读量145

点赞数 1

分类专栏： Flume 文章标签： Flume hadoop

本文链接：https://blog.csdn.net/ITgagaga/article/details/102888323

版权

Flume 专栏收录该内容

4 篇文章 0 订阅

订阅专栏

Flume–拦截器

文章目录

- Flume--拦截器

拦截器：拦截source数据，将flume的agent读取的数据先拦截，对数据进行添加数据头header,便于数据过滤使用

1）Timestamp Interceptor 时间戳拦截器

拦截source 对数据添加时间戳 header：{timestamp=12425367272}

配置拦截器
a1.sources.r1.interceptors = i1 拦截器别名
a1.sources.r1.interceptors.i1.type = timestamp 类型

1.输出到控制台

在/home/hadoop/apps/apache-flume-1.8.0-bin/conf/in_time.conf文件中写入如下内容：

# 给当前的agent  source channel  sink起别名  a1代表当前agent的名字
# source的别名
a1.sources = r1
# channel的别名
a1.channels = c1
# sink的别名
a1.sinks = k1

# 配置source的相关信息   数据源的 
a1.sources.r1.type = exec
a1.sources.r1.command = tail -100 /home/hadoop/zookeeper.out

# 指定拦截器
a1.sources.r1.interceptors = i1
a1.sources.r1.interceptors.i1.type = timestamp

# 配置channel的相关信息  内存
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity =100


# 配置sink的信息  控制台打印
a1.sinks.k1.type = logger


# 绑定source  channel   sink的对应关系
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

启动：

./flume-ng agent --conf conf --conf-file /home/hadoop/apps/apache-flume-1.8.0-bin/conf/in_time.conf --name a1 -Dflume.root.logger=INFO,console

其中的一个event的信息:

Event: { headers:{timestamp=1572778457556} body: 61 61 61 61    aaaa }

2.sink是hdfs，使用过滤器

# 给当前的agent  source channel  sink起别名  a1代表当前agent的名字
# source的别名
a1.sources = r1
# channel的别名
a1.channels = c1
# sink的别名
a1.sinks = k1

# 配置source的相关信息   数据源的 
a1.sources.r1.type = exec
a1.sources.r1.command = tail -100 /home/hadoop/zookeeper.out

# 指定拦截器
a1.sources.r1.interceptors = i1
a1.sources.r1.interceptors.i1.type = timestamp

# 配置channel的相关信息  内存
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity =100


# 配置sink的信息  控制台打印
a1.sinks.k1.type = hdfs
a1.sinks.k1.hdfs.path = /user/flume_data/%Y-%m-%d/%H
a1.sinks.k1.hdfs.filePrefix = event-
a1.sinks.k1.hdfs.fileSuffix = .log
a1.sinks.k1.hdfs.rollSize = 1024
a1.sinks.k1.hdfs.rollInterval = 10
a1.sinks.k1.hdfs.rollCount = 0 
a1.sinks.k1.hdfs.fileType = DataStream
a1.sinks.k1.hdfs.writeFormat = Text


# 绑定source  channel   sink的对应关系
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

启动

./flume-ng agent --conf conf --conf-file /home/hadoop/apps/apache-flume-1.8.0-bin/conf/in_host_conf --name a1 -Dflume.root.logger=INFO,console

2）主机名拦截器

拦截source,为source添加上数据源的主机名 | id

1.输出到控制台

在/home/hadoop/apps/apache-flume-1.8.0-bin/conf/in_host_conf中写入一下内容

# 给当前的agent  source channel  sink起别名  a1代表当前agent的名字
# source的别名
a1.sources = r1
# channel的别名
a1.channels = c1
# sink的别名
a1.sinks = k1

# 配置source的相关信息   数据源的 
a1.sources.r1.type = exec
a1.sources.r1.command = tail -100 /home/hadoop/zookeeper.out

# 指定拦截器
a1.sources.r1.interceptors = i1
a1.sources.r1.interceptors.i1.type = host

# 配置channel的相关信息  内存
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity =100


# 配置sink的信息  控制台打印
a1.sinks.k1.type = logger


# 绑定source  channel   sink的对应关系
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

启动：

./flume-ng agent --conf conf --conf-file /home/hadoop/apps/apache-flume-1.8.0-bin/conf/in_host_conf --name a1 -Dflume.root.logger=INFO,console

其中一个event信息如下：

 Event: { headers:{host=192.168.2.101} body: 61 61 61 61       aaaa }

可以添加上下面一句话，

a1.sources.r1.interceptors.i1.useIP = false

得到的其中一个event

 Event: { headers:{host=hadoop01} body: 32 30 31 39 2D 31 31 2D 30 33 20 30 31 3A 31 39 2019-11-03 01:19 }

2.指定两个拦截器，sink是hdfs

在/home/hadoop/apps/apache-flume-1.8.0-bin/conf/host_ts_conf中写入以下内容

# 给当前的agent  source channel  sink起别名  a1代表当前agent的名字
# source的别名
a1.sources = r1
# channel的别名
a1.channels = c1
# sink的别名
a1.sinks = k1

# 配置source的相关信息   数据源的 
a1.sources.r1.type = exec
a1.sources.r1.command = tail -100 /home/hadoop/zookeeper.out

# 指定拦截器
a1.sources.r1.interceptors = i1 i2
a1.sources.r1.interceptors.i1.type = timestamp

a1.sources.r2.interceptors.i2.type= host
a1.sources.r1.interceptors.i2.useIP = false

# 配置channel的相关信息  内存
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity =100


# 配置sink的信息  控制台打印
a1.sinks.k1.type = hdfs
a1.sinks.k1.hdfs.path = /user/flume_data/%{host}/%Y-%m-%d/%H
a1.sinks.k1.hdfs.filePrefix = event-
a1.sinks.k1.hdfs.fileSuffix = .log
a1.sinks.k1.hdfs.rollSize = 1024
a1.sinks.k1.hdfs.rollInterval = 10
a1.sinks.k1.hdfs.rollCount = 0 
a1.sinks.k1.hdfs.fileType = DataStream
a1.sinks.k1.hdfs.writeFormat = Text


# 绑定source  channel   sink的对应关系
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

启动：

./flume-ng agent --conf conf --conf-file /home/hadoop/apps/apache-flume-1.8.0-bin/conf/host_ts_conf --name a1 -Dflume.root.logger=INFO,console

得到的文件：/user/data/flume_data/hadoop01/2019-11-03/03/event.1572781570881.log

3）静态拦截器

static interceptors 自定义拦截器

header{K,V} K V可以自己定制

要指定 type:static key value

在/home/hadoop/apps/apache-flume-1.8.0-bin/conf/statis_conf中写入一下内容：

# 给当前的agent  source channel  sink起别名  a1代表当前agent的名字
# source的别名
a1.sources = r1 r2 r3
# channel的别名
a1.channels = c1
# sink的别名
a1.sinks = k1

# 配置source的相关信息   数据源的 
a1.sources.r1.type = exec
a1.sources.r1.command = tail -100 /home/hadoop/zookeeper.out

# 指定拦截器
a1.sources.r1.interceptors = i1
a1.sources.r1.interceptors.i1.type = static
a1.sources.r1.interceptors.i1.key=filename
a1.sources.r1.interceptors.i1.value = zoo

# 配置channel的相关信息  内存
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity =100


# 配置sink的信息  控制台打印
a1.sinks.k1.type = logger


# 绑定source  channel   sink的对应关系
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

启动：

./flume-ng agent --conf conf --conf-file /home/hadoop/apps/apache-flume-1.8.0-bin/conf/statis_conf --name a1 -Dflume.root.logger=INFO,console

其中的一个event

19/11/03 04:32:29 INFO sink.LoggerSink: Event: { headers:{filename=zoo} body: 32 30 31 39 2D 31 31 2D 30 33 20 30 31 3A 31 39 2019-11-03 01:19 }

4）综合案例

题目：

有A,B,C三台机器，A,B机器部署在hadoop01节点上，C机器部署在hadoop02节点上，
access.log、用户日志数据,A机器上
nginx.log、 nginx 服务器运行日志，A机器上
web.log 、　web工程的日志，B机器上
现在要求：把 A、B 机器中的 access.log、nginx.log、web.log 采集汇总到 hadoop03 机器上然后统一收集到 hdfs中。
而且在 hdfs 中要求的目录为：
/source/logs/access/日期/**
/source/logs/nginx/日期/**
/source/logs/web/20160101/**

分析：

hadoop01中A机器：access.log，位置/home/hadoop/apps/apache-flume-1.8.0-bin/conf/access.log

hadoop01中B机器：nginx.log，位置/home/hadoop/apps/apache-flume-1.8.0-bin/conf/nginx.log

hadoop02中C机器：web.log,位置/home/hadoop/apps/apache-flume-1.8.0-bin/conf/web.log

按照题目要求，采用的是flume的先并联再串联的部署方案：所以在每一台机器部署一个agent监控，监控A,B,C的三个agent的sink采用avro,由于题目要求在HDFS上面的文件要分类，所以这个三个agent的source需要自定义拦截器，K就同一是filename,V就是各自监控的文件名；而负责收集汇总信息的hadoop03节点上面的agent,source的type是avro,端口和ip要和A,B,C三台机器上面的agent的端口,ip保持一致；最后在hadoop03上面还要定义两个拦截器，一个是时间拦截器，一个是自定义拦截器，自定义拦截器的K是type,V就是log.

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-lCsiUlE0-1572791347665)(D:\CSDN图片\多agent合并串联.png)]

具体实现:

监听A机器的agent配置文件

# 给当前的agent  source channel  sink起别名  a1代表当前agent的名字
# source的别名
a1.sources = r1
# channel的别名
a1.channels = c1
# sink的别名
a1.sinks = k1

# 配置source的相关信息   数据源的 
a1.sources.r1.type = exec
a1.sources.r1.command = tail -100 /home/hadoop/apps/apache-flume-1.8.0-bin/conf/access.log

a1.sources.r1.interceptors = i1
a1.sources.r1.interceptors.i1.type = static
a1.sources.r1.interceptors.i1.key=filename
a1.sources.r1.interceptors.i1.value = access


# 配置channel的相关信息  内存
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity =100


# 配置sink的信息  采用avro协议
a1.sinks.k1.type = avro
a1.sinks.k1.hostname = hadoop03
a1.sinks.k1.port = 45551


# 绑定source  channel   sink的对应关系
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

监听B机器的agent配置文件

# 给当前的agent  source channel  sink起别名  a1代表当前agent的名字
# source的别名
a1.sources = r2
# channel的别名
a1.channels = c2
# sink的别名
a1.sinks = k2

# 配置source的相关信息   数据源的 
a1.sources.r2.type = exec
a1.sources.r2.command = tail -100 /home/hadoop/apps/apache-flume-1.8.0-bin/conf/nginx.log

a1.sources.r2.interceptors = i1
a1.sources.r2.interceptors.i1.type = static
a1.sources.r2.interceptors.i1.key=filename
a1.sources.r2.interceptors.i1.value = nginx

# 配置channel的相关信息  内存
a1.channels.c2.type = memory
a1.channels.c2.capacity = 1000
a1.channels.c2.transactionCapacity =100


# 配置sink的信息  输出avro
a1.sinks.k2.type = avro
a1.sinks.k2.hostname = hadoop03
a1.sinks.k2.port = 45551


# 绑定source  channel   sink的对应关系
a1.sources.r2.channels = c2
a1.sinks.k2.channel = c2

监听C机器的agent配置文件

# 给当前的agent  source channel  sink起别名  a1代表当前agent的名字
# source的别名
a1.sources = r1
# channel的别名
a1.channels = c1
# sink的别名
a1.sinks = k1

# 配置source的相关信息   数据源的 
a1.sources.r1.type = exec
a1.sources.r1.command = tail -100 /home/hadoop/apps/apache-flume-1.8.0-bin/conf/web.log

a1.sources.r1.interceptors = i1
a1.sources.r1.interceptors.i1.type = static
a1.sources.r1.interceptors.i1.key=filename
a1.sources.r1.interceptors.i1.value = web


# 配置channel的相关信息  内存
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity =100


# 配置sink的信息  采用avro协议
a1.sinks.k1.type = avro
a1.sinks.k1.hostname = hadoop03
a1.sinks.k1.port = 45551


# 绑定source  channel   sink的对应关系
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

A,B,C进行汇总的Collector配置文件

# 给当前的agent  source channel  sink起别名  a1代表当前agent的名字
# source的别名
a1.sources = r1
# channel的别名
a1.channels = c1
# sink的别名
a1.sinks = k1

# 配置source的相关信息   数据源的 
a1.sources.r1.type = avro
# 这里的主机  和avrosink 一致
a1.sources.r1.bind = hadoop03
a1.sources.r1.port = 45551

# 配置拦截器
a1.sources.r1.interceptors = i1 i2
a1.sources.r1.interceptors.i1.type = timestamp

a1.sources.r1.interceptors.i2.type = static
a1.sources.r1.interceptors.i2.key=type
a1.sources.r1.interceptors.i2.value = log



# 配置channel的相关信息  内存
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity =100


# 配置sink的信息  控制台打印
a1.sinks.k1.type = hdfs
a1.sinks.k1.hdfs.path = /user/flume_data/%{type}/%{filename}/%Y-%m-%d/%H
a1.sinks.k1.hdfs.filePrefix = event-
a1.sinks.k1.hdfs.fileSuffix = .log
a1.sinks.k1.hdfs.rollSize = 1024
a1.sinks.k1.hdfs.rollInterval = 10
a1.sinks.k1.hdfs.rollCount = 0 
a1.sinks.k1.hdfs.fileType = DataStream
a1.sinks.k1.hdfs.writeFormat = Text


# 绑定source  channel   sink的对应关系
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

启动：从后向前依次启动

./flume-ng agent --conf conf --conf-file /home/hadoop/apps/apache-flume-1.8.0-bin/conf/collector-conf --name a1 -Dflume.root.logger=INFO,console

./flume-ng agent --conf conf --conf-file /home/hadoop/apps/apache-flume-1.8.0-bin/conf/access-conf --name a1 -Dflume.root.logger=INFO,console

./flume-ng agent --conf conf --conf-file /home/hadoop/apps/apache-flume-1.8.0-bin/conf/nginx-conf --name a1 -Dflume.root.logger=INFO,console

./flume-ng agent --conf conf --conf-file /home/hadoop/apps/apache-flume-1.8.0-bin/conf/web-conf --name a1 -Dflume.root.logger=INFO,console

结果：

在这里插入图片描述

Keep hunger

关注

1
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Flume--拦截器

Flume–拦截器文章目录Flume--拦截器1）Timestamp Interceptor 时间戳拦截器1.输出到控制台2.sink是hdfs，使用过滤器2）主机名拦截器1.输出到控制台2.指定两个拦截器，sink是hdfs3）静态拦截器4）综合案例题目：分析：具体实现:监听A机器的agent配置文件监听B机器的agent配置文件监听C机器的agent配置文件A,B,C进行汇总的Collec...
复制链接

扫一扫