Flume

最新推荐文章于 2022-04-18 15:38:06 发布

塞纳河畔的王子

最新推荐文章于 2022-04-18 15:38:06 发布

阅读量171

点赞数

本文链接：https://blog.csdn.net/qq_38078738/article/details/106311599

版权

大数据专栏收录该内容

12 篇文章 0 订阅

订阅专栏

Apache Flume

一、概述

http://flume.apache.org/

Flume是一个分布式的、可靠的、高可用的高效的日志数据收集、聚合以及传输系统，它简单和灵活的架构是基于数据流的。Flume具备强大的容错保证机制，有多种容错和恢复保证。Flume使用简单可扩展的数据模型允许开发在线分析处理应用。

架构

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-ux2oLmH0-1590288162930)(D:\Learnspace\training camp\day10\图片\2019082301.png)]Flume Event

事件对象被定义为数据流中的一个单元，Event数据流的有效载荷（Body）为采集到的一条记录，Event Head中可以添加一些可选的KV结构的描述信息。

Flume Agent

Agent实例是一个JVM进程，它里面包含三个核心组件（Source、Channel、Sink），可以将数据从外部的系统传输到目的地进行有效存储。

Agent Source

Source组件负责数据的收集接收，并且会将收集到的数据封装为==Event(Head[k，v] + Body[一条记录])==事件对象，发送给Channel

Agent Channel

Channel组件，类似于写缓存，本质上是Event队列（符合队列的特点，先进先出 —> FIFO）

Agent Sink

Sink组件，负责Channel中Event的最终处理，将采集到的数据保存到指定的外部存储系统中

二、快速入门

配置文件语法

# example.conf: A single-node Flume configuration
# Name the components on this agent

# 指agent有一个source组件名字叫做r1
agent名称.sources = r1
# 指agent有一个sink组件名字叫做k1
agent名称.sinks = k1
# 指agent有一个channel组件名字叫做c1
agent名称.channels = c1

# Describe/configure the source
# 一个source组件的相关配置说明
agent名称.sources.r1.type = netcat
agent名称.sources.r1.bind = localhost
agent名称.sources.r1.port = 44444

# Describe the sink
# 一个sink组件的相关配置说明  logger指将采集到的数据以日志的形式sink到控制台窗口
agent名称.sinks.k1.type = logger

# Use a channel which buffers events in memory
agent名称.channels.c1.type = memory
agent名称.channels.c1.capacity = 1000
agent名称.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel
agent名称.sources.r1.channels = c1
agent名称.sinks.k1.channel = c1

环境搭建

[root@hadoop ~]# tar -zxf apache-flume-1.7.0-bin.tar.gz -C /usr
[root@hadoop ~]# cd /usr/apache-flume-1.7.0-bin/

实战操作

Simple Example

配置文件

# example.conf: A single-node Flume configuration
# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1

# Describe/configure the source
a1.sources.r1.type = netcat
a1.sources.r1.bind = localhost
a1.sources.r1.port = 44444

# Describe the sink
a1.sinks.k1.type = logger

# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

启动Flume Agent

启动Agent使用指令:

bin/flume-ng agent -n $agent_name -c conf -f conf/flume-conf.properties.template

[root@hadoop apache-flume-1.7.0-bin]# bin/flume-ng agent --conf conf --conf-file conf/simple.conf --name a1 -Dflume.root.logger=INFO,console

开启Windows Telnet客户端方法

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-UsNy08yk-1590288162933)(D:\Learnspace\training camp\day10\图片\2019082302.png)]

在Linux操作系统中安装Telnet客户端

[root@hadoop ~]# yum install telnet

通过Telnet客户端向Flume的Source发送数据

C:\Users\Administrator>telnet 192.168.12.129 44444

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-HPpBx9gE-1590288162935)(D:\Learnspace\training camp\day10\图片\2019082303.png)]

或

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-Dxe8xeGx-1590288162938)(D:\Learnspace\training camp\day10\图片\2019082304.png)]

常用的Source类型

Netcat

Netcat常使用于测试环境，启动服务，客户端通过TCP/IP协议发送请求数据，进行采集

Exec

Exec将Linux操作指令的执行结果作为数据来源

# Describe/configure the source
a1.sources.r1.type = exec
a1.sources.r1.command = tail -f /usr/apache-flume-1.7.0-bin/access.log

Spooling Directory

spooling driectory 将Linux文件系统中某一个目录中的文本文件的内容作为数据来源

注意：数据目录中数据文件的内容一旦采集完成，数据文件将会自动重命名为.COMPLETED

a1.sources.r1.type = spooldir
a1.sources.r1.spoolDir = /root/data

Avro

Avro实际上是Hadoop生态体系中一个用来进行对象序列化和反序列化的框架

Avro Source可以启动一个Avro Server接收来自于Avro Client发送的请求数据，类似于Netcat

a1.sources.r1.type = avro
a1.sources.r1.bind = 192.168.12.129
a1.sources.r1.port = 33333

使用专用的avro-client发送数据给avro server

[root@hadoop apache-flume-1.7.0-bin]# bin/flume-ng avro-client --host 192.168.12.129 --port 33333 --filename /root/splits.txt

Kafka Source（略）

将Kafka消息队列中的数据，作为Source的数据来源

常用的Channel类型

Memory

常用使用Memory存储Event事件

注意：内存存放数据可能会造成数据丢失

a1.channels.c1.type = memory

JDBC

将Event事件存储到一个内嵌的名称为Derby的数据库中

注意：JDBC不支持其它的数据库产品

a1.channels.c1.type = jdbc

File Channel

将Event事件存储到本地文件系统的文件中

a1.channels.c1.type = file

Spillable Memory Channel

内存溢写的Channel，当内存中存放的Event达到阈值时会自动溢写到磁盘进行存储

Kafka Channel （略）

常用的Sink类型

Logger

将数据最终输出到控制台窗口（以INFO级别形式的日志进行展示）

HDFS

将数据保存到HDFS中进行持久化存储

a1.sinks.k1.type = hdfs
a1.sinks.k1.hdfs.path = hdfs://hadoop:9000/flume/events/%y-%m-%d/%H%M/%S
a1.sinks.k1.hdfs.filePrefix = events-
# 每10分钟产生一个数据目录 不够10分钟的数据存放到相同的数据目录中
a1.sinks.k1.hdfs.round = true
a1.sinks.k1.hdfs.roundValue = 10
a1.sinks.k1.hdfs.roundUnit = minute

出现异常：Caused by: java.lang.NullPointerException: Expected timestamp in the Flume event headers, but it was null

解决方案：给Event事件对象添加TimeStamp时间戳信息，拦截器（interceptor）

# Describe/configure the source
a1.sources.r1.type = netcat
a1.sources.r1.bind = 192.168.12.129
a1.sources.r1.port = 44444
a1.sources.r1.interceptors = i1
a1.sources.r1.interceptors.i1.type = timestamp

HDFS Sink 默认采用SequenceFile的文件格式存放采集到的数据，如果需要保存数据的真实内容，需要将fileType修改为DataStream

Avro Sink

将数据通过Avro Client发送给指令的Avro Server，支持将多个Flume Agent串联构成一个数据采集服务集群

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-7yMqaf9b-1590288162940)(D:\Learnspace\training camp\day11\图片\2019082601.png)]

a1.sinks.k1.type = avro
a1.sinks.k1.hostname = 10.10.10.10
a1.sinks.k1.port = 4545

File Roll Sink

将数据保存到本地文件系统中

# Describe the sink
# 注意：数据存储目录需要在启动服务之前创建完成
a1.sinks.k1.type = file_roll
a1.sinks.k1.sink.directory = /root/fileRoll

Null Sink

输出黑洞，丢弃所有从Channel中接收的数据

HBase Sink

将采集到的数据保存到HBase中存储

a1.sinks.k1.type = hbase
a1.sinks.k1.table = foo_table
a1.sinks.k1.columnFamily = bar_cf
a1.sinks.k1.serializer = org.apache.flume.sink.hbase.RegexHbaseEventSerializer

将采集到的数据保存到Kafka Cluster中进行存储，Kafka对接计算框架Flink、Spark或者数据清洗进行MapReduce计算

ElasticSearch Sink

将采集到的数据保存ElasticSearch集群

Kafka Sink（略）

三、综合使用案例

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-MZIvvFMw-1590288162942)(D:\Learnspace\training camp\day11\图片\2019082602.png)]

多Sources Agent案例

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-8FAFfPki-1590288162944)(D:\Learnspace\training camp\day11\图片\2019082603.png)]

拦截器

作用于Source组件，按照设定的顺序对Event事件进行装饰或者过滤

基本拦截器的案例

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-UYmnEN3h-1590288162946)(D:\Learnspace\training camp\day11\图片\2019082604.png)]

# Describe/configure the source
a1.sources.r1.type = netcat
a1.sources.r1.bind = 192.168.12.129
a1.sources.r1.port = 44444
a1.sources.r1.interceptors = i1 i2 i3
a1.sources.r1.interceptors.i1.type = static
a1.sources.r1.interceptors.i1.key = datacenter
a1.sources.r1.interceptors.i1.value = bj
a1.sources.r1.interceptors.i2.type = host
a1.sources.r1.interceptors.i3.type = timestamp

基于正则过滤的拦截器

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-zCbTjD5q-1590288162947)(D:\Learnspace\training camp\day11\图片\2019082605.png)]


# Describe/configure the source
a1.sources.r1.type = netcat
a1.sources.r1.bind = 192.168.12.129
a1.sources.r1.port = 44444
a1.sources.r1.interceptors = i1
a1.sources.r1.interceptors.i1.type = regex_filter
a1.sources.r1.interceptors.i1.regex = ^INFO.*$

基于正则抽取的拦截器

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-MdFWgRg0-1590288162948)(D:\Learnspace\training camp\day11\图片\2019082606.png)]


# Describe/configure the source
a1.sources.r1.type = netcat
a1.sources.r1.bind = 192.168.12.129
a1.sources.r1.port = 44444
a1.sources.r1.interceptors = i1
a1.sources.r1.interceptors.i1.type=regex_extractor
# 注意：正则表达式的规则需要'\\规则'
a1.sources.r1.interceptors.i1.regex = ^(\\w*)\\s.*$
a1.sources.r1.interceptors.i1.serializers = s1
a1.sources.r1.interceptors.i1.serializers.s1.name = type

基于UUID的拦截器

# Describe/configure the source
a1.sources.r1.type = netcat
a1.sources.r1.bind = 192.168.12.129
a1.sources.r1.port = 44444
a1.sources.r1.interceptors = i1
a1.sources.r1.interceptors.i1.type=org.apache.flume.sink.solr.morphline.UUIDInterceptor$Builder

Channel Selector

允许Source组件基于预设的从所有的通道中选择一个或者多个

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-9kF3yNlx-1590288162950)(D:\Learnspace\training camp\day11\图片\2019082607.png)]

基于复制的通道选择器

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-a0NinXf2-1590288162952)(D:\Learnspace\training camp\day11\图片\2019082608.png)]


# Name the components on this agent
a1.sources = r1
a1.sinks = k1 k2
a1.channels = c1 c2

# Describe/configure the source
a1.sources.r1.type = netcat
a1.sources.r1.bind = 192.168.12.129
a1.sources.r1.port = 44444
a1.sources.r1.selector.type = replicating

# Describe the sink
a1.sinks.k1.type = logger
a1.sinks.k2.type = file_roll
a1.sinks.k2.sink.directory = /root/fileRoll

# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

a1.channels.c2.type = memory
a1.channels.c2.capacity = 1000
a1.channels.c2.transactionCapacity = 100

# Bind the source and sink to the channel
a1.sources.r1.channels = c1 c2
a1.sinks.k1.channel = c1
a1.sinks.k2.channel = c2

基于分发的通道选择器

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-IMBLYlHB-1590288162954)(D:\Learnspace\training camp\day11\图片\2019082609.png)]

a1.sources.r1.type = netcat
a1.sources.r1.bind = 192.168.12.129
a1.sources.r1.port = 44444
a1.sources.r1.interceptors = i1
a1.sources.r1.interceptors.i1.type= regex_extractor
a1.sources.r1.interceptors.i1.regex = ^(\\w*)\\s.*$
a1.sources.r1.interceptors.i1.serializers = s1
a1.sources.r1.interceptors.i1.serializers.s1.name = level

a1.sources.r1.selector.type = multiplexing
a1.sources.r1.selector.header = level
a1.sources.r1.selector.mapping.ERROR = c1
a1.sources.r1.selector.mapping.INFO = c2
a1.sources.r1.selector.default = c2

# Describe the sink
a1.sinks.k1.type = logger
a1.sinks.k2.type = file_roll
a1.sinks.k2.sink.directory = /root/fileRoll

# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

a1.channels.c2.type = memory
a1.channels.c2.capacity = 1000
a1.channels.c2.transactionCapacity = 100

# Bind the source and sink to the channel
a1.sources.r1.channels = c1 c2
a1.sinks.k1.channel = c1
a1.sinks.k2.channel = c2

Sink Group

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-aZVwjtZS-1590288162957)(D:\Learnspace\training camp\day11\图片\2019082610.png)]

基于Load Balance的Sink Group

# example.conf: A single-node Flume configuration
# Name the components on this agent
a1.sources = r1
a1.sinks = k1 k2
a1.channels = c1

# Describe/configure the source
a1.sources.r1.type = netcat
a1.sources.r1.bind = 192.168.12.129
a1.sources.r1.port = 44444

# Describe the sink
a1.sinks.k1.type = file_roll
a1.sinks.k1.sink.directory = /root/dir1

a1.sinks.k2.type = file_roll
a1.sinks.k2.sink.directory = /root/dir2

a1.sinkgroups = g1
a1.sinkgroups.g1.sinks = k1 k2
a1.sinkgroups.g1.processor.type = load_balance
a1.sinkgroups.g1.processor.backoff = true
a1.sinkgroups.g1.processor.selector = random

# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
a1.sinks.k2.channel = c1

四、基于Nginx服务器访问日志的数据采集

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-nKVDcOzd-1590288162958)(D:\Learnspace\training camp\day11\图片\2019082611.png)]安装 Nginx服务器

 [root@hadoop fileRoll]# yum install gcc-c++ perl-devel pcre-devel openssl-devel zlib-devel

编译安装

[root@hadoop nginx-1.11.1]# tar -zxf nginx-1.11.1.tar.gz -C .
[root@hadoop nginx-1.11.1]# cd nginx-1.11.1
[root@hadoop nginx-1.11.1]# ./configure --prefix=/usr/local/nginx
[root@hadoop nginx-1.11.1]# make && make install
[root@hadoop nginx-1.11.1]# cd /usr/local/nginx/
[root@hadoop nginx]# ll
总用量 4
drwxr-xr-x. 2 root root 4096 8月  26 16:42 conf
drwxr-xr-x. 2 root root   40 8月  26 16:42 html
drwxr-xr-x. 2 root root    6 8月  26 16:42 logs
drwxr-xr-x. 2 root root   19 8月  26 16:42 sbin

启动Nginx服务器

[root@hadoop nginx]# sbin/nginx -c conf/nginx.conf
[root@hadoop nginx]# ps -ef | grep nginx
root       6959      1  0 16:45 ?        00:00:00 nginx: master process sbin/nginx -c conf/nginx.conf
nobody     6960   6959  0 16:45 ?        00:00:00 nginx: worker process
root       6984   4177  0 16:45 pts/3    00:00:00 grep --color=auto nginx

访问地址：http://hadoop:80

查看access.log ，通过access.log日志文件计算系统的各项指标，如PV\UV\用户分布图\系统健壮性等各项指标

[root@hadoop nginx]# more logs/access.log
# client ip地址   请求时间                      请求方式 请求资源  状态码 响应字节 大小 浏览器信息
192.168.12.1 - - [26/Aug/2019:16:45:57 +0800] "GET / HTTP/1.1" 200 612 "-" "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.142 Safari/537.36"
192.168.12.1 - - [26/Aug/2019:16:45:57 +0800] "GET /favicon.ico HTTP/1.1" 404 571 "http://hadoop/" "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.142 Safari/537.36"

日志切割

通常情况，日志切割会按照每天进行处理

日志切割的shell脚本

[root@hadoop nginx]# vi nginx.sh
#!/bin/bash
#设置日志文件存放目录
logs_path="/usr/local/nginx/logs"

#设置pid文件
pid_path="/usr/local/nginx/logs/nginx.pid"

#重命名日志文件 
mv ${logs_path}/access.log /usr/local/nginx/history/access_$(date -d "yesterday" +"%Y-%m-%d-%H:%M").log

#向nginx主进程发信号重新打开日志 信号量
kill -USR1 `cat ${pid_path}`

修改shell脚本操作权限

[root@hadoop nginx]# chmod u+x nginx.sh
[root@hadoop nginx]# ll
总用量 8
drwx------. 2 nobody root    6 8月  26 16:45 client_body_temp
drwxr-xr-x. 2 root   root 4096 8月  26 16:42 conf
drwx------. 2 nobody root    6 8月  26 16:45 fastcgi_temp
drwxr-xr-x. 2 root   root    6 8月  26 16:57 history
drwxr-xr-x. 2 root   root   40 8月  26 16:42 html
drwxr-xr-x. 2 root   root   58 8月  26 16:45 logs
-rwxr--r--. 1 root   root  221 8月  26 17:02 nginx.sh
drwx------. 2 nobody root    6 8月  26 16:45 proxy_temp
drwxr-xr-x. 2 root   root   19 8月  26 16:42 sbin
drwx------. 2 nobody root    6 8月  26 16:45 scgi_temp
drwx------. 2 nobody root    6 8月  26 16:45 uwsgi_temp

添加Linux操作系统的调度任务（定时脚本）

# cron表达式： 秒 分 时 日 月 周 年
# linux支持5位cron表达式  从分开始
# 每隔3分钟触发一次  0/3 * * * *
# 每天晚上的凌晨1点0分触发一次  0 1 * * *
[root@hadoop nginx]# crontab -e
*/3 * * * * /usr/local/nginx/nginx.sh

配置并启动Flume的Agent

# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1

# Describe/configure the source
a1.sources.r1.type = spooldir
a1.sources.r1.spoolDir = /usr/local/nginx/history
a1.sources.r1.interceptors = i1
a1.sources.r1.interceptors.i1.type = timestamp

# Describe the sink

a1.sinks.k1.type = hdfs
a1.sinks.k1.hdfs.path = hdfs://hadoop:9000/nginx/events/%y-%m-%d/%H%M
a1.sinks.k1.hdfs.filePrefix = events-
a1.sinks.k1.hdfs.round = true
a1.sinks.k1.hdfs.roundValue = 10
a1.sinks.k1.hdfs.roundUnit = minute
a1.sinks.k1.hdfs.fileType = DataStream

# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

数据清洗

参考Hadoop

数据的分布式计算

参考：Hadoop

通过Highcharts进行数据的可视化展示

参考：Highcharts

https://www.highcharts.com.cn/docs

作业

展示系统PV\UV折线图
系统健壮性饼状图（各个状态码的所占比例）
系统每天活跃访问趋势图
-%m-%d/%H%M
a1.sinks.k1.hdfs.filePrefix = events-
a1.sinks.k1.hdfs.round = true
a1.sinks.k1.hdfs.roundValue = 10
a1.sinks.k1.hdfs.roundUnit = minute
a1.sinks.k1.hdfs.fileType = DataStream

Use a channel which buffers events in memory

a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

Bind the source and sink to the channel

a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1




### 数据清洗

参考Hadoop

### 数据的分布式计算

参考：Hadoop

### 通过Highcharts进行数据的可视化展示

参考：Highcharts

<https://www.highcharts.com.cn/docs>



#### 作业

1.  展示系统PV\UV折线图
2.  系统健壮性饼状图（各个状态码的所占比例）
3.  系统每天活跃访问趋势图
4.  可选（用户分布图，IP ---> IP逆解析为地理位置信息，通过Map展示用户分布信息）

塞纳河畔的王子

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Flume

Apache Flume一、概述http://flume.apache.org/Flume是一个分布式的、可靠的、高可用的高效的日志数据收集、聚合以及传输系统，它简单和灵活的架构是基于数据流的。Flume具备强大的容错保证机制，有多种容错和恢复保证。Flume使用简单可扩展的数据模型允许开发在线分析处理应用。架构[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-ux2oLmH0-1590288162930)(D:\Learnspace\training camp\day
复制链接

扫一扫

专栏目录