Flume日志收集

最新推荐文章于 2024-07-07 12:55:39 发布

sun_0128

最新推荐文章于 2024-07-07 12:55:39 发布

阅读量613

点赞数 2

分类专栏： spark 文章标签： hadoop 大数据 linux flume spark

本文链接：https://blog.csdn.net/sun_0128/article/details/108048290

版权

spark 专栏收录该内容

23 篇文章 5 订阅

订阅专栏

文章目录

一.Apache Flume简介
二.Flume架构
三.Source
四.Channel
五.Sink
六.多层代理(拓扑结构)
七.Flume Sink组
八.拦截器（Interceptors）

一.Apache Flume简介

Flume用于将多种来源的日志以流的方式传输至Hadoop或者其它目的地

一种可靠、可用的高效分布式数据收集服务

Flume拥有基于数据流上的简单灵活架构，支持容错、故障转移与恢复
由Cloudera 2009年捐赠给Apache，现为Apache顶级项目

二.Flume架构

Client：客户端，数据产生的地方，如Web服务器
Event：事件，指通过Agent传输的单个数据包，如日志数据通常对应一行数据
Agent：代理，一个独立的JVM进程

Flume以一个或多个Agent部署运行
Agent包含三个组件:Source,Channel,Sink

Hello Flume

agent.sources = s1    
agent.channels = c1  
agent.sinks = sk1    
#设置Source为netcat 端口为5678，使用的channel为c1  
agent.sources.s1.type = netcat  
agent.sources.s1.bind = localhost  
agent.sources.s1.port = 5678  
agent.sources.s1.channels = c1    
#设置Sink为logger模式，使用的channel为c1  
agent.sinks.sk1.type = logger  
agent.sinks.sk1.channel = c1  
#设置channel为capacity 
agent.channels.c1.type = memory

bin/flume-ng agent --name agent -f h0.conf -Dflume.root.logger=INFO,console

Flume组件

Source
SourceRunner
Interceptor
Channel
ChannelSelector
ChannelProcessor
Sink
SinkRunner
SinkProcessor
SinkSelector

Flume工作流程

在这里插入图片描述

三.Source

1.exec source

执行Linux指令，并消费指令返回的结果，如“tail -f”

属性	缺省值	描述
type	-	exec
command	-	如“tail -f xxx.log”
shell	-	选择系统Shell程序，如“/bin/sh”
batchSize	20	发送给channel的最大行数

练习1

使用Exec Source遍历Linux文件目录，并在控制台输出所有TXT文件名

1.编辑任务文件practice1:vi practice1

#命名
a2.sources = s1
a2.sinks = k1
a2.channels = c1
#模式
a2.sources.s1.type = exec
#模式
a2.sinks.k1.type = logger
#执行命令
a2.sources.s1.command = /data/ex
#该管道配置
a2.channels.c1.type = memory
a2.channels.c1.capacity = 1000
a2.channels.c1.transactionCapacity = 100
#连接
a2.sources.s1.channels = c1
a2.sinks.k1.channel = c1

2.编写命令脚本文件

touch /data/ex
echo 'for i in /data/*.txt; do echo $i; done' > ex
chmod 777 ex

3.执行任务

flume-ng agent -c /opt/install/flume/conf/ -f practice1 -n a2 -Dflume.root.logger=INFO,console

4.效果如下

2.spooling directory source

从磁盘文件夹中获取文件数据，可避免重启或者发送失败后数据丢失，还可用于监控文件夹新文件

属性	缺省值	描述
type	-	spooldir
spoolDir	-	需读取的文件夹
fileSuffix	.COMPLETED	文件读取完成后添加的后缀
deletePolicy	never	文件完成后删除策略：never和immediate

练习2:监控文件目录

需求说明:
使用Spooling Directory Source实时监控指定目录的新文件，并将文件内容输出至控制台

1.编辑任务文件:practice2:vi practice2

e1.sources = s1
e1.channels = c11
e1.sinks = k1
e1.sinks.k1.type = logger
e1.sources.s1.type = spooldir
e1.sources.s1.spoolDir = /data/p2
e1.channels.c11.type = memory
e1.sources.s1.channels = c11
e1.sinks.k1.channel = c11

2.执行任务

flume-ng agent -c /opt/install/flume/conf/ -f practice2 -n e1 -Dflume.root.logger=INFO,console

3.效果如下

3.http source

用于接收HTTP的Get和Post请求

属性	缺省值	描述
type	-	http
port	-	监听端口
bind	0.0.0.0	绑定IP
handler	org.apache.flume.source.http.JSONHandler	数据处理程序类全名

1.编写任务文件:vi http

#命名
a1.sources = s1
a1.channels = c1
a1.sinks = sk1
#source模式
a1.sources.s1.type = http
a1.sources.s1.port = 5140
#channel模式
a1.channels.c1.type = memory
#sink模式
a1.sinks.sk1.type = logger
#连接
a1.sources.s1.channels = c1
a1.sinks.sk1.channel = c1

2.执行任务

flume-ng agent -c /opt/install/flume/conf/ -f http -n a1 -Dflume.root.logger=INFO,console

3.在其他session下输入如下命令传输json

curl -XPOST localhost:5140 -d'[{"headers":{"h1":"v1","h2":"v2"},"body":"hello body"}]'

4.flume界面效果如下

4.avro source

监听Avro端口，并从外部Avro客户端接收events

属性	缺省值	描述
type	-	avro
bind	-	绑定IP地址
port	-	端口
threads	-	最大工作线程数量

5.taildir source

1.编写任务文件:vi taildir

#命名
a1.sources = s1
a1.channels = c1
a1.sinks = sk1

#source模式
a1.sources.s1.type = TAILDIR
#配置有几个组
a1.sources.s1.filegroups = f1 f2
#配置groups的f1 f2
a1.sources.s1.filegroups.f1 =  /data/tail_1/example.log
a1.sources.s1.filegroups.f2 =  /data/tail_2/.*log.*
#指定position的位置
a1.sources.s1.positionFile = /data/tail_position/taildir_position.json
#指定headers
a1.sources.s1.headers.f1.headerKey1 = value1
a1.sources.s1.headers.f2.headerKey1 = value2
a1.sources.s1.headers.f2.headerKey2 = value2-2
a1.sources.s1.fileHeader = true

#channel模式
a1.channels.c1.type = memory
#sink模式
a1.sinks.sk1.type = logger

#连接
a1.sources.s1.channels = c1
a1.sinks.sk1.channel = c1

mkdir -p /data/tail_1
mkdir -p /data/tail_2
flume-ng agent -c /opt/install/flume/conf/ -f taildir -n a1 -Dflume.root.logger=INFO,console

3.在其他session下创建文件

#session1
tail -f /data/tail_position/taildir_position.json
#session2
touch /data/tail_1/example.log
echo "hello world" >> /data/tail_1/example.log
touch /data/tail_2/hello.log
touch /data/tail_2/test.log.txt
echo "hello spark" >> /data/tail_2/hello.log
echo "hello flume" >> /data/tail_2/hello.log

4.效果如下:

四.Channel

Memory Channel

event保存在Java Heap中。如果允许数据小量丢失，推荐使用

File Channel

event保存在本地文件中，可靠性高，但吞吐量低于Memory Channel

Kafka Channel
JDBC Channel

event保存在关系数据中，一般不推荐使用

五.Sink

负责从Channel收集数据

1.avro sink

作为avro客户端向avro服务端发送avro事件

属性	缺省值	描述
type	-	avro
hostname	-	服务端IP地址
post	-	端口
batch-size	100	批量发送事件数量

1.编写source端任务:vi avro_source

#命名
a2.sources = s1
a2.channels = c1
a2.sinks = sk1
#source模式
a2.sources.s1.type = avro
a2.sources.s1.bind = localhost
a2.sources.s1.port = 44444

#channel模式
a2.channels.c1.type = memory
#sink模式
a2.sinks.sk1.type = logger
#连接
a2.sources.s1.channels = c1
a2.sinks.sk1.channel = c1

2.编写sink端任务:vi avro_sink

#命名
a1.sources = s1
a1.channels = c1
a1.sinks = sk1
#source模式
a1.sources.s1.type = exec
a1.sources.s1.command = tail -f /data/customers.csv
#channel模式
a1.channels.c1.type = memory
#sink模式
a1.sinks.sk1.type = avro
a1.sinks.sk1.hostname = localhost
a1.sinks.sk1.port = 44444
#连接
a1.sources.s1.channels = c1
a1.sinks.sk1.channel = c1

3.执行任务

#session1
flume-ng agent -c /opt/install/flume/conf/ -f  avro_source -n a2 -Dflume.root.logger=INFO,console
#session2
flume-ng agent -c /opt/install/flume/conf/ -f  avro_sink -n a1 -Dflume.root.logger=INFO,console

4.效果如下

练习3：分层收集

需求说明
创建两个Agent实现分层收集
Agent-1：使用Exec Source收集“/var/log/secure”日志文件内容，并使用Avro Sink输出
Agent-2：使用Avro Source收集Agent-1中的输出，并输出到控制台中
提示
先启动Agent-2，再启动Agent-1。并想一想为什么

1)编辑任务文件:vi practice3

#命名
a1.sources = s1
a1.sinks = k1
a1.channels = c1
#source模式
a1.sources.s1.type = exec
#执行命令
a1.sources.s1.command = tail -f /var/log/secure
#sink模式
a1.sinks.k1.type = avro
a1.sinks.k1.hostname = localhost
a1.sinks.k1.port = 44444
#channel模式
a1.channels.c1.type = memory
#连接
a1.sources.s1.channels = c1
a1.sinks.k1.channel = c1



#命名
a2.sources = s2
a2.sinks = k2
a2.channels = c2
#模式
a2.sources.s2.type =avro
a2.sources.s2.bind = localhost
a2.sources.s2.port = 44444
#模式
a2.sinks.k2.type = logger
#channel模式
a2.channels.c2.type = memory
#连接
a2.sources.s2.channels = c2
a2.sinks.k2.channel = c2

2)启动任务

#session1
flume-ng agent -c /opt/install/flume/conf/ -f  practice3 -n a2 -Dflume.root.logger=INFO,console
#session2
flume-ng agent -c /opt/install/flume/conf/ -f  practice3 -n a1 -Dflume.root.logger=INFO,console

3)效果如下:

2.HDFS sink

将事件写入Hadoop分布式文件系统（HDFS）

属性	缺省值	描述
type	-	hdfs
hdfs.path	-	hdfs目录
hfds.filePrefix	FlumeData	文件前缀
hdfs.fileSuffix	-	文件后缀

1.编写任务文件:vi hdfsFlume

#命名
a1.sources = s1
a1.channels = c1
a1.sinks = sk1
#source模式
a1.sources.s1.type = exec
a1.sources.s1.command = tail -f /data/customers.csv
#channel模式
a1.channels.c1.type = memory
#sink模式
a1.sinks.sk1.type = hdfs
a1.sinks.sk1.hdfs.path = /flume/events/%y-%m-%d/%H%M/%S
a1.sinks.sk1.hdfs.useLocalTimeStamp = true
#连接
a1.sources.s1.channels = c1
a1.sinks.sk1.channel = c1

2.执行任务

hdfs dfs -mkdir /flume
flume-ng agent -c /opt/install/flume/conf/ -f hdfsFlume -n a1 -Dflume.root.logger=INFO,console

3效果如下

**练习4：收集日志至HDFS **

需求说明
使用Exec Source收集本地“/var/log/secure”日志文件内容，并使用HDFS Sink输出至HDFS“/var/log/secure”中

1.编写任务文件:vi practice4

#命名
a1.sources = s1
a1.channels = c1
a1.sinks = sk1
#source模式
a1.sources.s1.type = exec
a1.sources.s1.command = tail -f /var/log/secure
a1.channels.c1.type = memory
#sink模式
a1.sinks.sk1.type = hdfs
a1.sinks.sk1.hdfs.path = /flume/events/%y-%m-%d/%H%M/%S
a1.sinks.sk1.hdfs.useLocalTimeStamp = true
#连接
a1.sources.s1.channels = c1
a1.sinks.sk1.channel = c1

2.执行任务

flume-ng agent -c /opt/install/flume/conf/ -f practice4 -n a1 -Dflume.root.logger=INFO,console

3效果如下

3.Hive sink

包含分隔文本或JSON数据流事件直接进入Hive表或分区
传入的事件数据字段映射到Hive表中相应的列

属性	缺省值	描述
type	-	hive
hive.metastore	-	Hive metastore URI
hive.database	-	Hive数据库名称
hive.table	-	Hive表
serializer	-	序列化器负责从事件中分析出字段并将它们映射为Hive表中的列。序列化器的选择取决于数据的格式。支持序列化器:DELIMITED和JSON

4.kafka sink

a1.sources = s1
a1.sinks = k1 
a1.channels = c2

a1.sources.s1.type = spooldir
a1.sources.s1.spoolDir = /data/kb07file/prodata
a1.sources.s1.deserializer = LINE
a1.sources.s1.deserializer.maxLineLength = 3000
a1.sources.s1.includePattern = train_[0-9]{4}-[0-9]{2}-[0-9]{2}.csv
a1.sources.s1.interceptors = head_filter
a1.sources.s1.interceptors.head_filter.type = regex_filter
a1.sources.s1.interceptors.head_filter.regex = ^user_id*
a1.sources.s1.interceptors.head_filter.excludeEvents = true


a1.channels.c2.type = memory
a1.channels.c2.capacity = 100000
a1.channels.c2.transactionCapacity = 10000

a1.sinks.k1.type = org.apache.flume.sink.kafka.KafkaSink
a1.sinks.k1.brokerList = 192.168.11.201:9092
a1.sinks.k1.topic = train2

a1.sources.s1.channels =  c2
a1.sinks.k1.channel = c2

六.多层代理(拓扑结构)

简单串联（multi-agent flow）
在这里插入图片描述
多路数据流（Multiplexing the flow）

合并（Consolidation），将多个源合并到一个目的地

负载均衡和故障转移

七.Flume Sink组

sink组是用来创建逻辑上的一组sink
sink组的行为是由sink处理器（processor）决定的，它决定了event的路由策略
processor包括故障转移和负载均衡两类

//故障转移
a1.sinkgroups = g1
a1.sinkgroups.g1.sinks = k1 k2
a1.sinkgroups.g1.processor.type = failover
a1.sinkgroups.g1.processor.priority.k1 = 5
a1.sinkgroups.g1.processor.priority.k2 = 10
a1.sinkgroups.g1.processor.maxpenalty = 10000

//负载均衡
a1.sinkgroups = g1
a1.sinkgroups.g1.sinks = k1 k2
a1.sinkgroups.g1.processor.type = load_balance
a1.sinkgroups.g1.processor.backoff = true
a1.sinkgroups.g1.processor.selector = random

八.拦截器（Interceptors）

拦截器可修改或丢弃事件

设置在source和channel之间

内置拦截器

HostInterceptor：在event header中插入“hostname”
TimestampInterceptor：插入时间戳
StaticInceptor：插入key-value
UUIDInceptor：插入UUID

自定义拦截器

1)写好拦截器代码,将jar包上传到$FLUME_HOME/lib 目录下

package njzb.kb07;

import org.apache.flume.Context;
import org.apache.flume.Event;
import org.apache.flume.interceptor.Interceptor;

import java.util.ArrayList;
import java.util.List;
import java.util.Map;

/**
 * @author sun_0128
 * @date 2020/08/17
 * @Software:IntelliJ IDEA
 * @description
 */
//todo 拦截器案例
public class InterceptorDemo implements Interceptor {
    private List<Event> addHeaderEvents;

    public void initialize() {
        addHeaderEvents = new ArrayList<Event>();
    }

    public Event intercept(Event event) {
        Map<String, String> headers = event.getHeaders();
        String body =new String(event.getBody());
        if(body.startsWith("gree")){
            headers.put("type","gree");
        }else {
            headers.put("type","lijia");
        }
        return event;
    }


    public List<Event> intercept(List<Event> list) {
        addHeaderEvents.clear();
        for (Event event : list) {
            addHeaderEvents.add(intercept(event));
        }
        return addHeaderEvents;
    }

    public void close() {

    }
    public  static class Builder implements Interceptor.Builder{


        public Interceptor build() {
            return new InterceptorDemo();
        }


        public void configure(Context context) {

        }
    }
}

2)写拦截器任务:vi lanjieqi.conf

a1.sources = r1
a1.channels = c1 c2
a1.sinks = k1 k2

a1.sources.r1.type = netcat
a1.sources.r1.bind = localhost
a1.sources.r1.port = 55555

a1.sources.r1.interceptors = i1
a1.sources.r1.interceptors.i1.type = njzb.kb07.InterceptorDemo$Builder

a1.sources.r1.selector.type = multiplexing
a1.sources.r1.selector.header = type
a1.sources.r1.selector.mapping.gree = c1
a1.sources.r1.selector.mapping.lijia = c2

a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

a1.channels.c2.type = memory
a1.channels.c2.capacity = 1000
a1.channels.c2.transactionCapacity = 100

a1.sinks.k1.type = hdfs
a1.sinks.k1.hdfs.fileType = DataStream
a1.sinks.k1.hdfs.filePrefix = gree
a1.sinks.k1.hdfs.fileSuffix = .csv
a1.sinks.k1.hdfs.path = /flume/events/gree/%Y-%m-%d
a1.sinks.k1.hdfs.useLocalTimeStamp = true
#a1.sinks.k1.hdfs.batchSize = 640
a1.sinks.k1.hdfs.rollCount = 0
a1.sinks.k1.hdfs.rollSize = 100
a1.sinks.k1.hdfs.rollInterval = 3

a1.sinks.k2.type = hdfs
a1.sinks.k2.hdfs.fileType = DataStream
a1.sinks.k2.hdfs.filePrefix = lijia
a1.sinks.k2.hdfs.fileSuffix = .csv
a1.sinks.k2.hdfs.path = /flume/events/lijia/%Y-%m-%d
a1.sinks.k2.hdfs.useLocalTimeStamp = true
#a1.sinks.k2.hdfs.batchSize = 640
a1.sinks.k2.hdfs.rollCount = 0
a1.sinks.k2.hdfs.rollSize = 100
a1.sinks.k2.hdfs.rollInterval = 3

a1.sources.r1.channels = c1 c2
a1.sinks.k1.channel = c1
a1.sinks.k2.channel = c2