flume日志采集系统

最新推荐文章于 2023-03-16 10:55:51 发布

yala说

最新推荐文章于 2023-03-16 10:55:51 发布

阅读量247

点赞数

分类专栏： flume整理笔记

本文链接：https://blog.csdn.net/d15514350208/article/details/100019756

版权

flume整理笔记专栏收录该内容

0 篇文章 0 订阅

订阅专栏

关于各个组件的详细配置可以参考

https://www.cnblogs.com/moonandstar08/p/6284243.html

flume日志采集系统

1.运行机制

agent是flume的核心角色。每一个agent相当于一个数据传送员，可以单个agent使用，也可以一对多的关系串联使用

1.1组件介绍

Source：数据采集组件，用于采集数据
Sink：下沉组件，用于往下一级agent传递数据或者往最终存储系统传递数据
Channel：传输通道组件，用于从source将数据传递到sink
event:数据流单元，agent数据传递中以event作为传递单元

1.2简单使用配置

自己创建conf文件用于写配置规则

配置步骤
- 1.定义agent三个组件的名称
  - a1.sources = r1
    a1.sinks = k1
    a1.channels = c1
- 2.配置source信息
  - a1.sources.r1.type = netcat 监听端口44444
    a1.sources.r1.bind = 192.168.52.120
    a1.sources.r1.port = 44444
    - 常用监听方式
      - NetCat Source：绑定的端口（tcp、udp），可以使用telnet ip 44444来测试
      - Avro Source：监听一个avro服务端口
      - Exec Source：于Unix的command在标准输出上采集数据；
      - spooldir：监控文件目录，目内文件不能重名，采集过的文件自动添加后缀completed
- 3.配置sink组件
  - a1.sinks.k1.type = logger 日志打印
    - type可以有很多种
      - hdfs：表示sink到hdfs上
      - avro：发送到目标设备的指定端口
        a1.sinks.k1.channel = c1
        a1.sinks.k1.type = avro
        a1.sinks.k1.hostname = node02
        a1.sinks.k1.port = 52020
- 4.配置channel组件
  - a1.channels.c1.type = memory
    a1.channels.c1.capacity = 1000 通道中最大的可以存储的event数量
    a1.channels.c1.transactionCapacity = 100 每次最大可以从source中拿到或者送到sink中的event数量
    - 文件存储类型
      - 文件类型file，存储在磁盘
        Type channel的类型： file，安全，速度慢
        checkpointDir ：检查点的数据存储目录【提前创建目录】
        dataDirs ：数据的存储目录【提前创建目录】
        transactionCapacity：channel中允许事务的最大event数目
      - 内存memory，存储在内存，关机丢失，但是速度快
        Type channel的类型：必须为memory
        memoryCapacity：内存的容量event数
        transactionCapacity：channel中允许事务的最大event数目
      - 内存文件共用Spillable Memory Channel
        使用内存作为channel超过了阀值就存在文件中
        Type channel的类型：SPILLABLEMEMORY
        memoryCapacity：内存的容量event数
        overflowCapacity：数据存到文件的event阀值数
        checkpointDir：检查点的数据存储目录
        dataDirs：数据的存储目录
- 5.描述source channel sink之间的连接关系
  - a1.sources.r1.channels = c1
    a1.sinks.k1.channel = c1
启动flume 注意
- bin/flume-ng agent -c conf -f conf/xxx.conf -n a1 -Dflume.root.logger=INFO,console
  - -c conf 指定flume自身的配置文件所在目录
  - -f conf/xxx.conf 指定我们所描述的采集方案
  - -n a1 指定我们这个agent的名字(一定要与配置文件中相同)

注意：当出现异常时flume会停止，只能重新启动，可以用脚本轮询检查目标来判断文件是不是增加来判断是否已经停止工作，并重启flume

可以使用failover实现非采集高可用，原理：agent串联，负责采集的agent不能使用，下级agent可以配置多个，部分配置成备用

1.3flume拦截器

static拦截器：功能就是往采集到的数据的header中插入自己定## 义的key-value对

用法案例：采集做个目录并存入hdf上不同的目录

1.定义sources，指定三个，表示采集多个目录
- a1.sources = r1 r2 r3
  a1.sinks = k1
  a1.channels = c1
分别定义r1、r2、r3并添加拦截器，例子下面是r1为例，其他相同
- a1.sources.r1.type = exec
  a1.sources.r1.command = tail -F /export/servers/taillogs/access.log
  a1.sources.r1.interceptors = i1 不同的source拦截器名字不相同
  a1.sources.r1.interceptors.i1.type = static 指定类型
- a1.sources.r1.interceptors.i1.key = type 定义拦截器的key key是type，对应value是access
  a1.sources.r1.interceptors.i1.value = access 后面何以通过{type}获取到access
定义sink的hdfs路径添加{type}获取动态路径
- a1.sinks.k1.type = hdfs
  a1.sinks.k1.hdfs.path=hdfs://192.168.52.100:8020/source/logs/%{type}/%Y%m%d
  a1.sinks.k1.hdfs.filePrefix =events
  a1.sinks.k1.hdfs.fileType = DataStream
  a1.sinks.k1.hdfs.writeFormat = Text

1.4自定义拦截器

步骤
- 实现interceptor接口AppInterceptor implements Interceptor
- 实现Interceptor.Builder接口构造自定义拦截器
- 将自定义拦截器打包成jar，放入到flume的lib目录下
- 配置conf文件指定自定义拦截器
  - a1.sources.r1.interceptors.i1.type = cn.itcast.xxx.xxx.AppInterceptor$AppInterceptorBuilder
  - a1.sources.r1.interceptors.i1.appId = 1 appId 是自定义拦截器中的属性，值是1，如果没有配置属性，可不写

高可用方式

shell脚本定时检查target文件，文件新增则重启flume
failover
- 案例：配置一个agent去采集，两个agent去写入hdfs
  - 第一个agent配置两个sink
  - #agent1 name
    agent1.channels = c1
    agent1.sources = r1
    agent1.sinks = k1 k2
  - 设置failover和sink的优先级
    - agent1.sinkgroups.g1.processor.type = failover
      agent1.sinkgroups.g1.processor.priority.k1 = 10
      agent1.sinkgroups.g1.processor.priority.k2 = 1
      agent1.sinkgroups.g1.processor.maxpenalty = 10000 单位毫秒，最大黑米名单时间，即如果10sk1失败，则使用可k2
  - 另外两个agent的中的source配置
  - a1.sources.r1.type = avro
    a1.sources.r1.bind = node02
    a1.sources.r1.port = 52020
    a1.sources.r1.interceptors = i1
    a1.sources.r1.interceptors.i1.type = static
    a1.sources.r1.interceptors.i1.key = Collector
    a1.sources.r1.interceptors.i1.value = node02 不同设备不名，运行在那台设备上，配置那台设备的名
    a1.sources.r1.channels = c1
配置sink到hdfs上注意：设定号我呢见滚动策略，防止生成过多小文件
- agent1.sinks.k1.type=hdfs
  agent1.sinks.sink1.hdfs.filePrefix = access_log
  agent1.sinks.k1.hdfs.path= hdfs://node01:8020/flume/failover/
  agent1.sinks.sink1.hdfs.maxOpenFiles = 5000
  agent1.sinks.sink1.hdfs.batchSize= 100 一次发送多少个event，默认100
  agent1.sinks.sink1.hdfs.fileType = DataStream
  agent1.sinks.sink1.hdfs.writeFormat =Text
  agent1.sinks.sink1.hdfs.rollSize = 102400 当文件大小达到100kb时生成新文件
  agent1.sinks.sink1.hdfs.rollCount = 1000000 当时间数量达到1000000个生成新文件
  agent1.sinks.sink1.hdfs.rollInterval = 60 60秒产生新文件，默认为30s。单位是s。0为不产生新文件。【即使没有数据也会产生文件】
  agent1.sinks.sink1.hdfs.round = true
  agent1.sinks.sink1.hdfs.roundValue = 10
  agent1.sinks.sink1.hdfs.roundUnit = minute 目录10分钟生成一个四舍五入规则
  agent1.sinks.sink1.hdfs.useLocalTimeStamp = true

负载均衡

架构图
主要配置
- processor.type = load_balance其他同failover相似
  - a1.sinkgroups.g1.processor.type = load_balance
    a1.sinkgroups.g1.processor.backoff = true 开启故障的节点列入黑名单，过一定时间再次发送，如果还失败，则等待是指数增长；直到达到最大的时间
    a1.sinkgroups.g1.processor.selector = round_robin 表示轮询方式发送
    a1.sinkgroups.g1.processor.selector.maxTimeOut=10000 最大被名单时间，单位毫秒

yala说

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
flume日志采集系统

目录flume日志采集系统1.运行机制1.1组件介绍 1.2简单使用配置1.3flume拦截器1.4自定义拦截器高可用方式负载均衡关于各个组件的详细配置可以参考https://www.cnblogs.com/moonandstar08/p/6284243.htmlflume日志采集系统1.运行机制agent是flume的核心角色。每...
复制链接

扫一扫

专栏目录