Flume配置文件总结

最新推荐文章于 2024-04-02 23:15:55 发布

不言尘世

最新推荐文章于 2024-04-02 23:15:55 发布

阅读量357

点赞数

分类专栏：大数据

原文链接：https://blog.csdn.net/student__software/article/details/81407168

版权

大数据专栏收录该内容

68 篇文章 6 订阅

订阅专栏

组件命名一：单个source和sink用这个


 
 
   
   
    
    
   
   
   
   
    
    
     
     # Name the components on this agent
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
     a1.sources = r1
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
     a1.sinks = k1
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
     a1.channels = c1

组件命名二：单个source，多个sink用这个，当然有副本和负载均衡等模式


 
 
   
   
    
    
   
   
   
   
    
    
     
     # Name the components on this agent
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
     a1.sources = r1
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
     a1.sinks = k1 k2
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
     a1.channels = c1 c2
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
     # 将数据流复制给多个channel 副本，还有负载均衡
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
     a1.sources.r1.selector.type = replicating

一、source配置

1、netcat的source：bind指定IP，port指定port


 
 
   
   
    
    
   
   
   
   
    
    
     
     # Describe/configure the source
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
     a1.sources.r1.type = netcat
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
     a1.sources.r1.bind = localhost
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
     a1.sources.r1.port = 44444

2、读文件exec：commd中写命令，如果用tail的话记得用大写的F


 
 
   
   
    
    
   
   
   
   
    
    
     
     # Describe/configure the source
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
     a1.sources.r1.type = exec
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
     a1.sources.r1.command = tail -F /opt/
     
     module/hive/logs/hive.log
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
     a1.sources.r1.shell = 
     
     /bin/bash -c

3、读取文件夹source：spooldir source ，tmp记得一定要忽略


 
 
   
   
    
    
   
   
   
   
    
    
     
     # Describe/configure the source
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
     a1.sources.r1.type = spooldir
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
     # 指定文件夹
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
     a1.sources.r1.spoolDir = /opt/module/flume/upload
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
     #指定文件上传后的后缀
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
     a1.sources.r1.fileSuffix = .COMPLETED
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
     a1.sources.r1.fileHeader = 
     
     true
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
     #忽略所有以.tmp结尾的文件，不上传
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
     a1.sources.r1.ignorePattern = ([^ ]*\.tmp)

4、最常用的source: arvo模式，bind指的是接收的主机，port不是随意的，是看sink给的端口


 
 
   
   
    
    
   
   
   
   
    
    
     
     # Describe/configure the source
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
     a1.sources.r1.type = avro
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
     a1.sources.r1.bind = hadoop102
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
     a1.sources.r1.port = 
     
     4141

二、sink配置文件

sink的选择特别多，常用的是hdfs，kafka,hbase,logger,同时还可能是avro，arvo主要是给下一个flume接收

1、hdfs的sink：


 
 
   
   
    
    
   
   
   
   
    
    
     
     # Describe the sink
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
     a1.sinks.k1.type = hdfs
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
     a1.sinks.k1.hdfs.path = hdfs://hadoop102:9000/flume/%Y%m%d/%H
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
     #上传文件的前缀
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
     a1.sinks.k1.hdfs.filePrefix = logs-
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
     #是否按照时间滚动文件夹
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
     a1.sinks.k1.hdfs.round = 
     
     true
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
     #多少时间单位创建一个新的文件夹
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
     a1.sinks.k1.hdfs.roundValue = 1
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
     #重新定义时间单位
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
     a1.sinks.k1.hdfs.roundUnit = hour
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
     #是否使用本地时间戳
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
     a1.sinks.k1.hdfs.useLocalTimeStamp = 
     
     true
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
     #积攒多少个Event才flush到HDFS一次
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
     a1.sinks.k1.hdfs.batchSize = 1000
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
     #设置文件类型，可支持压缩
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
     a1.sinks.k1.hdfs.fileType = DataStream
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
     #多久生成一个新的文件 单位是秒
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
     a1.sinks.k1.hdfs.rollInterval = 600
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
     #设置每个文件的滚动大小  128M
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
     a1.sinks.k1.hdfs.rollSize = 134217700
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
     #文件的滚动与Event数量无关
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
     a1.sinks.k1.hdfs.rollCount = 0
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
     #最小冗余数
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
     a1.sinks.k1.hdfs.minBlockReplicas = 1

2、arvo sink：hostname是IP指的是发送给谁，port指的是监听端口可以任意写


 
 
   
   
    
    
   
   
   
   
    
    
     
     a1.sinks.k2.type = avro
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
     a1.sinks.k2.hostname = hadoop102
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
     a1.sinks.k2.port = 4142

3、logger sink: 如果是像在控制台看打印，要在命令加，-Dflume.root.logger=INFO,console


 
 
   
   
    
    
   
   
   
   
    
    
     
     # Describe the sink
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
     a1.sinks.k1.type = logger

4、hbase待续

三、channel的配置

channel主要分为memory channel和file channel,如果是要求速度和不追求数据的完整性用，memory channel, 一般也都用它

1、memory channel：1000指的是队列的容量，100指的是sink取数据的时候最大值


 
 
   
   
    
    
   
   
   
   
    
    
     
     # Use a channel which buffers events in memory
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
     a1.channels.c1.type = memory
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
     a1.channels.c1.capacity = 1000
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
     a1.channels.c1.transactionCapacity = 100

2、file channel：见文档

四、channel 和source，sink的绑定

1、单个source和单个sink


 
 
   
   
    
    
   
   
   
   
    
    
     
     # Bind the source and sink to the channel
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
     a1.sources.r1.channels = c1
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
     a1.sinks.k1.channel = c1

2、单个source和多个


 
 
   
   
    
    
   
   
   
   
    
    
     
     # Bind the source and sink to the channel
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
     a1.sources.r1.channels = c1 c2
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
     a1.sinks.k1.channel = c1
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
     a1.sinks.k2.channel = c2

启动命令是： bin/flume-ng agent --conf conf/ --name a1 --conf-file job/flume-dir-hdfs.conf

或者是bin/flume-ng agent -a a1 -c conf/ --conf -f job/flume-dir-hdfs.conf a1不是乱写的

不言尘世

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
Flume配置文件总结

组件命名一：单个source和sink用这个# Name the components on this agenta1.sources = r1a1.sinks = k1a1.channels = c1 组件命名二：单个source，多个sink用这个，当然有副本和负载均衡等模式# Name the components on this agenta1.sources = r1a1.sinks = k1 k2a1.channels = c1 c2# 将数据流复制给多个channel 副本，还有负载均
复制链接

扫一扫