Flume知识点总结

最新推荐文章于 2024-07-23 08:03:31 发布

「miraitowa」

最新推荐文章于 2024-07-23 08:03:31 发布

阅读量314

点赞数

分类专栏： Flume 文章标签：大数据 flume

本文链接：https://blog.csdn.net/weixin_45557389/article/details/107725472

版权

Flume 专栏收录该内容

3 篇文章 0 订阅

订阅专栏

一、Flume概述

1.1 Flume定义

Flume是Cloudera提供的一个高可用的，高可靠的，分布式的海量日志采集、聚合和传输的系统。

Flume基于流式架构，灵活简单。

采集工具，将大数据分散的数据源(数据库，日志)统一采集到一个地方(hdfs)；

Flume最主要的作用是：实时读取服务器本地磁盘的数据，将数据写入到HDFS。

1.2 Flume基础架构

在这里插入图片描述

Agent

Agent是一个JVM进程，它以事件的形式将数据从源头送至目的；

Agent主要有3个部分组成：Source、Channel、Sink。
Source

Source是负责接收数据到Flume Agent的组件；

Source组件可以处理各种类型、各种格式的日志数据，包括avro、thrift、exec、jms、spooling directory、netcat、sequence generator、syslog、http、legacy。
Sink

Sink不断地轮询Channel中的事件且批量的移除它们，并将这些事件批量写入到存储或索引系统、或者被发送到另一个Flume Agent；

Sink组件目的地包括hdfs、logger、avro、thrift、ipc、file、HBase、solr、自定义。
Channel

Channel是位于Source和Sink之间的缓冲区，因此，Channel允许Source和Sink运作在不同的速率上；

Channel是线程安全的，可以同时处理几个Source的写入操作和几个Sink的读取操作；

Flume自带两种Channel：Memory Channel和File Channel以及Kafka Channel；
- Memory Channel是内存中的队列。Memory Channel在不需要关心数据丢失的情景下适用；如果需要关心数据丢失，那么Memory Channel就不应该使用，因为程序死亡、机器宕机或者重启都会导致数据丢失。
- File Channel将所有事件写到磁盘。因此在程序关闭或机器宕机的情况下不会丢失数据。
Event

传输单元，Flume数据传输的基本单元，以Event的形式将数据从源头送至目的地。Event由Header和Body两部分组成，Header用来存放该event的一些属性，以K-V结构，Body用来存放该条数据，形式为字节数组。
Interceptors（拦截器）

在Flume中允许使用拦截器对传输中的event进行拦截和处理，拦截器必须实现org.apache.flume.interceptor.Interceptor接口；拦截器可以根据开发者的设定修改甚至删除event，Flume同时支持拦截器链，即由多个拦截器组合而成，通过制定拦截链中拦截器的顺序，event将按照顺序依次被拦截器进行处理
Channel Selectors（选择器）

Channel Selectors用于source组件将event传输给多个channel的场景；常用的有replicating（默认）和multiplexing两种类型；replicating负责将event复制到多个channel，而multiplexing则根据event的属性和配置的参数进行匹配，匹配成功则发送到指定的channel
Sink Processors（处理器）

用户可以将多个sink组成一个整体（sink组），Sink Processors可用于提供组内的所有sink的负载平衡功能，或在时间故障的情况下实现从一个sink到另一个sink的故障转移

二、Flume安装部署

	node7-1	node7-2	node7-3	node7-4
web_server	√	√
flume_1			√
flume_2				√

解压、重命名
修改配置文件

conf下的flume-env.sh.template重命名flume-env.sh

增加JAVA_HOME、HADOOP_HOME、HIVE_HOME路径

三、Flume入门案例

在这里插入图片描述

安装netcat工具

yum -y install netcat
判断44444端口是否被占用

netstat -tunlp | grep 44444

创建Flume Agent配置文件flume-hw.properties

 # flume配置的例子
 # Name the components on this agent
 # properties文件它是java的配置文件,=左边就是键,=右边是值;键的开头都是以a1(表示agent的名称);a1随便起
 # r1:表示a1的Source的名称
 a1.sources = r1
 # k1:表示a1的Sink的名称
 a1.sinks = k1
 # c1:表示a1的Channels的名称
 a1.channels = c1

 # Describe/configure the source
 # a1(agent的名字).sources(来源).r1(来源的名字);配置多个来源
 # type:不能随便写(文档上说明)
 # 表示a1的输入源类型为netcat端口类型
 a1.sources.r1.type = netcat
 # 表示a1的监听的主机
 a1.sources.r1.bind = localhost
 # 表示a1的监听的端口号
 a1.sources.r1.port = 44444

 # Describe the sink
 # 表示a1的输出目的地是控制台logger类型
 a1.sinks.k1.type = logger

 # Use a channel which buffers events in memory
 # 表示a1的channel类型是memory内存型
 a1.channels.c1.type = memory
 # 表示a1的channel总容量1000个event
 a1.channels.c1.capacity = 1000
 # 表示a1的channel传输时收集到了100条event以后再去提交事务
 a1.channels.c1.transactionCapacity = 100

 # Bind the source and sink to the channel
 # 表示将r1和c1连接起来
 a1.sources.r1.channels = c1
 # 表示将k1和c1连接起来
 a1.sinks.k1.channel = c1

启动命令（服务器端）

–name后面的a1要和配置文件中agent的名字一样

nohup bin/flume-ng agent --conf conf --conf-file conf/flume-hw.properties --name a1 -Dflume.root.logger=INFO,consloe &
启动客户端

telnet localhost 44444

Telnet：向指定的服务器和端口发送数据，网络上ip+端口是否畅通；默认centos是没有安装的；

yum -y install telnet

ping：ip是否畅通；

如果允许任何人都可以连接flume（node7-1,node7-2,node7-3,node7-4都可以连接flume）

修改conf/flume-hw.properties中

# Describe/configure the source
# a1(agent的名字).sources(来源).r1(来源的名字);配置多个来源
# type:不能随便写(文档上说明)
# 表示a1的输入源类型为netcat端口类型
a1.sources.r1.type = netcat
# 表示a1的监听的主机
a1.sources.r1.bind = 0.0.0.0
# 表示a1的监听的端口号
a1.sources.r1.port = 44444

四、Source

Avro：数据序列化系统；在Java中new的队形想要存储到内存以外的地方，需要实现serializiable接口；flume高可用的时候必须要使用它。

spooldir：监控硬盘上指定的某个目录，如果文件发生变化，会被flume捕获

创建Flume Agent配置文件flume-source.properties

 # flume配置的例子
 # Name the components on this agent
 # properties文件它是java的配置文件,=左边就是键,=右边是值;键的开头都是以a1(表示agent的名称);a1随便起
 # r1:表示a1的Source的名称
 a1.sources = r1
 # k1:表示a1的Sink的名称
 a1.sinks = k1
 # c1:表示a1的Channels的名称
 a1.channels = c1

 # Describe/configure the source
 # 定义source类型为目录
 a1.sources.r1.type = spooldir
 # 定义监控的目录,此目录必须存在
 a1.sources.r1.spoolDir = /root/flume/
 # 定义文件上传完,会加上一个后缀
 a1.sources.r1.fileSuffix = .ok
 # 已经完成的文件,会立即删除,默认值是never;(永不删除)
 # a1.sources.r1.deletePolicy = immediate
 # 是否有文件头
 a1.sources.r1.fileHeader = true
 # 增加文件名到header中
 a1.sources.r1.basenameHeader = true
 # 只处理此目录下面的.txt文件;
 a1.sources.r1.includePattern = ^[\\w]+\\.txt$

 # Describe the sink
 # 表示a1的输出目的地是控制台logger类型
 a1.sinks.k1.type = logger

 # Use a channel which buffers events in memory
 # 表示a1的channel类型是memory内存型
 a1.channels.c1.type = memory
 # 表示a1的channel总容量1000000个event
 a1.channels.c1.capacity = 1000000
 # 表示a1的channel传输时收集到了1000000条event以后再去提交事务
 a1.channels.c1.transactionCapacity = 1000000

 # Bind the source and sink to the channel
 # 表示将r1和c1连接起来
 a1.sources.r1.channels = c1
 # 表示将k1和c1连接起来
 a1.sinks.k1.channel = c1

启动flume

bin/flume-ng agent --conf conf --conf-file conf/flume-source.properties --name a1 -Dflume.root.logger=INFO,consloe
看日志

tail -f nohup.out

五、Sink

创建Flume Agent配置文件flume-sink-hdfs.properties

 # flume配置的例子
 # Name the components on this agent
 # properties文件它是java的配置文件,=左边就是键,=右边是值;键的开头都是以a1(表示agent的名称);a1随便起
 # r1:表示a1的Source的名称
 a1.sources = r1
 # k1:表示a1的Sink的名称
 a1.sinks = k1
 # c1:表示a1的Channels的名称
 a1.channels = c1

 # Describe/configure the source
 # 定义source类型为目录
 a1.sources.r1.type = spooldir
 # 定义监控的目录,此目录必须存在
 a1.sources.r1.spoolDir = /root/flume/
 # 定义文件上传完,会加上一个后缀
 a1.sources.r1.fileSuffix = .ok
 # 已经完成的文件,会立即删除,默认值是never;(永不删除)
 # a1.sources.r1.deletePolicy = immediate
 # 是否有文件头
 a1.sources.r1.fileHeader = true
 # 增加文件名到header中
 a1.sources.r1.basenameHeader = true
 # 只处理此目录下面的.txt文件;
 a1.sources.r1.includePattern = ^[\\w]+\\.txt$

 # Describe the sink
 # sink类型为hdfs
 a1.sinks.k1.type = hdfs
 # 文件上传到hdfs的路径;配置hdfs一定要是大哥的路径;(必须是active)
 a1.sinks.k1.hdfs.path = hdfs://node-1:8020/flume/%Y-%m-%d/
 # 上传文件到hdfs的前缀
 a1.sinks.k1.hdfs.filePrefix = event
 # 上传文件到hdfs的后缀
 a1.sinks.k1.hdfs.fileSuffix = .txt
 # hdfs.inUsePrefix临时文件的前缀,hdfs.inUseSuffix临时文件的后缀
 # hdfs.codeC文件压缩
 # 设置文件类型，可支持压缩
 a1.sinks.k1.hdfs.fileType = DataStream
 # 设置文件的格式为textFile
 a1.sinks.k1.hdfs.writeFormat = Text
 # 是否使用本地时间戳
 a1.sinks.k1.hdfs.useLocalTimeStamp = true
 # 时间舍去法;%y,%m,%d(机器的本地时间),集群中所有时间是一样的;(如果集中所有的服务器时间不一样,允许时间有误差;统一是当前时间-误差)
 # 是否按时间滚动文件
 a1.sinks.k1.hdfs.round = true
 # 多少时间单位创建一个新的文件夹,设置20s
 a1.sinks.k1.hdfs.roundValue = 20
 # 重新定义时间单位
 a1.sinks.k1.hdfs.roundUnit = second
 # roll:滚动;时间30s,如果flume发现有新文件,上传到hdfs上,等待30秒
 # 避免有大文件,拆分成多个小文件;(准备一个大文件);三种策略选择一个(目的是不让文件拆分)
 # 多久生成新文件
 a1.sinks.k1.hdfs.rollInterval = 30
 # 多大生成新文件;0:表示不生效(1kb)
 a1.sinks.k1.hdfs.rollSize = 1024
 # 多少event生成新文件;0:表示不生效
 a1.sinks.k1.hdfs.rollCount = 0

 # Use a channel which buffers events in memory
 # 表示a1的channel类型是memory内存型
 a1.channels.c1.type = memory
 # 表示a1的channel总容量1000000个event
 a1.channels.c1.capacity = 1000000
 # 表示a1的channel传输时收集到了1000000条event以后再去提交事务
 a1.channels.c1.transactionCapacity = 1000000

 # Bind the source and sink to the channel
 # 表示将r1和c1连接起来
 a1.sources.r1.channels = c1
 # 表示将k1和c1连接起来
 a1.sinks.k1.channel = c1

启动flume

bin/flume-ng agent --conf conf --conf-file conf/flume-sink-hdfs.properties --name a1 -Dflume.root.logger=INFO,consloe

创建Flume Agent配置文件flume-sink-hive.properties

 # flume配置的例子
 # Name the components on this agent
 # properties文件它是java的配置文件,=左边就是键,=右边是值;键的开头都是以a1(表示agent的名称);a1随便起
 # r1:表示a1的Source的名称
 a1.sources = r1
 # k1:表示a1的Sink的名称
 a1.sinks = k1
 # c1:表示a1的Channels的名称
 a1.channels = c1

 # Describe/configure the source
 # 定义source类型为目录
 a1.sources.r1.type = spooldir
 # 定义监控的目录,此目录必须存在
 a1.sources.r1.spoolDir = /root/flume/
 # 定义文件上传完,会加上一个后缀
 a1.sources.r1.fileSuffix = .ok
 # 已经完成的文件,会立即删除,默认值是never;(永不删除)
 # a1.sources.r1.deletePolicy = immediate
 # 是否有文件头
 a1.sources.r1.fileHeader = true
 # 增加文件名到header中
 a1.sources.r1.basenameHeader = true
 # 只处理此目录下面的.txt文件;
 a1.sources.r1.includePattern = ^[\\w]+\\.txt$

 # Describe the sink
 # 采集的是日志(txt);在hive中创建一张表,load data把文件拷贝到指定目录下面,相当于把日志文件中的记录插入到了hive表
 # sink类型为hive
 a1.sinks.k1.type = hive
 # hive的服务器
 a1.sinks.k1.hive.metastore = thrift://node7-4:9083
 # hive的数据库
 a1.sinks.k1.hive.database = mydata
 # hive的表名,(这张表一定得存在,需要在hive中创建表)
 a1.sinks.k1.hive.table = flume_table
 # 配置分区,多个分区使用逗号隔开;time = %Y-%m-%d,a=b,c=d;一个分区的时候名字可以省;分区不是必须配置的
 a1.sinks.k1.hive.partition = %Y-%m-%d
 # 使用本地时间戳
 a1.sinks.k1.useLocalTimeStamp = true
 # 采集的数据是文本文件(如果是json文件就填写json)
 a1.sinks.k1.serializer = DELIMITED
 # 列与列之间的分隔符
 a1.sinks.k1.serializer.delimiter = ,
 # 采集的源文件里面有好几列,到底要使用哪几列的数据
 a1.sinks.k1.serializer.fieldnames= id,name,createtime

 # Use a channel which buffers events in memory
 # 表示a1的channel类型是memory内存型
 a1.channels.c1.type = memory
 # 表示a1的channel总容量1000000个event
 a1.channels.c1.capacity = 1000000
 # 表示a1的channel传输时收集到了1000000条event以后再去提交事务
 a1.channels.c1.transactionCapacity = 1000000

 # Bind the source and sink to the channel
 # 表示将r1和c1连接起来
 a1.sources.r1.channels = c1
 # 表示将k1和c1连接起来
 a1.sinks.k1.channel = c1

启动flume

bin/flume-ng agent --conf conf --conf-file conf/flume-sink-hive.properties --name a1 -Dflume.root.logger=INFO,consloe

创建Flume Agent配置文件flume-sink-hbase2.properties

 # flume配置的例子
 # Name the components on this agent
 # properties文件它是java的配置文件,=左边就是键,=右边是值;键的开头都是以a1(表示agent的名称);a1随便起
 # r1:表示a1的Source的名称
 a1.sources = r1
 # k1:表示a1的Sink的名称
 a1.sinks = k1
 # c1:表示a1的Channels的名称
 a1.channels = c1

 # Describe/configure the source
 # 定义source类型为目录
 a1.sources.r1.type = spooldir
 # 定义监控的目录,此目录必须存在
 a1.sources.r1.spoolDir = /root/flume/
 # 定义文件上传完,会加上一个后缀
 a1.sources.r1.fileSuffix = .ok
 # 已经完成的文件,会立即删除,默认值是never;(永不删除)
 # a1.sources.r1.deletePolicy = immediate
 # 是否有文件头
 a1.sources.r1.fileHeader = true
 # 增加文件名到header中
 a1.sources.r1.basenameHeader = true
 # 只处理此目录下面的.txt文件;
 a1.sources.r1.includePattern = ^[\\w]+\\.txt$

 # Describe the sink
 # sink类型为hbase2
 a1.sinks.k1.type = hbase2
 # hbase的表名;表名要加上namespace
 a1.sinks.k1.table = mydata:flume_table
 # hbase的列族
 a1.sinks.k1.columnFamily = cf
 # zookeeper
 a1.sinks.k1.zookeeperQuorum = node7-1:2181,node7-2:2181,node7-3:2181
 # 序列化;默认的值是org.apache.flume.sink.hbase2.SimpleHBase2EventSerializer
 a1.sinks.k1.serializer = org.apache.flume.sink.hbase2.RegexHBase2EventSerializer

 # Use a channel which buffers events in memory
 # 表示a1的channel类型是memory内存型
 a1.channels.c1.type = memory
 # 表示a1的channel总容量1000000个event
 a1.channels.c1.capacity = 1000000
 # 表示a1的channel传输时收集到了1000000条event以后再去提交事务
 a1.channels.c1.transactionCapacity = 1000000

 # Bind the source and sink to the channel
 # 表示将r1和c1连接起来
 a1.sources.r1.channels = c1
 # 表示将k1和c1连接起来
 a1.sinks.k1.channel = c1

启动flume

bin/flume-ng agent --conf conf --conf-file conf/flume-sink-hbase2.properties --name a1 -Dflume.root.logger=INFO,consloe

六、拦截器

创建Flume Agent配置文件flume-interceptors.properties

 # flume配置的例子
 # Name the components on this agent
 # properties文件它是java的配置文件,=左边就是键,=右边是值;键的开头都是以a1(表示agent的名称);a1随便起
 # r1:表示a1的Source的名称
 a1.sources = r1
 # k1:表示a1的Sink的名称
 a1.sinks = k1
 # c1:表示a1的Channels的名称
 a1.channels = c1

 # Describe/configure the source
 # a1(agent的名字).sources(来源).r1(来源的名字);配置多个来源
 # type:不能随便写(文档上说明)
 # 表示a1的输入源类型为netcat端口类型
 a1.sources.r1.type = netcat
 # 表示a1的监听的主机
 a1.sources.r1.bind = 0.0.0.0
 # 表示a1的监听的端口号
 a1.sources.r1.port = 44444
    
 # 拦截器
 a1.sources.r1.interceptors = i1 i2 sta1 seaRe
 # 时间戳拦截器
 a1.sources.r1.interceptors.i1.type = timestamp
 a1.sources.r1.interceptors.i1.preserveExisting = true
 # 主机名拦截器
 a1.sources.r1.interceptors.i2.type = host
 a1.sources.r1.interceptors.i2.useIP = false
 # 静态拦截器;支持汉字
 a1.sources.r1.interceptors.sta1.type = static
 a1.sources.r1.interceptors.sta1.key = mykey
 a1.sources.r1.interceptors.sta1.value = myval汉字
 # 搜索替换拦截器;支持汉字
 a1.sources.r1.interceptors.seaRe.type = search_replace
 a1.sources.r1.interceptors.seaRe.searchPattern = body
 a1.sources.r1.interceptors.seaRe.replaceString = mybody

 # Describe the sink
 # 表示a1的输出目的地是控制台logger类型
 a1.sinks.k1.type = logger

 # Use a channel which buffers events in memory
 # 表示a1的channel类型是memory内存型
 a1.channels.c1.type = memory
 # 表示a1的channel总容量1000个event
 a1.channels.c1.capacity = 1000
 # 表示a1的channel传输时收集到了100条event以后再去提交事务
 a1.channels.c1.transactionCapacity = 100

 # Bind the source and sink to the channel
 # 表示将r1和c1连接起来
 a1.sources.r1.channels = c1
 # 表示将k1和c1连接起来
 a1.sinks.k1.channel = c1

启动flume

bin/flume-ng agent --conf conf --conf-file conf/flume-interceptors.properties --name a1 -Dflume.root.logger=INFO,consloe

七、高可用

7.1 高可用架构

在这里插入图片描述

web端采集数据，上报给flume_1，之后flume_1把数据存储到hdfs上(落地)；

如果flume_1挂掉，所有的数据上报给flume_2；

虚线的方式是不允许的，因为hdfs是一个公司最机密的数据，一旦对外公开就不安全。

7.2 配置高可用

创建Flume Agent配置文件flume-web-server.properties（node7-1，node7-2）

 # 分别为sources,sinks,channels起别名
 a1.sources = r1
 a1.sinks = k1 k2
 a1.channels = c1
 
 # 配置source
 a1.sources.r1.type = spooldir
 # spoolDir:目录
 a1.sources.r1.spoolDir = /root/test
 # 是否添加存储绝对路径文件名的标题
 a1.sources.r1.fileHeader = true
 # 只处理此目录下面的.log文件;
 a1.sources.r1.includePattern = ^[\\w-_]+\\.log$
 
 # set sink1
 a1.sinks.k1.type = avro
 # 主机名
 a1.sinks.k1.hostname = node7-3
 # 端口
 a1.sinks.k1.port = 52020
 # set sink2
 a1.sinks.k2.type = avro
 # 主机名
 a1.sinks.k2.hostname = node7-4
 # 端口
 a1.sinks.k2.port = 52020
 
 #配置channel
 a1.channels.c1.type = memory
 a1.channels.c1.capacity = 1000000
 a1.channels.c1.transactionCapacity = 1000000
 
 #set gruop,可以将多个sink合并到一个组里面;
 a1.sinkgroups = g1  
 #set sink group  
 a1.sinkgroups.g1.sinks = k1 k2  
 #set failover(zkfc:zookeeper failover control)
 #故障转移，若node7-3故障，node7-4自动接替node7-3工作
 a1.sinkgroups.g1.processor.type = failover 
 #优先级(数字越大,越高)
 a1.sinkgroups.g1.processor.priority.k1 = 10
 a1.sinkgroups.g1.processor.priority.k2 = 5
 #最长等待10秒转移故障(单位是毫秒)
 a1.sinkgroups.g1.processor.maxpenalty = 10000
 
 # 绑定
 a1.sources.r1.channels = c1
 # 绑定
 a1.sinks.k1.channel = c1
 a1.sinks.k2.channel = c1

创建Flume Agent配置文件flume-first.properties（node7-3）

 # agent的名字不能重复
 # 分别为source,sinks,channels起别名
 a2.sources = r1
 a2.sinks = k1
 a2.channels = c1

 # avro:数据序列化系统
 a2.sources.r1.type = avro
 # 主机名
 a2.sources.r1.bind = node7-3
 # 端口号
 a2.sources.r1.port = 52020
 #增加拦截器，所有events,增加头,类似json格式里的"headers":{" key":" value"}
 a2.sources.r1.interceptors = i1
 a2.sources.r1.interceptors.i1.type = static
 a2.sources.r1.interceptors.i1.key = Collector 
 a2.sources.r1.interceptors.i1.value = node7-3

 # 配置sink
 a2.sinks.k1.type = hdfs
 # hdfs的路径:
 a2.sinks.k1.hdfs.path = hdfs://node7-1:8020/flume/%Y-%m-%d/
 # 写的格式是text
 a2.sinks.k1.hdfs.writeFormat = Text
 # 此处必须加上
 a2.sinks.k1.hdfs.useLocalTimeStamp = true
 # 文件的前缀
 a2.sinks.k1.hdfs.filePrefix = %H-%M-%S
 # hdfs.inUsePrefix临时文件的前缀,hdfs.inUseSuffix临时文件的后缀
 # hdfs.codeC 文件压缩
 # 输出原来的文件内容,不要压缩
 a2.sinks.k1.hdfs.fileType = DataStream
 # 文件的后缀
 a2.sinks.k1.hdfs.fileSuffix = .txt
 # 在等待的30秒以内,如果此文件发生了修改(也会进行拆分)
 a2.sinks.k1.hdfs.rollInterval = 10
 # 滚动,新增加的文件大小(等待时间期间);0:表示不生效
 a2.sinks.k1.hdfs.rollSize = 0
 # 滚动多少行(新增加多少行)(等待时间期间),每隔10行会在hdfs上生成一个新文件;0:表示不生效
 a2.sinks.k1.hdfs.rollCount = 0

 #配置channel
 a2.channels.c1.type = memory
 a2.channels.c1.capacity = 1000000
 a2.channels.c1.transactionCapacity = 1000000

 # 将source,和channel绑定起来
 a2.sources.r1.channels = c1
 # 绑定sink
 a2.sinks.k1.channel = c1

创建Flume Agent配置文件flume-second.properties（node7-4）

 # agent的名字不能重复
 # 分别为source,sinks,channels起别名
 a2.sources = r1
 a2.sinks = k1
 a2.channels = c1

 # 配置source
 a2.sources.r1.type = avro
 # 主机名
 a2.sources.r1.bind = node7-4
 # 端口号
 a2.sources.r1.port = 52020
 #增加拦截器，所有events,增加头,类似json格式里的"headers":{" key":" value"}
 a2.sources.r1.interceptors = i1
 a2.sources.r1.interceptors.i1.type = static
 a2.sources.r1.interceptors.i1.key = Collector 
 a2.sources.r1.interceptors.i1.value = node7-3

 # 配置sink
 a2.sinks.k1.type = hdfs
 # hdfs的路径:
 a2.sinks.k1.hdfs.path = hdfs://node7-1:8020/flume/%Y-%m-%d/
 # 写的格式是text
 a2.sinks.k1.hdfs.writeFormat = Text
 # 此处必须加上
 a2.sinks.k1.hdfs.useLocalTimeStamp = true
 # 文件的前缀
 a2.sinks.k1.hdfs.filePrefix=%H-%M-%S
 # hdfs.inUsePrefix临时文件的前缀,hdfs.inUseSuffix临时文件的后缀
 # hdfs.codeC文件压缩
 # 输出原来的文件内容,不要压缩
 a2.sinks.k1.hdfs.fileType = DataStream
 # 文件的后缀
 a2.sinks.k1.hdfs.fileSuffix = .txt
 # 在等待的30秒以内,如果此文件发生了修改(也会进行拆分)
 a2.sinks.k1.hdfs.rollInterval = 10
 # 滚动,新增加的文件大小(等待时间期间);0:表示不生效
 a2.sinks.k1.hdfs.rollSize = 0
 # 滚动多少行(新增加多少行)(等待时间期间),每隔10行会在hdfs上生成一个新文件;0:表示不生效
 a2.sinks.k1.hdfs.rollCount = 0

 #配置channel
 a2.channels.c1.type = memory
 a2.channels.c1.capacity = 1000
 a2.channels.c1.transactionCapacity = 100

 #绑定
 a2.sources.r1.channels = c1
 a2.sinks.k1.channel = c1

启动

一定要先启动hadoop（node7-1必须是active）

bin/flume-ng agent --conf conf --conf-file conf/flume-first.properties --name a2 -Dflume.root.logger=INFO,consloe

bin/flume-ng agent --conf conf --conf-file conf/flume-sceond.properties --name a2 -Dflume.root.logger=INFO,consloe

监控的目录必须存在

bin/flume-ng agent --conf conf --conf-file conf/flume-web-server.properties --name a1 -Dflume.root.logger=INFO,consloe

「miraitowa」

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
Flume知识点总结

一、Flume概述1.1 Flume定义Flume是Cloudera提供的一个高可用的，高可靠的，分布式的海量日志采集、聚合和传输的系统。Flume基于流式架构，灵活简单。采集工具，将大数据分散的数据源(数据库，日志)统一采集到一个地方(hdfs)；Flume最主要的作用是：实时读取服务器本地磁盘的数据，将数据写入到HDFS。1.2 Flume基础架构AgentAgent是一个JVM进程，它以事件的形式将数据从源头送至目的；Agent主要有3个部分组成：Source、Channel、
复制链接

扫一扫

专栏目录