Flume基础知识

最新推荐文章于 2021-06-05 18:15:58 发布

YuBx

最新推荐文章于 2021-06-05 18:15:58 发布

阅读量119

点赞数

分类专栏：大数据 flume 文章标签： flume 大数据

本文链接：https://blog.csdn.net/Yubingx/article/details/109457265

版权

大数据同时被 2 个专栏收录

24 篇文章 0 订阅

订阅专栏

flume

1 篇文章 0 订阅

订阅专栏

Flume基础知识

一、概叙

Flume是一个高可用、高可靠的用于海里日志采集、聚合和传输的系统。

Flume基于流式架构，简单灵活。

Flume最主要的作用是读取服务器本地磁盘文件的数据，将数据写入HDFS

二、Flume的基本组成架构

source：主要用于接收搜集日志数据，常用的有exec、spooldir、netcat、kafka、taildir、avro等
channel：主要用于缓存采集过来的数据，主要有memory channel和file channel，前者基于内存缓存，对数据安全性不高；后者基于磁盘，系统宕机数据不会丢失
sink：主要用于数据输出，常见的有HDFS、Kafka、avro、logger、file等
put事务：推送事务。先将数据添加到putList缓冲区中，检查channel内存队列是否足够合并，如果内存不足就会回滚数据
take事务：拿取事务。sink从channel中拿取数据到takeList中缓存，如果发送成功就清除takeLIst中的数据，若出现异常，就把takeList中的数据归还给channel内存队列

三、Flume Agent内部原理

source接收事件数据
编写拦截器，对数据进行处理
将处理好的数据发送给不同的channel
channel通过selector选择器选择出不同的channel来存储不同的数据
然后sink再选择不同channel中的数据发送给不同的地方

注意：选择器有两种，一种是Replicating，它是将所有从source发送过来的数据进行复制，发送到每一个channel中；一种是Multiplexing，它是选择性的把source过来的数据发送给不同的channel。

四、Flume的conf文件编写以及实例

基本步骤：
1. 定义source、channel和sink组件的名称
2. 编写source、channel和sink组件的详细配置
3. 定义source、channel还有sink之间的关系

监控端口数据到控制台案例（flume-telnet-logger.conf）

# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1

# Describe/configure the source
a1.sources.r1.type = netcat
a1.sources.r1.bind = localhost
a1.sources.r1.port = 44444

# Describe the sink
a1.sinks.k1.type = logger

# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

实时读取本地文件到HDFS案例（flume-file-hdfs.conf）

Flume要想将数据输出到HDFS，必须持有Hadoop相关jar包

将commons-configuration-1.6.jar、
hadoop-auth-2.7.2.jar、
hadoop-common-2.7.2.jar、
hadoop-hdfs-2.7.2.jar、
commons-io-2.4.jar、
htrace-core-3.1.0-incubating.jar
拷贝到/opt/module/flume/lib文件夹下。

代码

# Name the components on this agent
a2.sources = r2
a2.sinks = k2
a2.channels = c2

# Describe/configure the source
a2.sources.r2.type = exec
a2.sources.r2.command = tail -F /opt/module/hive/logs/hive.log
a2.sources.r2.shell = /bin/bash -c

# Describe the sink
a2.sinks.k2.type = hdfs
a2.sinks.k2.hdfs.path = hdfs://hadoop101:9000/flume/%Y%m%d/%H
#上传文件的前缀
a2.sinks.k2.hdfs.filePrefix = logs-
#是否按照时间滚动文件夹
a2.sinks.k2.hdfs.round = true
#多少时间单位创建一个新的文件夹
a2.sinks.k2.hdfs.roundValue = 1
#重新定义时间单位
a2.sinks.k2.hdfs.roundUnit = hour
#是否使用本地时间戳
a2.sinks.k2.hdfs.useLocalTimeStamp = true
#积攒多少个Event才flush到HDFS一次
a2.sinks.k2.hdfs.batchSize = 1000
#设置文件类型，可支持压缩
a2.sinks.k2.hdfs.fileType = DataStream
#多久生成一个新的文件
a2.sinks.k2.hdfs.rollInterval = 600
#设置每个文件的滚动大小
a2.sinks.k2.hdfs.rollSize = 134217700
#文件的滚动与Event数量无关
a2.sinks.k2.hdfs.rollCount = 0
#最小冗余数
a2.sinks.k2.hdfs.minBlockReplicas = 1

# Use a channel which buffers events in memory
a2.channels.c2.type = memory
a2.channels.c2.capacity = 1000
a2.channels.c2.transactionCapacity = 100

# Bind the source and sink to the channel
a2.sources.r2.channels = c2
a2.sinks.k2.channel = c2

实时读取目录文件到HDFS案例（flume-dir-hdfs.conf）

a3.sources = r3
a3.sinks = k3
a3.channels = c3

# Describe/configure the source
a3.sources.r3.type = spooldir
a3.sources.r3.spoolDir = /opt/module/flume/upload
a3.sources.r3.fileSuffix = .COMPLETED
a3.sources.r3.fileHeader = true
#忽略所有以.tmp结尾的文件，不上传
a3.sources.r3.ignorePattern = ([^ ]*\.tmp)

# Describe the sink
a3.sinks.k3.type = hdfs
a3.sinks.k3.hdfs.path = hdfs://hadoop101:9000/flume/upload/%Y%m%d/%H
#上传文件的前缀
a3.sinks.k3.hdfs.filePrefix = upload-
#是否按照时间滚动文件夹
a3.sinks.k3.hdfs.round = true
#多少时间单位创建一个新的文件夹
a3.sinks.k3.hdfs.roundValue = 1
#重新定义时间单位
a3.sinks.k3.hdfs.roundUnit = hour
#是否使用本地时间戳
a3.sinks.k3.hdfs.useLocalTimeStamp = true
#积攒多少个Event才flush到HDFS一次
a3.sinks.k3.hdfs.batchSize = 100
#设置文件类型，可支持压缩
a3.sinks.k3.hdfs.fileType = DataStream
#多久生成一个新的文件
a3.sinks.k3.hdfs.rollInterval = 600
#设置每个文件的滚动大小大概是128M
a3.sinks.k3.hdfs.rollSize = 134217700
#文件的滚动与Event数量无关
a3.sinks.k3.hdfs.rollCount = 0
#最小冗余数
a3.sinks.k3.hdfs.minBlockReplicas = 1

# Use a channel which buffers events in memory
a3.channels.c3.type = memory
a3.channels.c3.capacity = 1000
a3.channels.c3.transactionCapacity = 100

# Bind the source and sink to the channel
a3.sources.r3.channels = c3
a3.sinks.k3.channel = c3

五、Flume监控

Flume监控用Ganglia

六、Flume高级之自定义MySQLSource

官网说明：https://flume.apache.org/FlumeDeveloperGuide.html#source
继承AbstractSource类并实现Configurable和PollableSource接口。
实现相应方法：
1. getBackOffSleepIncrement()//暂不用
2. getMaxBackOffSleepInterval()//暂不用
3. configure(Context context)//初始化context
4. process()//获取数据（从MySql获取数据，业务处理比较复杂，所以我们定义一个专门的类——SQLSourceHelper来处理跟MySql的交互），封装成Event并写入Channel，这个方法被循环调用
5. stop()//关闭相关的资源

代码实现

导入pom依赖

<dependencies>
    <dependency>
        <groupId>org.apache.flume</groupId>
        <artifactId>flume-ng-core</artifactId>
        <version>1.7.0</version>
    </dependency>
    <dependency>
        <groupId>mysql</groupId>
        <artifactId>mysql-connector-java</artifactId>
        <version>5.1.27</version>
    </dependency>
</dependencies>

在ClassPath下添加jdbc.properties和log4j. properties

dbDriver=com.mysql.jdbc.Driver
dbUrl=jdbc:mysql://hadoop101:3306/mysqlsource?useUnicode=true&characterEncoding=utf-8
dbUser=root
dbPassword=123456

#--------console-----------
log4j.rootLogger=info,myconsole,myfile
log4j.appender.myconsole=org.apache.log4j.ConsoleAppender
log4j.appender.myconsole.layout=org.apache.log4j.SimpleLayout
#log4j.appender.myconsole.layout.ConversionPattern =%d [%t] %-5p [%c] - %m%n

#log4j.rootLogger=error,myfile
log4j.appender.myfile=org.apache.log4j.DailyRollingFileAppender
log4j.appender.myfile.File=/tmp/flume.log
log4j.appender.myfile.layout=org.apache.log4j.PatternLayout
log4j.appender.myfile.layout.ConversionPattern =%d [%t] %-5p [%c] - %m%n

配置文件准备（mysql.conf）

# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1

# Describe/configure the source
a1.sources.r1.type = com.test.SQLSource  
a1.sources.r1.connection.url = jdbc:mysql://192.168.1.101:3306/mysqlsource
a1.sources.r1.connection.user = root  
a1.sources.r1.connection.password = 123456  
a1.sources.r1.table = student  
a1.sources.r1.columns.to.select = *  
#a1.sources.r1.incremental.column.name = id  
#a1.sources.r1.incremental.value = 0 
a1.sources.r1.run.query.delay=5000

# Describe the sink
a1.sinks.k1.type = logger

# Describe the channel
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

七、Flume参数调优

source：
1. 增加source个数，可以增加source的读取数据的能力
2. 增大一次性运输到channel的event条数，batchSize参数适当调大可以提高Source搬运Event到Channel时的性能
channel：
1. 选择memory channel的性能最好，选择file channel的容错性最好
2. 使用file Channel时dataDirs配置多个不同盘下的目录可以提高性能
3. Capacity 参数决定Channel可容纳最大的event条数
4. transactionCapacity 参数决定每次Source往channel里面写的最大event条数和每次Sink从channel里面读的最大event条数（注意：transactionCapacity需要大于Source和Sink的batchSize参数）
sink：
1. 适当增加sink个数，可以增加sink的消费能力
2. 增大一次性从channel读取的event条数，batchSize参数适当调大可以提高sink从channel搬出event的性能

八、Flume的事务机制

Flume的事务机制（类似数据库的事务机制）：Flume使用两个独立的事务分别负责从Soucrce到Channel，以及从Channel到Sink的事件传递。

YuBx

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Flume基础知识

Flume基础知识一、概叙Flume是一个高可用、高可靠的用于海里日志采集、聚合和传输的系统。Flume基于流式架构，简单灵活。Flume最主要的作用是读取服务器本地磁盘文件的数据，将数据写入HDFS二、Flume的基本组成架构source：主要用于接收搜集日志数据，常用的有exec、spooldir、netcat、kafka、taildir、avro等channel：主要用于缓存采集过来的数据，主要有memory channel和file channel，前者基于内存缓存，对数据安全性不高
复制链接

扫一扫