Flume--常用Sink及应用案例

最新推荐文章于 2024-04-20 12:17:30 发布

韩家小志

最新推荐文章于 2024-04-20 12:17:30 发布

阅读量594

点赞数

文章标签： flume

本文链接：https://blog.csdn.net/qq_46893497/article/details/111086950

版权

基础专栏收录该内容

148 篇文章 1 订阅

订阅专栏

常用Sink及应用案例

1、功能
2、logger sink
3、hdfs sink
4、File Roll sink
5、avro sink
6、kafka sink

1、功能

负责从Channel中取出所有的数据event，将数据发送到目标地中

2、logger sink

功能：将event写入日志
应用场景：用于测试

3、hdfs sink

http://archive.cloudera.com/cdh5/cdh/5/flume-ng-1.6.0-cdh5.14.0/FlumeUserGuide.html#hdfs-sink

功能：将输入写入HDFS，这是工作中主要使用的
应用场景
- 实时采集到HDFS，提供离线架构
- Flume属于数据采集层
- |
- | HDFS ：实现ETL
- |
- Hive：数据仓库
为什么不直接将Flume的数据放入Hive中？
- 能不能实现？可以实现
- 但是对写入的Hive表有要求
- Hive表必须为桶表
- Hive表的数据存储格式必须为orc

需求1：将hive的日志动态的采集到HDFS上

启动HDFS和Hive
复制程序

cp hive-mem-console.properties hive-mem-hdfs.properties

修改程序

# define sourceName/channelName/sinkName for the agent 
a1.sources = s1
a1.channels = c1
a1.sinks = k1

# define the s1
a1.sources.s1.type = exec
a1.sources.s1.command  = tail -f /export/servers/hive-1.1.0-cdh5.14.0/logs/hive.log


# define the c1
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# def the k1
a1.sinks.k1.type = hdfs
#指定写入HDFS哪个目录中
a1.sinks.k1.hdfs.path =  /flume/hdfs/normal
#指定生成的文件的前缀
a1.sinks.k1.hdfs.filePrefix = hiveLog
#指定生成的文件的后缀
a1.sinks.k1.hdfs.fileSuffix = .log
#指定写入HDFS的文件的类型：普通的文件
a1.sinks.k1.hdfs.fileType = DataStream 

#source、channel、sink bond
a1.sources.s1.channels = c1
a1.sinks.k1.channel = c1

测试运行

bin/flume-ng agent -c conf/ -f userCase/hive-mem-hdfs.properties  -n a1 -Dflume.root.logger=INFO,console

问题1：如何决定HDFS生成的文件大小，希望一个文件大概在一个块的大小，这样存储最好的
- HDFS不适合存储小文件

需求2：读取hive的日志，采集到HDFS中，按照固定文件大小生成文件

默认的大小： rollSize ：1024 byte = 1KB
- 问题：默认配置是1KB一个文件，而实际情况下都会超过1KB
- 原因：Flume中最小的数据传输单元是event
  - 如果写入一个event，还不足1KB，就会写入下一个Event
  - 写入下一个Event就超过1KB
工作中一般是一个块左右的大小
- 由于有Event的大小的溢出
- 一般我们建议：设置125M左右对应的字节数
  - 如果设置为128M的字节数
测试以10KB为测试基准
复制一份程序

cp hive-mem-hdfs.properties hive-mem-size.properties

修改程序

# define sourceName/channelName/sinkName for the agent 
a1.sources = s1
a1.channels = c1
a1.sinks = k1

# define the s1
a1.sources.s1.type = exec
a1.sources.s1.command  = tail -f /export/servers/hive-1.1.0-cdh5.14.0/logs/hive.log


# define the c1
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# def the k1
a1.sinks.k1.type = hdfs
#指定写入HDFS哪个目录中
a1.sinks.k1.hdfs.path =  /flume/hdfs/size
#指定生成的文件的前缀
a1.sinks.k1.hdfs.filePrefix = hiveLog
#指定生成的文件的后缀
a1.sinks.k1.hdfs.fileSuffix = .log
#指定写入HDFS的文件的类型：普通的文件
a1.sinks.k1.hdfs.fileType = DataStream 
#指定按照10KB一个文件生成
a1.sinks.k1.hdfs.rollSize = 10240
#指定按照多长时间生成一个文件
a1.sinks.k1.hdfs.rollInterval = 0
#指定按照多少个Event生成一个文件
a1.sinks.k1.hdfs.rollCount = 0

#source、channel、sink bond
a1.sources.s1.channels = c1
a1.sinks.k1.channel = c1

测试运行

bin/flume-ng agent -c conf/ -f userCase/hive-mem-size.properties -n a1 -Dflume.root.logger=INFO,console

问题2：如果利用将数据采集到HDFS，经过ETL以后，加载到Hive表，Hive表一般做外部分区表，如何根据数据来实现分区呢？
一般Hive 中的分区都是按照时间分区，能不能在采集的时候，就按照时间分区采集，不同时间的数据写入不同的HDFS目录中，这样以后hive加载时，可以直接做分区加载
在工作中，我们不用hive sink，将数据直接采集到Hive
使用HDFS sink来代替Hive sink
step1：通过flume将数据采集到一个HDFS的目录中
step2：在hive中创建一张表，通过location指定采集的HDFS的目录
- 注意：Hive表的分隔符一定要与文件数据的分隔符一致
- 我们在Hive一般建议使用分区表，来优化程序的输入
- 分区表在HDFS中的存储是一个分区一个目录
- 如何利用Flume实现一个分区一个目录？

需求3：Flume采集hive的日志，根据采集的时间在HDFS上生成不同时间的分区目录

复制一份程序

cp hive-mem-size.properties hive-mem-part.properties

修改程序

# define sourceName/channelName/sinkName for the agent 
a1.sources = s1
a1.channels = c1
a1.sinks = k1

# define the s1
a1.sources.s1.type = exec
a1.sources.s1.command  = tail -f /export/servers/hive-1.1.0-cdh5.14.0/logs/hive.log


# define the c1
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# def the k1
a1.sinks.k1.type = hdfs
#指定写入HDFS哪个目录中
a1.sinks.k1.hdfs.path =  /flume/hdfs/part/daystr=%Y%m%d/hourstr=%H
#指定生成的文件的前缀
a1.sinks.k1.hdfs.filePrefix = hiveLog
#指定生成的文件的后缀
a1.sinks.k1.hdfs.fileSuffix = .log
#指定写入HDFS的文件的类型：普通的文件
a1.sinks.k1.hdfs.fileType = DataStream 
#指定按照10KB一个文件生成
a1.sinks.k1.hdfs.rollSize = 10240
#指定按照多长时间生成一个文件
a1.sinks.k1.hdfs.rollInterval = 0
#指定按照多少个Event生成一个文件
a1.sinks.k1.hdfs.rollCount = 0
#使用本地时间作为时间戳
a1.sinks.k1.hdfs.useLocalTimeStamp = true

#source、channel、sink bond
a1.sources.s1.channels = c1
a1.sinks.k1.channel = c1

运行测试

bin/flume-ng agent -c conf/ -f userCase/hive-mem-part.properties -n a1 -Dflume.root.logger=INFO,console

4、File Roll sink

功能：将采集到的数据写入本地文件系统
应用场景：将采集到的数据在本地进行备份
需要指定的属性

type    –   The component type name, needs to be file_roll.
sink.directory  –   The directory where files will be stored

5、avro sink

一般与avro source一起连用

6、kafka sink

将数据发动到Kafka
实现实时数据采集到Kafka，提供实时架构

韩家小志

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
1
评论
Flume--常用Sink及应用案例

1、功能负责从Channel中取出所有的数据event，将数据发送到目标地中2、logger sink功能：将event写入日志应用场景：用于测试3、hdfs sinkhttp://archive.cloudera.com/cdh5/cdh/5/flume-ng-1.6.0-cdh5.14.0/FlumeUserGuide.html#hdfs-sink功能：将输入写入HDFS，这是工作中主要使用的应用场景实时采集到HDFS，提供离线架构Flume属于数据采集层|| HDF
复制链接

扫一扫

专栏目录