Flume知识点入门学习一

最新推荐文章于 2022-03-27 22:58:38 发布

爱上口袋的天空

最新推荐文章于 2022-03-27 22:58:38 发布

阅读量136

点赞数

本文链接：https://blog.csdn.net/K_520_W/article/details/100602050

版权

一：简介

二：Flume 角色

三：Flume 传输过程

四：安装flume

上传安装包到linux,并且解压到指定目录下
修改配置文件名称
修改flume-env.sh文件

五：案例一：监控端口数据

需求：Flume 监控一端 Console，另一端 Console 发送消息，使被监控端实时显示
创建 Flume Agent 配置文件 job_flume_telnet.conf
判断 44444 端口是否被占用
先开启 flume 先听端口
使用 telnet 工具向本机的 44444 端口发送内容

六：案例二：实时读取本地文件到 HDFS

需求：实时监控 hive 日志，并上传到 HDFS 中
拷贝 Hadoop 相关 jar 到 Flume 的 /opt/module/apache-flume-1.7.0-bin/lib/ 目录下：

创建job_flume_2hdfs.conf 文件

内容：

#把agent起个名叫a2,sources叫r2,sinks叫k2.hdfs,channels叫c2
a2.sources = r2 
a2.sinks = k2 
a2.channels = c2 

# Describe/configure the source 
# exec即execute执行命令
a2.sources.r2.type = exec 
# 要执行的命令
a2.sources.r2.command = tail -F /opt/module/hive/logs/hive.log 
# 执行shell脚本的绝对路径
a2.sources.r2.shell = /bin/bash -c 
 
# Describe the sink 
a2.sinks.k2.type = hdfs 
# 上传到hdfs的路径
a2.sinks.k2.hdfs.path = hdfs://hadoop102:9000/flume/%Y%m%d/%H 
#上传文件的前缀 
a2.sinks.k2.hdfs.filePrefix = logs- 
#是否按照时间滚动文件夹 
a2.sinks.k2.hdfs.round = true 
#多少时间单位创建一个新的文件夹 
a2.sinks.k2.hdfs.roundValue = 1 
#重新定义时间单位 
a2.sinks.k2.hdfs.roundUnit = hour 
#是否使用本地时间戳 
a2.sinks.k2.hdfs.useLocalTimeStamp = true 
#积攒多少个 Event 才 flush 到 HDFS 一次 
a2.sinks.k2.hdfs.batchSize = 1000 
#设置文件类型，可支持压缩 
a2.sinks.k2.hdfs.fileType = DataStream 
#多久生成一个新的文件 （单位：秒）
a2.sinks.k2.hdfs.rollInterval = 600 
#设置每个文件的滚动大小 （单位：字节）
a2.sinks.k2.hdfs.rollSize = 134217700 
#文件的滚动与 Event 数量无关 
a2.sinks.k2.hdfs.rollCount = 0 
#最小副本数
a2.sinks.k2.hdfs.minBlockReplicas = 1 
 
# Use a channel which buffers events in memory 
#channels阶段以内存的形式保存数据  event数量100
a2.channels.c2.type = memory 
a2.channels.c2.capacity = 1000 
a2.channels.c2.transactionCapacity = 100 
 
# Bind the source and sink to the channel 
#把source和sink和channel对接   source可以对接多个channels  sinks只能对接一个channel
a2.sources.r2.channels = c2 
a2.sinks.k2.channel = c2

启动hdfs集群
执行监控配置

下面我们再次打开一个控制台，操作hive:

在浏览器上查看hdfs:

七：案例三：实时读取目录文件到 HDFS

需求：使用 flume 监听整个目录的文件

创建配置文件job_flume_dir.conf

内容：

#把agent起个名叫a3,sources叫r3,sinks叫k3.hdfs,channels叫c3
a3.sources = r3 
a3.sinks = k3 
a3.channels = c3 
 
# Describe/configure the source 
a3.sources.r3.type = spooldir 
#要监听的目录
a3.sources.r3.spoolDir = /opt/module/apache-flume-1.7.0-bin/upload
#上传后的文件结尾 
a3.sources.r3.fileSuffix = .COMPLETED 
a3.sources.r3.fileHeader = true 
#忽略所有以.tmp 结尾的文件，不上传 
a3.sources.r3.ignorePattern = ([^ ]*\.tmp) 
 
# Describe the sink 
a3.sinks.k3.type = hdfs 
a3.sinks.k3.hdfs.path = hdfs://hadoop102:9000/flume/upload/%Y%m%d/%H 
#上传文件的前缀 
a3.sinks.k3.hdfs.filePrefix = upload- 
#是否按照时间滚动文件夹 
a3.sinks.k3.hdfs.round = true 
#多少时间单位创建一个新的文件夹 
a3.sinks.k3.hdfs.roundValue = 1 
#重新定义时间单位 
a3.sinks.k3.hdfs.roundUnit = hour 
#是否使用本地时间戳 
a3.sinks.k3.hdfs.useLocalTimeStamp = true 
#积攒多少个 Event 才 flush 到 HDFS 一次 
a3.sinks.k3.hdfs.batchSize = 100 
#设置文件类型，可支持压缩
a3.sinks.k3.hdfs.fileType = DataStream 
#多久生成一个新的文件 
a3.sinks.k3.hdfs.rollInterval = 600 
#设置每个文件的滚动大小大概是 128M 
a3.sinks.k3.hdfs.rollSize = 134217700 
#文件的滚动与 Event 数量无关 
a3.sinks.k3.hdfs.rollCount = 0 
#最小副本数 
a3.sinks.k3.hdfs.minBlockReplicas = 1 
 
# Use a channel which buffers events in memory 
a3.channels.c3.type = memory 
#通道中存储的最大事件数
a3.channels.c3.capacity = 1000 
#每个事务通道从源或提供给接收器的最大事件数
a3.channels.c3.transactionCapacity = 100 
 
# Bind the source and sink to the channel 
a3.sources.r3.channels = c3 
a3.sinks.k3.channel = c3