大数据入门--Flume（一）安装教程与案例

最新推荐文章于 2023-10-13 21:35:06 发布

许中宝

最新推荐文章于 2023-10-13 21:35:06 发布

阅读量471

点赞数

分类专栏：大数据文章标签： flume 大数据

本文链接：https://blog.csdn.net/qq_18675693/article/details/119268417

版权

大数据专栏收录该内容

21 篇文章 1 订阅

订阅专栏

Flume（一）安装教程与案例

安装教程

下载安装 apache-flume-1.9.0-bin.tar.gz
解压配置JAVA_HOME
vi conf/flume-env.sh.template
```
export JAVA_HOME=/opt/module/jdk1.8.0_144
```

案例

监控端口数据官方案例(netcat-logger)

官方链接敬上
因为此案例需要用到netcat，如果没有需要提前安装

# 安装
yum install -y nc

# 启动服务端
nc -l localhost 44444

# 启动客户端，就可以通信了
nc  localhost 44444

flume配置文件，配置文件我们建立在flume主目录的job/flume-netcat-logger.conf
vi job/flume-netcat-logger.conf

# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1

# Describe/configure the source
a1.sources.r1.type = netcat
a1.sources.r1.bind = localhost
a1.sources.r1.port = 44444

# Describe the sink
a1.sinks.k1.type = logger

# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

启动

[hadoop@hadoop101 apache-flume-1.9.0-bin]$ bin/flume-ng agent --conf conf/ \
 --conf-file job/flume-netcat-logger.conf \
 --name a1 \
 -Dflume.root.logger=INFO,console

此时flume就可以监听 localhost 44444的TCP数据了。启动一个nc客户端试试吧。

实时监控单个追加文件(exec-hdfs)

使用 source的exec命令使用tail -f xxx-file进行监听单个文件的变化
source–>exec
sink-hdfs
vi job/flume-exec-logger.conf

# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1

# Describe/configure the source
a1.sources.r1.type = exec
a1.sources.r1.command= tail -f /opt/module/apache-flume-1.9.0-bin/demo/tail-log.txt

# Describe the sink
a1.sinks.k1.type = logger

# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

建立文件

mkdir demo
touch demo/tail-log.txt

启动

bin/flume-ng agent -c conf/ \
-f job/flume-exec-logger.conf \
-n a1 \
-Dflume.root.logger=INFO,console

进阶版

我们要将sink由logger切换到hdfs
此时我们的配置文件
vi job/flume-exec-hdfs.conf

# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1

# Describe/configure the source
a1.sources.r1.type = exec
a1.sources.r1.command= tail -f /opt/module/apache-flume-1.9.0-bin/demo/tail-log.txt

# Describe the sink
a1.sinks.k1.type=hdfs
a1.sinks.k1.hdfs.path=hdfs://hadoop101:8020/flume/%Y%m%d/%H
#是否按照时间滚动文件夹
a1.sinks.k1.hdfs.round = true
#多少时间单位创建一个新的文件夹
a1.sinks.k1.hdfs.roundValue = 1
#重新定义时间单位
a1.sinks.k1.hdfs.roundUnit = hour
# 是否启用本地时间戳：即在event的header部分，将本地时间戳传输
# 文件夹滚动依赖event的header中传输timestamp
a1.sinks.k1.hdfs.useLocalTimeStamp = true
#文件前缀
a1.sinks.k1.hdfs.filePrefix=log-
# 当前文件滚动间隔，单位：秒，设置为0表示不启用此配置项
a1.sinks.k1.hdfs.rollInterval=30
# 文件滚动大小，单位：byte，设置为0表示不启用。默认1024
# 一般生产环境设置为块大小稍微小一点。每一个block滚动一次
a1.sinks.k1.hdfs.rollSize=134217700
# 多少次event写入滚动一次，设置为0表示不启用。
a1.sinks.k1.hdfs.rollCount=0
# event批处理个数，积攒多少个event后写入hdfs
a1.sinks.k1.hdfs.batchSize=100
# 设置文件类型，DataStream即不压缩，不能设置hdfs.codeC选项
a1.sinks.k1.hdfs.fileType = DataStream

# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

如果出现如下报错

Exception in thread "SinkRunner-PollingRunner-DefaultSinkProcessor" java.lang.NoSuchMethodError: com.google.common.base.Preconditions.checkArgument(ZLjava/lang/String;Ljava/lang/Object;)V
        at org.apache.hadoop.conf.Configuration.set(Configuration.java:1357)
        at org.apache.hadoop.conf.Configuration.set(Configuration.java:1338)
        at org.apache.hadoop.conf.Configuration.setBoolean(Configuration.java:1679)
        at org.apache.flume.sink.hdfs.BucketWriter.open(BucketWriter.java:221)
        at org.apache.flume.sink.hdfs.BucketWriter.append(BucketWriter.java:572)
        at org.apache.flume.sink.hdfs.HDFSEventSink.process(HDFSEventSink.java:412)
        at org.apache.flume.sink.DefaultSinkProcessor.process(DefaultSinkProcessor.java:67)
        at org.apache.flume.SinkRunner$PollingRunner.run(SinkRunner.java:145)

请参照这篇文章解决

存在的问题

只能监控一个文件
如果tail -f 在运行过程中断掉怎么办？重启？那么此时可能有两个问题，默认tail -f 会打印后10行数据，这样就会造成数据重复上传；然后这里有些小伙伴可能说我可以在加一个-n 0，哦 no！！！，是不是有可能又丢数据了呢？其实这里就是一个断点续传的问题了。

引申问题：tail -F 和tail -f有什么区别？？？
传送门

tail -f :按照inode监控文件内容，文件出现被移动（包括重命名）或者删除后停止，文件重新创建不监听
tail -F 按照文件名称进行监控，文件被删除或者移动时停止，但是会重试如果同名文件再次被创建继续监听

实时监控目录下多个新文件(taildir)

解决exec tail -f 只能监控一个文件的问题
解决了断点续传的问题：通过positionFile参数指定一个文件记录文件的写入位置

source：taildir传送门
配置文件
vi job/flume-taildir-hdfs.conf

a1.sources = r1
a1.sinks = k1
a1.channels = c1

# Describe/configure the source
a1.sources.r1.type = TAILDIR
a1.sources.r1.channels = c1
a1.sources.r1.positionFile = /opt/module/apache-flume-1.9.0-bin/demo/test01/taildir_position.json
a1.sources.r1.filegroups = f1 f2
a1.sources.r1.filegroups.f1 = /opt/module/apache-flume-1.9.0-bin/demo/test01/tail-01.txt
a1.sources.r1.headers.f1.headerKey1 = value1
a1.sources.r1.filegroups.f2 = /opt/module/apache-flume-1.9.0-bin/demo/test01/tail-02.txt
a1.sources.r1.headers.f2.headerKey1 = value2
a1.sources.r1.headers.f2.headerKey2 = value2-2
a1.sources.r1.fileHeader = true
a1.sources.ri.maxBatchCount = 1000

# Describe the sink
a1.sinks.k1.type=hdfs
a1.sinks.k1.hdfs.path=hdfs://hadoop101:8020/flume/%Y%m%d/%H
#是否按照时间滚动文件夹
a1.sinks.k1.hdfs.round = true
#多少时间单位创建一个新的文件夹
a1.sinks.k1.hdfs.roundValue = 1
#重新定义时间单位
a1.sinks.k1.hdfs.roundUnit = hour
#是否启用本地时间戳：即在event的header部分，将本地时间戳传输，替换hdfs中使用的时间戳
a1.sinks.k1.hdfs.useLocalTimeStamp = true
#文件前缀
a1.sinks.k1.hdfs.filePrefix=log-
# 当前文件滚动间隔，单位：秒，设置为0表示不启用此配置项
a1.sinks.k1.hdfs.rollInterval=30
# 文件滚动大小，单位：byte，设置为0表示不启用。默认1024
# 一般生产环境设置为块大小稍微小一点。每一个block滚动一次
a1.sinks.k1.hdfs.rollSize=134217700
# 多少次event写入滚动一次，设置为0表示不启用。
a1.sinks.k1.hdfs.rollCount=0
# event批处理个数，积攒多少个event后写入hdfs
a1.sinks.k1.hdfs.batchSize=100
# 设置文件类型，DataStream即不压缩，不能设置hdfs.codeC选项
a1.sinks.k1.hdfs.fileType = DataStream

# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

Taildir 说明：
Taildir Source 维护了一个 json 格式的 position File，其会定期的往 position File
中更新每个文件读取到的最新的位置，因此能够实现断点续传。

注意：
taildir不是单纯使用inode来记录文件的，实际测试发现，如果我们将文件重命名，然后继续对重命名后的文件进行追加是不能检测到的，但是如果我们将文件名称（包含路径）在修改为最初的名称，即与position File中记录一致的话，之前未上传的内容就可以自动上传上去。
总结：taildir是根据文件名称和路径+inode监控文件是否有新内容的。

实时监控目录下多个新文件（spooldir-hdfs）

source–>spooldir

vi job/flume-dir-hdfs.conf

a1.sources = r1
a1.sinks = k1
a1.channels = c1

# Describe/configure the source
a1.sources.r1.type = spooldir
a1.sources.r1.spoolDir = /opt/module/apache-flume-1.9.0-bin/demo/upload
a1.sources.r1.fileSuffix = .COMPLETED
a1.sources.r1.fileHeader = true
#忽略所有以.tmp 结尾的文件，不上传
a3.sources.r3.ignorePattern = ([^ ]*\.tmp)

# Describe the sink
a1.sinks.k1.type=hdfs
a1.sinks.k1.hdfs.path=hdfs://hadoop101:8020/flume/%Y%m%d/%H
#是否按照时间滚动文件夹
a1.sinks.k1.hdfs.round = true
#多少时间单位创建一个新的文件夹
a1.sinks.k1.hdfs.roundValue = 1
#重新定义时间单位
a1.sinks.k1.hdfs.roundUnit = hour
#是否启用本地时间戳：即在event的header部分，将本地时间戳传输，替换hdfs中使用的时间戳
a1.sinks.k1.hdfs.useLocalTimeStamp = true
#文件前缀
a1.sinks.k1.hdfs.filePrefix=log-
# 当前文件滚动间隔，单位：秒，设置为0表示不启用此配置项
a1.sinks.k1.hdfs.rollInterval=30
# 文件滚动大小，单位：byte，设置为0表示不启用。默认1024
# 一般生产环境设置为块大小稍微小一点。每一个block滚动一次
a1.sinks.k1.hdfs.rollSize=134217700
# 多少次event写入滚动一次，设置为0表示不启用。
a1.sinks.k1.hdfs.rollCount=0
# event批处理个数，积攒多少个event后写入hdfs
a1.sinks.k1.hdfs.batchSize=100
# 设置文件类型，DataStream即不压缩，不能设置hdfs.codeC选项
a1.sinks.k1.hdfs.fileType = DataStream

# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

许中宝

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
大数据入门--Flume（一）安装教程与案例

安装教程下载安装 apache-flume-1.9.0-bin.tar.gz解压配置JAVA_HOMEvi conf/flume-env.sh.templateexport JAVA_HOME=/opt/module/jdk1.8.0_144案例监控端口数据官方案例(netcat-logger)官方链接敬上因为此案例需要用到netcat，如果没有需要提前安装# 安装yum install -y nc# 启动服务端nc -l localhost 44444# 启动客户端，就
复制链接

扫一扫