Flume+Hadoop大数据采集部署

引言

在大数据处理中,日志数据的采集是数据分析的第一步。Apache Flume是一个分布式、可靠且可用的系统,用于有效地收集、聚合和移动大量日志数据到集中式数据存储。本文将详细介绍如何使用Flume采集日志数据,并将其上传到Hadoop分布式文件系统(HDFS)中。

Flume简介

Apache Flume是一个高可用的、高可靠的,分布式的海量日志采集、聚合和传输的系统。它基于流式架构,提供了灵活性和简单性,能够实时读取服务器本地磁盘的数据,并将数据写入到HDFS。

系统要求

  • Hadoop
    Hadoop 2.8.0
    百度网盘链接:https://pan.baidu.com/s/16VZGWk4kdiJ6GYxDP5BUew
    提取码:j9fa
  • Flume 1.9.0
    百度网盘链接:https://pan.baidu.com/s/1eLLKeQWaMvPjSJziEewfVA
    提取码:3q2s
  • Centos 7

Flume配置结构

Flume的配置文件定义了数据流的来源和去向。以下是一个基本的配置示例,它定义了一个简单的Flume Agent,该Agent从一个本地端口收集数据,并将其输出到控制台
在这里插入图片描述
Flume的架构上可以知道,它主要分为三部分source、sink和channel

配置Flume

在/opt/server/flume/conf目录下创建flume-hdfs.conf文件

a1.sources = r1
a1.sinks = k1
a1.channels = c1
#配置source
a1.sources.r1.type = exec
a1.sources.r1.bind = 0.0.0.0
a1.sources.r1.port = 4141
a1.sources.r1.command = tail -F /var/log/flume-test.log # 设置要执行的命令,实时读取指定日志文件的最新内容

# 配置sink
a1.sinks.k1.type = hdfs
a1.sinks.k1.hdfs.path = hdfs://zhang:9000/flume/logs/ #zhang为主机名,指令hostname一下就可以显示自己主机名,flume/logs/ 该文件路径不用手动创建文件夹,程序会自动创建
# 配置channel
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
# 绑定 source 和sink 到 channel
a1.sources.r1.channels = c1 #注意此处channels 多了个S
a1.sinks.k1.channel = c1

启动Hadoop hdfs

start-dfs.sh

若HDFS搭建启动正常,打开Wed界面如下
在这里插入图片描述
也可以使用hdfs dfs -ls /

[root@localhost 192 conf]# hdfs dfs -ls /
Found 2 items
drwxr-xr-x   - root supergroup          0 2024-07-13 08:09 /flume
-rw-r--r--   1 root supergroup          0 2024-07-12 19:53 /test.txt

能查询到文件即可

启动Flume

需要在flume-hdfs.conf文件下,运行启动命令

[root@localhost 192 conf]# flume-ng agent --conf ./ --conf-file flume-hdfs.conf --name a1 -Dflume.root.logger=INFO,console
Info: Sourcing environment configuration script /opt/server/flume/conf/flume-env.sh
Info: Including Hadoop libraries found via (/opt/server/hadoop-2.8.0/bin/hadoop) for HDFS access
Info: Including Hive libraries found via () for Hive access
+ exec /usr/lib/jvm/jdk1.8.0_65/bin/java -Xmx20m -Dflume.root.logger=INFO,console -cp '/opt/server/flume/conf:/opt/server/flume/lib/*:/opt/server/hadoop-2.8.0/etc/hadoop:/opt/server/hadoop-2.8.0/share/hadoop/common/lib/*:/opt/server/hadoop-2.8.0/share/hadoop/common/*:/opt/server/hadoop-2.8.0/share/hadoop/hdfs:/opt/server/hadoop-2.8.0/share/hadoop/hdfs/lib/*:/opt/server/hadoop-2.8.0/share/hadoop/hdfs/*:/opt/server/hadoop-2.8.0/share/hadoop/yarn/lib/*:/opt/server/hadoop-2.8.0/share/hadoop/yarn/*:/opt/server/hadoop-2.8.0/share/hadoop/mapreduce/lib/*:/opt/server/hadoop-2.8.0/share/hadoop/mapreduce/*:/opt/server/hadoop-2.8.0/contrib/capacity-scheduler/*.jar:/lib/*' -Djava.library.path=:/opt/server/hadoop-2.8.0/lib/native org.apache.flume.node.Application --conf-file flume-hdfs.conf --name a1
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/opt/server/flume/lib/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/server/hadoop-2.8.0/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
2024-07-14 05:36:25,840 (lifecycleSupervisor-1-0) [INFO - org.apache.flume.node.PollingPropertiesFileConfigurationProvider.start(PollingPropertiesFileConfigurationProvider.java:62)] Configuration provider starting
2024-07-14 05:36:25,850 (conf-file-poller-0) [INFO - org.apache.flume.node.PollingPropertiesFileConfigurationProvider$FileWatcherRunnable.run(PollingPropertiesFileConfigurationProvider.java:138)] Reloading configuration file:flume-hdfs.conf
2024-07-14 05:36:25,861 (conf-file-poller-0) [INFO - org.apache.flume.conf.FlumeConfiguration$AgentConfiguration.addComponentConfig(FlumeConfiguration.java:1203)] Processing:c1
2024-07-14 05:36:25,862 (conf-file-poller-0) [INFO - org.apache.flume.conf.FlumeConfiguration$AgentConfiguration.addComponentConfig(FlumeConfiguration.java:1203)] Processing:r1
2024-07-14 05:36:25,862 (conf-file-poller-0) [INFO - org.apache.flume.conf.FlumeConfiguration$AgentConfiguration.addComponentConfig(FlumeConfiguration.java:1203)] Processing:r1
2024-07-14 05:36:25,862 (conf-file-poller-0) [INFO - org.apache.flume.conf.FlumeConfiguration$AgentConfiguration.addComponentConfig(FlumeConfiguration.java:1203)] Processing:r1
2024-07-14 05:36:25,862 (conf-file-poller-0) [INFO - org.apache.flume.conf.FlumeConfiguration$AgentConfiguration.addProperty(FlumeConfiguration.java:1117)] Added sinks: k1 Agent: a1
2024-07-14 05:36:25,862 (conf-file-poller-0) [INFO - org.apache.flume.conf.FlumeConfiguration$AgentConfiguration.addComponentConfig(FlumeConfiguration.java:1203)] Processing:k1
2024-07-14 05:36:25,862 (conf-file-poller-0) [INFO - org.apache.flume.conf.FlumeConfiguration$AgentConfiguration.addComponentConfig(FlumeConfiguration.java:1203)] Processing:r1
2024-07-14 05:36:25,862 (conf-file-poller-0) [INFO - org.apache.flume.conf.FlumeConfiguration$AgentConfiguration.addComponentConfig(FlumeConfiguration.java:1203)] Processing:k1
2024-07-14 05:36:25,863 (conf-file-poller-0) [INFO - org.apache.flume.conf.FlumeConfiguration$AgentConfiguration.addComponentConfig(FlumeConfiguration.java:1203)] Processing:k1
2024-07-14 05:36:25,863 (conf-file-poller-0) [INFO - org.apache.flume.conf.FlumeConfiguration$AgentConfiguration.addComponentConfig(FlumeConfiguration.java:1203)] Processing:r1
2024-07-14 05:36:25,863 (conf-file-poller-0) [INFO - org.apache.flume.conf.FlumeConfiguration$AgentConfiguration.addComponentConfig(FlumeConfiguration.java:1203)] Processing:c1
2024-07-14 05:36:25,863 (conf-file-poller-0) [INFO - org.apache.flume.conf.FlumeConfiguration$AgentConfiguration.addComponentConfig(FlumeConfiguration.java:1203)] Processing:c1
2024-07-14 05:36:25,863 (conf-file-poller-0) [WARN - org.apache.flume.conf.FlumeConfiguration$AgentConfiguration.validateConfigFilterSet(FlumeConfiguration.java:623)] Agent configuration for 'a1' has no configfilters.
2024-07-14 05:36:25,908 (conf-file-poller-0) [INFO - org.apache.flume.conf.FlumeConfiguration.validateConfiguration(FlumeConfiguration.java:163)] Post-validation flume configuration contains configuration for agents: [a1]
2024-07-14 05:36:25,909 (conf-file-poller-0) [INFO - org.apache.flume.node.AbstractConfigurationProvider.loadChannels(AbstractConfigurationProvider.java:151)] Creating channels
2024-07-14 05:36:25,916 (conf-file-poller-0) [INFO - org.apache.flume.channel.DefaultChannelFactory.create(DefaultChannelFactory.java:42)] Creating instance of channel c1 type file
2024-07-14 05:36:26,038 (conf-file-poller-0) [INFO - org.apache.flume.node.AbstractConfigurationProvider.loadChannels(AbstractConfigurationProvider.java:205)] Created channel c1
2024-07-14 05:36:26,039 (conf-file-poller-0) [INFO - org.apache.flume.source.DefaultSourceFactory.create(DefaultSourceFactory.java:41)] Creating instance of source r1, type exec
2024-07-14 05:36:26,044 (conf-file-poller-0) [INFO - org.apache.flume.sink.DefaultSinkFactory.create(DefaultSinkFactory.java:42)] Creating instance of sink: k1, type: hdfs
2024-07-14 05:36:26,092 (conf-file-poller-0) [INFO - org.apache.flume.node.AbstractConfigurationProvider.getConfiguration(AbstractConfigurationProvider.java:120)] Channel c1 connected to [r1, k1]
2024-07-14 05:36:26,110 (conf-file-poller-0) [INFO - org.apache.flume.node.Application.startAllComponents(Application.java:162)] Starting new configuration:{ sourceRunners:{r1=EventDrivenSourceRunner: { source:org.apache.flume.source.ExecSource{name:r1,state:IDLE} }} sinkRunners:{k1=SinkRunner: { policy:org.apache.flume.sink.DefaultSinkProcessor@61cd9609 counterGroup:{ name:null counters:{} } }} channels:{c1=FileChannel c1 { dataDirs: [/root/.flume/file-channel/data] }} }
2024-07-14 05:36:26,118 (conf-file-poller-0) [INFO - org.apache.flume.node.Application.startAllComponents(Application.java:169)] Starting Channel c1
2024-07-14 05:36:26,137 (conf-file-poller-0) [INFO - org.apache.flume.node.Application.startAllComponents(Application.java:184)] Waiting for channel: c1 to start. Sleeping for 500 ms
2024-07-14 05:36:26,137 (lifecycleSupervisor-1-0) [INFO - org.apache.flume.channel.file.FileChannel.start(FileChannel.java:278)] Starting FileChannel c1 { dataDirs: [/root/.flume/file-channel/data] }...
2024-07-14 05:36:26,358 (lifecycleSupervisor-1-0) [INFO - org.apache.flume.instrumentation.MonitoredCounterGroup.register(MonitoredCounterGroup.java:119)] Monitored counter group for type: CHANNEL, name: c1: Successfully registered new MBean.
2024-07-14 05:36:26,359 (lifecycleSupervisor-1-0) [INFO - org.apache.flume.instrumentation.MonitoredCounterGroup.start(MonitoredCounterGroup.java:95)] Component type: CHANNEL, name: c1 started
2024-07-14 05:36:26,376 (lifecycleSupervisor-1-0) [INFO - org.apache.flume.channel.file.Log.<init>(Log.java:356)] Encryption is not enabled
2024-07-14 05:36:26,384 (lifecycleSupervisor-1-0) [INFO - org.apache.flume.channel.file.Log.replay(Log.java:406)] Replay started
2024-07-14 05:36:26,396 (lifecycleSupervisor-1-0) [INFO - org.apache.flume.channel.file.Log.replay(Log.java:418)] Found NextFileID 1, from [/root/.flume/file-channel/data/log-1]
2024-07-14 05:36:26,462 (lifecycleSupervisor-1-0) [INFO - org.apache.flume.channel.file.EventQueueBackingStoreFileV3.<init>(EventQueueBackingStoreFileV3.java:55)] Starting up with /root/.flume/file-channel/checkpoint/checkpoint and /root/.flume/file-channel/checkpoint/checkpoint.meta
2024-07-14 05:36:26,463 (lifecycleSupervisor-1-0) [INFO - org.apache.flume.channel.file.EventQueueBackingStoreFileV3.<init>(EventQueueBackingStoreFileV3.java:59)] Reading checkpoint metadata from /root/.flume/file-channel/checkpoint/checkpoint.meta
2024-07-14 05:36:26,686 (lifecycleSupervisor-1-0) [INFO - org.apache.flume.channel.file.FlumeEventQueue.<init>(FlumeEventQueue.java:115)] QueueSet population inserting 0 took 0
2024-07-14 05:36:26,699 (lifecycleSupervisor-1-0) [INFO - org.apache.flume.channel.file.Log.replay(Log.java:457)] Last Checkpoint Sun Jul 14 05:36:22 CST 2024, queue depth = 0
2024-07-14 05:36:26,701 (lifecycleSupervisor-1-0) [INFO - org.apache.flume.channel.file.Log.doReplay(Log.java:542)] Replaying logs with v2 replay logic
2024-07-14 05:36:26,703 (lifecycleSupervisor-1-0) [INFO - org.apache.flume.channel.file.ReplayHandler.replayLog(ReplayHandler.java:249)] Starting replay of [/root/.flume/file-channel/data/log-1]
2024-07-14 05:36:26,704 (lifecycleSupervisor-1-0) [INFO - org.apache.flume.channel.file.ReplayHandler.replayLog(ReplayHandler.java:262)] Replaying /root/.flume/file-channel/data/log-1
2024-07-14 05:36:26,717 (lifecycleSupervisor-1-0) [INFO - org.apache.flume.tools.DirectMemoryUtils.getDefaultDirectMemorySize(DirectMemoryUtils.java:112)] Unable to get maxDirectMemory from VM: NoSuchMethodException: sun.misc.VM.maxDirectMemory(null)
2024-07-14 05:36:26,719 (lifecycleSupervisor-1-0) [INFO - org.apache.flume.tools.DirectMemoryUtils.allocate(DirectMemoryUtils.java:48)] Direct Memory Allocation:  Allocation = 1048576, Allocated = 0, MaxDirectMemorySize = 20316160, Remaining = 20316160
2024-07-14 05:36:26,843 (lifecycleSupervisor-1-0) [INFO - org.apache.flume.channel.file.LogFile$SequentialReader.skipToLastCheckpointPosition(LogFile.java:658)] fast-forward to checkpoint position: 1655
2024-07-14 05:36:26,848 (lifecycleSupervisor-1-0) [INFO - org.apache.flume.channel.file.LogFile$SequentialReader.next(LogFile.java:683)] Encountered EOF at 1655 in /root/.flume/file-channel/data/log-1
2024-07-14 05:36:26,848 (lifecycleSupervisor-1-0) [INFO - org.apache.flume.channel.file.ReplayHandler.replayLog(ReplayHandler.java:345)] read: 0, put: 0, take: 0, rollback: 0, commit: 0, skip: 0, eventCount:0
2024-07-14 05:36:26,848 (lifecycleSupervisor-1-0) [INFO - org.apache.flume.channel.file.FlumeEventQueue.replayComplete(FlumeEventQueue.java:417)] Search Count = 0, Search Time = 0, Copy Count = 0, Copy Time = 0
2024-07-14 05:36:26,854 (lifecycleSupervisor-1-0) [INFO - org.apache.flume.channel.file.Log.replay(Log.java:505)] Rolling /root/.flume/file-channel/data
2024-07-14 05:36:26,854 (lifecycleSupervisor-1-0) [INFO - org.apache.flume.channel.file.Log.roll(Log.java:990)] Roll start /root/.flume/file-channel/data
2024-07-14 05:36:26,855 (lifecycleSupervisor-1-0) [INFO - org.apache.flume.channel.file.LogFile$Writer.<init>(LogFile.java:220)] Opened /root/.flume/file-channel/data/log-2
2024-07-14 05:36:26,872 (lifecycleSupervisor-1-0) [INFO - org.apache.flume.channel.file.Log.roll(Log.java:1006)] Roll end
2024-07-14 05:36:26,872 (lifecycleSupervisor-1-0) [INFO - org.apache.flume.channel.file.EventQueueBackingStoreFile.beginCheckpoint(EventQueueBackingStoreFile.java:230)] Start checkpoint for /root/.flume/file-channel/checkpoint/checkpoint, elements to sync = 0
2024-07-14 05:36:26,876 (lifecycleSupervisor-1-0) [INFO - org.apache.flume.channel.file.EventQueueBackingStoreFile.checkpoint(EventQueueBackingStoreFile.java:255)] Updating checkpoint metadata: logWriteOrderID: 1720906586485, queueSize: 0, queueHead: 9
2024-07-14 05:36:26,936 (lifecycleSupervisor-1-0) [INFO - org.apache.flume.channel.file.Log.writeCheckpoint(Log.java:1065)] Updated checkpoint for file: /root/.flume/file-channel/data/log-2 position: 0 logWriteOrderID: 1720906586485
2024-07-14 05:36:26,936 (lifecycleSupervisor-1-0) [INFO - org.apache.flume.channel.file.FileChannel.start(FileChannel.java:289)] Queue Size after replay: 0 [channel=c1]
2024-07-14 05:36:26,938 (conf-file-poller-0) [INFO - org.apache.flume.node.Application.startAllComponents(Application.java:196)] Starting Sink k1
2024-07-14 05:36:26,939 (lifecycleSupervisor-1-1) [INFO - org.apache.flume.instrumentation.MonitoredCounterGroup.register(MonitoredCounterGroup.java:119)] Monitored counter group for type: SINK, name: k1: Successfully registered new MBean.
2024-07-14 05:36:26,940 (lifecycleSupervisor-1-1) [INFO - org.apache.flume.instrumentation.MonitoredCounterGroup.start(MonitoredCounterGroup.java:95)] Component type: SINK, name: k1 started
2024-07-14 05:36:26,957 (conf-file-poller-0) [INFO - org.apache.flume.node.Application.startAllComponents(Application.java:207)] Starting Source r1
2024-07-14 05:36:26,957 (lifecycleSupervisor-1-0) [INFO - org.apache.flume.source.ExecSource.start(ExecSource.java:170)] Exec source starting with command: tail -F /var/log/flume-test.log # 设置要执行的命令,实时读取指定日志文件的最新内容
2024-07-14 05:36:26,958 (lifecycleSupervisor-1-0) [INFO - org.apache.flume.instrumentation.MonitoredCounterGroup.register(MonitoredCounterGroup.java:119)] Monitored counter group for type: SOURCE, name: r1: Successfully registered new MBean.
2024-07-14 05:36:26,958 (lifecycleSupervisor-1-0) [INFO - org.apache.flume.instrumentation.MonitoredCounterGroup.start(MonitoredCounterGroup.java:95)] Component type: SOURCE, name: r1 started

给fluem-test.log追加一行日志

[root@zhang conf]# echo hello world! 2024 >> /var/log/flume-test.log
[root@zhang conf]# echo hello world! 2024 >> /var/log/flume-test.log

flume启动窗口会增加日志

在这里插入图片描述

验证文件是否传到HDFS

启动Flume Agent后,可以通过查看HDFS上的文件来验证数据是否成功上传

[root@zhang conf]# hdfs dfs -ls /flume/logs/
Found 16 items
-rw-r--r--   1 root supergroup        499 2024-07-13 08:09 /flume/logs/FlumeData.1720829337367
-rw-r--r--   1 root supergroup        152 2024-07-13 08:09 /flume/logs/FlumeData.1720829337368
-rw-r--r--   1 root supergroup        223 2024-07-13 08:11 /flume/logs/FlumeData.1720829454692
-rw-r--r--   1 root supergroup        152 2024-07-13 08:12 /flume/logs/FlumeData.1720829497524
-rw-r--r--   1 root supergroup        209 2024-07-13 08:13 /flume/logs/FlumeData.1720829603294
-rw-r--r--   1 root supergroup        209 2024-07-13 11:32 /flume/logs/FlumeData.1720841536911
-rw-r--r--   1 root supergroup        152 2024-07-13 11:33 /flume/logs/FlumeData.1720841576880
-rw-r--r--   1 root supergroup        152 2024-07-13 11:39 /flume/logs/FlumeData.1720841955226
-rw-r--r--   1 root supergroup        209 2024-07-13 12:06 /flume/logs/FlumeData.1720843583557
-rw-r--r--   1 root supergroup        152 2024-07-13 12:14 /flume/logs/FlumeData.1720844061038
-rw-r--r--   1 root supergroup        152 2024-07-13 16:19 /flume/logs/FlumeData.1720858723069
-rw-r--r--   1 root supergroup        499 2024-07-13 16:23 /flume/logs/FlumeData.1720858983761
-rw-r--r--   1 root supergroup        152 2024-07-13 16:23 /flume/logs/FlumeData.1720858983762
-rw-r--r--   1 root supergroup        499 2024-07-14 05:15 /flume/logs/FlumeData.1720905309221
-rw-r--r--   1 root supergroup        152 2024-07-14 05:15 /flume/logs/FlumeData.1720905309222
-rw-r--r--   1 root supergroup        280 2024-07-14 05:16 /flume/logs/FlumeData.1720905354484

显示所有传输的文件在/flume/logs/
在这里插入图片描述
我们可以在web界面查看
在这里插入图片描述

补充

上使用的channel是memory的方式,也可以换成file模式,memory的弊端是中断会丢失数据。

a1.sources = r1
a1.sinks = k1
a1.channels = c1
#配置source
a1.sources.r1.type = exec
a1.sources.r1.bind = 0.0.0.0
a1.sources.r1.port = 4141
a1.sources.r1.command = tail -F /var/log/flume-test.log # 设置要执行的命令,实时读取指定日志文件的最新内容

# 配置sink
a1.sinks.k1.type = hdfs
a1.sinks.k1.hdfs.path = hdfs://zhang:9000/flume/logs/
# 配置channel
a1.channels.c1.type = file
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
# 绑定 source 和sink 到 channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

wq!退出
测试下

[root@zhang conf]# echo hello world! 2024 >> /var/log/flume-test.log
[root@zhang conf]# 

flume实时日志如下
在这里插入图片描述
Web界面
在这里插入图片描述
到此部署完成,并成功测试结果!
参照大神
Flume+Hadoop:打造你的大数据处理流水线

  • 31
    点赞
  • 15
    收藏
    觉得还不错? 一键收藏
  • 打赏
    打赏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

终有一刻

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值