大数据_Flume

最新推荐文章于 2024-05-08 00:29:28 发布

平揽星尘

最新推荐文章于 2024-05-08 00:29:28 发布

阅读量68

点赞数

分类专栏：大数据文章标签： flume

本文链接：https://blog.csdn.net/qq_15764943/article/details/90518845

版权

大数据专栏收录该内容

10 篇文章 0 订阅

订阅专栏

简介

1、Flume是一个分布式、可靠、和高可用的海量日志采集、聚合和传输的系统。
2、Flume可以采集文件，socket数据包、文件、文件夹、kafka等各种形式源数据，又可以将采集到的数据(下沉sink)输出到HDFS、hbase、hive、kafka等众多外部存储系统中
3、一般的采集需求，通过对ﬂume的简单配置即可实现
4、Flume针对特殊场景也具备良好的自定义扩展能力，因此，ﬂume可以适用于大部分的日常数据采集场景

架构

在这里插入图片描述

案例

监听网络

在ﬂume的conf目录下新建一个配置文件（采集方案）。
本地开发信息通信服务，其它客户端通过netcat 192.168.237.131 8888 向flume实时发送数据，输出到日志。

#定义agent名， source、channel、sink的名称
a1.sources = r1
a1.channels = c1
a1.sinks = k1

#具体定义source
a1.sources.r1.type = netcat
a1.sources.r1.bind = localhost
a1.sources.r1.port = 8888

#具体定义channel
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

#具体定义sink
a1.sinks.k1.type = logger

#组装source、channel、sink
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

采集目录到 HDFS

1、需求
某服务器的某特定目录下，会不断产生新的文件，每当有新文件出现，就需要把文件采集到HDFS中去

2、分析
根据需求，首先定义以下3大要素

数据源组件，即source ——监控文件目录 : spooldir
1. 监视一个目录，只要目录中出现新文件，就会采集文件中的内容
2. 采集完成的文件，会被agent自动添加一个后缀：COMPLETED
3. 所监视的目录中不允许重复出现相同文件名的文件
下沉组件，即sink——HDFS文件系统 : hdfs sink
通道组件，即channel——可用ﬁle channel 也可以用内存channel

3、Flume 配置文件
cd /export/servers/apache-flume-1.8.0-bin/conf
mkdir -p /export/servers/dirfile
vim spooldir.conf

# Name the components on this agent 
a1.sources = r1 a1.sinks = k1 
a1.channels = c1 
# Describe/configure the source 
##注意：不能往监控目中重复丢同名文件 
a1.sources.r1.type = spooldir 
a1.sources.r1.spoolDir = /export/servers/dirfile 
a1.sources.r1.fileHeader = true 
# Describe the sink 
a1.sinks.k1.type = hdfs 
a1.sinks.k1.channel = c1 
a1.sinks.k1.hdfs.path = hdfs://node01:8020/spooldir/files/%y-%m-%d/%H%M/ 
a1.sinks.k1.hdfs.filePrefix = events- 
a1.sinks.k1.hdfs.round = true 
a1.sinks.k1.hdfs.roundValue = 10 
a1.sinks.k1.hdfs.roundUnit = minute 
a1.sinks.k1.hdfs.rollInterval = 3 
a1.sinks.k1.hdfs.rollSize = 20 
a1.sinks.k1.hdfs.rollCount = 5 
a1.sinks.k1.hdfs.batchSize = 1 
a1.sinks.k1.hdfs.useLocalTimeStamp = true 
#生成的文件类型，默认是Sequencefile，可用DataStream，则为普通文本 
a1.sinks.k1.hdfs.fileType = DataStream 
# Use a channel which buffers events in memory 
a1.channels.c1.type = memory 
a1.channels.c1.capacity = 1000 
a1.channels.c1.transactionCapacity = 100 
# Bind the source and sink to the channel 
a1.sources.r1.channels = c1 
a1.sinks.k1.channel = c1

Channel参数解释
capacity：默认该通道中最大的可以存储的event数量
trasactionCapacity：每次最大可以从source中拿到或者送到sink中的event数量
keep-alive：event添加到通道中或者移出的允许时间

4、启动 Flume
bin/flume-ng agent -c ./conf -f ./conf/spooldir.conf -n a1 -Dflume.root.logger=INFO,console

5、上传文件到指定目录
cd /export/servers/dirfile

说明：将不同的文件上传到下面目录里面去，注意文件不能重名。

采集文件到HDFS

1、需求
比如业务系统使用log4j生成的日志，日志内容不断增加，需要把追加到日志文件中的数据实时采集到 hdfs。

2、分析
根据需求，首先定义以下3大要素

采集源，即source——监控文件内容更新 : exec ‘tail -F ﬁle’ 。
下沉目标，即sink——HDFS文件系统 : hdfs sink 。
Source和sink之间的传递通道——channel，可用ﬁle channel 也可以用内存channel 。

3、创建配置文件
cd /export/servers/apache-flume-1.8.0-bin/conf
vim tail-file.conf

agent1.sources = source1 
agent1.sinks = sink1 
agent1.channels = channel1 
 
# Describe/configure tail -F source1 
agent1.sources.source1.type = exec 
agent1.sources.source1.command = tail -F /export/servers/taillogs/access_log 
agent1.sources.source1.channels = channel1 
 
 
# Describe sink1 
agent1.sinks.sink1.type = hdfs 
#a1.sinks.k1.channel = c1 
agent1.sinks.sink1.hdfs.path = hdfs://node01:8020/weblog/flume-collection/%y-%m-%d/%H-% 
agent1.sinks.sink1.hdfs.filePrefix = access_log 
agent1.sinks.sink1.hdfs.maxOpenFiles = 5000 
agent1.sinks.sink1.hdfs.batchSize= 100 
agent1.sinks.sink1.hdfs.fileType = DataStream 
agent1.sinks.sink1.hdfs.writeFormat =Text 
 
agent1.sinks.sink1.hdfs.round = true 
agent1.sinks.sink1.hdfs.roundValue = 10 
agent1.sinks.sink1.hdfs.roundUnit = minute 
agent1.sinks.sink1.hdfs.useLocalTimeStamp = true 
 
# Use a channel which buffers events in memory 
agent1.channels.channel1.type = memory 
agent1.channels.channel1.keep-alive = 120 
agent1.channels.channel1.capacity = 500000 
agent1.channels.channel1.transactionCapacity = 600 
 
# Bind the source and sink to the channel 
agent1.sources.source1.channels = channel1 
agent1.sinks.sink1.channel = channel1

4、启动flume
cd /export/servers/apache-flume-1.6.0-cdh5.14.0-bin
bin/flume-ng agent -c conf -f conf/tail-file.conf -n agent1 -Dflume.root.logger=INFO,console

5、开发 Shell 脚本定时追加文件内容
mkdir -p /export/servers/shells/
cd /export/servers/shells/
vim tail-file.sh

#!/bin/bash 
while true 
do  
	date >> /export/servers/taillogs/access_log;   
	sleep 0.5; 
done

6、启动脚本

# 创建文件夹 
mkdir -p /export/servers/taillogs 
# 启动脚本 sh 
/export/servers/shells/tail-file.sh

监听文件

#bin/flume-ng agent -n a2 -f /home/hadoop/a2.conf -c conf -Dflume.root.logger=INFO,console
#定义agent名， source、channel、sink的名称
a2.sources = r1
a2.channels = c1
a2.sinks = k1

#具体定义source  监听某个log文件
a2.sources.r1.type = exec
a2.sources.r1.command = tail -F /home/hadoop/a.log

#具体定义channel
a2.channels.c1.type = memory
a2.channels.c1.capacity = 1000
a2.channels.c1.transactionCapacity = 100

#具体定义sink  输出到日志
a2.sinks.k1.type = logger

#组装source、channel、sink
a2.sources.r1.channels = c1
a2.sinks.k1.channel = c1

Agent 级联

1、需求：
在这里插入图片描述
2、分析
第一个agent负责收集文件当中的数据，通过网络发送到第二个agent当中去
第二个agent负责接收第一个agent发送的数据，并将数据保存到hdfs上面去

3、Node02 安装 Flume
将node03机器上面解压后的ﬂume文件夹拷贝到node02机器上面去

cd  /export/servers 
scp -r apache-flume-1.8.0-bin/ node02:$PWD

4、Node02 配置 Flume
在node02机器配置我们的ﬂume
cd /export/servers/ apache-flume-1.8.0-bin/conf
vim tail-avro-avro-logger.conf

# Name the components on this agent 
a1.sources = r1 
a1.sinks = k1 
a1.channels = c1 
# Describe/configure the source 
a1.sources.r1.type = exec 
a1.sources.r1.command = tail -F /export/servers/taillogs/access_log 
a1.sources.r1.channels = c1 
# Describe the sink 
##sink端的avro是一个数据发送者 
a1.sinks = k1 
a1.sinks.k1.type = avro 
a1.sinks.k1.channel = c1 
a1.sinks.k1.hostname = 192.168.174.120 
a1.sinks.k1.port = 4141 
a1.sinks.k1.batch-size = 10 
# Use a channel which buffers events in memory 
a1.channels.c1.type = memory 
a1.channels.c1.capacity = 1000 
a1.channels.c1.transactionCapacity = 100 
# Bind the source and sink to the channel 
a1.sources.r1.channels = c1 
a1.sinks.k1.channel = c1

5、开发脚本向文件中写入数据
cd /export/servers
scp -r shells/ taillogs/ node02:$PWD

说明：
直接将node03下面的脚本和数据拷贝到node02即可，node03机器上执行以下命令

6、Node03 Flume 配置文件
在node03机器上开发ﬂume的配置文件
cd /export/servers/apache-flume-1.8.0-bin/conf
vim avro-hdfs.conf

# Name the components on this agent 
a1.sources = r1 a1.sinks = k1 
a1.channels = c1 
# Describe/configure the source 
##source中的avro组件是一个接收者服务 
a1.sources.r1.type = avro 
a1.sources.r1.channels = c1 
a1.sources.r1.bind = 192.168.174.120 
a1.sources.r1.port = 4141 
# Describe the sink 
a1.sinks.k1.type = hdfs 
a1.sinks.k1.hdfs.path = hdfs://node01:8020/av /%y-%m-%d/%H%M/ 
a1.sinks.k1.hdfs.filePrefix = events- 
a1.sinks.k1.hdfs.round = true 
a1.sinks.k1.hdfs.roundValue = 10 
a1.sinks.k1.hdfs.roundUnit = minute 
a1.sinks.k1.hdfs.rollInterval = 3 
a1.sinks.k1.hdfs.rollSize = 20 
a1.sinks.k1.hdfs.rollCount = 5 
a1.sinks.k1.hdfs.batchSize = 1 
a1.sinks.k1.hdfs.useLocalTimeStamp = true 
#生成的文件类型，默认是Sequencefile，可用DataStream，则为普通文本 
a1.sinks.k1.hdfs.fileType = DataStream 
# Use a channel which buffers events in memory 
a1.channels.c1.type = memory 
a1.channels.c1.capacity = 1000 
a1.channels.c1.transactionCapacity = 100 
 
# Bind the source and sink to the channel 
a1.sources.r1.channels = c1 
a1.sinks.k1.channel = c1

7、顺序启动

node03机器启动ﬂume进程
cd /export/servers/apache-flume-1.8.0-bin 
bin/flume-ng agent -c conf -f conf/avro-hdfs.conf -n a1  -Dflume.root.logger=INFO,console

node02机器启动ﬂume进程
cd /export/servers/apache-flume-1.8.0-bin/ 
bin/flume-ng agent -c conf -f conf/tail-avro-avro-logger.conf -n a1  -Dflume.root.logger=INFO,console

node02机器启shell脚本生成文件
cd  /export/servers/shells 
sh tail-file.sh

高可用

在完成单点的Flume NG搭建后，下面我们搭建一个高可用的Flume NG集群，架构图如下所示：
在这里插入图片描述

Node01 安装和配置

将node03机器上面的ﬂume安装包以及文件生产的两个目录拷贝到node01机器上面去

node03机器执行以下命令

cd /export/servers 
scp -r apache-flume-1.8.0-bin/ node01:$PWD 
scp -r shells/ taillogs/ node01:$PWD

node01机器配置agent的配置文件

cd /export/servers/apache-flume-1.8.0-bin/conf 
vim agent.conf 

#agent1 name 
agent1.channels = c1 
agent1.sources = r1 
agent1.sinks = k1 k2 
# ##set gruop 
agent1.sinkgroups = g1 # 
 
agent1.sources.r1.channels = c1 
agent1.sources.r1.type = exec 
agent1.sources.r1.command = tail -F /export/servers/taillogs/access_log 
# ##set channel 
agent1.channels.c1.type = memory 
agent1.channels.c1.capacity = 1000 
agent1.channels.c1.transactionCapacity = 100 
# ## set sink1 
agent1.sinks.k1.channel = c1 
agent1.sinks.k1.type = avro 
agent1.sinks.k1.hostname = node02 
agent1.sinks.k1.port = 52020 
# ## set sink2 
agent1.sinks.k2.channel = c1 
agent1.sinks.k2.type = avro 
agent1.sinks.k2.hostname = node03 
agent1.sinks.k2.port = 52020 
# ##set sink group 
agent1.sinkgroups.g1.sinks = k1 k2 
# ##set failover 
agent1.sinkgroups.g1.processor.type = failover 
agent1.sinkgroups.g1.processor.priority.k1 = 10 
agent1.sinkgroups.g1.processor.priority.k2 = 1 
agent1.sinkgroups.g1.processor.maxpenalty = 10000

Node02 与 Node03 配置 FlumeCollection

node02机器修改配置文件

cd /export/servers/apache-flume-1.8.0-bin/conf 
vim collector.conf 

#set Agent name 
a1.sources = r1 a1.channels = c1 
a1.sinks = k1 
# ##set channel 
a1.channels.c1.type = memory 
a1.channels.c1.capacity = 1000 
a1.channels.c1.transactionCapacity = 100 
# ## other node,nna to nns 
a1.sources.r1.type = avro 
a1.sources.r1.bind = node02 
a1.sources.r1.port = 52020 
a1.sources.r1.channels = c1 
# ##set sink to hdfs 
a1.sinks.k1.type=hdfs 
a1.sinks.k1.hdfs.path= hdfs://node01:8020/flume/failover/ 
a1.sinks.k1.hdfs.fileType=DataStream 
a1.sinks.k1.hdfs.writeFormat=TEXT 
a1.sinks.k1.hdfs.rollInterval=10 
a1.sinks.k1.channel=c1 
a1.sinks.k1.hdfs.filePrefix=%Y-%m-%d #

node03机器修改配置文件

cd  /export/servers/apache-flume-1.8.0-bin/conf 
vim collector.conf 

#set Agent name 
a1.sources = r1 
a1.channels = c1 
a1.sinks = k1 
# ##set channel 
a1.channels.c1.type = memory 
a1.channels.c1.capacity = 1000 
a1.channels.c1.transactionCapacity = 100 
# ## other node,nna to nns 
a1.sources.r1.type = avro 
a1.sources.r1.bind = node03 
a1.sources.r1.port = 52020 
a1.sources.r1.channels = c1 
# ##set sink to hdfs 
a1.sinks.k1.type=hdfs 
a1.sinks.k1.hdfs.path= hdfs://node01:8020/flume/failover/ 
a1.sinks.k1.hdfs.fileType=DataStream 
a1.sinks.k1.hdfs.writeFormat=TEXT 
a1.sinks.k1.hdfs.rollInterval=10 
a1.sinks.k1.channel=c1 
a1.sinks.k1.hdfs.filePrefix=%Y-%m-%d

顺序启动

node03机器上面启动ﬂume
cd /export/servers/apache-flume-1.8.0-bin
bin/flume-ng agent -n a1 -c conf -f conf/collector.conf -Dflume.root.logger=DEBUG,console

node02机器上面启动ﬂume
cd /export/servers/apache-flume-1.8.0-bin
bin/flume-ng agent -n a1 -c conf -f conf/collector.conf -Dflume.root.logger=DEBUG,console

node01机器上面启动ﬂume
cd /export/servers/apache-flume-1.8.0-bin
bin/flume-ng agent -n agent1 -c conf -f conf/agent.conf -Dflume.root.logger=DEBUG,console

node01机器启动文件产生脚本
cd /export/servers/shells
sh tail-file.sh

Failover 测试

下面我们来测试下Flume NG集群的高可用（故障转移）。场景如下：我们在Agent1节点上传文件，由于我们配置Collector1的权重比Collector2大，所以 Collector1优先采集并上传到存储系统。然后我们 kill掉Collector1，此时有Collector2负责日志的采集上传工作，之后，我们手动恢复Collector1节点的 Flume服务，再次在Agent1上次文件，发现Collector1恢复优先级别的采集工作。具体如下步骤所示：

Collector1优先上传；
HDFS集群中上传的log内容预览；
Collector1宕机，Collector2获取优先上传权限；
重启Collector1服务，Collector1重新获得优先上传的权限；

Flume 的负载均衡

负载均衡是用于解决一台机器(一个进程)无法解决所有请求而产生的一种算法。Load balancing Sink Processor 能够实现 load balance 功能，如下图Agent1 是一个路由节点，负责将 Channel 暂存的 Event 均衡到对应的多个 Sink组件上，而每个 Sink 组件分别连接到一个独立的 Agent 上，示例配置，如下所示：
在这里插入图片描述
在此处我们通过三台机器来进行模拟ﬂume的负载均衡
三台机器规划如下：

node01：采集数据，发送到node02和node03机器上去
node02：接收node01的部分数据
node03：接收node01的部分数据

开发node01服务器的ﬂume配置

node01服务器配置：

cd /export/servers/apache-flume-1.8.0-bin/conf 
vim load_banlancer_client.conf 

# agent name 
a1.channels = c1 
a1.sources = r1 
a1.sinks = k1 k2 

# set gruop 
a1.sinkgroups = g1 
 
# set channel 
a1.channels.c1.type = memory 
a1.channels.c1.capacity = 1000 
a1.channels.c1.transactionCapacity = 100 
a1.sources.r1.channels = c1 
a1.sources.r1.type = exec 
a1.sources.r1.command = tail -F /export/servers/taillogs/access_log 
 
# set sink1 
a1.sinks.k1.channel = c1 
a1.sinks.k1.type = avro 
a1.sinks.k1.hostname = node02 
a1.sinks.k1.port = 52020 
 
# set sink2 
a1.sinks.k2.channel = c1 
a1.sinks.k2.type = avro 
a1.sinks.k2.hostname = node03 
a1.sinks.k2.port = 52020 
 
# set sink group 
a1.sinkgroups.g1.sinks = k1 k2 
 
# set failover 
a1.sinkgroups.g1.processor.type = load_balance 
a1.sinkgroups.g1.processor.backoff = true 
a1.sinkgroups.g1.processor.selector = round_robin 
a1.sinkgroups.g1.processor.selector.maxTimeOut=10000

开发node02服务器的ﬂume配置

cd /export/servers/apache-flume-1.8.0-bin/conf  
vim load_banlancer_server.conf 

# Name the components on this agent 
a1.sources = r1 
a1.sinks = k1 
a1.channels = c1 
 
# Describe/configure the source 
a1.sources.r1.type = avro 
a1.sources.r1.channels = c1 
a1.sources.r1.bind = node02 
a1.sources.r1.port = 52020 
 
# Describe the sink 
a1.sinks.k1.type = logger 
  
# Use a channel which buffers events in memory 
a1.channels.c1.type = memory 
a1.channels.c1.capacity = 1000 
a1.channels.c1.transactionCapacity = 100 
 
# Bind the source and sink to the channel 
a1.sources.r1.channels = c1 
a1.sinks.k1.channel = c1

开发node03服务器ﬂume配置

node03服务器配置

cd /export/servers/apache-flume-1.8.0-bin/conf 

vim load_banlancer_server.conf 
# Name the components on this agent 
a1.sources = r1 
a1.sinks = k1 
a1.channels = c1 

# Describe/configure the source 
a1.sources.r1.type = avro 
a1.sources.r1.channels = c1 
a1.sources.r1.bind = node03 
a1.sources.r1.port = 52020 

# Describe the sink 
a1.sinks.k1.type = logger 

# Use a channel which buffers events in memory 
a1.channels.c1.type = memory 
a1.channels.c1.capacity = 1000 
a1.channels.c1.transactionCapacity = 100 

# Bind the source and sink to the channel 
a1.sources.r1.channels = c1 
a1.sinks.k1.channel = c1

准备启动ﬂume服务

启动node03的ﬂume服务
cd /export/servers/apache-flume-1.8.0-bin
bin/flume-ng agent -n a1 -c conf -f conf/load_banlancer_server.conf -Dflume.root.logger=INFO,console

启动node02的ﬂume服务
cd /export/servers/apache-flume-1.8.0-bin
bin/flume-ng agent -n a1 -c conf -f conf/load_banlancer_server.conf -Dflume.root.logger=INFO,console

启动node01的ﬂume服务
cd /export/servers/apache-flume-1.8.0-bin
bin/flume-ng agent -n a1 -c conf -f conf/load_banlancer_client.conf -Dflume.root.logger=INFO,console

node01服务器运行脚本产生数据

cd /export/servers/shells
sh tail-ﬁle.sh

Flume 案例

案例场景

A、B两台日志服务机器实时生产日志主要类型为access.log、nginx.log、web.log
现在要求：

把A、B 机器中的access.log、nginx.log、web.log 采集汇总到C机器上然后统一收集到hdfs中。但是在hdfs中要求的目录为：

/source/logs/access/20180101/**
/source/logs/nginx/20180101/**
/source/logs/web/20180101/**

场景分析

在这里插入图片描述

数据流程处理分析

在这里插入图片描述

实现

服务器A对应的IP为 192.168.174.100
服务器B对应的IP为 192.168.174.110
服务器C对应的IP为 192.168.174.120

采集端配置文件开发
node01与node02服务器开发ﬂume的配置文件

cd /export/servers/apache-flume-1.6.0-cdh5.14.0-bin/conf 

vim exec_source_avro_sink.conf 
# Name the components on this agent
a1.sources = r1 r2 r3 
a1.sinks = k1 
a1.channels = c1 

# Describe/configure the source 
a1.sources.r1.type = exec a1.sources.r1.command = tail -F /export/servers/taillogs/access.log   
a1.sources.r1.interceptors = i1 
a1.sources.r1.interceptors.i1.type = static 

## static拦截器的功能就是往采集到的数据的header中插入自己定## 义的key-value对 
a1.sources.r1.interceptors.i1.key = type a1.sources.r1.interceptors.i1.value = access 
a1.sources.r2.type = exec a1.sources.r2.command = tail -F /export/servers/taillogs/nginx.log 
a1.sources.r2.interceptors = i2 
a1.sources.r2.interceptors.i2.type = static 
a1.sources.r2.interceptors.i2.key = type 
a1.sources.r2.interceptors.i2.value = nginx 
a1.sources.r3.type = exec a1.sources.r3.command = tail -F /export/servers/taillogs/web.log 
a1.sources.r3.interceptors = i3 
a1.sources.r3.interceptors.i3.type = static 
a1.sources.r3.interceptors.i3.key = type 
a1.sources.r3.interceptors.i3.value = web 
 
# Describe the sink 
a1.sinks.k1.type = avro 
a1.sinks.k1.hostname = node03 
a1.sinks.k1.port = 41414 
# Use a channel which buffers events in memory 
a1.channels.c1.type = memory 
a1.channels.c1.capacity = 20000 
a1.channels.c1.transactionCapacity = 10000 
 
# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sources.r2.channels = c1 
a1.sources.r3.channels = c1 
a1.sinks.k1.channel = c1

服务端配置文件开发
在node03上面开发ﬂume配置文件

cd /export/servers/apache-flume-1.6.0-cdh5.14.0-bin/conf 
vim avro_source_hdfs_sink.conf 

a1.sources = r1 a1.sinks = k1 
a1.channels = c1 

# 定义source
a1.sources.r1.type = avro 
a1.sources.r1.bind = 192.168.174.120 
a1.sources.r1.port =41414 
 
# 添加时间拦截器
 
a1.sources.r1.interceptors = i1 
a1.sources.r1.interceptors.i1.type = org.apache.flume.interceptor.TimestampInterceptor$ 

# 定义channels 
a1.channels.c1.type = memory 
a1.channels.c1.capacity = 20000 
a1.channels.c1.transactionCapacity = 10000 

# 定义sink 
a1.sinks.k1.type = hdfs 
a1.sinks.k1.hdfs.path=hdfs://192.168.174.100:8020/source/logs/%{type}/%Y%m%d 
a1.sinks.k1.hdfs.filePrefix =events a1.sinks.k1.hdfs.fileType = DataStream 
a1.sinks.k1.hdfs.writeFormat = Text

# 时间类型 
a1.sinks.k1.hdfs.useLocalTimeStamp = true 

# 生成的文件不按条数生成
a1.sinks.k1.hdfs.rollCount = 0 

# 生成的文件按时间生成 
a1.sinks.k1.hdfs.rollInterval = 30 

# 生成的文件按大小生成 
 
a1.sinks.k1.hdfs.rollSize  = 10485760 
# 批量写入hdfs的个数
a1.sinks.k1.hdfs.batchSize = 10000 

# flume操作hdfs的线程数（包括新建，写入等） 
a1.sinks.k1.hdfs.threadsPoolSize=10 

# 操作hdfs超时时间 
a1.sinks.k1.hdfs.callTimeout=30000 

# 组装source、channel、sink
a1.sources.r1.channels = c1 
a1.sinks.k1.channel = c1

采集端文件生成脚本
在node01与node02上面开发shell脚本，模拟数据生成

cd /export/servers/shells vim server.sh 

# !/bin/bash
while true 
do  
 date >> /export/servers/taillogs/access.log; 
 date >> /export/servers/taillogs/web.log; 
 date >> /export/servers/taillogs/nginx.log; 
  sleep 0.5;  
done

顺序启动服务
node03启动ﬂume实现数据收集

cd /export/servers/apache-flume-1.6.0-cdh5.14.0-bin 
bin/flume-ng agent -c conf -f conf/avro_source_hdfs_sink.conf -name a1 -Dflume.root.logger=INOF,console

node01与node02启动ﬂume实现数据监控

cd /export/servers/apache-flume-1.6.0-cdh5.14.0-bin 
bin/flume-ng agent -c conf -f conf/exec_source_avro_sink.conf -name a1 -Dflume.root.logger=INOF,console

node01与node02启动生成文件脚本

cd /export/servers/shells 
sh server.sh

在这里插入图片描述

平揽星尘

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
大数据_Flume

大数据_Flume简介架构案例监听网络采集目录到 HDFS采集文件到HDFS监听文件Agent 级联简介1、Flume是一个分布式、可靠、和高可用的海量日志采集、聚合和传输的系统。2、Flume可以采集文件，socket数据包、文件、文件夹、kafka等各种形式源数据，又可以将采集到的数据(下沉sink)输出到HDFS、hbase、hive、kafka等众多外部存储系统中3、一般的采集需...
复制链接

扫一扫