Flume基础入门

最新推荐文章于 2023-09-03 17:05:05 发布

磨刀大神

最新推荐文章于 2023-09-03 17:05:05 发布

阅读量163

点赞数

分类专栏：大数据协作框架文章标签： Flume

本文链接：https://blog.csdn.net/weixin_43241054/article/details/94362791

版权

大数据协作框架专栏收录该内容

6 篇文章 0 订阅

订阅专栏

Flume
Cloudera开发的框架，Flume是一个非常优秀日志采集组件，类似于logstash，我们通常将Flume作为agent部署在application server上，用于收集本地的日志文件。
实时收集数据

apache/ngnix

Kafka/Flume + storm/spark spark streaming

Web Server -> Source -> Channel -> Sink -> HDFS
-----------Agent----------
Source 采集数据，产生数据流的地方，同时将数据流传输到Channel。
Channel 连接Source和Sink，像一个队列。
Sink 从Channel收集数据，将数据写到目标源，可以是下一个Source或HDFS或HBase。

source：源文件、源数据端，指定Flume从何处采集数据（流）。Flume支持多种source，比如“Avro source”（类似RPC模式，接收远端Avro客户端发送的数据Entity）、“Thrift Source”（Thrift客户端发送的数据）、“Exec Source”（linux指令返回的数据条目）、“Kafka Source”、“Syslog Source”、“Http Source”等等。

我们本文主要涉及到Spooling和Taildir两种，Taildir是1.7新增的特性，在此之前，如果想实现tail特性，需要使用“Exec Source”来模拟，或者自己开发代码。
channel：通道，简单而言就是数据流的缓冲池，多个source的数据可以发送给一个channel，在channel内部可以对数据进行cache、溢出暂存、流量整形等。目前Flume支持“Memory Channel”（数据保存在有限空间的内存中）、“JDBC Channel”（数据暂存在数据库中，保障恢复）、“Kafka Channel”（暂存在kafka中）、“File Channel”（暂存在本地文件中）；除Memory之外，其他的channel都支持持久化，可以在故障恢复、sink离线或者无sink等场景下提供有效的担保机制，避免消息丢失和流量抗击。
sink：流输出端，每个channel都可以对应一个sink，每个sink可以指定一种类型的存储方式，目前Flume支持的sink类型比较常用的有“HDFS Sink”（将数据保存在hdfs中）、“Hive Sink”、“Logger Sink”（特殊场景，将数据以INFO级别输出到控制台，通常用于测试）、“Avro Sink”、“Thrift Sink”、“File Roll Sink”（转存到本地文件系统中）等等。

运行在有logs的地方
系统：Linux
JVM/JDK
轻量级（eg.zookeeper,）

安装
vi flume-env.sh
export JAVA_HOME=/opt/jdk1.8.0_171

#主要命令参数
[root@hadoop-senior01 flume-1.5.0-cdh5.3.6]# bin/flume-ng 

Usage: bin/flume-ng <command> [options]...

commands:
  agent                     run a Flume agent

global options:
  --conf,-c <conf>          use configs in <conf> directory
  -Dproperty=value          sets a Java system property value

agent options:
  --name,-n <name>          the name of this agent (required)
  --conf-file,-f <file>     specify a config file (required if -z missing)

-----------------------------------------------------------------------------------
案例1
bin/flume-ng agent --conf conf --name agent-test --conf-file test.conf
或
bin/flume-ng agent -c conf -n agent-test -f test.conf

#编写conf，vi a1.conf
# The configuration file needs to define the sources,
# the channels and the sinks.

###define agent
a1.sources = r1
a1.channels = c1
a1.sinks = k1

###define sources
a1.sources.r1.type = netcat
a1.sources.r1.bind = hadoop-senior01.zhangbk.com
a1.sources.r1.port = 44444

###define channels
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

###define sink
a1.sinks.k1.type = logger

###bind the sources and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

#命令执行：
bin/flume-ng agent \
-c conf \
-n a1 \
-f conf/a1.conf \
-Dflume.root.logger=DEBUG,console

telnet ip port ,输入数据-----------------------------------------------------------------------------------
案例2 收集log，hive运行的日志
/opt/hive-0.13.1-cdh5.3.6/logs/hive.log
memory channel

hdfs
/user/flume/hive-logs/
agent程序

# The configuration file needs to define the sources,
# the channels and the sinks.

###define agent
a2.sources = r2
a2.channels = c2
a2.sinks = k2

###define sources
a2.sources.r2.type = exec
a2.sources.r2.command = tail -f /opt/hive-0.13.1-cdh5.3.6/logs/hive.log
a2.sources.r2.shell = /bin/bash -c

###define channels
a2.channels.c2.type = memory
a2.channels.c2.capacity = 1000
a2.channels.c2.transactionCapacity = 100

###define sink
a2.sinks.k2.type = hdfs
a2.sinks.k2.hdfs.fileType = DataStream 
a2.sinks.k2.hdfs.path = hdfs://hadoop-senior01.zhangbk.com:8020/user/flume/hive-logs/
#a2.sinks.k2.hdfs.path = hdfs://ns1/user/flume/hive-logs/
a2.sinks.k2.hdfs.writeFormat = Text
a2.sinks.k2.hdfs.batchSize = 10

###bind the sources and sink to the channel
a2.sources.r2.channels = c2
a2.sinks.k2.channel = c2

命令执行：
bin/flume-ng agent \
-c conf \
-n a2 \
-f conf/flume-tail.conf \
-Dflume.root.logger=DEBUG,console注意需要hadoop中的jar包
/opt/hadoop-2.5.0-cdh5.3.6/share/hadoop/hdfs/hadoop-hdfs-2.5.0-cdh5.3.6.jar
/opt/hadoop-2.5.0-cdh5.3.6/share/hadoop/common/hadoop-common-2.5.0-cdh5.3.6.jar
/opt/hadoop-2.5.0-cdh5.3.6/share/hadoop/common/lib/hadoop-auth-2.5.0-cdh5.3.6.jar
/opt/hadoop-2.5.0-cdh5.3.6/share/hadoop/common/lib/commons-configuration-1.6.jar
--------------------------------------------------------------------------------------

监控某个日志文件的目录
log4j设置文件大小
FileChannel

# The configuration file needs to define the sources,
# the channels and the sinks.

###define agent
a3.sources = r3
a3.channels = c3
a3.sinks = k3

###define sources
a3.sources.r3.type = spooldir
a3.sources.r3.spoolDir = /opt/flume-1.5.0-cdh5.3.6/spoollogs
a3.sources.r3.ignorePattern = ^(.)*\\.log$
a3.sources.r3.fileSuffix = .delete

###define channels
a3.channels.c3.type = file
a3.channels.c3.checkpointDir = /opt/flume-1.5.0-cdh5.3.6/filechannel/checkpoint
a3.channels.c3.dataDirs = /opt/flume-1.5.0-cdh5.3.6/filechannel/data

###define sink
a3.sinks.k3.type = hdfs
a3.sinks.k3.hdfs.fileType = DataStream
a3.sinks.k3.hdfs.path = hdfs://hadoop-senior01.zhangbk.com:8020/user/flume/splogs/%Y%m%d
a3.sinks.k3.hdfs.useLocalTimeStamp = true
#a3.sinks.k3.hdfs.path = hdfs://ns1/user/flume/hive-logs/
a3.sinks.k3.hdfs.writeFormat = Text
a3.sinks.k3.hdfs.batchSize = 10

###bind the sources and sink to the channel
a3.sources.r3.channels = c3
a3.sinks.k3.channel = c3

命令执行：
bin/flume-ng agent \
-c conf \
-n a3 \
-f conf/flume-app.conf \
-Dflume.root.logger=DEBUG,console
-------------------------------------------------------------------------------------------------

磨刀大神

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Flume基础入门

Flume Cloudera开发的框架，Flume是一个非常优秀日志采集组件，类似于logstash，我们通常将Flume作为agent部署在application server上，用于收集本地的日志文件。实时收集数据 apache/ngnix Kafka/Flume + storm/spark spark streaming Web Server -&g...
复制链接

扫一扫