小白学数据仓库日记day1——flume

最新推荐文章于 2022-07-03 14:57:16 发布

兰翎翡竹

最新推荐文章于 2022-07-03 14:57:16 发布

阅读量365

点赞数

文章标签： flume

本文链接：https://blog.csdn.net/qq_42515611/article/details/118944548

版权

Flume重要概念:
Event:表示Flume中对数据的封装,一条数据即为一个事件,事件包括 header和body
Ageng:即Flume程序,包含组件Source、 Channel、 Sink
Source:是决定从哪种数据源接收数据 (file\tcp\http)
Channel: 在数据传输过程中的通道 (memory file)
Sink:决定将数据输出到何处 hdfs mysql kafka

组件及其作用
- Client：
   客户端，Client生产数据，运行在一个独立的线程中
- Event：
   一个数据单元，消息头和消息体组成。（Events可以是日志记录、 avro 对象等。）
- Flow：
   Event从源点到达目的点的迁移的抽象。
- Agent：
   一个独立的Flume进程，运行在JVM中，包含组件Source、 Channel、 Sink。
   每台机器运行一个agent，但是一个agent中可以包含多个sources和sinks。
- Source：
   数据收集组件。source从Client收集数据，传递给Channel
- Channel：
   管道，负责接收source端的数据，然后将数据推送到sink端。
- Sink：
   负责从channel端拉取数据，并将其推送到持久化系统或者是下一个Agent。
- selector：
   选择器，作用于source端，然后决定数据发往哪个目标。
- interceptor：
   拦截器，flume允许使用拦截器拦截数据。允许使用拦截器链，作用于source和sink阶段。

在flume提供的数据流模型中，几个原则很重要。
Source--> Channel
1.单个Source组件可以和多个Channel组合建立数据流，既可以replicating 和 multiplexing。
2.多个Sources可以写入单个 Channel
Channel-->Sink
1.多个Sinks又可以组合成Sinkgroups从Channel中获取数据，既可以loadbalancing和failover机制。
2.多个Sinks也可以从单个Channel中取数据。
3.单个Sink只能从单个Channel中取数据

案例演示：

1、Flume Helloworld
exec+memory+console agent (flume flow应用程序)
#flume-helloworld.conf

#在你想要的路径下创建.conf的配置文件
[root@sh01~]# vi avro-logger.conf
#agent
#a1作为所有配置的起始字符即为agent name
#agent.source
a1.sources = r1   #指定source名
a1.sources.r1.type = exec
a1.sources.r1.command= tail -f /Users/ly/tmp/2102.txt  #监控文件
a1.sources.r1.channels=c1  #source可以对应多个channel 以空格隔开

#agent.channel
# 定义channel名
a1.channels = c1  
# or file
a1.channels.c1.type=memory  

#agent.sink
a1.sinks=k1
a1.sinks.k1.type=logger
a1.sinks.k1.channel=c1

启动agent

[root@sh01 ~]# flume-ng agent -c /usr/local/flume1.8/conf -f /root/flumeconf/flume-helloworld.conf  -n a1 -Dflume.root.logger=INFO,console

参数                                            作用                                                                列表
–conf 或 -c            指定配置文件夹，包含flume-env.sh和log4j的配置文件         –conf …/conf
–conf-file 或 -f                            配置文件地址                                    –conf-file …/conf/flume.conf
–name 或 -n                               agent名称                                                         –name a1
-z                                            zookeeper连接字符串                           -z zkhost:2181,zkhost1:2181
-p                                            zookeeper中的存储路径前缀                                -p /flume
-Dflume                                启动日志打印到当前控制台            -Dflume.root.logger=INFO,console

测试数据：

[root@sh01 ~]# echo 'hello world' >>/Users/ly/tmp/2102.txt

结果：

2、

avro+memory+logger
Avro Source：监听一个指定的Avro端口，通过Avro端口可以获取到Avro client发送过来的文件，即只要应用程序通过Avro端口发送文件，source组件就可以获取到该文件中的内容,输出位置为Logger

[root@sh01~]# vi avro-logger.conf
#定义各个组件的名字
a1.sources=avro-sour1
a1.channels=mem-chan1
a1.sinks=logger-sink1

#定义sources组件的相关属性
a1.sources.avro-sour1.type=avro
a1.sources.avro-sour1.bind=hadoop01 #绑定ip，在此我做过映射所以可以写成hadoop01
a1.sources.avro-sour1.port=9999

#定义channels组件的相关属性
a1.channels.mem-chan1.type=memory

#定义sinks组件的相关属性
a1.sinks.logger-sink1.type=logger
a1.sinks.logger-sink1.maxBytesToLog=100

#组件之间进行绑定
a1.sources.avro-sour1.channels=mem-chan1
a1.sinks.logger-sink1.channel=mem-chan1

启动 agent

[root@sh01 ~]# flume-ng agent -c /usr/local/flume1.8/conf -f /root/flumeconf/avro-logger.conf -n a1 -Dflume.root.logger=INFO,console

测试数据

创建test.data
[root@sh01~]# date >> test.data
[root@sh01~]# flume-ng avro-client -c /usr/local/flume1.8/conf/ -H hadoop01 -p 9999 -F test.data

3、

exec+memory+hdfs
Exec Source:监听一个指定的命令，获取一条命令的结果作为它的数据源
#常用的是tail -F file指令，即只要应用程序向日志（文件）里面写数据，source组件就可以获取到日志（文件）中最新的内容
memory:传输数据的Channel为Memory
hdfs 是输出目标为Hdfs

[root@sh01~]#  vi exec-hdfs.conf
a1.sources=r1
a1.sources.r1.type=exec
a1.sources.r1.command=tail -F /root/flumedata/test.data

a1.sinks=k1
a1.sinks.k1.type=hdfs
a1.sinks.k1.hdfs.path=hdfs://hadoop01:8020/flume/tailout/%y-%m-%d/%H%M/
a1.sinks.k1.hdfs.filePrefix=events-
a1.sinks.k1.hdfs.round=true
a1.sinks.k1.hdfs.roundValue=10
a1.sinks.k1.hdfs.roundUnit=second
a1.sinks.k1.hdfs.rollInterval=3
a1.sinks.k1.hdfs.rollSize=20
a1.sinks.k1.hdfs.rollCount=5
a1.sinks.k1.hdfs.batchSize=1
a1.sinks.k1.hdfs.useLocalTimeStamp=true
a1.sinks.k1.hdfs.fileType=DataStream


a1.channels=c1
a1.channels.c1.type=memory
a1.channels.c1.capacity=1000
a1.channels.c1.transactionCapacity=100

a1.sources.r1.channels=c1
a1.sinks.k1.channel=c1

先启动hdfs

启动 agent

[root@sh01 ~]# flume-ng agent -c /usr/local/flume1.8/conf -f /root/flumeconf/exec-hdfs.conf -n a1 -Dflume.root.logger=INFO,console

测试数据

创建test.data
[root@sh01~]# ping hadoop01 >> test.data

测试结果：

在hdfs中寻找

4、

exec+memory+logger

Exec Source:监听一个指定的命令，获取一条命令的结果作为它的数据源

#常用的是tail -F file指令，即只要应用程序向日志（文件）里面写数据，source组件就可以获取到日志（文件）中最新的内容 ,

logger为日志格式输出

[root@sh01~]#   vi exec-logger.conf
a2.sources = r1 
a2.channels = c1
a2.sinks = s1

a2.sources.r1.type = exec
a2.sources.r1.command = tail -F /home/flume/log.01

a2.channels.c1.type=memory
a2.channels.c1.capacity=1000
a2.channels.c1.transactionCapacity=100
a2.channels.c1.keep-alive=3
a2.channels.c1.byteCapacityBufferPercentage=20
a2.channels.c1.byteCapacity=800000

a2.sinks.s1.type=logger
a2.sinks.s1.maxBytesToLog=30

a2.sources.r1.channels=c1
a2.sinks.s1.channel=c1

启动 agent


[root@sh01 ~]# flume-ng agent -c /usr/local/flume1.8/conf -f /root/flumeconf/spool-hdfs.conf-n a1 -Dflume.root.logger=INFO,console

测试数据

[root@sh01 ~]# echo "nice" >> /home/flume/log.01

5、

spool +file + hdfs
spool 是Source来源于目录，有文件进入目录就摄取，File Channel将它暂存到磁盘，最终目的地是HDFS
即只要某个目录不断有文件，HDFS上也会同步到所有数据。

[root@sh01~]#  vi spool-hdfs.conf
a1.sources = r1
a1.channels = c1
a1.sinks = k1

a1.sources.r1.type = spooldir
a1.sources.r1.spoolDir = /home/flume/input/2020/01/


a1.channels.c1.type = file
a1.channels.c1.checkpointDir = /home/flume/checkpoint
a1.channels.c1.dataDirs = /home/flume/data


a1.sinks.k1.type = hdfs
a1.sinks.k1.hdfs.path = hdfs:/hadoop01:8020/flume/spooldir
a1.sinks.k1.hdfs.filePrefix = 
a1.sinks.k1.hdfs.round = true
a1.sinks.k1.hdfs.roundValue = 10
a1.sinks.k1.hdfs.roundUnit = minute
a1.sinks.k1.hdfs.fileSuffix= .log
a1.sinks.k1.hdfs.rollInterval=60
a1.sinks.k1.hdfs.fileType=DataStream
a1.sinks.k1.hdfs.writeFormat=Text


a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

先启动hdfs

启动 agent

注意：先创建文件/home/flume/input/2020/01/否则运行可能出错
[root@sh01 ~]# flume-ng agent -c /usr/local/flume1.8/conf -f /root/flumeconf/exec-logger.conf-n a1 -Dflume.root.logger=INFO,console

测试数据

[root@sh01 ~]# echo "nice" >> /home/flume/log.01

兰翎翡竹

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
1
评论
小白学数据仓库日记day1——flume

Flume重要概念:Event:表示Flume中对数据的封装,一条数据即为一个事件,事件包括 header和bodyAgeng:即Flume程序,包含组件Source、 Channel、 SinkSource:是决定从哪种数据源接收数据 (file\tcp\http)Channel: 在数据传输过程中的通道 (memory file)Sink:决定将数据输出到何处 hdfs mysql kafka组件及其作用- Client：客户端，Client生产数据，运行在一个独立的线程中...
复制链接

扫一扫