Flume基础入门

最新推荐文章于 2023-01-24 14:18:59 发布

date-date

最新推荐文章于 2023-01-24 14:18:59 发布

阅读量260

点赞数

分类专栏： hadoop flume 文章标签： flume

本文链接：https://blog.csdn.net/learner_up/article/details/106013598

版权

hadoop 同时被 2 个专栏收录

6 篇文章 0 订阅

订阅专栏

flume

1 篇文章 0 订阅

订阅专栏

官网具体地址：http://flume.apache.org/FlumeUserGuide.html

flume 核心概念

Client：Client生产数据，运行在一个独立的线程

agent：java 进程，运行在日志收集节点上（服务器节点），包含source、channel、sink三个核心组件

source：从Client收集数据，可以处理各种类型各种格式的日志数据。数据封装到事件（event）里被传递给channel

channel：缓存收集来的数据，保存由Source组件传递过来的Event

sink：从channel中取出数据，发送到目的地。目的地包括hdfs、logger、avro、thrift、ipc、file、Hbase、solr、自定义。

event： event将传输的数据进行封装，是flume传输数据的基本单位，在整个数据的传输过程中，流动的是event，即事务保证是在event级别进行的。如果是文本文件，通常是一行记录，event也是事务的基本单位。event从source，流向channel，再到sink，本身为一个字节数组，并可携带headers(头信息)信息。event代表着一个数据的最小完整单元，从外部数据源来，向外部的目的地去。一个完整的event包括：event headers、event body、event信息(即文本文件中的单行记录)，其中event信息就是flume收集到的日记记录。

分区：flume事件的数据通常以时间进行分区，通过 agent1.sinks.sink.hdfs.pathse设置，使用具有时间格式的转义序列的子目录。agent1.sinks.sink.hdfs.path=/temp/flume/year=%Y/month=%m/day=%d 以天作为分区粒度。Flume事件被写入的分区由事件heaer 中的timestamp 决定，但是默认情况事件的header 中没有Timestamp ,通过 interceptor(拦截器)添加。

下列为格式转义序列的完整列表：

interceptor(拦截器)：拦截器是一种对事件流中事件进行修改删除的组件，它们连接source 并在事件被传递到channel之前对事件进行处理。

//为source1 增加一个时间戳拦截器，将为source 产生的每个事件添加一个时间戳
agent1.sources.source1.interceptors = interceptor1
agent1.sources.source1.interceptors.interceptor1.type = timestamp

如果存在多级flume 代理，那么创建的时间和写入的时间存在差异，那么使用agent1.sinks.sink1.hdfs.useLocalTimeStamp=true;

事物：flume 使用两个独立的事物分别负责从source 到channel 以及从channel 到sink 的事件传递。

可靠性：flume 中source 到达sink至少一次（at least once ）。同一事件有可能重复到达。不论source 还是sink都有可能产生重复。

扇出：从一个source 向多个channel，亦即向多个sink 传递事件。

使用spooling directorysource和hdfs sink 的两层 flume 代理配置：

# First tier agent
--设定agent各组件名字
agent1.sources = source1
agent1.sinks = sink1
agent1.channels = channel1
--设定连接关系
agent1.sources.source1.channels = channel1
agent1.sinks.sink1.channel = channel1

--描述source 来源方式
agent1.sources.source1.type = spooldir
agent1.sources.source1.spoolDir = /tmp/spooldir
--描述sink
agent1.sinks.sink1.type = avro
agent1.sinks.sink1.hostname = localhost
agent1.sinks.sink1.port = 10000

--描述channel
agent1.channels.channel1.type = file
agent1.channels.channel1.checkpointDir=/tmp/agent1/file-channel/checkpoint
agent1.channels.channel1.dataDirs=/tmp/agent1/file-channel/data

# Second tier agent

agent2.sources = source2
agent2.sinks = sink2
agent2.channels = channel2

agent2.sources.source2.channels = channel2
agent2.sinks.sink2.channel = channel2

agent2.sources.source2.type = avro
agent2.sources.source2.bind = localhost
agent2.sources.source2.port = 10000

agent2.sinks.sink2.type = hdfs
--配置hdfs上文件目录位置以及文件名格式
agent2.sinks.sink2.hdfs.path = /tmp/flume
agent2.sinks.sink2.hdfs.filePrefix = events
agent2.sinks.sink2.hdfs.fileSuffix = .log
agent2.sinks.sink2.hdfs.fileType = DataStream

agent2.channels.channel2.type = file
agent2.channels.channel2.checkpointDir=/tmp/agent2/file-channel/checkpoint
agent2.channels.channel2.dataDirs=/tmp/agent2/file-channel/data

启动命令：

这两个代理需要分别运行

先启动agent1：

% flume-ng agent --conf-file 配置文件所在目录 --name agent1 --conf $FLUME_HOME/conf（通用配置所在目录） -Dflume.root.logger=INFO,console

再启动agent2:

% flume-ng agent --conf-file 配置文件所在目录 --name agent2 --conf $FLUME_HOME/conf（通用配置所在目录） -Dflume.root.logger=INFO,console

使用sink 组保证负载均衡或者故障转移

# First tier agent

agent1.sources = source1
--多sink 间用空格
agent1.sinks = sink1a sink1b
--设置sink 组名
agent1.sinkgroups = sinkgroup1
agent1.channels = channel1

agent1.sources.source1.channels = channel1
agent1.sinks.sink1a.channel = channel1
agent1.sinks.sink1b.channel = channel1

agent1.sinkgroups.sinkgroup1.sinks = sink1a sink1b
--设定处理器类型，该处理器循环选择sink分发事件（processor.selector 可更改）
agent1.sinkgroups.sinkgroup1.processor.type = load_balance
agent1.sinkgroups.sinkgroup1.processor.backoff = true

agent1.sources.source1.type = spooldir
agent1.sources.source1.spoolDir = /tmp/spooldir

agent1.sinks.sink1a.type = avro
agent1.sinks.sink1a.hostname = localhost
agent1.sinks.sink1a.port = 10000

agent1.sinks.sink1b.type = avro
agent1.sinks.sink1b.hostname = localhost
agent1.sinks.sink1b.port = 10001

agent1.channels.channel1.type = file
agent1.channels.channel1.checkpointDir=/tmp/agent1/file-channel/checkpoint
agent1.channels.channel1.dataDirs=/tmp/agent1/file-channel/data

# Second tier agent

agent2a.sources = source2a
agent2a.sinks = sink2a
agent2a.channels = channel2a

agent2a.sources.source2a.channels = channel2a
agent2a.sinks.sink2a.channel = channel2a

agent2a.sources.source2a.type = avro
agent2a.sources.source2a.bind = localhost
agent2a.sources.source2a.port = 10000

agent2a.sinks.sink2a.type = hdfs
agent2a.sinks.sink2a.hdfs.path = /tmp/flume
agent2a.sinks.sink2a.hdfs.filePrefix = events-a
agent2a.sinks.sink2a.hdfs.fileSuffix = .log
agent2a.sinks.sink2a.hdfs.fileType = DataStream

agent2a.channels.channel2a.type = file
agent2a.channels.channel2a.checkpointDir=/tmp/agent2a/file-channel/checkpoint
agent2a.channels.channel2a.dataDirs=/tmp/agent2a/file-channel/data

# Second tier agent (running on a different port number)

agent2b.sources = source2b
agent2b.sinks = sink2b
agent2b.channels = channel2b

agent2b.sources.source2b.channels = channel2b
agent2b.sinks.sink2b.channel = channel2b

agent2b.sources.source2b.type = avro
agent2b.sources.source2b.bind = localhost
agent2b.sources.source2b.port = 10000

agent2b.sinks.sink2b.type = hdfs
agent2b.sinks.sink2b.hdfs.path = /tmp/flume
agent2b.sinks.sink2b.hdfs.filePrefix = events-b
agent2b.sinks.sink2b.hdfs.fileSuffix = .log
agent2b.sinks.sink2b.hdfs.fileType = DataStream

agent2b.channels.channel2b.type = file
agent2b.channels.channel2b.checkpointDir=/tmp/agent2b/file-channel/checkpoint
agent2b.channels.channel2b.dataDirs=/tmp/agent2b/file-channel/data

以kafka 为sink ：

a1.source=r1
a1.channel=c1
a1.sink=k1

a1.sources.r1.channel=c1
a1.sources.r1.type=spoolDir 
a1.source.r1.spoolDir =/tmp/spoolDir 

a1.sinks.k1.channel  =  c1 
a1.sinks.k1.type  =  org.apache.flume.sink.kafka.KafkaSink 
a1.sinks.k1.kafka.topic  =  mytopic 
a1.sinks.k1.kafka.bootstrap.servers  =  本地主机：9092 
a1.sinks.k1.kafka.flumeBatchSize  =  20 
a1.sinks.k1.kafka.producer.acks  =  1 
a1.sinks.k1.kafka.producer.linger.ms  =  1 
a1.sinks.k1.kafka .producer.compression.type  =  snappy