一、Flume的概述
1、功能
数据采集:将数据从一个地方收集到另一个地方,将数据进行复制,大数据中把各种需要处理的数据复制到大数据平台中
2、架构
flume的架构图如下:
2.1 Agent
Agent是一个JVM进程,以事件的形式将数据从源头输送到目的。
Agent有三部分组成,Source、Channel、Sink。每台机器运行一个Agent,一个Agent可以包含多个Source、Channel、Sink。
2.1.1Source:负责接收数据到Agent
2.1.2Channel:位于Source和Sink之间的缓冲,分为File Channel、Mermory Channel 和Kafka Channel。
File Channel:将event写到磁盘,安全可靠,程序不关闭或者机器不宕机就不会丢数据,效率会低一些。
# 命名 Agent 上的组件
agent_name.sources = source_name
agent_name.channels = channel_name
agent_name.sinks = sink_name
# source
agent_name.sources.source_name.type = XXX
# channel
# channel中存储的最大event数为3000000,一次事务中可读取或添加的event数为20000
# 检查点路径为/usr/local/flume/checkpoint,数据存放路径为/test1, /test2,开启备份检查点,备份检查点路径为/test/flume/backup/checkpoint
agent_name.channels.channel_name.type = file
agent_name.channels.channel_name.dataDirs = ${log_path}/dataTest1, ${log_path}/dataTest2
agent_name.channels.channel_name.checkpointDir = ${exec_log_path}/stat_info_checkpointDir
agent_name.channels.channel_name.useDualCheckpoints = true
agent_name.channels.channel_name.backupCheckpointDir = /test/flume/backup/checkpoint
#filechannel可容纳的最大event数
agent_name.channels.channel_name.capacity = 3000000
#一次事务中写入和读取的event最大数,默认10000
agent_name.channels.channel_name.transactionCapacity = 20000
#在Channel中写入或读取event等待完成的超时时间,单位:秒 默认3
agent_name.channels.channel_name.keep-alive = 5
# sink
agent_name.sinks.sink_name.type = hdfs
Mermory Channel:基于内存,读写速率快,在Flume挂掉或者宕机情况下会丢失数据。
# 命名 Agent 上的组件
agent_name.sources = source_name
agent_name.channels = channel_name
agent_name.sinks = sink_name
# source
agent_name.sources.source_name.type = XXX
# channel
# channel中存储的最大event数为3000000,一次事务中可读取或添加的event数为20000
agent_name.channels.channel_name.type = memory
agent_name.channels.channel_name.capacity = 1000
agent_name.channels.channel_name.transactionCapacity = 1000
# sink
agent_name.sinks.sink_name.type = hdfs
# source | channel | sink 关联
agent_name.sources.source_name.channels = channel_name
agent_name.sinks.sink_name.channel = channel_name```
Kafka Channel:将Kafka做为Channel,存储量更大、容错能力更强,结合两种Channel的优势,提高收集数据性能。
# 命名 Agent 上的组件
agent_name.channels = channel_name
agent_name.sinks = sink_name
# channel
agent_name.channels.channel_name.type = org.apache.flume.channel.kafka.KafkaChannel
agent_name.channels.channel_name.kafka.bootstrap.servers = zkServer01:9092, zkServer02:9092
agent_name.channels.channel_name.kafka.topic = test_topic_01
agent_name.channels.channel_name.kafka.consumer.group.id = test_01
# sink
agent_name.sinks.sink_name.type = hdfs
# source | channel | sink 关联
agent_name.sources.source_name.channels = channel_name
agent_name.sinks.sink_name.channel = channel_name