Flume是一个分布式,可靠地,用户高效收集,聚合,移动大量数据的可用系统服务,有着基于流动数据技术的简单灵活架构。具有健壮性和容错性,使用可调节的可靠机制以及多种容灾和恢复机制。使用可扩展数据模型,运行在线分析应用。
flume架构:水槽
--------------------
1.flume event
header + payload{byte array}
2.flume agent
是独立的守护进程,从client(source)接收数据,转发给sink或者agent
3.flume三个重要组件
source[Avro,Thrift,twitter...] //从数据生成器接收数据,以flume event形式传递给一个或对个channel
channel[JDBC,File system,Memory...] //临时存放source传递过来的event数据,缓存到sink消费为止,是source和sink之间的桥梁
sink[HDFS,HBase,Avro,Kafka...] //存储数据到hdfs或者hbase,从channel提取数据,分发到目的地。sink的目的地可以是两个agent,也可以是中央存储
注意:一个flume event可以有一个或多个source,channel,sink
其他组件: interceptors //拦截器:在source和channel监控数据
channel selectors //通道选择器:找多通道情况下,采用哪个通道来传递数据,有两种类型通道选择器:a.Defaultb channel selectors 在通道中复制每个事件 b.Multiplexing channel selectors 通过判断event的header信息,决定通道来发送事件
sink processors //沉槽处理器:从sink组中选择一特定的sink进行调用,可以为sink创建容灾路径或者在多个sink之间实现负载均衡
Channel
1.内存channel(基石/非持久化):提供了速度保障,但代价是故障时会丢失数据。------->适用于系统指标
type = memory
capacity = 100 默认event容量,增加这个值,同样需要通过"-Xmx"参数增加Java堆栈大小
transactionCapacity = 100 source一次可以向channel中写入的最大event数量(同样是sink一次可以从channel中取出的最大的event数量),这个值越大,事务event消耗越快,速度越快,但发生故障时,source和sink就必须回滚更多的数据。
byteCapacityBufferPercentage/byteCapacity 用于度量内存channel容量的方法,如果event在大小上有很大差异,需要用这两个参数调整容量大小。
keep-alive 写数据到channel中遇到channel爆满时在放弃之前的等待时长,这个等待写入channel的阶段将会阻塞随后进入source中的数据。这可能导致数据上游的agent发生回退,最后导致event被丢弃(需要根据峰值流量和计划中的维护定夺这个值)
案例:配置的channel的值如下:
#####hdfsChannel13#####
agent.channels.hdfsChannel13.type = memory
agent.channels.hdfsChannel13.capacity =500000000
agent.channels.hdfsChannel13.transactionCapacity = 50000000
agent.channels.hdfsChannel13.byteCapacity= 10000000000
agent.channels.hdfsChannel13.keep-alive = 100
#####kfkChannel#####
agent.channels.kfkChannel.keep-alive =100
agent.channels.kfkChannel.type = memory
agent.channels.kfkChannel.capacity = 500000000
agent.channels.kfkChannel.transactionCapacity = 50000000
agent.channels.kfkChannel.byteCapacity= 10000000000
#####hdfsChannel#####
agent.channels.hdfsChannel.type = memory
agent.channels.hdfsChannel.capacity =500000000
agent.channels.hdfsChannel.transactionCapacity = 50000000
agent.channels.hdfsChannel.byteCapacity= 10000000000
agent.channels.hdfsChannel.keep-alive = 100
2.文件channel(基石/持久化):提供更可靠的运输保证,它可以容忍agent失败和重启,代价是性能消耗。------->适用于不容许数据流产生误差的场景下
type = file
checkpointDir/dataDir 指定agent本地存储数据的位置,默认即可。但如果agent配置有多个file channel,需要为每个file channel的存储位置指定绝对路径。
注:在适用file channel时一定要设置HADOOP_PREFIX和JAVA_HOME环境变量。
Capacity = 100 这个值要一方面考虑摄取量,另一方面要保证满足异常情况的恢复时间,当然最主要的还得受限于磁盘空间
keep-alive 和内存channel中相似,是source尝试写数据到一个爆满的channel总的最长等待时间
transactionCapacity 一个事务中最大可以允许的event数量,设定该值越高,将会在内部占用越多的资源
CheckpointInterval 执行checkpoint(将日志文件写入到logDirs中)之间的毫秒数,该值不能低于1000毫秒
checkpoint 根据maxFileSize属性中指定的数据量大小将数据写入文件中。可以为低速度的channel来降低这个值以节省一些磁盘空间。这个值取决于磁盘空间和agent配置的channel数量
maxFileSize 最大文件大小
minimumRequiredSpace 不想用于写日志文件的空间大小,这个值设置越小,磁盘空间使用率将会越高
HDFS Sink
type = hdfs
path 是在HDFS上写数据的目录,即HDFS路径,像Hadoop中的大多数文件路径一样,可以以三种不同方式来指定:绝对路径,带服务器名称的绝对路径,相对路径
channel 指定将要获取数据的channel
filePrefix/fileSuffix 设置文件名:指定文件名以一段字符,文件开始的时间戳和以xx结尾
maxOpenFiles 最大打开的文件数量,默认为5000个,如果超出这个限制,仍然打开的最早的文件将被关闭。每一个打开的文件都会在操作系统层面上和HDFS上增加开销
round/roundValue/roundUnit 设置event在时间粒度(小时、分钟或者秒)上进行舍入,同时在文件路径上保留这些元素
timeZone 指定用来做转义序列的时间的时区(默认值是电脑的本机时间)
inUsePrefix = _ /inUseSuffix = 设置临时文件的前缀为下划线“_”,后缀为空(而不是默认的.tmp),避免临时文件在关闭之前被使用
rollInterval/rollCount/rollSize 默认情况下,Flume每30秒钟或者每10个event或者每1024个字节就主动触发文件的滚动。三个滚动机制可以配合使用(不能都禁用),否则你的数据文件将挤在一个文件中
batchSize sink在一个事务中从channel中读取event的数量,如果channel有大量数据存在,将此值较默认值调高些,可以降低每个活动事务的开销,提升性能
codeC 指定数据压缩方法,如gzip
fileType 指定输出文件格式:SequenceFile 序列化文件;DataStream 数据流;CompressedStream 压缩流(需要指定codeC)
callTimeout HDFS sink等待HDFS返回成功(或者失败)的等待时间
idleTimeout 自动关闭空闲文件前等待的时长(一般不用),因为fileRollInterval在每次滚动的时候都会处理去关闭这些文件,并且如果channel空闲时,是不会打开新文件的
threadsPoolSize 调整工作线程数量。默认是10个,这是同一时间内可以同时写文件的最大数量,该值不能设置太大,会压垮HDFS
rollTimerPoolSize 这是处理由idleTimeout属性定义的超时时间的工作线程数,可以忽略
案例:配置的HDFS sink的值如下:
######hdfsSink13######
agent.sinks.hdfsSink13.type=hdfs
agent.sinks.hdfsSink13.hdfs.path=hdfs://IP/DATA/%Y%m%d
agent.sinks.hdfsSink13.hdfs.filePrefix=cs_data_13
agent.sinks.hdfsSink13.hdfs.rollInterval = 1800
agent.sinks.hdfsSink13.hdfs.rollSize = 380000000
agent.sinks.hdfsSink13.hdfs.rollCount = 0
agent.sinks.hdfsSink13.hdfs.batchSize = 400000
agent.sinks.hdfsSink13.hdfs.callTimeout=100000
agent.sinks.hdfsSink13.hdfs.useLocalTimeStamp = true
agent.sinks.hdfsSink13.channel=hdfsChannel13
#agent.sinks.hdfsSink13.hdfs.codeC=gzip
#agent.sinks.hdfsSink13.hdfs.fileType=CompressedStream
agent.sinks.hdfsSink13.hdfs.fileType=DataStream
#####hdfsSink1#####
agent.sinks.hdfsSink1.type=hdfs
agent.sinks.hdfsSink1.hdfs.path=hdfs://nameservice/DATA/%Y%m%d
agent.sinks.hdfsSink1.hdfs.filePrefix=cs_data_1
agent.sinks.hdfsSink1.hdfs.rollInterval = 1800
agent.sinks.hdfsSink1.hdfs.rollSize = 380000000
agent.sinks.hdfsSink1.hdfs.rollCount = 0
agent.sinks.hdfsSink1.hdfs.batchSize = 400000
agent.sinks.hdfsSink1.hdfs.callTimeout=100000
agent.sinks.hdfsSink1.hdfs.useLocalTimeStamp = true
agent.sinks.hdfsSink1.channel=hdfsChannel
#agent.sinks.hdfsSink1.hdfs.codeC=gzip
#agent.sinks.hdfsSink1.hdfs.fileType=CompressedStream
agent.sinks.hdfsSink1.hdfs.fileType=DataStream
#####hdfsSink3#####
agent.sinks.hdfsSink3.type=hdfs
agent.sinks.hdfsSink3.hdfs.path=hdfs://nameservice/DATA/%Y%m%d
agent.sinks.hdfsSink3.hdfs.filePrefix=cs_data_3
agent.sinks.hdfsSink3.hdfs.rollInterval = 1800
agent.sinks.hdfsSink3.hdfs.rollSize = 380000000
agent.sinks.hdfsSink3.hdfs.rollCount = 0
agent.sinks.hdfsSink3.hdfs.batchSize = 400000
agent.sinks.hdfsSink3.hdfs.callTimeout=100000
agent.sinks.hdfsSink3.hdfs.useLocalTimeStamp = true
agent.sinks.hdfsSink3.channel=hdfsChannel
#agent.sinks.hdfsSink3.hdfs.codeC=gzip
#agent.sinks.hdfsSink3.hdfs.fileType=CompressedStream
agent.sinks.hdfsSink3.hdfs.fileType=DataStream
#####hdfsSink2#####
agent.sinks.hdfsSink2.type=hdfs
agent.sinks.hdfsSink2.hdfs.path=hdfs://nameservice/DATA/%Y%m%d
agent.sinks.hdfsSink2.hdfs.filePrefix=cs_data_2
agent.sinks.hdfsSink2.hdfs.rollInterval = 1800
agent.sinks.hdfsSink2.hdfs.rollSize = 380000000
agent.sinks.hdfsSink2.hdfs.rollCount = 0
agent.sinks.hdfsSink2.hdfs.batchSize = 400000
agent.sinks.hdfsSink2.hdfs.callTimeout=100000
agent.sinks.hdfsSink2.hdfs.useLocalTimeStamp = true
agent.sinks.hdfsSink2.channel=hdfsChannel
#agent.sinks.hdfsSink2.hdfs.codeC=gzip
#agent.sinks.hdfsSink2.hdfs.fileType=CompressedStream
agent.sinks.hdfsSink2.hdfs.fileType=DataStream
#####hdfsSink4#####
agent.sinks.hdfsSink4.type=hdfs
agent.sinks.hdfsSink4.hdfs.path=hdfs://nameservice/DATA/%Y%m%d
agent.sinks.hdfsSink4.hdfs.filePrefix=cs_data_4
agent.sinks.hdfsSink4.hdfs.rollInterval = 1800
agent.sinks.hdfsSink4.hdfs.rollSize = 380000000
agent.sinks.hdfsSink4.hdfs.rollCount = 0
agent.sinks.hdfsSink4.hdfs.batchSize = 400000
agent.sinks.hdfsSink4.hdfs.callTimeout=100000
agent.sinks.hdfsSink4.hdfs.useLocalTimeStamp = true
agent.sinks.hdfsSink4.channel=hdfsChannel
#agent.sinks.hdfsSink4.hdfs.codeC=gzip
#agent.sinks.hdfsSink4.hdfs.fileType=CompressedStream
agent.sinks.hdfsSink4.hdfs.fileType=DataStream
#####hdfsSink5#####
agent.sinks.hdfsSink5.type=hdfs
agent.sinks.hdfsSink5.hdfs.path=hdfs://nameservice/DATA/%Y%m%d
agent.sinks.hdfsSink5.hdfs.filePrefix=cs_data_5
agent.sinks.hdfsSink5.hdfs.rollInterval = 1800
agent.sinks.hdfsSink5.hdfs.rollSize = 380000000
agent.sinks.hdfsSink5.hdfs.rollCount = 0
agent.sinks.hdfsSink5.hdfs.batchSize = 400000
agent.sinks.hdfsSink5.hdfs.callTimeout=100000
agent.sinks.hdfsSink5.hdfs.useLocalTimeStamp = true
agent.sinks.hdfsSink5.channel=hdfsChannel
#agent.sinks.hdfsSink5.hdfs.codeC=gzip
#agent.sinks.hdfsSink5.hdfs.fileType=CompressedStream
agent.sinks.hdfsSink5.hdfs.fileType=DataStream
#####hdfsSink6#####
agent.sinks.hdfsSink6.type=hdfs
agent.sinks.hdfsSink6.hdfs.path=hdfs://nameservice1/DATA/%Y%m%d
agent.sinks.hdfsSink6.hdfs.filePrefix=cs_data_6
agent.sinks.hdfsSink6.hdfs.rollInterval = 1800
agent.sinks.hdfsSink6.hdfs.rollSize = 380000000
agent.sinks.hdfsSink6.hdfs.rollCount = 0
agent.sinks.hdfsSink6.hdfs.batchSize = 400000
agent.sinks.hdfsSink6.hdfs.callTimeout=100000
agent.sinks.hdfsSink6.hdfs.useLocalTimeStamp = true
agent.sinks.hdfsSink6.channel=hdfsChannel
#agent.sinks.hdfsSink6.hdfs.codeC=gzip
#agent.sinks.hdfsSink6.hdfs.fileType=CompressedStream
agent.sinks.hdfsSink6.hdfs.fileType=DataStream
#####hdfsSink7#####
agent.sinks.hdfsSink7.type=hdfs
agent.sinks.hdfsSink7.hdfs.path=hdfs://nameservice1/DATA/%Y%m%d
agent.sinks.hdfsSink7.hdfs.filePrefix=cs_data_7
agent.sinks.hdfsSink7.hdfs.rollInterval = 1800
agent.sinks.hdfsSink7.hdfs.rollSize = 380000000
agent.sinks.hdfsSink7.hdfs.rollCount = 0
agent.sinks.hdfsSink7.hdfs.batchSize = 400000
agent.sinks.hdfsSink7.hdfs.callTimeout=100000
agent.sinks.hdfsSink7.hdfs.useLocalTimeStamp = true
agent.sinks.hdfsSink7.channel=hdfsChannel
#agent.sinks.hdfsSink7.hdfs.codeC=gzip
#agent.sinks.hdfsSink7.hdfs.fileType=CompressedStream
agent.sinks.hdfsSink7.hdfs.fileType=DataStream
#####hdfsSink8#####
agent.sinks.hdfsSink8.type=hdfs
agent.sinks.hdfsSink8.hdfs.path=hdfs://nameservice1/DATA/%Y%m%d
agent.sinks.hdfsSink8.hdfs.filePrefix=cs_data_8
agent.sinks.hdfsSink8.hdfs.rollInterval = 1800
agent.sinks.hdfsSink8.hdfs.rollSize = 380000000
agent.sinks.hdfsSink8.hdfs.rollCount = 0
agent.sinks.hdfsSink8.hdfs.batchSize = 400000
agent.sinks.hdfsSink8.hdfs.callTimeout=100000
agent.sinks.hdfsSink8.hdfs.useLocalTimeStamp = true
agent.sinks.hdfsSink8.channel=hdfsChannel
#agent.sinks.hdfsSink8.hdfs.codeC=gzip
#agent.sinks.hdfsSink8.hdfs.fileType=CompressedStream
agent.sinks.hdfsSink8.hdfs.fileType=DataStream
#####hdfsSink9#####
agent.sinks.hdfsSink9.type=hdfs
agent.sinks.hdfsSink9.hdfs.path=hdfs://nameservice1/DATA/%Y%m%d
agent.sinks.hdfsSink9.hdfs.filePrefix=cs_data_9
agent.sinks.hdfsSink9.hdfs.rollInterval = 1800
agent.sinks.hdfsSink9.hdfs.rollSize = 380000000
agent.sinks.hdfsSink9.hdfs.rollCount = 0
agent.sinks.hdfsSink9.hdfs.batchSize = 400000
agent.sinks.hdfsSink9.hdfs.callTimeout=100000
agent.sinks.hdfsSink9.hdfs.useLocalTimeStamp = true
agent.sinks.hdfsSink9.channel=hdfsChannel
#agent.sinks.hdfsSink9.hdfs.codeC=gzip
#agent.sinks.hdfsSink9.hdfs.fileType=CompressedStream
agent.sinks.hdfsSink9.hdfs.fileType=DataStream
#####hdfsSink10#####
agent.sinks.hdfsSink10.type=hdfs
agent.sinks.hdfsSink10.hdfs.path=hdfs://nameservice1/DATA/%Y%m%d
agent.sinks.hdfsSink10.hdfs.filePrefix=cs_data_10
agent.sinks.hdfsSink10.hdfs.rollInterval = 1800
agent.sinks.hdfsSink10.hdfs.rollSize = 380000000
agent.sinks.hdfsSink10.hdfs.rollCount = 0
agent.sinks.hdfsSink10.hdfs.batchSize = 400000
agent.sinks.hdfsSink10.hdfs.callTimeout=100000
agent.sinks.hdfsSink10.hdfs.useLocalTimeStamp = true
agent.sinks.hdfsSink10.channel=hdfsChannel
#agent.sinks.hdfsSink10.hdfs.codeC=gzip
#agent.sinks.hdfsSink10.hdfs.fileType=CompressedStream
agent.sinks.hdfsSink10.hdfs.fileType=DataStream
Sinkgroups
为了消除数据处理管道中的单点故障,可以使用负载均衡或故障转移策略将event发送到不同的sink
负载均衡sink组
type = load_balance 负载均衡,round robin策略将会被使用
backoff = true 指定了一个指数级的阻塞时间用于当重试一个sink抛出异常时。false:意味着sink抛出异常后,下次flume将再次基于循环或随机选择机制进行尝试;true:对于失败后的等待时间将增加一倍,从一秒大大约18个小时的限制结束(2^16)
selector = round_robin 选择round robin策略
案例:配置的sink组的值如下:
#######hdfssinkgroups###########
agent.sinkgroups.hdfssinkgroups.sinks = hdfsSink1 hdfsSink2 hdfsSink3 hdfsSink4 hdfsSink5 hdfsSink6 hdfsSink7 hdfsSink8 hdfsSink9 hdfsSink10
agent.sinkgroups.hdfssinkgroups.processor.type = load_balance
agent.sinkgroups.hdfssinkgroups.processor.backoff = true
agent.sinkgroups.hdfssinkgroups.processor.selector = round_robin
故障转移sink组
type = failover 故障转移,出现故障时去尝试另一个sink
priority 优先级设置,数字越小,优先级越高,对于同样的数字,则是随机的顺序
maxPerality sink组中失败的sink被关小黑屋的时限设置了上限值,在第一次失败后,再次使用前它将会被禁用一秒。每次失败会加倍这个等待时间,直到达到maxPerality设定的值
Source
Spooling directory source 跟踪哪些文件已经被转移为flume event以及哪些文件仍然需要处理
type = spooldir 需要一个单独的进程来清理spool目录中的所有被Flume标记为已发送的旧文件,否则磁盘目录最终会被塞满
spoolDir spool目录
fileSuffix = .do 设置文件被传输完毕后它将会被重命名为一个以“.do”结尾的文件
fileHeaderkey 修改key值
batchSize 每个事务从channel中每次写event数量,增加这个值可以会提高更好的吞吐量,代价是更大的事务(更多的回滚)
bufferMaxLines 用来设置内存缓冲区大小的,用来在读取文件时将它和maxBufferLineLength属性值相乘
maxBufferLineLength 如果数据短,可以考虑减少maxBufferLineLength大小的同时增加bufferMaxLines的大小,但需保证event的值不能比maxBufferLineLength设置的值大
总结:确保无论采用何种机制来往spooling目录创建新的文件时,文件名都必须是独特的。
注:重启和错误将会导致在spooling目录中任何没有被标记为已完成的文件创建重复的event
案例:配置的source的值为:
#####dirsource#####
agent.sources.dirsource.channels = hdfsChannel kfkChannel hdfsChannel13
agent.sources.dirsource.type = spooldir
agent.sources.dirsource.spoolDir =/data
agent.sources.dirsource.ignorePattern =.*(\\.tmp|\\.ctr)$
agent.sources.dirsource.inputCharset=latin1
agent.sources.dirsource.fileSuffix=.do
agent.sources.dirsource.bufferMaxLineLength =1000000
agent.sources.dirsource.deserializer.maxLineLength=1000000
#agent.sources.dirsource.deletePolicy=immediate
exec source 提供了一种在flume外运行命令然后将内容作为event输出到flume中
type = exec
channels 所有source都要指定要写数据的channel列表,值为用空格分割的channel名称列表
commad 该属性告诉了flume应该将什么样的命令传给操作系统
restart 是否重启commad
restartThrottle 重启周期
logstdErr 是否想要捕获的内容输出到stdErr
batchSize 捕获每个事务中处理的event数量
##单节点flume
#修改conf/flume-conf.properties
#memory channel called ch1 on agent1
producer.channels.channel1.type = memory
producer.channels.channel1.capacity = 10000
producer.channels.channel1.transactionCapacity = 1000
# Define an Avro source called avro-source1 on agent1 and tell it
# to bind to 0.0.0.0:41414. Connect it to channel ch1.
producer.sources.source1.channels = channel1
producer.sources.source1.type = spooldir
producer.sources.source1.spoolDir =/opt/cur_day/ ####这里是扫描文件的位置
producer.sources.source1.ignorePattern =^(.)*\\.txt$ ###这里是忽略txt结尾的文件
producer.sources.source1.batchSize =200
producer.sources.source1.inputCharset=UTF-8
producer.sources.source1.fileSuffix=.done ##这里是扫描完成的文件后缀名
# Define a logger sink that simply logs all events it receives
# and connect it to the other end of the same channel.
producer.sinks.sink1.type = org.apache.flume.sink.kafka.KafkaSink
producer.sinks.sink1.topic =cqcn_flume_livereport ##topic的名称
producer.sinks.sink1.brokerList =hadoop41:9092,hadoop42:9092,hadoop43:9092 ##kafka集群
producer.sinks.sink1.channel = channel1
producer.sinks.sink1.batchSize = 50
# Finally, now that we've defined all of our components, tell
# agent1 which ones we want to activate.
producer.channels = channel1
producer.sources = source1
producer.sinks = sink1
##修改log4j.properties 日志位置
flume.log.dir=/data12/flume_log
mkdir -p /data12/flume_log
案例:flume入hdfs
###注意:flume入hdfs需要flume运行主机上具备hadoop环境(将hadoop jar包解压到特定目录,添加环境变量即可)
agent.sources = source1//sources名称
agent.channels = memoryChannel//channels名称
agent.sinks = sink1//sinks名称
#source
agent.sources.source1.type = avro
agent.sources.source1.bind = hadoop105
agent.sources.source1.port = 23004
agent.sources.source1.channels = memoryChannel
#channels
agent.channels.memoryChannel.type = memory
agent.channels.memoryChannel.capacity = 1000
agent.channels.memoryChannel.transactionCapacity = 1000
agent.channels.memoryChannel.keep-alive = 100
#sink
agent.sinks.sink1.channel = memoryChannel
agent.sinks.sink1.type=hdfs
agent.sinks.sink1.hdfs.path=hdfs://10.45.47.103:8020/data/%y-%m-%d/%H%M%S
agent.sinks.sink1.hdfs.fileType=DataStream
agent.sinks.sink1.hdfs.writeFormat=TEXT
agent.sinks.sink1.hdfs.useLocalTimeStamp=true
agent.sinks.sink1.hdfs.filePrefix=events-
#下面三个参数一起使用,默认值为:30 1024 10,执行时取最小值
agent.sinks.sink1.hdfs.rollSize=100000 //单位b
agent.sinks.sink1.hdfs.rollInterval=3000//单位s
agent.sinks.sink1.hdfs.rollCount=100//记录条数(行数)
##启动flume
nohup /opt/flume/bin/flume-ng agent --conf /opt/flume/conf/ -f /opt/flume/conf/flume-conf.properties -n producer &