大家在上篇博客中,可以看到,对flume本身的优化,我们可以说是一个较大的进步,但是,后期梳理时,发现,数据的处理经过了很多没有必要的步骤,我们的处理有些多余,但是精简哪里,又成为了一个问题,本篇博客带领大家一起看看,精简的关键位置及效果。
还是老样子,大家会议上篇博客的架构:
不难看出,有一个性能点就是从主端口下发的时候,三个端口到es的过程中,为了让数据有较好的缓冲,我们使用了kafka作为缓冲区,但是三个flume先得有些多余,我们可以使用首端(第一个flume)做到三个输出,不再是avro端口,而是直接和kafka对接,大家看优化之后的图:
配置:
- balance.sources = source1
- balance.sinks = k1 k2 k3 k4
- balance.channels = channel1
- # Describe/configure source1
- balance.sources.source1.type = avro
- balance.sources.source1.bind = 192.168.10.83
- balance.sources.source1.port = 12300
- #define sinkgroups
- balance.sinkgroups=g1
- balance.sinkgroups.g1.sinks=k1 k2 k3 k4
- balance.sinkgroups.g1.processor.type=load_balance
- balance.sinkgroups.g1.processor.backoff=true
- balance.sinkgroups.g1.processor.selector=round_robin
- #define the sink 1
- balance.sinks.k1.type = org.apache.flume.sink.kafka.KafkaSink
- balance.sinks.k1.topic = ulog
- balance.sinks.k1.brokerList = 192.168.10.83:9092,192.168.10.84:9092
- balance.sinks.k1.requiredAcks = 1
- balance.sinks.k1.batchSize = 10000
- #define the sink 2
- balance.sinks.k2.type = org.apache.flume.sink.kafka.KafkaSink
- balance.sinks.k2.topic = ulog
- balance.sinks.k2.brokerList = 192.168.10.83:9092,192.168.10.84:9092
- balance.sinks.k2.requiredAcks = 1
- balance.sinks.k2.batchSize = 10000
- #define the sink 3
- balance.sinks.k3.type = org.apache.flume.sink.kafka.KafkaSink
- balance.sinks.k3.topic = ulog
- balance.sinks.k3.brokerList = 192.168.10.83:9092,192.168.10.84:9092
- balance.sinks.k3.requiredAcks = 1
- balance.sinks.k3.batchSize = 10000
- #define the sink 4
- balance.sinks.k4.type = org.apache.flume.sink.kafka.KafkaSink
- balance.sinks.k4.topic = ulog
- balance.sinks.k4.brokerList = 192.168.10.83:9092,192.168.10.84:9092
- balance.sinks.k4.requiredAcks = 1
- balance.sinks.k4.batchSize = 10000
- # Use a channel which buffers events in memory
- balance.channels.channel1.type = file
- balance.channels.channel1.checkpointDir = /export/data/flume/flume-1.6.0/dataeckPoint/balance
- balance.channels.channel1.useDualCheckpoints = true
- balance.channels.channel1.backupCheckpointDir = /export/data/flume/flume-1.6.0/data/bakcheckPoint/balance
- balance.channels.channel1.dataDirs =/export/data/flume/flume-1.6.0/data/balance
- balance.channels.channel1.transactionCapacity = 10000
- balance.channels.channel1.checkpointInterval = 30000
- balance.channels.channel1.maxFileSize = 2146435071
- balance.channels.channel1.minimumRequiredSpace = 524288000
- balance.channels.channel1.capacity = 1000000
- balance.channels.channel1.keep-alive=3
- # Bind the source and sink to the channel
- balance.sources.source1.channels = channel1
- balance.sinks.k1.channel = channel1
- balance.sinks.k2.channel=channel1
- balance.sinks.k3.channel=channel1
- balance.sinks.k4.channel=channel1
这样我们就将5个flume优化为2个flume,在flume身上加载的瓶颈就会减少很多,因为没有sinkgroups,它还是一个多线程的处理方式,这样,我们就将这样的瓶颈固定在2个节点,下一步,我们的优化方向就是flume的filechannel,因为,我们发现这样的架构运行一端时间后,系统io有较大消耗。
总结:
有时候,我们的增加,是为了负载均衡,这是没有问题的,但是如果负载均衡的分发端成了瓶颈,或者传输介质,或者存储介质成了瓶颈,我们不妨向多线程方向考虑,就会有更大的天地。
flume高并发优化——(2)精简结构
大家在上篇博客中,可以看到,对flume本身的优化,我们可以说是一个较大的进步,但是,后期梳理时,发现,数据的处理经过了很多没有必要的步骤,我们的处理有些多余,但是精简哪里,又成为了一个问题,本篇博客带领大家一起看看,精简的关键位置及效果。
还是老样子,大家会议上篇博客的架构:
不难看出,有一个性能点就是从主端口下发的时候,三个端口到es的过程中,为了让数据有较好的缓冲,我们使用了kafka作为缓冲区,但是三个flume先得有些多余,我们可以使用首端(第一个flume)做到三个输出,不再是avro端口,而是直接和kafka对接,大家看优化之后的图:
配置:
- balance.sources = source1
- balance.sinks = k1 k2 k3 k4
- balance.channels = channel1
- # Describe/configure source1
- balance.sources.source1.type = avro
- balance.sources.source1.bind = 192.168.10.83
- balance.sources.source1.port = 12300
- #define sinkgroups
- balance.sinkgroups=g1
- balance.sinkgroups.g1.sinks=k1 k2 k3 k4
- balance.sinkgroups.g1.processor.type=load_balance
- balance.sinkgroups.g1.processor.backoff=true
- balance.sinkgroups.g1.processor.selector=round_robin
- #define the sink 1
- balance.sinks.k1.type = org.apache.flume.sink.kafka.KafkaSink
- balance.sinks.k1.topic = ulog
- balance.sinks.k1.brokerList = 192.168.10.83:9092,192.168.10.84:9092
- balance.sinks.k1.requiredAcks = 1
- balance.sinks.k1.batchSize = 10000
- #define the sink 2
- balance.sinks.k2.type = org.apache.flume.sink.kafka.KafkaSink
- balance.sinks.k2.topic = ulog
- balance.sinks.k2.brokerList = 192.168.10.83:9092,192.168.10.84:9092
- balance.sinks.k2.requiredAcks = 1
- balance.sinks.k2.batchSize = 10000
- #define the sink 3
- balance.sinks.k3.type = org.apache.flume.sink.kafka.KafkaSink
- balance.sinks.k3.topic = ulog
- balance.sinks.k3.brokerList = 192.168.10.83:9092,192.168.10.84:9092
- balance.sinks.k3.requiredAcks = 1
- balance.sinks.k3.batchSize = 10000
- #define the sink 4
- balance.sinks.k4.type = org.apache.flume.sink.kafka.KafkaSink
- balance.sinks.k4.topic = ulog
- balance.sinks.k4.brokerList = 192.168.10.83:9092,192.168.10.84:9092
- balance.sinks.k4.requiredAcks = 1
- balance.sinks.k4.batchSize = 10000
- # Use a channel which buffers events in memory
- balance.channels.channel1.type = file
- balance.channels.channel1.checkpointDir = /export/data/flume/flume-1.6.0/dataeckPoint/balance
- balance.channels.channel1.useDualCheckpoints = true
- balance.channels.channel1.backupCheckpointDir = /export/data/flume/flume-1.6.0/data/bakcheckPoint/balance
- balance.channels.channel1.dataDirs =/export/data/flume/flume-1.6.0/data/balance
- balance.channels.channel1.transactionCapacity = 10000
- balance.channels.channel1.checkpointInterval = 30000
- balance.channels.channel1.maxFileSize = 2146435071
- balance.channels.channel1.minimumRequiredSpace = 524288000
- balance.channels.channel1.capacity = 1000000
- balance.channels.channel1.keep-alive=3
- # Bind the source and sink to the channel
- balance.sources.source1.channels = channel1
- balance.sinks.k1.channel = channel1
- balance.sinks.k2.channel=channel1
- balance.sinks.k3.channel=channel1
- balance.sinks.k4.channel=channel1
这样我们就将5个flume优化为2个flume,在flume身上加载的瓶颈就会减少很多,因为没有sinkgroups,它还是一个多线程的处理方式,这样,我们就将这样的瓶颈固定在2个节点,下一步,我们的优化方向就是flume的filechannel,因为,我们发现这样的架构运行一端时间后,系统io有较大消耗。
总结:
有时候,我们的增加,是为了负载均衡,这是没有问题的,但是如果负载均衡的分发端成了瓶颈,或者传输介质,或者存储介质成了瓶颈,我们不妨向多线程方向考虑,就会有更大的天地。