ELK的一次吞吐量优化
问题一
- 最近发现kibana的日志传的很慢,常常查不到日志,由于所有的日志收集都只传输到了一个logstash进行收集和过滤,于是怀疑是否是由于logstash的吞吐量存在瓶颈。一看,还真是到了瓶颈。
优化过程
- 经过查询logstash完整配置文件,有几个参数需要调整
# pipeline线程数,官方建议是等于CPU内核数
pipeline.workers: 24
# 实际output时的线程数
pipeline.output.workers: 24
# 每次发送的事件数
pipeline.batch.size: 3000
# 发送延时
pipeline.batch.delay: 5
PS
:由于我们的ES集群数据量较大(>28T),所以具体配置数值视自身生产环境
优化结果
- ES的吞吐由每秒9817/s提升到41183/s
问题二
- 在查看logstash日志过程中,我们看到了大量的以下报错
[2017-03-18T09:46:21,043][INFO ][logstash.outputs.elasticsearch] retrying failed action with response code: 429 ({"type"=>"es_rejected_execution_exception", "reason"=>"rejected execution of org.elasticsearch.transport.TransportService$6@6918cf2e on EsThreadPoolExecutor[bulk, queue capacity = 50, org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor@55337655[Running, pool size = 24, active threads = 24, queued tasks = 50, completed tasks = 1767887463]]"})
[2017-03-18T09:46:21,043][ERROR][logstash.outputs.elasticsearch] Retrying individual actions
- 查询官网,确认为时ES的写入遇到了瓶颈
Make sure to watch for TOO_MANY_REQUESTS (429) response codes (EsRejectedExecutionException with the Java client), which is the way that Elasticsearch tells you that it cannot keep up with the current indexing rate. When it happens, you should pause indexing a bit before trying again, ideally with randomized exponential backoff.
我们首先想到的是来调整ES的线程数,但是官网写到”Don’t Touch There Settings!”, 那怎么办?于是乎官方建议我们修改logstash的参数pipeline.batch.size
- 在ES5.0以后,es将bulk、flush、get、index、search等线程池完全分离,自身的写入不会影响其他功能的性能。
来查询一下ES当前的线程情况:
GET _nodes/stats/thread_pool?pretty
可以看到:
{
"_nodes": {
"total": 6,
"successful": 6,
"failed": 0
},
"cluster_name": "dev-elasticstack5.0",
"nodes": {
"nnfCv8FrSh-p223gsbJVMA": {
"timestamp": 1489804973926,
"name": "node-3",
"transport_address": "192.168.3.***:9301",
"host": "192.168.3.***",
"ip": "192.168.3.***:9301",
"roles": [
"master",
"data",
"ingest"
],
"attributes": {
"rack": "r1"
},
"thread_pool": {
"bulk": {
"threads": 24,
"queue": 214,
"active": 24,
"rejected": 30804543,
"largest": 24,
"completed": 1047606679
},
......
"watcher": {
"threads": 0,
"queue": 0,
"active": 0,
"rejected": 0,
"largest": 0,
"completed": 0
}
}
}
}
}
其中:”bulk”模板的线程数24,当前活跃的线程数24,证明所有的线程是busy的状态,queue队列214,rejected为30804543。那么问题就找到了,所有的线程都在忙,队列堵满后再有进程写入就会被拒绝,而当前拒绝数为30804543。
优化方案
- 问题找到了,如何优化呢。官方的建议是提高每次批处理的数量,调节传输间歇时间。当batch.size增大,es处理的事件数就会变少,写入也就愉快了。
vim /etc/logstash/logstash.yml
#
pipeline.workers: 24
pipeline.output.workers: 24
pipeline.batch.size: 10000
pipeline.batch.delay: 10
具体的worker/output.workers数量建议等于CPU数,batch.size/batch.delay根据实际的数据量逐渐增大来测试最优值。
做完这些,世界又清净了。