启动多个节点的ES后,ES开始推举master节点并同步分片shard数据到新ES节点上,此时观察Logstash日志抛出以下错误:
{
"index": {
"blocks": {
"read_only_allow_delete": "false"
}
}
}
[2018-01-18T08:05:44,583][INFO ][o.e.m.j.JvmGcMonitorService] [es-2] [gc][765841] overhead, spent [262ms] collecting
[2018-01-18T08:07:17,853][INFO ][o.e.m.j.JvmGcMonitorService] [es-2] [gc][766113] overhead, spent [444ms] collecting
[2018-01-18T08:10:54,285][INFO ][o.e.m.j.JvmGcMonitorService] [es-2] [gc][766568] overhead, spent [270ms] collecting
[2018-01-18T08:18:16,306][INFO ][o.e.m.j.JvmGcMonitorService] [es-2] [gc][766590] overhead, spent [375ms] collecting
[2018-01-18T08:18:55,840][INFO ][logstash.outputs.elasticsearch] retrying failed action with response code: 503
({"type"=>"unavailable_shards_exception", "reason"=>"[twitter_news][0] primary shard is not active Timeout: [1m],
request: [BulkShardRequest to [twitter_news] containing [1] requests]"})>
[2018-01-18T08:18:55,840][ERROR][logstash.outputs.elasticsearch] Retrying individual actions>
[2018-01-18T08:18:55,841][ERROR][logstash.outputs.elasticsearch] Action
此时观察ES集群状态:curl http://10.0.7.220:9200/_cluster/health?pretty
{
"cluster_name" : "Bond_ELK",
"status" : "red",
"timed_out" : false,
"number_of_nodes" : 3,
"number_of_data_nodes" : 3,
"active_primary_shards" : 641,
"active_shards" : 1282,
"relocating_shards" : 0,
"initializing_shards" : 0,
"unassigned_shards" : 0,
"delayed_unassigned_shards" : 0,
"number_of_pending_tasks" : 0,
"number_of_in_flight_fetch" : 0,
"task_max_waiting_in_queue_millis" : 0,
"active_shards_percent_as_number" : 45.0
}
注意到,集群状态"status" : "red",非健康状态
并且,此时观察集群内各索引状态:curl http://10.0.7.220:9200/_cat/indices
(10.0.7.220为其中一个ES节点的IP地址,9200为ES的服务端口)
green open logstash-2017.09.10 Lo4z4egNRMGu7qrKYWM35w 5 1 77152 0 218.8mb 109.4mb
green open logstash-2017.10.04 rABy9W2MQmaUmuGiT8QYnQ 5 1 63638 0 89mb 44.5mb
green open logstash-2017.12.20 As0qvTxcTHSW5enaZ9i9Gg 5 1 190670 0 214.8mb 107.4mb
green open logstash-2017.09.09 V0OO7JPJQPmdPeLDJS15Uw 5 1 123109 0 331.3mb 165.6mb
green open logstash-2017.11.18 jLh97NBWSyWYZ8E0UEIZpw 5 1 1646106 0 2.7gb 1.3gb
green open logstash-2017.11.11 BjA78HyzRuycpUQ701giqg 5 1 1401268 0 2.1gb 1gb
green open logstash-2017.12.24 U47kSs37Tw6Umt_ElE3mvg 5 1 463518 0 618.4mb 309.2mb
green open logstash-2017.11.09 R5nYBGDzSlKK2MxE855i8g 5 1 537955 0 872.2mb 436.1mb
green open logstash-2017.10.22 mSh5vwMMSBOA1XqxuEAsqw 5 1 328375 0 509.8mb 254.9mb
green open logstash-2017.09.13 CdOl9OasRtS1kZNbekrBdA 5 1 115972 0 163.4mb 81.7mb
green open logstash-2018.01.15 tGT6NEJpQTWqK9e86BiuRQ 5 1 148796 0 206.8mb 103.5mb
green open logstash-2017.11.01 8F4VFNhJRtSt0eQtQOmKmw 5 1 323805 0 452.8mb 226.4mb
green open logstash-2017.11.02 8c59nl75RPiXCnw2vmkDFQ 5 1 417596 0 685.5mb 342.7mb
green open logstash-2017.09.22 pGS8fBFLS0CervHlE9_lkA 5 1 372848 0 572.2mb 286.1mb
green open logstash-2017.12.02 VleMOwNUTGmjHFmAQXTSBA 5 1 628957 0 1.2gb 638.3mb
green open logstash-2017.10.31 9ke66J_2RpOMa-181TVYwg 5 1 152957 0 221.4mb 110.7mb
green open logstash-2017.10.19 U4vbt88oRMyWhSKcZM8K4Q 5 1 191099 0 280.4mb 140.2mb
red open logstash-2017.10.08 fz9MKG0qQ2OrQTbFaixLMg 5 1 203432 0 0mb 0mb
green open logstash-2017.11.10 7H0fs5DwTE-m8BMxKEYCtQ 5 1 767469 0 1.2gb 626.2mb
green open logstash-2017.11.22 006C00fJR7ynfy54gIL1Mw 5 1 345869 0 573.1mb 286.5mb
red open logstash-2017.10.13 2wgSB3yKSyi8rf58M5_ODA 5 1 340665 0 0mb 0mb
green open logstash-2017.12.29 C7gjv7ImQXWIlLfib3sr9A 5 1 503307 0 584.1mb 292mb
green open logstash-2017.09.26 UYxrsJiNT4uuI1ED5X_JvQ 5 1 121005 0 178.7mb 89.3mb
green open logstash-2017.11.17 DkfdybaxTNap75Z_ebhYGA 5 1 802889 0 1.2gb 651.9mb
green open logstash-2018.01.04 7plCP47OQYeX0PGogzgVwg 5 1 307646 0 357.2mb 178.6mb
发现索引logstash-2017.10.08和logstash-2017.10.13为red异常状态
解决方法:
删除异常分片,首先保证集群重新正常运行,但是注意,会丢失被删除的同步异常数据
同样在Kibana的开发工具上执行:
DELETE /logstash-2017.10.08,logstash-2017.10.13
之后ES集群状态从"red"恢复为"green"或者"yellow",视实际情况而定,ES集群恢复正常数据同步,当"active_shards_percent_as_number" : 的值为100时,说明数据分片完全同步
{
"cluster_name" : "Bond_ELK",
"status" : "green",
"timed_out" : false,
"number_of_nodes" : 3,
"number_of_data_nodes" : 3,
"active_primary_shards" : 641,
"active_shards" : 1282,
"relocating_shards" : 0,
"initializing_shards" : 0,
"unassigned_shards" : 0,
"delayed_unassigned_shards" : 0,
"number_of_pending_tasks" : 0,
"number_of_in_flight_fetch" : 0,
"task_max_waiting_in_queue_millis" : 0,
"active_shards_percent_as_number" : 56.0
}