Elasticsearch yellow unassigned_shards 恢复 replicas 节点恢复
问题
Elasticsearch 5.1 集群三个节点,由于某些原因,导致其中两个挂掉了。及时重启后,es的健康状态由red变为yellow,并一直持续yellow状态
分析
调用健康状态接口:
GET /_cluster/health
----------------------
{
"cluster_name": "es",
"status": "yellow",
"timed_out": false,
"number_of_nodes": 3,
"number_of_data_nodes": 3,
"active_primary_shards": 37,
"active_shards": 71,
"relocating_shards": 0,
"initializing_shards": 0,
"unassigned_shards": 3,
"delayed_unassigned_shards": 0,
"number_of_pending_tasks": 0,
"number_of_in_flight_fetch": 0,
"task_max_waiting_in_queue_millis": 0,
"active_shards_percent_as_number": 95.94594594594594
}
发现有三个节点的副本没有分配。
调用节点详情接口:
GET /_cat/shards
my_index 4 p STARTED 7795827 10.7gb 192.168.0.205 node-205-1
my_index 4 r UNASSIGNED
my_index 3 p STARTED 7801305 12.4gb 192.168.0.205 node-205-1
my_index 3 r UNASSIGNED
my_index 2 r STARTED 7797142 10.6gb 192.168.0.149 node-149-1
my_index 2 p STARTED 7797211 10.6gb 192.168.0.173 node-173-1
my_index 1 p STARTED 7801554 11.4gb 192.168.0.205 node-205-1
my_index 1 r UNASSIGNED
my_index 0 r STARTED 7795061 10.8gb 192.168.0.149 node-149-1
my_index 0 p STARTED 7795107 10.8gb 192.168.0.173 node-173-1
发现 primary shards 是ok的,但是 1 3 4 的 replica shards 是挂掉的。
按理来说,Elasticsearch是有自我分配节点功能的,首先查看这个功能是否开启:
GET /_cluster/settings
-----------------------
{
"persistent": {},
"transient": {
"cluster": {
"routing": {
"allocation": {
"enable": "all" }
}
}
}
}
已经开启了自动分配功能。那就很奇怪了,为什么这三个节点没有分配呢。
于是登上机器,查看es的日志,发现error如下:
[2017-11-21T15:43:54,799][WARN ][o.e.i.c.IndicesClusterStateService] [node-149-1] [[my_index][4]] marking and sending shard failed due to [failed recovery]
org.elasticsearch.indices.recovery.RecoveryFailedException: [my_index][4]: Recovery failed from {node-205-1}{WPC32CtxTtOiTPuCseqF8g}{AyjHnVtwSnik2Rcu_SQg8A}{192.168.0.205}{192.168.0.205:9300} into {node-149-1}{fa9ZVqyXSHKhYHvAhr8x6w}{ECrBVQS_QPOXtc9E0is9Tw}{192.168.0.149}{192.168.0.149:9300}
......
Caused by: org.elasticsearch.indices.recovery.RecoverFilesRecoveryException: Failed to transfer [0] files with total size of [0b]
......
Caused by: java.lang.IllegalStateException: try to recover [my_index][4] from primary shard with sync id but number of docs differ: 7728586 (node-205-1, primary) vs 7728583(node-149-1)
......
原来并不是es没有尝试恢复,而是恢复报错了。
一头雾水,不知道发生了什么,于是google。
尝试了各种方法,比如
- stackoverflow上面的使用/_cluster/reroute来恢复,但是我和这个提问者情况好像不太一样,我所有的primary shards都是好的,使用 /_cluster/reroute 会报错:
POST /_cluster/reroute
{
"commands" : [ {
"allocate_empty_primary" : {
"index" : "my_index",
"shard" : 1,
"node" : "node-149-1",
"accept_data_loss":true
}
}]
}
-----------------------------
{
"error": {
"root_cause": [
{
"type": "remote_transport_exception",
"reason": "[node-205-1][192.168.0.205:9300][cluster:admin/reroute]"
}
],
"type": "illegal_argument_exception",
"reason": "[allocate_empty_primary] primary [my_index][1] is already assigned"
},
"status": 400
}
说 primary shards 1 已经被分配了,所以使用这个方法好像行不通。
找到一篇文章:解决elasticsearch集群Unassigned Shards 无法reroute的问题。这个和我情况类似,他是使用reindex的方法解决了此问题,但是我这个没办法reindex啊,我这个是热index啊。。。
It’s RED!!! How do I recover unassigned elasticsearch cluster shards?,又翻到一片这个文章,于是乎有了启发
解决方法
因为所有primary shards都是好的,所有replica shards有问题,那么我强制删除掉replica shards,让es再重新生成,不就ok了吗。
首先先将出问题的index的副本为0
PUT /my_index/_settings
{
"index" : {
"number_of_replicas" : 0
}
}
--------------------------
{
"acknowledged": true
}
此时再查看节点状态:
GET /_cat/shards
my_index 4 p STARTED 7795827 10.7gb 192.168.0.205 node-205-1
my_index 3 p STARTED 7801305 12.4gb 192.168.0.205 node-205-1
my_index 2 p STARTED 7797211 10.6gb 192.168.0.173 node-173-1
my_index 1 p STARTED 7801554 11.4gb 192.168.0.205 node-205-1
my_index 0 p STARTED 7795107 10.8gb 192.168.0.173 node-173-1
没有 replica shards 了。
接下来再恢复回去:
PUT /my_index/_settings
{
"index" : {
"number_of_replicas" : 1
}
}
--------------------------
{
"acknowledged": true
}
等待节点自动分配后,集群成功恢复成green!!!