es集群选主时间过长
问题:在一个网络不稳定的环境下,es节点偶尔会短暂脱离集群,然后在1分钟后重新连接上。而在这1分钟内发到该节点的索引请求都失败,原因是没有master节点。也就是说选主过程花了1分钟以上,这么漫长的选主时间是不能接受的。
解决:检查配置后发现discovery.zen.ping_timeout:60s。这个配置参数控制了es的选主时间,即一个节点加入集群后至少要等60秒才能开始选主。
参考:https://elasticsearch.cn/question/4199
用ElasticSearch-head查询ES7报错
{"error":"Content-Type header [application/x-www-form-urlencoded] is not supported","status":406}
解决办法:
解决方法:
1、进入head安装目录;
2、编辑vendor.js 共有两处
①. 6886行 contentType: "application/x-www-form-urlencoded
改成
contentType: "application/json;charset=UTF-8"
②. 7573行 var inspectData = s.contentType === "application/x-www-form-urlencoded" &&
改成
var inspectData = s.contentType === "application/json;charset=UTF-8" &&
ES 状态文件(.st)文件损坏导致启动失败
ElasticsearchException[failed to read [id:21, legacy:false, file:/mnt/disk1/data/elasticsearch2/elasticsearch/nodes/0/indices/people3/_state/state-21.st]]; nested: IOException[failed to read [id:21, legacy:false, file:/mnt/disk1/data/elasticsearch2/elasticsearch/nodes/0/indices/people3/_state/state-21.st]]; nested: IllegalStateException[class org.apache.lucene.store.BufferedChecksumIndexInput cannot seek backwards (pos=-16 getFilePointer()=0)];
at org.elasticsearch.ExceptionsHelper.maybeThrowRuntimeAndSuppress(ExceptionsHelper.java:163)
at org.elasticsearch.gateway.MetaDataStateFormat.loadLatestState(MetaDataStateFormat.java:309)
at org.elasticsearch.gateway.MetaStateService.loadIndexState(MetaStateService.java:112)
at org.elasticsearch.gateway.MetaStateService.loadFullState(MetaStateService.java:97)
at org.elasticsearch.gateway.GatewayMetaState.loadMetaState(GatewayMetaState.java:99)
at org.elasticsearch.gateway.GatewayMetaState.pre20Upgrade(GatewayMetaState.java:225)
at org.elasticsearch.gateway.GatewayMetaState.<init>(GatewayMetaState.java:87)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at org.elasticsearch.common.inject.DefaultConstructionProxyFactory$1.newInstance(DefaultConstructionProxyFactory.java:50)
at org.elasticsearch.common.inject.ConstructorInjector.construct(ConstructorInjector.java:86)
at org.elasticsearch.common.inject.ConstructorBindingImpl$Factory.get(ConstructorBindingImpl.java:104)
at org.elasticsearch.common.inject.ProviderToInternalFactoryAdapter$1.call(ProviderToInternalFactoryAdapter.java:47)
at org.elasticsearch.common.inject.InjectorImpl.callInContext(InjectorImpl.java:886)
at org.elasticsearch.common.inject.ProviderToInternalFactoryAdapter.get(ProviderToInternalFactoryAdapter.java:43)
at org.elasticsearch.common.inject.Scopes$1$1.get(Scopes.java:59)
at org.elasticsearch.common.inject.InternalFactoryToProviderAdapter.get(InternalFactoryToProviderAdapter.java:46)
at org.elasticsearch.common.inject.InjectorBuilder$1.call(InjectorBuilder.java:201)
at org.elasticsearch.common.inject.InjectorBuilder$1.call(InjectorBuilder.java:193)
at org.elasticsearch.common.inject.InjectorImpl.callInContext(InjectorImpl.java:879)
at org.elasticsearch.common.inject.InjectorBuilder.loadEagerSingletons(InjectorBuilder.java:193)
at org.elasticsearch.common.inject.InjectorBuilder.injectDynamically(InjectorBuilder.java:175)
at org.elasticsearch.common.inject.InjectorBuilder.build(InjectorBuilder.java:110)
at org.elasticsearch.common.inject.Guice.createInjector(Guice.java:96)
at org.elasticsearch.common.inject.Guice.createInjector(Guice.java:70)
at org.elasticsearch.common.inject.ModulesBuilder.createInjector(ModulesBuilder.java:46)
at org.elasticsearch.node.Node.<init>(Node.java:213)
at org.elasticsearch.node.Node.<init>(Node.java:140)
at org.elasticsearch.node.NodeBuilder.build(NodeBuilder.java:143)
at org.elasticsearch.bootstrap.Bootstrap.setup(Bootstrap.java:178)
at org.elasticsearch.bootstrap.Bootstrap.init(Bootstrap.java:270)
at org.elasticsearch.bootstrap.Elasticsearch.main(Elasticsearch.java:35)
Caused by: java.io.IOException: failed to read [id:21, legacy:false, file:/mnt/disk1/data/elasticsearch2/elasticsearch/nodes/0/indices/people3/_state/state-21.st]
at org.elasticsearch.gateway.MetaDataStateFormat.loadLatestState(MetaDataStateFormat.java:304)
... 32 more
解决:
用如下命令找出问题状态文件,并且删除
find /mnt/disk*/data/elasticsearch2/elasticsearch/nodes/0/indices/ | grep state | grep “.st” | xargs ls -l | awk ‘{if($5==0)print $0}’ | awk ‘{print $9}’| xargs rm -rf
删除后重启es即可
Translog 导致的分片异常
Recovery failed from {hdh146}{Pu9ZuWyvQ0yBJxFLs2-_hg}{18.126.51.146}{18.126.51.146:9300} into {hdh150}{jJm265igTrGgN2y8CdyFPA}{18.126.51.150}{18.126.51.150:9300}]; nested: RemoteTransportException[[hdh146][18.126.51.146:9300][internal:index/shard/recovery/start_recovery]]; nested: IllegalStateException[can't increment translog [283] channel ref count];
at org.elasticsearch.indices.recovery.RecoveryTarget.doRecovery(RecoveryTarget.java:258)
at org.elasticsearch.indices.recovery.RecoveryTarget.access$1100(RecoveryTarget.java:69)
at org.elasticsearch.indices.recovery.RecoveryTarget$RecoveryRunner.doRun(RecoveryTarget.java:508)
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
这种异常只需要把报错的节点重启就能恢复
ES进程被杀,ES日志没有异常
grep Kill /var/log/massages
发现系统内存溢出,OOM Killer 杀死了ES进程,这种情况下ES是无辜的,只是因为系统的内存不够,OOM Killer的策略选择占用内存大的进程杀死。
设置磁盘空间水位线
PUT _cluster/settings
{
"transient": {
"cluster.routing.allocation.disk.watermark.low": "80%",
"cluster.routing.allocation.disk.watermark.high": "90%",
}
}
集群重启后个别分配一直处于UNASSIGNED状态
首选使用命令 _cluster/reroute?retry_failed=true 重新尝试恢复,如果失败则做以下操作。
收到将为分配的分片分配
$curl -XPOST 'http://localhost:9200/_cluster/reroute' -d '{
"commands":[{
"allocate":{
"index":"filebeat-ali-hk-fd-tss1",
"shard":1,
"node":"ali-hk-ops-elk1",
allow_primary" : true (允许该分片做主分片)
}
}]
}'
遇到几百个分片未分配的情况就不可能一条条手动执行了,我写了一个执行脚本仅供参考
#!/bin/bash
# batch assign shards
address="http://10.3.69.137:9200"
nodenum=24
curl -XPOST ''$address'/_cat/shards' | grep UNASSIGNED > sourceFile
count=1
offset=0
node=0
cat sourceFile | while read line
do
echo "Line $count: $line"
count=$[ $count + 1 ]
for token in $line
do
offset=$[ $offset + 1 ]
case "$offset" in
1) index=$token;;
2) shard=$token;;
3) primary=$token;;
esac
done
offset=0
num1=$(( $node % $nodenum + 1 ))
echo $num1
num2=` printf "%03d" $num1`
nodename=node$num2
if [ $primary = "p"]
then
node=$[ $node + 1 ]
curl -XPOST ''$address'/_cluster/reroute' -d '{"commands":[{"allocate":{"index":"'$index'","shard":'$shard',"node":"'$nodename'","allow_primary" : true }}]}'
fi
done
该脚本使用者需要修改的参数是
- address:填写自己的ip:port;
- nodenum: 集群节点的个数
参考:https://www.cnblogs.com/sunny3096/articles/7155044.html
ES相同请求多次结果不同
如果请求的结果安装timestamp排序,两个文档的timestamp值相同,由于请求是随机去主分片或者副本的结果的,不能保证主分片和副本返回的这两个文档的排序一致,因此会导致相同请求多次结果不一致问题。这就是Bouncing Results问题。参考https://www.elastic.co/guide/en/elasticsearch/guide/master/_search_options.html
要解决这个问题需要在查询url中加preference参数,参考
https://www.elastic.co/guide/en/elasticsearch/reference/master/search-request-preference.html
大集群重启
集群节点临时重启
当修改配置时可能需要重启集群才生效,或者集群发生严重错误无法恢复时都可能需要重启集群
一个集群节点重启前要先临时禁用自动分配,设置cluster.routing.allocation.enable为none,否则节点停止后,当前节点的分片会自动分配到其他节点上,本节点启动后需要等其他节点RECOVERING后才会RELOCATING,也就是分片在其他节点恢复后又转移回来,浪费大量时间
首先禁用自动分配
curl -XPUT http://127.0.0.1:9200/_cluster/settings -d ‘{
“transient” : {
“cluster.routing.allocation.enable” : “none”
}
}’
然后再重启集群
集群启动后再改回配置
curl -XPUT http://127.0.0.1:9200/_cluster/settings -d ‘{
“transient” : {
“cluster.routing.allocation.enable” : “all”
}
}’
切换ES使用的netty版本
// 切换netty3
./bin/elasticsearch -Ehttp.type=netty3