Cassandra协议错误导致客户端报NoNodeAvailableException错误的解决方案

最新推荐文章于 2024-08-15 16:57:02 发布

吕淮子

最新推荐文章于 2024-08-15 16:57:02 发布

阅读量1.9k

点赞数

文章标签： cassandra

原文链接：https://zhuanlan.zhihu.com/p/345623263

版权

Cassandra NoNodeAvailableException 协议协商系统peers表客户端错误

关键词由CSDN通过智能技术生成

最近发现我们的cassandra客户端程序运行一段时间后会报NoNodeAvailableException的错误，并且无法自行恢复，需要重启客户端才能解决。

而且有些情况下还会与警告：This is likely a gossip or snitch issue, this host will be ignored

com.datastax.oss.driver.api.core.NoNodeAvailableException: No node was available to execute the query
	at com.datastax.oss.driver.api.core.NoNodeAvailableException.copy(NoNodeAvailableException.java:40)
	at com.datastax.oss.driver.internal.core.util.concurrent.CompletableFutures.getUninterruptibly(CompletableFutures.java:149)
	at com.datastax.oss.driver.internal.core.cql.CqlRequestSyncProcessor.process(CqlRequestSyncProcessor.java:53)
	at com.datastax.oss.driver.internal.core.cql.CqlRequestSyncProcessor.process(CqlRequestSyncProcessor.java:30)
	at com.datastax.oss.driver.internal.core.session.DefaultSession.execute(DefaultSession.java:230)
	at com.datastax.oss.driver.api.core.cql.SyncCqlSession.execute(SyncCqlSession.java:53)

google了一下，没有找到类似问题的解决方案，只能自己解决了。

看了一下报错之前的日志，发现如下日志内容：

2021-01-15 14:46:11.163  WARN 1 --- [     s0-admin-0] c.d.o.d.internal.core.pool.ChannelPool   : [s0|cassandra-contactpoints.default.svc.cluster.local/172.16.21.133:9042]  Error while opening new channel (ConnectionInitException: [s0|connecting...] Protocol initialization request, step 1 (STARTUP {CQL_VERSION=3.0.0, DRIVER_NAME=DataStax Java driver for Apache Cassandra(R), DRIVER_VERSION=4.6.1, CLIENT_ID=ab768486-6b17-4280-af3d-7ea42acd6913}): failed to send request (io.netty.channel.StacklessClosedChannelException))
2021-01-15 14:46:13.306  WARN 1 --- [     s0-admin-0] c.d.o.d.internal.core.pool.ChannelPool   : [s0|cassandra-contactpoints.default.svc.cluster.local/172.16.21.133:9042]  Error while opening new channel (ConnectionInitException: [s0|connecting...] Protocol initialization request, step 1 (STARTUP {CQL_VERSION=3.0.0, DRIVER_NAME=DataStax Java driver for Apache Cassandra(R), DRIVER_VERSION=4.6.1, CLIENT_ID=ab768486-6b17-4280-af3d-7ea42acd6913}): failed to send request (io.netty.channel.StacklessClosedChannelException))
2021-01-15 14:46:16.708  WARN 1 --- [     s0-admin-0] c.d.o.d.internal.core.pool.ChannelPool   : [s0|cassandra-contactpoints.default.svc.cluster.local/172.16.21.133:9042]  Error while opening new channel (ConnectionInitException: [s0|connecting...] Protocol initialization request, step 1 (STARTUP {CQL_VERSION=3.0.0, DRIVER_NAME=DataStax Java driver for Apache Cassandra(R), DRIVER_VERSION=4.6.1, CLIENT_ID=ab768486-6b17-4280-af3d-7ea42acd6913}): failed to send request (io.netty.channel.StacklessClosedChannelException))
2021-01-15 14:46:28.596  WARN 1 --- [     s0-admin-0] c.d.o.d.internal.core.pool.ChannelPool   : [s0|cassandra-contactpoints.default.svc.cluster.local/172.16.21.133:9042]  Error while opening new channel (ConnectionInitException: [s0|connecting...] Protocol initialization request, step 1 (STARTUP {CQL_VERSION=3.0.0, DRIVER_NAME=DataStax Java driver for Apache Cassandra(R), DRIVER_VERSION=4.6.1, CLIENT_ID=ab768486-6b17-4280-af3d-7ea42acd6913}): failed to send request (java.nio.channels.NotYetConnectedException))

发现172.16.21.133这个节点连接不上，看了一下cassandra集群的状态，发现有个节点重启了，因为我们的cassandra部署在kubernetes集群上，节点重启后会重新分配一个IP，所以导致连接不上，但是我们的集群有三个节点，客户端为什么不连接其他两个节点呢？继续查看客户端日志，发现在客户端启动时有下面这些日志：

2021-01-14 23:28:39.184  INFO 1 --- [           main] c.d.o.d.internal.core.ContactPoints      : Contact point cassandra-contactpoints.default.svc.cluster.local:9042 resolves to multiple addresses, will use them all ([cassandra-contactpoints.default.svc.cluster.local/172.16.11.40, cassandra-contactpoints.default.svc.cluster.local/172.16.20.55, cassandra-contactpoints.default.svc.cluster.local/172.16.21.133])
2021-01-14 23:28:39.497  INFO 1 --- [           main] c.d.o.d.i.core.DefaultMavenCoordinates   : DataStax Java driver for Apache Cassandra(R) (com.datastax.oss:java-driver-core) version 4.6.1
2021-01-14 23:28:41.420  INFO 1 --- [     s0-admin-0] c.d.oss.driver.internal.core.time.Clock  : Using native clock for microsecond precision
2021-01-14 23:28:41.909  INFO 1 --- [        s0-io-0] c.d.o.d.i.core.channel.ChannelFactory    : [s0] Failed to connect with protocol DSE_V2, retrying with DSE_V1
2021-01-14 23:28:41.919  INFO 1 --- [        s0-io-1] c.d.o.d.i.core.channel.ChannelFactory    : [s0] Failed to connect with protocol DSE_V1, retrying with V4
2021-01-14 23:28:41.930  INFO 1 --- [        s0-io-0] c.d.o.d.i.core.channel.ChannelFactory    : [s0] Failed to connect with protocol V4, retrying with V3
2021-01-14 23:28:41.959  INFO 1 --- [        s0-io-1] c.d.oss.driver.api.core.uuid.Uuids       : PID obtained through native call to getpid(): 1
2021-01-14 23:28:42.066  WARN 1 --- [        s0-io-1] c.d.o.d.i.c.m.DefaultTopologyMonitor     : [s0] Found invalid row in system.peers for peer: /172.16.11.82. This is likely a gossip or snitch issue, this node will be ignored.
2021-01-14 23:28:42.195  INFO 1 --- [     s0-admin-0] c.d.o.d.i.core.session.DefaultSession    : [s0] Negotiated protocol version V3 for the initial contact point, but other nodes only support V4, downgrading
2021-01-14 23:28:42.230  WARN 1 --- [        s0-io-1] c.d.o.d.i.c.m.SchemaAgreementChecker     : [s0] Missing schema_version in system.peers row for cd12a2ee-e22f-4f40-9e3c-7cad37c98131, excluding from schema agreement check
2021-01-14 23:28:42.598  WARN 1 --- [     s0-admin-1] c.d.o.d.internal.core.pool.ChannelPool   : [s0|cassandra-contactpoints.default.svc.cluster.local/172.16.11.40:9042] Fatal error while initializing pool, forcing the node down (UnsupportedProtocolVersionException: [cassandra-contactpoints.default.svc.cluster.local/172.16.11.40:9042] Host does not support protocol version V4)
2021-01-14 23:28:42.616  WARN 1 --- [     s0-admin-0] c.d.o.d.internal.core.pool.ChannelPool   : [s0|cassandra-contactpoints.default.svc.cluster.local/172.16.20.55:9042] Fatal error while initializing pool, forcing the node down (UnsupportedProtocolVersionException: [cassandra-contactpoints.default.svc.cluster.local/172.16.20.55:9042] Host does not support protocol version V4)

客户端启动时拿到了三个节点的IP，但是在协议协商时将两个节点剔除了，只留下了172.16.21.133这个节点，这样当172.16.21.133重启后，客户端就彻底联系不上集群了。

我们cassandra集群的节点版本都是一样的，为什么协商协议时会出现v3和v4版本呢？

google了一下关于客户端协议协商的过程，找到了这篇文章DataStax Java Driver - Native protocol，客户端协议协商的过程大致如下：

连接第一个节点，并协商一个版本，设为版本1
查询system.peers表，获取其他节点均支持的最高版本，设为版本2
如果版本2比版本1低，将关闭并重新连接集群，否则，将使用版本1的协议连接剩余节点

一般只有在对cassandra集群进行滚动升级时，才会存在多个版本，上述协商逻辑能够保证客户端使用集群节点均支持的最高版本协议与集群通信，从而保证在升级过程中，客户端仍可以连接所有节点，实现平滑升级。但是我们的客户端日志与这个协商逻辑并不一致，日志中显示，初始连接协商的版本是v3，system.peers表获得的版本是v4，v4大于v3，按照上述逻辑，应该使用v3，但是日志中显示最终使用的协议是v4。是不是客户端代码的逻辑和文档的逻辑有出入？于是我又看了一下客户端代码，在com.datastax.oss.driver.internal.core.session.DefaultSession中找到了相关代码：

private void afterInitialNodeListRefresh(CqlIdentifier keyspace) {
      try {
        boolean protocolWasForced =
            context.getConfig().getDefaultProfile().isDefined(DefaultDriverOption.PROTOCOL_VERSION);
        if (!protocolWasForced) {
          ProtocolVersion currentVersion = context.getProtocolVersion();
          ProtocolVersion bestVersion =
              context
                  .getProtocolVersionRegistry()
                  .highestCommon(metadataManager.getMetadata().getNodes().values());
          if (!currentVersion.equals(bestVersion)) {
            LOG.info(
                "[{}] Negotiated protocol version {} for the initial contact point, "
                    + "but other nodes only support {}, downgrading",
                logPrefix,
                currentVersion,
                bestVersion);
            context.getChannelFactory().setProtocolVersion(bestVersion);

            // Note that, with the default topology monitor, the control connection is already
            // connected with currentVersion at this point. This doesn't really matter because none
            // of the control queries use any protocol-dependent feature.
            // Keep going as-is, the control connection might switch to the "correct" version later
            // if it reconnects to another node.
          }
        }
        metadataManager
            .refreshSchema(null, false, true)
            .whenComplete(
                (metadata, error) -> {
                  if (error != null) {
                    Loggers.warnWithException(
                        LOG,
                        "[{}] Unexpected error while refreshing schema during initialization, "
                            + "keeping previous version",
                        logPrefix,
                        error);
                  }
                  afterInitialSchemaRefresh(keyspace);
                });
      } catch (Throwable throwable) {
        initFuture.completeExceptionally(throwable);
      }
    }

代码中的currentVersion即版本1，bestVersion即版本2，代码中并没有比较这两个版本的大小，只是比较是否相同，如果不同，就使用bestVersion连接所有节点，与文档中的逻辑不一致。

总结一下，客户端实际使用的是n-1个节点均支持的最高版本协议来连接集群，n为集群节点数。这样在对集群进行滚动升级时，如果大部分节点均已升级到高版本，而客户端此时连接集群，客户端将使用高版本协议连接集群，这样，将导致客户端无法连接还未升级的节点，导致cassandra负载不均，可能影响cassandra和所有客户端的稳定性，应该算是客户端的BUG。

回到我们遇到的问题本身，我们的cassandra节点都是3.x版本，使用的协议应该都是v4版本，怎么在协商的过程中会出现v3版本呢？

在日志中还有一条这样的日志“Found invalid row in system.peers for peer: /172.16.11.82. This is likely a gossip or snitch issue, this node will be ignored.”，google了一下，找到这篇文章Why am I getting "Found invalid row in system.peers"?，文中说，节点ip如果发生变化，system.peers表中可能仍然保留旧的IP，需要手动删除。

于是cqlsh连接所有节点看了一下，cqlsh连接后会显示连接的协议版本，其中两个节点显示v3，一个节点显示v4，显示v3的两个节点system.peers表中均存在一条不存在的IP记录,这条记录除了IP，其他字段均是空的，显示v4的，则没有问题，手动删除后重启节点，cqlsh再次连接，协议均变为v4。看来system.peers表不仅影响客户端使用的协议，还影响服务器节点对外提供的协议，但是我并没找到相关的文档支持，暂且认为如此吧。至于在什么情况下system.peers表会出现不存在的IP记录，文档说重启可能会，也没说百分百会，上面我删除记录重启后，确实没有出现这个问题，不知道以后还会不会出现，后续继续观察一下，如果再次出现，不知道节点的协议版本会不会再次变成v3，先这样吧。

解决cassandra集群的问题后，需要重启所有客户端程序，以更新客户端缓存的节点列表。重启之后，检查一下客户端日志，确认协议是v4，并且没有节点被剔除。