在elasticsearch在选举中,节点之间的相互投票通过ping来实现。
其中的实现类为UnicastZenPing,在其构造方法中读取配置中的discovery.zen.ping.unicast.hosts
来把保存节点之间的别的节点ip。
if (DISCOVERY_ZEN_PING_UNICAST_HOSTS_SETTING.exists(settings)) {
configuredHosts = DISCOVERY_ZEN_PING_UNICAST_HOSTS_SETTING.get(settings);
// we only limit to 1 addresses, makes no sense to ping 100 ports
limitPortCounts = LIMIT_FOREIGN_PORTS_COUNT;
}
当节点要参与选举,而希望得到别的节点的信息的时候将会通过ping()方法来获得。
@Override
public void ping(final Consumer<PingCollection> resultsConsumer, final TimeValue duration) {
ping(resultsConsumer, duration, duration);
}
其参数resultConsumer实则是CompletableFuture::complete,处理一次ping操作结束的操作。
在ping()方法中,首先通过resolveHostLists()方法来解析集群内的节点。
final List<DiscoveryNode> seedNodes;
try {
seedNodes = resolveHostsLists(
unicastZenPingExecutorService,
logger,
configuredHosts,
limitPortCounts,
transportService,
UNICAST_NODE_PREFIX,
resolveTimeout);
} catch (InterruptedException e) {
throw new RuntimeException(e);
}
在resolveHostLists()方法中,会根据所有配置的host ip,创建一个定时任务,以便并发的将host ip解析成方便transportService调用的TransportAddress。
final List<Callable<TransportAddress[]>> callables =
hosts
.stream()
.map(hn -> (Callable<TransportAddress[]>) () -> transportService.addressesFromString(hn, limitPortCounts))
.collect(Collectors.toList());
final List<Future<TransportAddress[]>> futures =
executorService.invokeAll(callables, resolveTimeout.nanos(), TimeUnit.NANOSECONDS);
接下来会得到本机的publish ip和绑定的ip,然后将会遍历所有的future去等待地址的解析完毕。在得到解析结果之后,过滤掉自己的publish ip和绑定的ip之后,包装成discoveryNode,并加入返回的结果集。
localAddresses.add(transportService.boundAddress().publishAddress());
localAddresses.addAll(Arrays.asList(transportService.boundAddress().boundAddresses()));
// ExecutorService#invokeAll guarantees that the futures are returned in the iteration order of the tasks so we can associate the
// hostname with the corresponding task by iterating together
final Iterator<String> it = hosts.iterator();
for (final Future<TransportAddress[]> future : futures) {
final String hostname = it.next();
if (!future.isCancelled()) {
assert future.isDone();
try {
final TransportAddress[] addresses = future.get();
logger.trace("resolved host [{}] to {}", hostname, addresses);
for (int addressId = 0; addressId < addresses.length; addressId++) {
final TransportAddress address = addresses[addressId];
// no point in pinging ourselves
if (localAddresses.contains(address) == false) {
discoveryNodes.add(
new DiscoveryNode(
nodeId_prefix + hostname + "_" + addressId + "#",
address,
emptyMap(),
emptySet(),
Version.CURRENT.minimumCompatibilityVersion()));
}
}
} catch (final ExecutionException e) {
assert e.getCause() != null;
final String message = "failed to resolve host [" + hostname + "]";
logger.warn(message, e.getCause());
}
} else {
logger.warn("timed out after [{}] resolving host [{}]", resolveTimeout, hostname);
}
}
return discoveryNodes;
在完成了,配置中节点的解析之后,还可以根据集群中的别的节点去发现更多的节点,之后,根据集群状态clusterState中将所有可能成为master节点的节点(也就是master属性为true)也加入到seedNodes中。
然后构造一个ConnectionProfile,确定ping操作的连接类型为reg,以及握手和连接的timeout,默认为3秒。
之后就会构造一次ping操作的抽象,PingingRound。
final ConnectionProfile connectionProfile =
ConnectionProfile.buildSingleChannelProfile(TransportRequestOptions.Type.REG, requestDuration, requestDuration);
final PingingRound pingingRound = new PingingRound(pingingRoundIdGenerator.incrementAndGet(), seedNodes, resultsConsumer,
nodes.getLocalNode(), connectionProfile);
activePingingRounds.put(pingingRound.id(), pingingRound);
一个PingingRound代表本节点一次ping操作,包含了ping操作的连接类型和timeout,所要发往的节点,和本轮操作的id。
接下来将会构造一个pingSender,分三轮,按照一定的时间间隔通过sendPings方法向所有节点发送三次ping请求。
final AbstractRunnable pingSender = new AbstractRunnable() {
@Override
public void onFailure(Exception e) {
if (e instanceof AlreadyClosedException == false) {
logger.warn("unexpected error while pinging", e);
}
}
@Override
protected void doRun() throws Exception {
sendPings(requestDuration, pingingRound);
}
};
threadPool.generic().execute(pingSender);
threadPool.schedule(TimeValue.timeValueMillis(scheduleDuration.millis() / 3), ThreadPool.Names.GENERIC, pingSender);
threadPool.schedule(TimeValue.timeValueMillis(scheduleDuration.millis() / 3 * 2), ThreadPool.Names.GENERIC, pingSender);
默认时间为1秒一次,执行一次pingSender的sendPings()方法。
final UnicastPingRequest pingRequest = new UnicastPingRequest();
pingRequest.id = pingingRound.id();
pingRequest.timeout = timeout;
ClusterState lastState = contextProvider.clusterState();
pingRequest.pingResponse = createPingResponse(lastState);
Set<DiscoveryNode> nodesFromResponses = temporalResponses.stream().map(pingResponse -> {
assert clusterName.equals(pingResponse.clusterName()) :
"got a ping request from a different cluster. expected " + clusterName + " got " + pingResponse.clusterName();
return pingResponse.node();
}).collect(Collectors.toSet());
// dedup by address
final Map<TransportAddress, DiscoveryNode> uniqueNodesByAddress =
Stream.concat(pingingRound.getSeedNodes().stream(), nodesFromResponses.stream())
.collect(Collectors.toMap(DiscoveryNode::getAddress, Function.identity(), (n1, n2) -> n1));
// resolve what we can via the latest cluster state
final Set<DiscoveryNode> nodesToPing = uniqueNodesByAddress.values().stream()
.map(node -> {
DiscoveryNode foundNode = lastState.nodes().findByAddress(node.getAddress());
if (foundNode == null) {
return node;
} else {
return foundNode;
}
}).collect(Collectors.toSet());
nodesToPing.forEach(node -> sendPingRequestToNode(node, timeout, pingingRound, pingRequest));
在其sendPings()方法中,在之前得到seedNodes的基础上,再加上之前向当前节点同样发送ping消息的同集群消息。
同时,当前构造一个pingRequest,其id正是pingRound的id,这样的id的pingRequest将会有三个。关于当前节点的节点数据,以及其所在集群的master节点(如果还未选出则为null),将会构造成一个pingResponse,携带在pingRequest中。
之后将会遍历之前得到的节点集合,分别调用sendPingRequstToNode()将pingRequest和pingingRound发送至目标节点。
transportService.sendRequest(connection, ACTION_NAME, pingRequest,
TransportRequestOptions.builder().withTimeout((long) (timeout.millis() * 1.25)).build(),
getPingResponseHandler(pingingRound, node));
在sendPingRequstToNode()中,启动了一个线程,最终去向目标节点的discovery/zen/unicast发送ping请求。同时通过getPingResponseHandler()设置了关于此次的pingingRound的responseHandler,用来处理目标节点对以这个request的回复。
@Override
public void handleResponse(UnicastPingResponse response) {
logger.trace("[{}] received response from {}: {}", pingingRound.id(), node, Arrays.toString(response.pingResponses));
if (pingingRound.isClosed()) {
if (logger.isTraceEnabled()) {
logger.trace("[{}] skipping received response from {}. already closed", pingingRound.id(), node);
}
} else {
Stream.of(response.pingResponses).forEach(pingingRound::addPingResponseToCollection);
}
}
在处理response的时候,先判断当前pingingRound是否已经关闭,如果还未关闭,则将response中关于节点对于选举的数据存放在其Map中,按照节点和其选举的选择进行存放。
一次pingingRound的时间为3秒,所以在完成pingingRound中3次ping的定时任务安排后同时会schedule一个3秒之后触发的任务,用来结束pingingRound关于pingResponse的收集。
@Override
public void close() {
List<Connection> toClose = null;
synchronized (this) {
if (closed.compareAndSet(false, true)) {
activePingingRounds.remove(id);
toClose = new ArrayList<>(tempConnections.values());
tempConnections.clear();
}
}
if (toClose != null) {
// we actually closed
try {
pingListener.accept(pingCollection);
} finally {
IOUtils.closeWhileHandlingException(toClose);
}
}
}
在关闭的过程中,在同步块中通过cas将closed的状态从false改为true,并将当前id从可用pingingRound中移出,关闭当前所有与集群中别的节点的连接。
之后调用,在最一开始传进来的consumer的accept方法,将ping之后的结果返回。
以上是调用ping的节点的过程。
transportService.registerRequestHandler(ACTION_NAME, UnicastPingRequest::new, ThreadPool.Names.SAME,
new UnicastPingRequestHandler());
每个节点在UncastZenPing的构造方法中,都会对discovery/zen/unicast注册requsetHandler用于处理ping请求的处理。
在UnicastPingRequstHandler中,在接收到别的节点的ping请求的时候,将会先判断是否处于一个集群的节点,如果是,则将会发回相应有HandlerPingRequest所产生的pingResponse。
private UnicastPingResponse handlePingRequest(final UnicastPingRequest request) {
assert clusterName.equals(request.pingResponse.clusterName()) :
"got a ping request from a different cluster. expected " + clusterName + " got " + request.pingResponse.clusterName();
temporalResponses.add(request.pingResponse);
// add to any ongoing pinging
activePingingRounds.values().forEach(p -> p.addPingResponseToCollection(request.pingResponse));
threadPool.schedule(TimeValue.timeValueMillis(request.timeout.millis() * 2), ThreadPool.Names.SAME,
() -> temporalResponses.remove(request.pingResponse));
List<PingResponse> pingResponses = CollectionUtils.iterableAsArrayList(temporalResponses);
pingResponses.add(createPingResponse(contextProvider.clusterState()));
UnicastPingResponse unicastPingResponse = new UnicastPingResponse();
unicastPingResponse.id = request.id;
unicastPingResponse.pingResponses = pingResponses.toArray(new PingResponse[pingResponses.size()]);
return unicastPingResponse;
}
其中temporalResponse用来存放每个节点所发过来的pingRequset中携带的节点数据(PingResponse),这里先加入当前request的pingResponse,然后向当前所有可用的pingingRound中加入当前request的pingResponse作为接下来获得的结果。
最后,起一个两倍pingingRound有效timeout的定时任务来从temporalResponse众移出此次pingResponse,保证数据的时效性。
最后构造类似的pingResponse,将当前节点的选举数据,以及所接收到的其他节点的pingResponse一并返回给发出的节点。
如上文,发出的节点将在ReponseHandler中解析这些数据。
又及,pingingRound中的id也可作为一个节点数据的版本号来作为管理,当ping返回回来的结果冲突的时候,将会选择id相对较大的。
public synchronized boolean addPing(PingResponse ping) {
PingResponse existingResponse = pings.get(ping.node());
if (existingResponse == null || existingResponse.id() <= ping.id()) {
pings.put(ping.node(), ping);
return true;
}
return false;
}