作为分布式系统的spark rpc通信是一个很重要的课题。在spark2.3.3版本中rpc框架已经由akka替换成netty实现。接下来我们先介绍一下spark rpc中的组件,然后根据sparkcontext初始化过程来分析rpc组件如何初始化和使用原理。
在spark中定义transportclient和transportserver来封装netty的channel和handler,定义box来封装bytebuf,所以spark rpc有两个重要过程初始化和通信。
一、RpcEnv初始化
sparkcore所有的初始化都是从sparkcontext开始,如下图,transportclient的初始化也是在sparkContext中。
这里的RpcEnv是一个抽象类,真正实现类是NettyRpcEnv。在NettyRpcEnv中包含了RPC通信的所有操作方法,当然也有创建transportclient所需要的参数以及createClient方法。
private[netty] class NettyRpcEnv(
val conf: SparkConf,
javaSerializerInstance: JavaSerializerInstance,
host: String,
securityManager: SecurityManager,
numUsableCores: Int) extends RpcEnv(conf) with Logging {
private[netty] val transportConf = SparkTransportConf.fromSparkConf(
conf.clone.set("spark.rpc.io.numConnectionsPerPeer", "1"),
"rpc",
conf.getInt("spark.rpc.io.threads", 0))
private val dispatcher: Dispatcher = new Dispatcher(this, numUsableCores)
private val streamManager = new NettyStreamManager(this)
private val transportContext = new TransportContext(transportConf,
new NettyRpcHandler(dispatcher, this, streamManager))
private def createClientBootstraps(): java.util.List[TransportClientBootstrap] = {
if (securityManager.isAuthenticationEnabled()) {
java.util.Arrays.asList(new AuthClientBootstrap(transportConf,
securityManager.getSaslUser(), securityManager))
} else {
java.util.Collections.emptyList[TransportClientBootstrap]
}
}
private val clientFactory = transportContext.createClientFactory(createClientBootstraps())
@volatile private var fileDownloadFactory: TransportClientFactory = _
val timeoutScheduler = ThreadUtils.newDaemonSingleThreadScheduledExecutor("netty-rpc-env-timeout")
@volatile private var server: TransportServer = _
private val stopped = new AtomicBoolean(false)
private val outboxes = new ConcurrentHasMp[RpcAddress, Outbox]()
先来看一段NettyRpcEnv的参数初始化代码(有删减)。这里的初始化参数都有很重要的作用:
-
- transportConf :通信配置参数
- streamManager :流管理器
- transportContext :通信上下文
- dispatcher:netty handler中实际高性能异步处理消息的组件,从代码中可以看到disatcher已经封装到了nettyRpcHandler中,后面会重点讲这个的作用
- clientFactory :创建transportclient工厂在传参的时候有一个createClientBootstraps方法,这个其实是在创建client的时候做一些补充handler比如权限校验
- outboxes :需要传递的消息盒子
二、Dispatcher初始化
在NettyRpcEnv对象创建的时候会直接初始化dispatcher。dispatcher里面有一个内部类用来封装rpcEndpoint、rpcEndpointRef、inbox的EndpointData。RpcEndpoint是消息处理终端属于服务端,RpcEndpointRes属于客户端代表消息终端引用。其中存储的消息是inbox和outbox这里面存储了一个message序列代表所有发送和收到的消息。
private[netty] class Dispatcher(nettyEnv: NettyRpcEnv, numUsableCores: Int) extends Logging {
private class EndpointData(
val name: String,
val endpoint: RpcEndpoint,
val ref: NettyRpcEndpointRef) {
val inbox = new Inbox(ref, endpoint)
}
private val endpoints: ConcurrentMap[String, EndpointData] =
new ConcurrentHashMap[String, EndpointData]
private val endpointRefs: ConcurrentMap[RpcEndpoint, RpcEndpointRef] =
new ConcurrentHashMap[RpcEndpoint, RpcEndpointRef]
// Track the receivers whose inboxes may contain messages.
private val receivers = new LinkedBlockingQueue[EndpointData]
dispatcher作为中间层记录了endpoint和endpointRef之间的关系,确保消息被正确消费。endpoints、endpointRefs就记录了在服务端启动时向dispatcher注册的endpoint和endpointRef之间的关系。在后续建立通信时找到正确的处理终端消费信息。dispatcher是在nettyRpcEnv环境中初始化的,整个通信过程对于同一个环境来说只有一个dispatcher,全部由dispatcher进一步分发消息,所以他的高效是至关重要。那么他是怎么接受消息并高效分发的呢?
def registerRpcEndpoint(name: String, endpoint: RpcEndpoint): NettyRpcEndpointRef = {
val addr = RpcEndpointAddress(nettyEnv.address, name)
val endpointRef = new NettyRpcEndpointRef(nettyEnv.conf, addr, nettyEnv)
synchronized {
if (stopped) {
throw new IllegalStateException("RpcEnv has been stopped")
}
if (endpoints.putIfAbsent(name, new EndpointData(name, endpoint, endpointRef)) != null) {
throw new IllegalArgumentException(s"There is already an RpcEndpoint called $name")
}
val data = endpoints.get(name)
endpointRefs.put(data.endpoint, data.ref)
receivers.offer(data) // for the OnStart message
}
endpointRef
}
在dispatcher初始化中个人认为最重要的初始化就是threadpool线程池的初始化,这里不仅仅是开启了线程池,并且对接受消息的阻塞队列receivers 进行了监控。这个功能就交给了MessageLoop这个任务。我们来梳理一下这个消费过程。首先启动多线程并根据“spark.rpc.netty.dispatcher.numThreads”或者核数来确定线程数据,然后运行任务MessageLoop。这个任务的内容就是获取receivers中的新进消息,由于take方法是阻塞的所以当没有新消息进入时这个线程是阻塞的。当有新消息进入这个阻塞队列,dispatcher就会调用消息中的process方法进行消息消费。这整个过程一旦初始化以后就会自动完成分发消息的功能,非常方便高效。
/** Thread pool used for dispatching messages. */
private val threadpool: ThreadPoolExecutor = {
val availableCores =
if (numUsableCores > 0) numUsableCores else Runtime.getRuntime.availableProcessors()
val numThreads = nettyEnv.conf.getInt("spark.rpc.netty.dispatcher.numThreads",
math.max(2, availableCores))
val pool = ThreadUtils.newDaemonFixedThreadPool(numThreads, "dispatcher-event-loop")
for (i <- 0 until numThreads) {
pool.execute(new MessageLoop)
}
pool
}
/** Message loop used for dispatching messages. */
private class MessageLoop extends Runnable {
override def run(): Unit = {
try {
while (true) {
try {
val data = receivers.take()
if (data == PoisonPill) {
// Put PoisonPill back so that other MessageLoops can see it.
receivers.offer(PoisonPill)
return
}
data.inbox.process(Dispatcher.this)
} catch {
case NonFatal(e) => logError(e.getMessage, e)
}
}
} catch {
case ie: InterruptedException => // exit
}
}
}
三、TransportClient初始化
介绍了dispatcher我们来看一下创建transportclient的代码。这段代码是NettyRpcEnv的入口,只需要传入通信地址即可。
private[netty] def createClient(address: RpcAddress): TransportClient = {
clientFactory.createClient(address.host, address.port)
}
接下来会进入工厂的创建方法。在介绍createclient方法之前,需要先介绍工厂类中的一个内部类ClinetPool,因为通信都是一对一的,所以避免不了创建多个transportclient,在工厂中使用ClientPool来管理相同地址客户端池,方便缓存客户端来重复使用。ClientPool在初始化的时候会一起初始化TransportClient,但是此时的TransportClient并没有channel和handler,只是一具空壳,他会在真正需要的时候创建与server端的连接,加入channel和handler到TransportClient中以便发送处理消息,这里其实有lazy加载的思想。从代码中可以看到除了transportclient以外,还有一个相同长度的Object[] locks,这里的locks和transportclients按照索引是一一对应的,通过对每个Transportclient分别采用不同的锁,降低并发情况下线程间对锁的争用,进而减少阻塞,提高并发度。这里的clientPool的思想比较常用,在开发过程中,遇到连接场景可以学习着使用提高性能。
private static class ClientPool {
TransportClient[] clients;
Object[] locks;
ClientPool(int size) {
clients = new TransportClient[size];
locks = new Object[size];
for (int i = 0; i < size; i++) {
locks[i] = new Object();
}
}
}
下面这段createClient方法其实并不是真正创建transportclient的代码而是从对应地址的clientpool获取已经创建的transportclient。简单介绍一下这个过程,显示更具传入的host和port生成InetSocketAddress,然后再在缓存池connectionPool中找是否存在clientpool连接池,如果不存在则创建一个新的clientpool并随机取出一个TransportClient,由于是新创建的clientpool,里面的transportClient自然是空的未激活的。这里为什么说transportClient是未激活的,其实是因为transportclient是对netty的channel的包装,在向channel中写入数据的时候会有其他处理(requesthandler等)这里还没有与server端建立连接将通信的channel放入client中,所以代码会进入最下面那个createClient方法来真正初始化TransportClient。如果已经存在clientpool,会直接将缓存中的transportclient返回。
private final ConcurrentHashMap<SocketAddress, ClientPool> connectionPool;
/**从缓存中获取transportclient*/
public TransportClient createClient(String remoteHost, int remotePort)
throws IOException, InterruptedException {
// Get connection from the connection pool first.
// If it is not found or not active, create a new one.
// Use unresolved address here to avoid DNS resolution each time we creates a client.
final InetSocketAddress unresolvedAddress =
InetSocketAddress.createUnresolved(remoteHost, remotePort);
// Create the ClientPool if we don't have it yet.
ClientPool clientPool = connectionPool.get(unresolvedAddress);
if (clientPool == null) {
connectionPool.putIfAbsent(unresolvedAddress, new ClientPool(numConnectionsPerPeer));
clientPool = connectionPool.get(unresolvedAddress);
}
int clientIndex = rand.nextInt(numConnectionsPerPeer);
TransportClient cachedClient = clientPool.clients[clientIndex];
if (cachedClient != null && cachedClient.isActive()) {
// Make sure that the channel will not timeout by updating the last use time of the
// handler. Then check that the client is still alive, in case it timed out before
// this code was able to update things.
TransportChannelHandler handler = cachedClient.getChannel().pipeline()
.get(TransportChannelHandler.class);
synchronized (handler) {
handler.getResponseHandler().updateTimeOfLastRequest();
}
if (cachedClient.isActive()) {
logger.trace("Returning cached connection to {}: {}",
cachedClient.getSocketAddress(), cachedClient);
return cachedClient;
}
}
// If we reach here, we don't have an existing connection open. Let's create a new one.
// Multiple threads might race here to create new connections. Keep only one of them active.
final long preResolveHost = System.nanoTime();
final InetSocketAddress resolvedAddress = new InetSocketAddress(remoteHost, remotePort);
final long hostResolveTimeMs = (System.nanoTime() - preResolveHost) / 1000000;
if (hostResolveTimeMs > 2000) {
logger.warn("DNS resolution for {} took {} ms", resolvedAddress, hostResolveTimeMs);
} else {
logger.trace("DNS resolution for {} took {} ms", resolvedAddress, hostResolveTimeMs);
}
synchronized (clientPool.locks[clientIndex]) {
cachedClient = clientPool.clients[clientIndex];
if (cachedClient != null) {
if (cachedClient.isActive()) {
logger.trace("Returning cached connection to {}: {}", resolvedAddress, cachedClient);
return cachedClient;
} else {
logger.info("Found inactive connection to {}, creating a new one.", resolvedAddress);
}
}
clientPool.clients[clientIndex] = createClient(resolvedAddress);
return clientPool.clients[clientIndex];
}
}
我们来看真正创建transportclient的这个create方法,这个createClient和上面取缓存的方法是重载的,两个方法可能只是为了先取缓存才导致方法参数不同,其实这两个参数意义是一样的。下面就是我们熟悉的netty创建客户端的代码了,这里不会对netty进行讲解,只会对其中的spark特殊处理的部分进行说明。我们都知道netty中最重要的过程就是责任链pipeline,在这里也不例外,sparkrpc对pipeline中的handler做了大量的特殊处理,其中包括messageHandler、TransportChannelHandler 、dispatcher还有transportclient。我们直接看一看这里最重要的部分context.initializePipeline(ch).
private TransportClient createClient(InetSocketAddress address)
throws IOException, InterruptedException {
logger.debug("Creating new connection to {}", address);
Bootstrap bootstrap = new Bootstrap();
bootstrap.group(workerGroup)
.channel(socketChannelClass)
// Disable Nagle's Algorithm since we don't want packets to wait
.option(ChannelOption.TCP_NODELAY, true)
.option(ChannelOption.SO_KEEPALIVE, true)
.option(ChannelOption.CONNECT_TIMEOUT_MILLIS, conf.connectionTimeoutMs())
.option(ChannelOption.ALLOCATOR, pooledAllocator);
if (conf.receiveBuf() > 0) {
bootstrap.option(ChannelOption.SO_RCVBUF, conf.receiveBuf());
}
if (conf.sendBuf() > 0) {
bootstrap.option(ChannelOption.SO_SNDBUF, conf.sendBuf());
}
final AtomicReference<TransportClient> clientRef = new AtomicReference<>();
final AtomicReference<Channel> channelRef = new AtomicReference<>();
bootstrap.handler(new ChannelInitializer<SocketChannel>() {
@Override
public void initChannel(SocketChannel ch) {
TransportChannelHandler clientHandler = context.initializePipeline(ch);
clientRef.set(clientHandler.getClient());
channelRef.set(ch);
}
});
// Connect to the remote server
long preConnect = System.nanoTime();
ChannelFuture cf = bootstrap.connect(address);
if (!cf.await(conf.connectionTimeoutMs())) {
throw new IOException(
String.format("Connecting to %s timed out (%s ms)", address, conf.connectionTimeoutMs()));
} else if (cf.cause() != null) {
throw new IOException(String.format("Failed to connect to %s", address), cf.cause());
}
TransportClient client = clientRef.get();
Channel channel = channelRef.get();
assert client != null : "Channel future completed successfully with null client";
// Execute any client bootstraps synchronously before marking the Client as successful.
long preBootstrap = System.nanoTime();
logger.debug("Connection to {} successful, running bootstraps...", address);
try {
for (TransportClientBootstrap clientBootstrap : clientBootstraps) {
clientBootstrap.doBootstrap(client, channel);
}
} catch (Exception e) { // catch non-RuntimeExceptions too as bootstrap may be written in Scala
long bootstrapTimeMs = (System.nanoTime() - preBootstrap) / 1000000;
logger.error("Exception while bootstrapping client after " + bootstrapTimeMs + " ms", e);
client.close();
throw Throwables.propagate(e);
}
long postBootstrap = System.nanoTime();
logger.info("Successfully created connection to {} after {} ms ({} ms spent in bootstraps)",
address, (postBootstrap - preConnect) / 1000000, (postBootstrap - preBootstrap) / 1000000);
return client;
}
进入这个方法可以看到会将之前在NettyRpcEnv中初始化的NettyRpcHandler(这里面包含了dispatcher)作为参数传入,这里面主要有两个过程一个是创建通用channelhandler(客户端和服务端是共用的),另一个就是创建pipeline。创建pipeline其实就是将handler加入到pipeline中形成责任链。前面几个handler就是普通的编解码器和计时器。这里重点讲TransportChannelHandler消息处理器,所有的消息类型都是通过TransportChannelHandler进行分发处理。
public TransportChannelHandler initializePipeline(SocketChannel channel) {
return initializePipeline(channel, rpcHandler);
}
public TransportChannelHandler initializePipeline(
SocketChannel channel,
RpcHandler channelRpcHandler) {
try {
TransportChannelHandler channelHandler = createChannelHandler(channel, channelRpcHandler);
channel.pipeline()
.addLast("encoder", ENCODER)
.addLast(TransportFrameDecoder.HANDLER_NAME, NettyUtils.createFrameDecoder())
.addLast("decoder", DECODER)
.addLast("idleStateHandler", new IdleStateHandler(0, 0, conf.connectionTimeoutMs() / 1000))
// NOTE: Chunks are currently guaranteed to be returned in the order of request, but this
// would require more logic to guarantee if this were not part of the same event loop.
.addLast("handler", channelHandler);
return channelHandler;
} catch (RuntimeException e) {
logger.error("Error while initializing Netty pipeline", e);
throw e;
}
}
在这个方法里面,会先后创建TransportResponseHandler和TransportRequestHandler ,这两个handler都继承自MessageHandler。MessageHandler中定义了handle、channelActive、exceptionCaught、channelInactive接口,handle方法就是对消息进行处理。messagehandler是一个泛型类,它指定Message子类作为消息参数。
TransportResponseHandler对应的是客户端发送消息后对返回的消息进行处理的handler,TransportRequestHandler 则是服务端接受到消息对接收到的消息进行处理的handler。transportclient的主要作用是通过channel向服务端发送请求并且对返回结果进行处理,在transportclient发送请求后将callback处理通过addRpcRequet加入到TransportResponseHandler中,所以他有两个主要参数channel和responseHandler 。而TransportRequestHandler则是对于TransportServer来说的,对请求进行处理然后返回消息给客户端,所以需要channel, client,rpcHandler。TransportChannelHandler则是将这三者整合对客户端服务端统一进行处理,对消息的类型进行判断区分是服务端还是客户端,调用各自的handle方法。
private TransportChannelHandler createChannelHandler(Channel channel, RpcHandler rpcHandler) {
TransportResponseHandler responseHandler = new TransportResponseHandler(channel);
TransportClient client = new TransportClient(channel, responseHandler);
TransportRequestHandler requestHandler = new TransportRequestHandler(channel, client,
rpcHandler, conf.maxChunksBeingTransferred());
return new TransportChannelHandler(client, responseHandler, requestHandler,
conf.connectionTimeoutMs(), closeIdleConnections);
}
在createClient中的最后一步就是用TransportClientBootstrap对client做一些引导工作。到这里transportclient就差不多初始化完成了。TransportServer的初始化和Transportclient大同小异。
四、spark rpc通信过程
下图是spark rpc框架调用过程的uml图,可以帮助理解spark rpc框架各个组件之间是如何运作的。从下图中可以很直观的看到dispatcher扮演的角色,引导接收到的消息给指定终端进行消费。
值得注意的是在使用nettyRpcEndpointRef发送请求时,内部会根据地址来判断是属于本地发送还是远程通信,如果是本地会直接调用dispatcher的分发方法,如果是远程则通过client进行远程发送。
private[netty] def send(message: RequestMessage): Unit = {
val remoteAddr = message.receiver.address
if (remoteAddr == address) {
// Message to a local RPC endpoint.
try {
dispatcher.postOneWayMessage(message)
} catch {
case e: RpcEnvStoppedException => logDebug(e.getMessage)
}
} else {
// Message to a remote RPC endpoint.
postToOutbox(message.receiver, OneWayOutboxMessage(message.serialize(this)))
}
}