Spark源码剖析之数据通信模块解析

Spark是一个分布式计算框架,当 我们提交一个任务,会划分为多个子任务分发到集群的各个节点进行计算,这里思考一个问题,Spark是如何进行消息的传递,如何将任务分发到各个节点,并且如何将计算结果汇总起来的呢?

实际上,Spark内部使用Akka进行消息的传递,心跳报告等,使用Netty提供RPC服务进行数据的上传与下载功能。这点与Flink类似。

块管理器BlockManager是Spark存储体系中的核心组件,在Driver以及Executor中都会创建BlockManager,在BlockManager初始化时,它会初始化一些组件,其具体源码如下:

private[spark] class BlockManager(
    executorId: String,
    actorSystem: ActorSystem,
    val master: BlockManagerMaster,
    defaultSerializer: Serializer,
    maxMemory: Long,
    val conf: SparkConf,
    mapOutputTracker: MapOutputTracker,
    shuffleManager: ShuffleManager,
    blockTransferService: BlockTransferService,
    securityManager: SecurityManager,
    numUsableCores: Int)
  extends BlockDataManager with Logging {
  //磁盘存储管理器
  val diskBlockManager = new DiskBlockManager(this, conf)

  private val blockInfo = new TimeStampedHashMap[BlockId, BlockInfo]

  // Actual storage of where blocks are kept
  private var tachyonInitialized = false
  //内存存储
  private[spark] val memoryStore = new MemoryStore(this, maxMemory)
  //磁盘存储
  private[spark] val diskStore = new DiskStore(this, diskBlockManager)
  //TachyonStore
  private[spark] lazy val tachyonStore: TachyonStore = {
    val storeDir = conf.get("spark.tachyonStore.baseDir", "/tmp_spark_tachyon")
    val appFolderName = conf.get("spark.tachyonStore.folderName")
    val tachyonStorePath = s"$storeDir/$appFolderName/${this.executorId}"
    val tachyonMaster = conf.get("spark.tachyonStore.url",  "tachyon://localhost:19998")
    val tachyonBlockManager =
      new TachyonBlockManager(this, tachyonStorePath, tachyonMaster)
    tachyonInitialized = true
    new TachyonStore(this, tachyonBlockManager)
  }

  private[spark]
  val externalShuffleServiceEnabled = conf.getBoolean("spark.shuffle.service.enabled", false)

  // Port used by the external shuffle service. In Yarn mode, this may be already be
  // set through the Hadoop configuration as the server is launched in the Yarn NM.
  private val externalShuffleServicePort =
    Utils.getSparkOrYarnConfig(conf, "spark.shuffle.service.port", "7337").toInt

  // Check that we're not using external shuffle service with consolidated shuffle files.
  if (externalShuffleServiceEnabled
      && conf.getBoolean("spark.shuffle.consolidateFiles", false)
      && shuffleManager.isInstanceOf[HashShuffleManager]) {
    throw new UnsupportedOperationException("Cannot use external shuffle service with consolidated"
      + " shuffle files in hash-based shuffle. Please disable spark.shuffle.consolidateFiles or "
      + " switch to sort-based shuffle.")
  }

  var blockManagerId: BlockManagerId = _

  // Address of the server that serves this executor's shuffle files. This is either an external
  // service, or just our own Executor's BlockManager.
  private[spark] var shuffleServerId: BlockManagerId = _

  // Client to read other executors' shuffle files. This is either an external service, or just the
  // standard BlockTransferService to directly connect to other Executors.
  //创建shuffleClient,如果有外部ShuffleService,那么创建外部ShuffleClient,否则为blockTransferService
  private[spark] val shuffleClient = if (externalShuffleServiceEnabled) {
    val transConf = SparkTransportConf.fromSparkConf(conf, numUsableCores)
    new ExternalShuffleClient(transConf, securityManager, securityManager.isAuthenticationEnabled())
  } else {
    blockTransferService
  }

  // Whether to compress broadcast variables that are stored
  private val compressBroadcast = conf.getBoolean("spark.broadcast.compress", true)
  // Whether to compress shuffle output that are stored
  private val compressShuffle = conf.getBoolean("spark.shuffle.compress", true)
  // Whether to compress RDD partitions that are stored serialized
  private val compressRdds = conf.getBoolean("spark.rdd.compress", false)
  // Whether to compress shuffle output temporarily spilled to disk
  private val compressShuffleSpill = conf.getBoolean("spark.shuffle.spill.compress", true)

  private val slaveActor = actorSystem.actorOf(
    Props(new BlockManagerSlaveActor(this, mapOutputTracker)),
    name = "BlockManagerActor" + BlockManager.ID_GENERATOR.next)

  // Pending re-registration action being executed asynchronously or null if none is pending.
  // Accesses should synchronize on asyncReregisterLock.
  private var asyncReregisterTask: Future[Unit] = null
  private val asyncReregisterLock = new Object

  //非广播block清理器
  private val metadataCleaner = new MetadataCleaner(
    MetadataCleanerType.BLOCK_MANAGER, this.dropOldNonBroadcastBlocks, conf)
  //广播block清理器
  private val broadcastCleaner = new MetadataCleaner(
    MetadataCleanerType.BROADCAST_VARS, this.dropOldBroadcastBlocks, conf)

  // Field related to peer block managers that are necessary for block replication
  @volatile private var cachedPeers: Seq[BlockManagerId] = _
  private val peerFetchLock = new Object
  private var lastPeerFetchTime = 0L

  /* The compression codec to use. Note that the "lazy" val is necessary because we want to delay
   * the initialization of the compression codec until it is first used. The reason is that a Spark
   * program could be using a user-defined codec in a third party jar, which is loaded in
   * Executor.updateDependencies. When the BlockManager is initialized, user level jars hasn't been
   * loaded yet. */
  //压缩算法
  private lazy val compressionCodec: CompressionCodec = CompressionCodec.createCodec(conf)

主要完成了以下初始化操作:

  • Shuffle客户端ShuffleClient
  • BlockManagerMaster:对所有的Executor上的BlockManager进行统一管理
  • 磁盘块管理器DiskBlockManager
  • 内存存储MemoryStore
  • 磁盘存储DiskStore
  • Tachyon存储TachyonStore
  • 非广播Block清理器metadataCleaner和广播Block清理器broadcastCleaner
  • 压缩算法的实现compressionCodec

在初始化shuffleClient的时候可以看到它会判断是否有外部ShuffleClient,如果有则创建ExternalShuffleClient,否则创建BlockTransferService。

BlockTransferService是一个抽象类,它的默认实现类为NettyBlockTransferService,进入到该类的init()方法中,源码如下:

 override def init(blockDataManager: BlockDataManager): Unit = {
    val (rpcHandler: RpcHandler, bootstrap: Option[TransportClientBootstrap]) = {
      //创建RPC服务
      val nettyRpcHandler = new NettyBlockRpcServer(serializer, blockDataManager)
      if (!authEnabled) {
        (nettyRpcHandler, None)
      } else {
        (new SaslRpcHandler(nettyRpcHandler, securityManager),
          Some(new SaslClientBootstrap(transportConf, conf.getAppId, securityManager)))
      }
    }
    //构造TransportContext
    transportContext = new TransportContext(transportConf, rpcHandler)
    //创建RPC客户端工厂TransportClientFactory
    clientFactory = transportContext.createClientFactory(bootstrap.toList)
    //创建Netty服务器TransportServer
    server = transportContext.createServer(conf.getInt("spark.blockManager.port", 0))
    appId = conf.getAppId
    logInfo("Server created on " + server.getPort)
  }

它的初始化过程主要完成了以下任务;

  • 创建RPCServer:NettyBlockRpcServer
  • 构建TransportContext
  • 构建RPC客户端工厂TransportClientFactory
  • 创建Netty服务器TransportServer

Block RPC服务

当map任务与reduce任务处于不同节点的时候,reduce端就需要从远端节点下载map任务的中间输出,因此需要打开NettyBlockRpcServer,它提供了下载block的功能,有时候为了容错,我们需要将Block的数据备份到其他节点上,所以它还提供了上传的功能。源码如下:

class NettyBlockRpcServer(
    serializer: Serializer,
    blockManager: BlockDataManager)
  extends RpcHandler with Logging {

  private val streamManager = new OneForOneStreamManager()

  override def receive(
      client: TransportClient,
      messageBytes: Array[Byte],
      responseContext: RpcResponseCallback): Unit = {
    val message = BlockTransferMessage.Decoder.fromByteArray(messageBytes)
    logTrace(s"Received request: $message")

    message match {
      case openBlocks: OpenBlocks =>
        val blocks: Seq[ManagedBuffer] =
          openBlocks.blockIds.map(BlockId.apply).map(blockManager.getBlockData)
        val streamId = streamManager.registerStream(blocks.iterator)
        logTrace(s"Registered streamId $streamId with ${blocks.size} buffers")
        responseContext.onSuccess(new StreamHandle(streamId, blocks.size).toByteArray)

      case uploadBlock: UploadBlock =>
        // StorageLevel is serialized as bytes using our JavaSerializer.
        val level: StorageLevel =
          serializer.newInstance().deserialize(ByteBuffer.wrap(uploadBlock.metadata))
        val data = new NioManagedBuffer(ByteBuffer.wrap(uploadBlock.blockData))
        blockManager.putBlockData(BlockId(uploadBlock.blockId), data, level)
        responseContext.onSuccess(new Array[Byte](0))
    }
  }

  override def getStreamManager(): StreamManager = streamManager
}

传输上下文TransportContext

TransportContext用于维护传输上下文,它既可以创建Netty服务,也可以创建Netty访问客户端,其构造器如下:

public class TransportContext {
  private final Logger logger = LoggerFactory.getLogger(TransportContext.class);

  //主要控制Netty框架提供shuffle的IO交互客户端和服务端线程数量
  private final TransportConf conf;
  //负责shuffle的IO服务端在接收到客户端的RPC请求后,提供打开Block或者上传Block的服务
  private final RpcHandler rpcHandler;
  //在shuffle的IO服务端对客户端传来的数据进行解析,防止丢包和解析错误
  private final MessageEncoder encoder;
  //在shuffle的IO客户端对 消息内容进行编码,防止服务端丢包和解析错误
  private final MessageDecoder decoder;

  public TransportContext(TransportConf conf, RpcHandler rpcHandler) {
    this.conf = conf;
    this.rpcHandler = rpcHandler;
    this.encoder = new MessageEncoder();
    this.decoder = new MessageDecoder();
  }

构建RPC客户端工厂TransportClientFactory

TransportClientFactory是Netty的客户端工厂,用于通过连接远端的Executor的BlockManager中的TransportServer提供RPC服务下载或者上传的Block。

public TransportClientFactory(
      TransportContext context,
      //用于缓存客户端列表
      List<TransportClientBootstrap> clientBootstraps) {
    this.context = Preconditions.checkNotNull(context);
    this.conf = context.getConf();
    this.clientBootstraps = Lists.newArrayList(Preconditions.checkNotNull(clientBootstraps));
    //用于缓存客户端连接
    this.connectionPool = new ConcurrentHashMap<SocketAddress, ClientPool>();
    //节点之间取连接的连接数,可以通过参数spark.shuffle.io.numConnectionsPerPeer来配置,默认为1
    this.numConnectionsPerPeer = conf.numConnectionsPerPeer();
    this.rand = new Random();

    IOMode ioMode = IOMode.valueOf(conf.ioMode());
    //客户端channel被创建时使用的类,可以使用属性spark.shuffle.io.mode来 配置,默认为NioSocketChannel
    this.socketChannelClass = NettyUtils.getClientChannelClass(ioMode);
    // TODO: Make thread pool name configurable.
    this.workerGroup = NettyUtils.createEventLoop(ioMode, conf.clientThreads(), "shuffle-client");
    //对本地线程缓存禁用的分配器
    this.pooledAllocator = NettyUtils.createPooledByteBufAllocator(
      conf.preferDirectBufs(), false /* allowCache */, conf.clientThreads());
  }

Netty服务器TransportServer

TransportServer用于远端节点可以访问本地Executor的BlockManager中的TransportServer提供的RPC服务下载或者上传Block。

  /** Creates a TransportServer that binds to the given port, or to any available if 0. */
  public TransportServer(TransportContext context, int portToBind) {
    this.context = context;
    this.conf = context.getConf();

    init(portToBind);
  }

从上面代码中可以看到它调用了init()方法进行初始化,实际上就是Netty服务器的初始化,这里不做过多解释。

远程拉取Shuffle文件

NettyBlockTransferServicefetchBlocks方法中用于获取远程的shuffle文件:

override def fetchBlocks(
      host: String,
      port: Int,
      execId: String,
      blockIds: Array[String],
      listener: BlockFetchingListener): Unit = {
    logTrace(s"Fetch blocks from $host:$port (executor id $execId)")
    try {
      val blockFetchStarter = new RetryingBlockFetcher.BlockFetchStarter {
        override def createAndStart(blockIds: Array[String], listener: BlockFetchingListener) {
          val client = clientFactory.createClient(host, port)
          new OneForOneBlockFetcher(client, appId, execId, blockIds.toArray, listener).start()
        }
      }

      val maxRetries = transportConf.maxIORetries()
      if (maxRetries > 0) {
        // Note this Fetcher will correctly handle maxRetries == 0; we avoid it just in case there's
        // a bug in this code. We should remove the if statement once we're sure of the stability.
        new RetryingBlockFetcher(transportConf, blockFetchStarter, blockIds, listener).start()
      } else {
        blockFetchStarter.createAndStart(blockIds, listener)
      }
    } catch {
      case e: Exception =>
        logError("Exception while beginning fetchBlocks", e)
        blockIds.foreach(listener.onBlockFetchFailure(_, e))
    }
  }

上传shuffle文件

NettyBlockTransferServiceuploadBlock方法用于上传shuffle文件到远程executor,首先它会创建客户单,然后使用hostname以及port连接选中的BlockManager,然后将block序列化为字节数组进行上传,然后上传块的状态信息,源码如下:

 override def uploadBlock(
      hostname: String,
      port: Int,
      execId: String,
      blockId: BlockId,
      blockData: ManagedBuffer,
      level: StorageLevel): Future[Unit] = {
    val result = Promise[Unit]()
    val client = clientFactory.createClient(hostname, port)

    // StorageLevel is serialized as bytes using our JavaSerializer. Everything else is encoded
    // using our binary protocol.
    val levelBytes = serializer.newInstance().serialize(level).array()

    // Convert or copy nio buffer into array in order to serialize it.
    val nioBuffer = blockData.nioByteBuffer()
    val array = if (nioBuffer.hasArray) {
      nioBuffer.array()
    } else {
      val data = new Array[Byte](nioBuffer.remaining())
      nioBuffer.get(data)
      data
    }

    client.sendRpc(new UploadBlock(appId, execId, blockId.toString, levelBytes, array).toByteArray,
      new RpcResponseCallback {
        override def onSuccess(response: Array[Byte]): Unit = {
          logTrace(s"Successfully uploaded block $blockId")
          result.success()
        }
        override def onFailure(e: Throwable): Unit = {
          logError(s"Error while uploading block $blockId", e)
          result.failure(e)
        }
      })

    result.future
  }

欢迎加入大数据学习交流群:731423890

评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值