HPC技术:MPICH实现原理分析

7 篇文章 6 订阅

随着超算产业的兴起,MPI已然成为了CAE求解器开发的主要并行技术。MPICHMPI标准的一种最重要的实现,MPICH的开发与MPI规范的制订是同步进行的,最能反映MPI的变化和发展。因此,非常有必要对MPICH为代表的MPI实现原理予以研究。

 Refs. from MPICH User’s Guide

MPICH is a high-performance and widely portable implementation of the MPI Standard, designed to implement all of MPI-1, MPI-2, and MPI-3 (including dynamic process management, one-sided operations, parallel I/O, and other extensions).

另外,国内外虽然有不少书籍讲解MPI编程相关技术,但大多仅局限于MPI编程模型等方面,鲜有书籍对MPI(特别是主流MPI实现)实现原理进行深入分析,这也是笔者撰写此文的另一个原因。

 本文旨在通过分析MPICH源代码来强化对MPI概念原理使用等方面的理解。

注1:限于笔者研究水平,难免有误,敬请批评指正。

注2:文章内容会不定期更新。


零、说明

MPI: MPICH-v3.0

PM: smpd


一、概述

1.1 Architecture

1.2 Communication Devices

Refs. from MPICH Installer’s Guide

MPICH is designed to be build with many different communication devices, allowing an implementation to be tuned for different communication fabrics. A simple communication device, known as “ch3” (for the third version of the “channel” interface) is provided with MPICH and is the default choice.

The ch3 device itself supports a variety of communication methods. These are specified by providing the name of the method after a colon in the --with-device configure option. For example, --with-device=ch3:sock selects the (older) socket-base communication method. Methods supported by the MPICH group include:

  • ch3:nemesis This method is our new, high performance method. It has been made the default communication channel starting the 1.1 release of MPICH. It uses shared-memory to send messages between processes on the same node and the network for processes between nodes. Currently sockets and Myrinet-MX are supported networks. It supports MPI THREAD MULTIPLE and other levels of thread safety.
  • ch3:sock This method uses sockets for all communications between processes. It supports MPI THREAD MULTIPLE and other levels of thread safety.

1.3 PM与PMI

Refs. from MPICH FAQ

Process managers are basically external (typically distributed) agents that spawn and manage parallel jobs. These process managers communicate with MPICH processes using a predefined interface called as PMI (process management interface). Since the interface is (informally) standardized within MPICH and its derivatives, you can use any process manager from MPICH or its derivatives with any MPI application built with MPICH or any of its derivatives, as long as they follow the same wire protocol. There are three known implementations of the PMI wire protocol: "simple", "smpd" and "slurm". By default, MPICH and all its derivatives use the "simple" PMI wire protocol, but MPICH can be configured to use "smpd" or "slurm" as well.

For example, MPICH provides several different process managers such as Hydra, MPD, Gforker and Remshell which follow the "simple" PMI wire protocol. MVAPICH2 provides a different process manager called "mpirun" that also follows the same wire protocol. OSC mpiexec follows the same wire protocol as well. You can mix and match an application built with any MPICH derivative with any process manager. For example, an application built with Intel MPI can run with OSC mpiexec or MVAPICH2's mpirun or MPICH's Gforker.

MPD has been the traditional default process manager for MPICH till the 1.2.x release series. Starting the 1.3.x series, Hydra is the default process manager.

SMPD is another process manager distributed with MPICH that uses the "smpd" PMI wire protocol. This is mainly used for running MPICH on Windows or a combination of UNIX and Windows machines. This will be deprecated in the future releases of MPICH in favour or Hydra. MPICH can be configured with SMPD using:

./configure --with-pm=smpd --with-pmi=smpd

SLURM is an external process manager that uses MPICH's PMI interface as well.

Note that the default build of MPICH will work fine in SLURM environments. No extra steps are needed.

However, if you want to use the srun tool to launch jobs instead of the default mpiexec, you can configure MPICH as follows:

./configure --with-pm=none --with-pmi=slurm

Once configured with slurm, no internal process manager is built for MPICH; the user is expected to use SLURM's launch models (such as srun).

Refs. from Process Manager Interface

MPI needs the services of a process manager. Many MPI implementations include their own process managers. For greatest flexibility and usability, an MPI program should work with any process manager, including third-party process managers (e.g., PBS). MPICH uses a "Process Manager Interface" (PMI) to separate the process management functions from the MPI implementation. PMI should be scalable in design and functional in specification. MPICH executables can be handled by different process managers without recompiling or relinking. MPICH includes multiple implementations.  

 Refs. from MPICH Installer’s Guide

MPICH has been designed to work with multiple process managers; that is, although you can start MPICH jobs with mpiexec, there are different mechanisms by which your processes are started. An interface (called PMI) isolates the MPICH library code from the process manager.

  • hydra This is the default process manager that natively uses the existing daemons on the system such as ssh, slurm, pbs.
  • gforker This is a simple process manager that creates all processes on a single machine. It is useful both for debugging and for running on shared memory multiprocessors.

 Refs. from MPICH User’s Guide

The remshell “process manager” provides a very simple version of mpiexec that makes use of the secure shell command (ssh) to start processes on a collection of machines. As this is intended primarily as an illustration of how to build a version of mpiexec that works with other process managers, it does not implement all of the features of the other mpiexec programs described in this document.

There are multiple ways of using MPICH with Slurm or PBS. Hydra provides native support for both Slurm and PBS, and is likely the easiest way to use MPICH on these systems (see the Hydra documentation above for more details).

Alternatively, Slurm also provides compatibility with MPICH’s internal process management interface. To use this, you need to configure MPICH with Slurm support, and then use the srun job launching utility provided by Slurm.

slurm提供了对MPI的支持,这部分代码在slurm源代码的slurm/pmi.h、contribs/pmi、contribs/pmi2等目录文件中。

1.4 KVS(key-value space)

 二、编译安装

2.1 Linux下安装MPICH

在GitHub上,下载MPICH

git clone --recurse-submodules https://github.com/pmodels/mpich.git
gti checkout -b v3.0 v3.0

(注:由于最新版本mpi-v4.1a1已经不再维护smpd进程管理器,本文选择mpi-v3.0版本 。)

进入mpich目录,运行以下命令完成编译安装,

./autogen.sh
./configure --prefix=/home/youquan/mpich-install
make -j8
make install

2.2 Windows下安装MPICH

MPICH官网于MPICH2 1.4.1p1版本之后,不再提供Windows版本的安装包;GitHub上MPICH在3.1.rc2之后不再提供smpd进程管理器的支持。MPICH Wiki建议使用MS-MPI

 Refs. from MPICH FAQ

Why can't I build MPICH on Windows anymore?

Unfortunately, due to the lack of developer resources, MPICH is not supported on Windows anymore including Cygwin. The last version of MPICH, which was supported on Windows, was MPICH2 1.4.1p1. There is minimal support left for this version, but you can find it on the downloads page:

Downloads | MPICH

Alternatively, Microsoft maintains a derivative of MPICH which should provide the features you need. You also find a link to that on the downloads page above. That version is much more likely to work on your system and will continue to be updated in the future. We recommend all Windows users migrate to using MS-MPI.

2.3 安装测试

可以在控制台内输入"mpiexec -help2"查看mpich编译安装是否成功,控制台输出mpiexec命令选型与使用方法。

mpiexec -help2

All options to mpiexec:

-n x
-np x
  launch x processes
-localonly x
-n x -localonly
  launch x processes on the local machine
-machinefile filename
  use a file to list the names of machines to launch on
-host hostname
-hosts n host1 host2 ... hostn
-hosts n host1 m1 host2 m2 ... hostn mn
  launch on the specified hosts
  In the second version the number of processes = m1 + m2 + ... + mn
-binding proc_binding_scheme
  Set the proc binding for each of the launched processes to a single core.
  Currently "auto" and "user" are supported as the proc_binding_schemes
-map drive:\\host\share
  map a drive on all the nodes
  this mapping will be removed when the processes exit
-mapall
  map all of the current network drives
  this mapping will be removed when the processes exit
  (Available currently only on windows)
-dir drive:\my\working\directory
-wdir /my/working/directory
  launch processes in the specified directory
-env var val
  set environment variable before launching the processes
-logon
  prompt for user account and password
-pwdfile filename
  read the account and password from the file specified
  put the account on the first line and the password on the second
-nompi
  launch processes without the mpi startup mechanism
-nopopup_debug
  disable the system popup dialog if the process crashes
-exitcodes
  print the process exit codes when each process exits.
-noprompt
  prevent mpiexec from prompting for user credentials.
-priority class[:level]
  set the process startup priority class and optionally level.
  class = 0,1,2,3,4   = idle, below, normal, above, high
  level = 0,1,2,3,4,5 = idle, lowest, below, normal, above, highest
  the default is -priority 1:3
-localroot
  launch the root process directly from mpiexec if the host is local.
  (This allows the root process to create windows and be debugged.)
-port port
-p port
  specify the port that smpd is listening on.
-phrase passphrase
  specify the passphrase to authenticate connections to smpd with.
-smpdfile filename
  specify the file where the smpd options are stored including the passphrase.
-path search_path
  search path for executable, ; separated
-register [-user n]
  encrypt a user name and password to the Windows registry.
  optionally specify a user slot index
-remove [-user n]
  delete the encrypted credentials from the Windows registry.
  If no user index is specified then all entries are removed.
-validate [-user n] [-host hostname]
  validate the encrypted credentials for the current or specified host.
  A specific user index can be specified otherwise index 0 is the default.
-user n
  use the registered user credentials from slot n to launch the job.
-timeout seconds
  timeout for the job.
-plaintext
  don't encrypt the data on the wire.
-delegate
  use passwordless delegation to launch processes
-impersonate
  use passwordless authentication to launch processes
-add_job <job_name> <domain\user>
-add_job <job_name> <domain\user> -host <hostname>
  add a job key for the specified domain user on the local or specified host
  requires administrator privileges
-remove_job <name>
-remove_job <name> -host <hostname>
  remove a job key from the local or specified host
  requires administrator privileges
-associate_job <name>
-associate_job <name> -host <hostname>
  associate the current user's token with the specified job on the local or specified host
-job <name>
  launch the processes in the context of the specified job
-whomai
  print the current user name
-l
  prefix output with the process number. (This option is a lowercase L not the number one)

三、mpiexec

mpiexec主要用于启动MPI作业,是实际开发应用中用的最多的MPI命令。对应于smpd的mpiexec主函数位于src/pm/smpd/mpiexec.c文件中。

mpiexec整体运行流程如下:

  • 解析mpiexec命令行参数;
  • 创建套接字与完成端口;
  • 发起连接:选择编号为1的主机smpd,默认监听端口8676;
  • 进入事件循环状态机;

3.1 PMPI_Init

PMPI_Init函数的实现代码位于src/util/multichannel/mpi.c文件中。PMPI_Init调用LoadMPILibrary、LoadFunctions来加载mpich2nemesis.dll等库文件,以实现动态加载外部PMI及MPI实现的功能,这样便可使用不同版本的MPICH来运行用户的MPI程序。

This file implements an mpi binding that calls another dynamically loaded mpi binding. The environment variables MPI_DLL_NAME and MPICH2_CHANNEL control which library should be loaded. The default library is mpich2.dll or mpich2d.dll. A wrapper dll can also be named to replace only the MPI functions using the MPI_WRAP_DLL_NAME environment variable.

The motivation for this binding is to allow compiled mpi applications to be able to use different implementations of mpich2 at run-time without re-linking the application.

This way mpiexec or the user can choose the best channel to use at run-time.

3.2 SMPDU_Sock_init/SMPDU_Sock_finalize

smpd定义了一套基于socket的底层网络通信接口(src/pm/smpd/sock/include/smpd_util_sock.h),同时提供了基于IOCP、poll等网络I/O复用技术的实现。

  • IOCP

代码位于src/pm/sock/iocp/smpd_util_sock.c中,SMPDU_Sock_init通过调用WSAStartup()初始化winsock库;SMPDU_Sock_finalize()调用WSACleanup()终止winsock库的使用。

Ref. from Microsoft

The WSAStartup function initiates use of the Winsock DLL by a process.

Syntax

int WSAStartup(
        WORD      wVersionRequired,
  [out] LPWSADATA lpWSAData
);

The WSACleanup function terminates use of the Winsock 2 DLL (Ws2_32.dll).

Syntax

int WSACleanup();
  • poll

代码位于src/pm/sock/poll/smpd_sock_init.i中。略。

3.3 smpd_init_process

代码位于src/pm/sock/smpd_connect.c中,主要完成全局结构体对象smpd_process的初始化。

3.4 mp_parse_command_args

代码位于src/pm/sock/mp_parse_command_line.c中,解析mpiexec调用的命令行参数。

3.5 smpd_get_host_id

代码位于src/pm/smpd/smpd_host_util.c中,用于获取host的编号,如果该host不存在,则创建一个smpd_host_node_t对象,并将其添加到smpd_process.host_list二叉树列表中。

3.6 smpd_get_next_host

代码位于src/pm/smpd/smpd_host_util.c中,当mpiexec启动之后,调用mp_parse_command_args进行解析参数之后,根据节点配置与进程总数,采用轮询的方式确定哪些计算节点参与计算。调用smpd_get_next_host可确定与某一进程相关的计算节点的标号,从而创建了一个smpd_launch_node_t对象。而smpd_process.launch_list列表则维护了所有的参与计算节点的列表。

特别指出,如果在调用mpiexec的时候没有指定计算节点,则smpd_get_next_host会创建一个smpd_launch_node_t对象用于表示当前调用mpiexec的节点,虽然依旧采用轮询的方式确定计算节点,但显而易见,所有的mpi进程都只会在本地运行。

  // mp_parse_command_args

    smpd_dbg_printf("Creating launch nodes (%d)\n", nproc);
	for (i=0; i<nproc; i++)
	{
	    /* create a launch_node */
	    launch_node = (smpd_launch_node_t*)MPIU_Malloc(sizeof(smpd_launch_node_t));
	    if (launch_node == NULL)
	    {
		smpd_err_printf("unable to allocate a launch node structure.\n");
		smpd_exit_fn(FCNAME);
		return SMPD_FAIL;
	    }
	    launch_node->clique[0] = '\0';
	    smpd_get_next_host(&host_list, launch_node);
        smpd_dbg_printf("Adding host (%s) to launch list \n", launch_node->hostname);
	    launch_node->iproc = cur_rank++;
#ifdef HAVE_WINDOWS_H
        if(smpd_process.affinity_map_sz > 0){
            launch_node->binding_proc =
                smpd_process.affinity_map[affinity_map_index % smpd_process.affinity_map_sz];
            affinity_map_index++;
        }
        else{
            launch_node->binding_proc = -1;
        }
#endif
	    launch_node->appnum = appnum;
	    launch_node->priority_class = n_priority_class;
	    launch_node->priority_thread = n_priority;
	    launch_node->env = launch_node->env_data;
	    strcpy(launch_node->env_data, env_data);
	    if (launch_node->alt_hostname[0] != '\0')
	    {
		if (smpd_append_env_option(launch_node->env_data, SMPD_MAX_ENV_LENGTH, "MPICH_INTERFACE_HOSTNAME", launch_node->alt_hostname) != SMPD_SUCCESS)
		{
		    smpd_err_printf("unable to add the MPICH_INTERFACE_HOSTNAME environment variable to the launch command.\n");
		    smpd_exit_fn(FCNAME);
		    return SMPD_FAIL;
		}
	    }
	    if (wdir[0] != '\0')
	    {
		strcpy(launch_node->dir, wdir);
	    }
	    else
	    {
		if (gwdir[0] != '\0')
		{
		    strcpy(launch_node->dir, gwdir);
		}
		else
		{
		    launch_node->dir[0] = '\0';
		    getcwd(launch_node->dir, SMPD_MAX_DIR_LENGTH);
		}
	    }
	    if (path[0] != '\0')
	    {
		strcpy(launch_node->path, path);
		/* should the gpath be appended to the local path? */
	    }
	    else
	    {
		if (gpath[0] != '\0')
		{
		    strcpy(launch_node->path, gpath);
		}
		else
		{
		    launch_node->path[0] = '\0';
		}
	    }
	    launch_node->map_list = drive_map_list;
	    if (drive_map_list)
	    {
		/* ref count the list so when freeing the launch_node it can be known when to free the list */
		drive_map_list->ref_count++;
	    }
	    strcpy(launch_node->exe, exe);
	    launch_node->args[0] = '\0';
	    if (smpd_process.mpiexec_inorder_launch == SMPD_TRUE)
	    {
		/* insert the node in order */
		launch_node->next = NULL;
		if (smpd_process.launch_list == NULL)
		{
		    smpd_process.launch_list = launch_node;
		    launch_node->prev = NULL;
		}
		else
		{
		    launch_node_iter = smpd_process.launch_list;
		    while (launch_node_iter->next)
			launch_node_iter = launch_node_iter->next;
		    launch_node_iter->next = launch_node;
		    launch_node->prev = launch_node_iter;
		}
	    }
	    else
	    {
		/* insert the node in reverse order */
		launch_node->next = smpd_process.launch_list;
		if (smpd_process.launch_list)
		    smpd_process.launch_list->prev = launch_node;
		smpd_process.launch_list = launch_node;
		launch_node->prev = NULL;
	    }
	}

3.7 SMPDU_Sock_create_set

对应于IOCP与poll,也提供了两种实现。

  • IOCP

代码位于src/pm/sock/smpd_util_sock.c中,SMPDU_Sock_create_set调用CreateIoCompletionPort()创建完成端口。

 Ref. from Microsoft

Creates an input/output (I/O) completion port and associates it with a specified file handle, or creates an I/O completion port that is not yet associated with a file handle, allowing association at a later time.

Associating an instance of an opened file handle with an I/O completion port allows a process to receive notification of the completion of asynchronous I/O operations involving that file handle.

Syntax

HANDLE WINAPI CreateIoCompletionPort(
  _In_     HANDLE    FileHandle,
  _In_opt_ HANDLE    ExistingCompletionPort,
  _In_     ULONG_PTR CompletionKey,
  _In_     DWORD     NumberOfConcurrentThreads
);
  • poll

代码位于src/pm/sock/poll/smpd_sock_set.i中。略。

3.8 smpd_create_context

代码在src/pm/smpd/smpd_command.c中,创建并初始化smpd_context_t对象。

3.9 SMPDU_Sock_post_connect

SMPDU_Sock_post_connect连接本地的smpd进程,对应IOCP与poll,也有两种实现。

  • IOCP

代码位于src/pm/sock/smpd_util_sock.c中,创建通信套接字,并调用CreateIoCompletionPort关联到完成端口。

  • poll

代码位于src/pm/sock/smpd_util_sock.c中。略。

3.10 smpd_enter_at_state

代码在src/pm/smpd/smpd_state_machine.c中,启动有限状态机,收发数据,并调用相应的消息处理函数。

3.11 SMPDU_Sock_wait

代码在src/pm/smpd/sock/iocp/smpd_util_sock.c中,调用GetQueuedCompletionStatus函数获取套接字数据包。

Ref. from Microsoft

Attempts to dequeue an I/O completion packet from the specified I/O completion port. If there is no completion packet queued, the function waits for a pending I/O operation associated with the completion port to complete.

Syntax

C++Copy

BOOL GetQueuedCompletionStatus(
  [in]  HANDLE       CompletionPort,
        LPDWORD      lpNumberOfBytesTransferred,
  [out] PULONG_PTR   lpCompletionKey,
  [out] LPOVERLAPPED *lpOverlapped,
  [in]  DWORD        dwMilliseconds
);

3.12 mpiexec_rsh

除了使用smpd进程管理器来创建MPI进程,smpd也支持通过rsh来生成MPI进程,默认使用命令"ssh -x"来创建MPI进程(可以通过配置环境变量MPIEXEC_RSH修改rsh命令)。这部分代码在src/pm/smpd/mpiexec_rsh.c中。

四、smpd 

MPI提供了hydra、smpd等进程管理器,smpd实现较为简单(windows版本mpich的默认进程管理器),主函数位于src/pm/smpd/smpd.c文件中,

smpd整体运行流程如下:

  • 初始化PMI,解析命令行参数;
  • 创建套接字与完成端口;
  • 监听连接请求;
  • 进入事件循环状态机;

4.1 smpd_entry_point

smpd_entry_point()是smpd.c主体函数,在这个函数中创建了监听套接字,调用smpd_enter_at_state进入有限状态机循环。

五、有限状态机

从代码分析可以看出,无论是对于客户端mpiexec,还是对于服务端smpd,消息处理是由smpd_enter_at_state函数完成的。

smpd_enter_at_state代码位于src/pm/smpd/smpd_state_machine.c文件中。在这个文件中,采用有限状态机(Finite State Machine, FSM)实现了消息收发与响应。

网络资料

MPICHhttps://www.mpich.org/

MPICH Developer Documentation Wiki https://github.com/pmodels/mpich/blob/main/doc/wiki/Index.md

MVAPICH2http://mvapich.cse.ohio-state.edu/

DeinoMPI http://mpi.deino.net/

slurm https://slurm.schedmd.com/

GitHub Slurm https://github.com/SchedMD/slurm

  • 7
    点赞
  • 10
    收藏
    觉得还不错? 一键收藏
  • 2
    评论
评论 2
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值