在上一篇中我们介绍了如何安装和使用 mpi4py,下面我们以几个简单的例子来展示怎么使用 mpi4py 来进行并行编程,以使读者能够快速地上手使用 mpi4py。这些例子来自 mpi4py 的 Document,有些做了一些适当的改动。
点到点通信
传递通用的 Python 对象(阻塞方式)
这种方式非常简单易用,适用于任何可被 pickle 系列化的 Python 对象,但是在发送和接收端的 pickle 和 unpickle 操作却并不高效,特别是在传递大量的数据时。另外阻塞式的通信在消息传递时会阻塞进程的执行。
# p2p_blocking.py
from mpi4py import MPI
comm = MPI.COMM_WORLD
rank = comm.Get_rank()
if rank == 0:
data = {'a': 7, 'b': 3.14}
print 'process %d sends %s' % (rank, data)
comm.send(data, dest=1, tag=11)
elif rank == 1:
data = comm.recv(source=0, tag=11)
print 'process %d receives %s' % (rank, data)
运行结果如下:
$ mpiexec -n 2 python p2p_blocking.py
process 0 sends {'a': 7, 'b': 3.14}
process 1 receives {'a': 7, 'b': 3.14}
传递通用的 Python 对象(非阻塞方式)
这种方式非常简单易用,适用于任何可被 pickle 系列化的 Python 对象,但是在发送和接收端的 pickle 和 unpickle 操作却并不高效,特别是在传递大量的数据时。非阻塞式的通信可以将通信和计算进行重叠从而大大改善性能。
# p2p_non_blocking.py
from mpi4py import MPI
comm = MPI.COMM_WORLD
rank = comm.Get_rank()
if rank == 0:
data = {'a': 7, 'b': 3.14}
print 'process %d sends %s' % (rank, data)
req = comm.isend(data, dest=1, tag=11)
req.wait()
elif rank == 1:
req = comm.irecv(source=0, tag=11)
data = req.wait()
print 'process %d receives %s' % (rank, data)
运行结果如下:
$ mpiexec -n 2 python p2p_non_blocking.py
process 0 sends {'a': 7, 'b': 3.14}
process 1 receives {'a': 7, 'b': 3.14}
传递 numpy 数组(高效快速的方式)
对类似于数组这样的数据,准确来说是具有单段缓冲区接口(single-segment buffer interface)的 Python 对象,如 numpy 数组及内置的 bytes/string/array 等,可以用一种更为高效的方式直接进行传递,而不需要经过 pickle 系列化和恢复。以这种方式传递数据需要使用通信子对象的以大写字母开头的方法,如 Send(),Recv(),Bcast(),Scatter(),Gather() 等。
# p2p_numpy_array.py
import numpy
from mpi4py import MPI
comm = MPI.COMM_WORLD
rank = comm.Get_rank()
# passing MPI datatypes explicitly
if rank == 0:
data = numpy.arange(10, dtype='i')
print 'process %d sends %s' % (rank, data)
comm.Send([data, MPI.INT], dest=1, tag=77)
elif rank == 1:
data = numpy.empty(10, dtype='i')
comm.Recv([data, MPI.INT], source=0, tag=77)
print 'process %d receives %s' % (rank, data)
# automatic MPI datatype discovery
if rank == 0:
data = numpy.arange(10, dtype=numpy.float64)
print 'process %d sends %s' % (rank, data)
comm.Send(data, dest=1, tag=13)
elif rank == 1:
data = numpy.empty(10, dtype=numpy.float64)
comm.Recv(data, source=0, tag=13)
print 'process %d receives %s' % (rank, data)
运行结果如下:
$ mpiexec -n 2 python p2p_numpy_array.py
process 0 sends [0 1 2 3 4 5 6 7 8 9]
process 1 receives [0 1 2 3 4 5 6 7 8 9]
process 0 sends [0. 1. 2. 3. 4. 5. 6. 7. 8. 9.]
process 1 receives [0. 1. 2. 3. 4. 5. 6. 7. 8. 9.]
集合通信
广播(Broadcast)
广播操作将根进程的数据复制到同组内其他所有进程中。
广播通用的 Python 对象
# bcast.py
from mpi4py import MPI
comm = MPI.COMM_WORLD
rank = comm.Get_rank()
if rank == 0:
data = {'key1' : [7, 2.72, 2+3j],
'key2' : ( 'abc', 'xyz')}
print 'before broadcasting: process %d has %s' % (rank, data)
else:
data = None
print 'before broadcasting: process %d has %s' % (rank, data)
data = comm.bcast(data, root=0)
print 'after broadcasting: process %d has %s' % (rank, data)
运行结果如下:
$ mpiexec -n 2 python bcast.py
before broadcasting: process 0 has {'key2': ('abc', 'xyz'), 'key1': [7, 2.72, (2+3j)]}
after broadcasting: process 0 has {'key2': ('abc', 'xyz'), 'key1': [7, 2.72, (2+3j)]}
before broadcasting: process 1 has None
after broadcasting: process 1 has {'key2': ('abc', 'xyz'), 'key1': [7, 2.72, (2+3j)]}
广播 numpy 数组
# Bcast.py
import numpy as np
from mpi4py import MPI
comm = MPI.COMM_WORLD
rank = comm.Get_rank()
if rank == 0:
data = np.arange(10, dtype='i')
print 'before broadcasting: process %d has %s' % (rank, data)
else:
data = np.zeros(10, dtype='i')
print 'before broadcasting: process %d has %s' % (rank, data)
comm.Bcast(data, root=0)
print 'after broadcasting: process %d has %s' % (rank, data)
运行结果如下:
$ mpiexec -n 2 python Bcast.py
before broadcasting: process 0 has [0 1 2 3 4 5 6 7 8 9]
after broadcasting: process 0 has [0 1 2 3 4 5 6 7 8 9]
before broadcasting: process 1 has [0 0 0 0 0 0 0 0 0 0]
after broadcasting: process 1 has [0 1 2 3 4 5 6 7 8 9]
发散(Scatter)
发散操作从组内的根进程分别向组内其它进程散发不同的消息。
发散通用的 Python 对象
# scatter.py
from mpi4py import MPI
comm = MPI.COMM_WORLD
size = comm.Get_size()
rank = comm.Get_rank()
if rank == 0:
data = [ (i + 1)**2 for i in range(size) ]
print 'before scattering: process %d has %s' % (rank, data)
else:
data = None
print 'before scattering: process %d has %s' % (rank, data)
data = comm.scatter(data, root=0)
print 'after scattering: process %d has %s' % (rank, data)
运行结果如下:
$ mpiexec -n 3 python scatter.py
before scattering: process 0 has [1, 4, 9]
after scattering: process 0 has 1
before scattering: process 1 has None
after scattering: process 1 has 4
before scattering: process 2 has None
after scattering: process 2 has 9
发散 numpy 数组
# Scatter.py
import numpy as np
from mpi4py import MPI
comm = MPI.COMM_WORLD
size = comm.Get_size()
rank = comm.Get_rank()
sendbuf = None
if rank == 0:
sendbuf = np.empty([size, 10], dtype='i')
sendbuf.T[:, :] = range(size)
print 'before scattering: process %d has %s' % (rank, sendbuf)
recvbuf = np.empty(10, dtype='i')
comm.Scatter(sendbuf, recvbuf, root=0)
print 'after scattering: process %d has %s' % (rank, recvbuf)
运行结果如下:
$ mpiexec -n 3 python Scatter.py
before scattering: process 0 has [[0 0 0 0 0 0 0 0 0 0]
[1 1 1 1 1 1 1 1 1 1]
[2 2 2 2 2 2 2 2 2 2]]
before scattering: process 1 has None
before scattering: process 2 has None
after scattering: process 0 has [0 0 0 0 0 0 0 0 0 0]
after scattering: process 2 has [2 2 2 2 2 2 2 2 2 2]
after scattering: process 1 has [1 1 1 1 1 1 1 1 1 1]
收集(Gather)
收集操作是发散的逆操作,根进程从其它进程收集不同的消息依次放入自己的接收缓冲区内。
收集通用的 Python 对象
# gather.py
from mpi4py import MPI
comm = MPI.COMM_WORLD
size = comm.Get_size()
rank = comm.Get_rank()
data = (rank + 1)**2
print 'before gathering: process %d has %s' % (rank, data)
data = comm.gather(data, root=0)
print 'after scattering: process %d has %s' % (rank, data)
运行结果如下:
$ mpiexec -n 3 python gather.py
before gathering: process 0 has 1
after scattering: process 0 has [1, 4, 9]
before gathering: process 1 has 4
after scattering: process 1 has None
before gathering: process 2 has 9
after scattering: process 2 has None
收集 numpy 数组
# Gather.py
import numpy as np
from mpi4py import MPI
comm = MPI.COMM_WORLD
size = comm.Get_size()
rank = comm.Get_rank()
sendbuf = np.zeros(10, dtype='i') + rank
print 'before gathering: process %d has %s' % (rank, sendbuf)
recvbuf = None
if rank == 0:
recvbuf = np.empty([size, 10], dtype='i')
comm.Gather(sendbuf, recvbuf, root=0)
print 'after gathering: process %d has %s' % (rank, recvbuf)
运行结果如下:
$ mpiexec -n 3 python Gather.py
before gathering: process 0 has [0 0 0 0 0 0 0 0 0 0]
after gathering: process 0 has [[0 0 0 0 0 0 0 0 0 0]
[1 1 1 1 1 1 1 1 1 1]
[2 2 2 2 2 2 2 2 2 2]]
before gathering: process 1 has [1 1 1 1 1 1 1 1 1 1]
after gathering: process 1 has None
before gathering: process 2 has [2 2 2 2 2 2 2 2 2 2]
after gathering: process 2 has None
最后让我们比较一下以小写字母开头的 send()/recv() 方法与以大写字母开头的 Send()/Recv() 方法在传递 numpy 数组时的性能差异。
比较 send()/recv() 和 Send()/Recv()
# send_recv_timing.pu
import time
import numpy as np
from mpi4py import MPI
comm = MPI.COMM_WORLD
rank = comm.Get_rank()
if rank == 0:
data = np.random.randn(10000).astype(np.float64)
else:
data = np.empty(10000, dtype=np.float64)
comm.barrier()
# use comm.send() and comm.recv()
t1 = time.time()
if rank == 0:
comm.send(data, dest=1, tag=1)
else:
comm.recv(source=0, tag=1)
t2 = time.time()
if rank == 0:
print 'time used by send/recv: %f seconds' % (t2 - t1)
comm.barrier()
# use comm.Send() and comm.Recv()
t1 = time.time()
if rank == 0:
comm.Send(data, dest=1, tag=2)
else:
comm.Recv(data, source=0, tag=2)
t2 = time.time()
if rank == 0:
print 'time used by Send/Recv: %f seconds' % (t2 - t1)
运行结果如下:
$ mpiexec -n 2 python send_recv_timing.py
time used by send/recv: 0.000412 seconds
time used by Send/Recv: 0.000091 seconds
可以看出在代码几乎一样的情况下,以大写字母开头的 Send()/Recv() 方法对 numpy 数组的传递效率要高的多,因此在涉及 numpy 数组的并行操作时,应尽量选择以大写字母开头的通信方法。
以上通过几个简单的例子介绍了怎么在 Python 中利用 mpi4py 进行并行编程,可以看出 mpi4py 使得在 Python 中进行 MPI 并行编程非常容易,也比在 C、C++、Fortran 中调用 MPI 的应用接口进行并行编程要方便和灵活的多,特别是 mpi4py 提供的基于 pickle 的通用 Python 对象传递机制,使我们在编程过程中完全不用考虑所传递的数据类型和数据长度。这种灵活性和易用性虽然会有一些性能上的损失,但是在传递的数据量不大的情况下,这种性能损失是可以忽略的。当需要传递大量的数组类型的数据时,mpi4py 提供的以大写字母开头的通信方法使得数据可以以接近 C、C++、Fortran 的速度在不同的进程间高效地传递。对 numpy 数组,这种高效性却并不损失或很少损失其灵活性和易用性,因为 mpi4py 可以自动推断出 numpy 数组的类型及数据长度信息,因此一般情况下不用显式的指定。这给我们利用 numpy 的数组进行高性能的并行计算编程带来莫大的方便。
在后面我们将详细地介绍 mpi4py 所提供的各种方法及其具体的用法。