并行计算之并行前缀法(Parallel Prefix)

高阶近似

已于 2022-10-31 18:30:06 修改

阅读量2.8k

点赞数 5

分类专栏：并行计算算法文章标签：算法 c语言并行计算

于 2022-10-30 21:06:58 首次发布

本文链接：https://blog.csdn.net/weixin_45937291/article/details/127605085

版权

算法同时被 2 个专栏收录

7 篇文章

订阅专栏

并行计算

4 篇文章

订阅专栏

并行计算之并行前缀法

前缀和问题(Prefix sum)
并行前缀算法
实际应用
致谢

前缀和问题(Prefix sum)

首先介绍一下前缀和问题：

输入：一个二元可结合的运算符 $\otimes$ (比如说加减乘除)，以及 $n$ 个元素 $x_0, x_1, x_2,\cdots,x_{n-1}$

输出： $n$ 个元素 $s_0,s_1, \cdots,s_{n-1}$ ，对任意 $i\in[0,n-1]$ ，都有 $s_i=x_0 \otimes x_1 \otimes \cdots x_i$

比如说，给定加法操作，以及输入 $16, 23, 7, 31, 9$

输入	16	23	7	31	9
	16	16+23	16+23+7	16+23+7+31	16+23+7+31+9
输出	16	39	46	77	86

该问题使用串行算法实现很简单，时间复杂度为 $O (n)$

由于存在串行依赖，即 $s_i$ 依赖于 $s_{i-1}$ ，所以有必要考虑如何对这类问题进行并行化

并行前缀算法

首先来考虑这么一个问题：假设输入 $x_i$ 是 $9, 8, 3, 2, 7, 1, 6, 4$ ，操作为加法操作，一共8个进程，每个进程恰好拥有一个输入 $x_i$ ，那么可以按照下面这张图来计算输出 $s_i$

在这里插入图片描述

看懂这张图的计算过程后，下面这个并行前缀和的算法就可以看懂了，其中id是当前进程的id，p是进程的数目

在这里插入图片描述

上面的例子中进程的数目恰好等于输入元素的个数，但实际情况却有可能是

$n > p$ ， $n$ 是输入元素的个数， $p$ 是进程的数目
$n$ 不是 $p$ 的整数倍
- 每个进程拥有的输入元素个数为 $\lceil \frac{n}{p} \rceil$ 或 $\lfloor \frac{n}{p} \rfloor$
$p$ 不是2的整数次幂
- $d=\lceil \log_2p \rceil$
- 在任何通信阶段，如果当前进程要通信的进程的id大于 $p$ (假设进程从0依次编号到 $p - 1$ )，则不执行任何操作

因此，更加通用的算法可以分为以下三步 (假设每个进程拥有的输入元素个数都为 $n / p$ ，如果n不是p的整数倍，其实步骤也差不多)

每个进程计算它自己拥有的这 $n / p$ 个元素的前缀和
每个进程都调用上面的 PARALLEL_PREFIX_SUM(id, Xid, p) 函数，其中id是进程的id号，p是进程的数目，Xid是每个进程自己这 $n / p$ 个元素的前缀和中的最后那个前缀和
最后每个进程再把函数PARALLEL_PREFIX_SUM的返回值加到已经计算的前缀和上即可

下面看个例子来体会这个过程

在这里插入图片描述

这个算法的时间复杂度为：

步骤1：计算时间为 $O(\frac{n}{p})$ ，进程通信时间为 0
步骤2：计算时间为 $O(\log p)$ ，进程通信时间为 $O(\log p)$
步骤3：计算时间为 $O(\frac{n}{p})$ ，进程通信时间为 0

最终计算时间为 $O(\frac{n}{p}+\log p)$ ，通信时间为 $O(\log p)$

实际应用

多项式的计算

输入：一个实数 $x_0$ ，以及 $n$ 个整数系数 $\{a_0,a_1,\cdots,a_{n-1}\}$

输出： $P(x_0)=a_0+a_1x_0+a_2x_0^2+\cdots+a_{n-1}x_0^{n-1}$

可以使用并行前缀法解决这个问题：

假设这 $n$ 个整数系数分布在 $p$ 个进程上，不妨认为进程 $P_i$ 拥有 $a_{i\frac{n}{p}}$ 到 $a_{(i+1)\frac{n}{p}-1}$
进程 $P_i$ 负责计算局部和
$\sum_{j=0}^{\frac{n}{p}-1}a_{i\frac{n}{P}+j}+x_0^{i\frac{n}{p}+j}$
进程 $P_i$ 需要的 $x_0^{i\frac{n}{p}}$ 可以通过并行前缀法来求得，即，每个进程先是计算 $x_0^{\frac{n}{p}}$ ，然后每个进程 $P_i$ 通过并行前缀法来求得自己需要的 $x_0^{i\frac{n}{p}}$

线性递归

输入：实数 $x_0,x_1$ ，以及整数系数 $a, b$

输出：序列 $\{x_2,x_3,\cdots,x_n\}$ 使得 $x_i=ax_{i-1}+bx_{i-2}$

上式可以重写为
$\begin{bmatrix} x_i & x_{i-1} \end{bmatrix} =\begin{bmatrix} x_{i-1} & x_{i-2} \end{bmatrix} \begin{bmatrix} a & 1\\ b & 0 \end{bmatrix}$
因此有
$\begin{bmatrix} x_i & x_{i-1} \end{bmatrix} =\begin{bmatrix} x_{1} & x_{0} \end{bmatrix} \begin{bmatrix} a & 1\\ b & 0 \end{bmatrix}^{i-1}$
可以使用并行前缀法计算 $\begin{bmatrix}a & 1 \\b & 0 \end{bmatrix} ^ {i}$

基于线性同余生成器的伪随机数序列发生器

输入：整数 A 和 B，以及大素数 P

输出：输出伪随机数序列 $\{x_1,\cdots,x_n\}$ ，其中 $x_{i+1}=(Ax_i+B)\space mod\space P$ ，不妨设 $x_0=0$

类似地，我们有
$\begin{bmatrix} x_i & 1 \end{bmatrix} =\begin{bmatrix} x_{0} & 1 \end{bmatrix} \begin{bmatrix} A & 0\\ B & 1 \end{bmatrix}^{i}=\begin{bmatrix} 0 & 1 \end{bmatrix} \begin{bmatrix} A & 0\\ B & 1 \end{bmatrix}^{i}$
可以使用并行前缀法计算 $\begin{bmatrix}A & 0 \\B & 1 \end{bmatrix} ^ {i}$

下面是实现的完整源码

#include <stdio.h>
#include <stdlib.h>
#include <mpi.h>
#include <sys/time.h>

// #define PRINT_SERIES

/*
    return the difference between end and start in microseconds
*/
int timeDiff(struct timeval start, struct timeval end) {
	return (end.tv_sec-start.tv_sec)*1000000 + (end.tv_usec-start.tv_usec);
}

void compute_Mi_mul_M(int *Mi, int A, int B, int P);
void serial_matrix(int *base, int A, int B, int P, int iter_times, int *random_numbers);

int main(int argc, char *argv[]){
    int my_rank, comm_sz, n, A, B, P, *local_series, i, local_count, *total_series;
    struct timeval start_time, end_time;
    int local_runtime, total_runtime;

	MPI_Init(&argc, &argv);
	MPI_Comm_rank(MPI_COMM_WORLD, &my_rank);
	MPI_Comm_size(MPI_COMM_WORLD, &comm_sz);

    // get input from the command
    n = atoi(argv[1]); 

    // {A,B,P} are user-defined parameters
    A = 2;
    B = 101;
    P = 93563;
    if (argc > 2)
        A = atoi(argv[2]);
    else if (argc > 3)
        B = atoi(argv[3]);
    else if (argc > 4)
        P = atoi(argv[4]);

    // Suppose n is divisible by comm_sz
    local_count = n / comm_sz;
    local_series = (int*)malloc(sizeof(int) * local_count);

    gettimeofday(&start_time, NULL);

    int base_M[4] = {1, 0, 0, 1}, send_M[4] = {1, 0, 0, 1}, recv_M[4];
    // calculate M^local_count
    for(i = 0; i < local_count; ++i){
        compute_Mi_mul_M(send_M, A, B, P);
    }
    MPI_Status status;
    // Use parallel prefix to compute M^i
    for(i = 1; i < comm_sz; i = i << 1){
        int partner;
        if (my_rank % (i << 1) < i){
            partner = my_rank + i;
            if (partner < comm_sz){
                MPI_Send(send_M, 4, MPI_INT, partner, 0, MPI_COMM_WORLD);
                MPI_Recv(recv_M, 4, MPI_INT, partner, 0, MPI_COMM_WORLD, &status);
                send_M[0] = ((long long)send_M[0] * recv_M[0]) % P;
                send_M[2] = ((long long)send_M[2] * recv_M[0] + recv_M[2]) % P;
            }
        }else{
            partner = my_rank - i;
            MPI_Recv(recv_M, 4, MPI_INT, partner, 0, MPI_COMM_WORLD, &status);
            MPI_Send(send_M, 4, MPI_INT, partner, 0, MPI_COMM_WORLD);
            send_M[0] = ((long long)send_M[0] * recv_M[0]) % P;
            send_M[2] = ((long long)send_M[2] * recv_M[0] + recv_M[2]) % P;
            base_M[0] = ((long long)base_M[0] * recv_M[0]) % P;
            base_M[2] = ((long long)base_M[2] * recv_M[0] + recv_M[2]) % P;
        }
    }
    // calculate local random numbers
    serial_matrix(base_M, A, B, P, local_count, local_series);

    gettimeofday(&end_time, NULL);

    // Gather all the local_series to rank 0
    if(my_rank == 0){
        total_series = (int*)malloc(sizeof(int) * n);
        MPI_Gather(local_series, local_count, MPI_INT, total_series, local_count, MPI_INT, 0, MPI_COMM_WORLD);
#ifdef PRINT_SERIES
        for(i = 0; i < n; ++i){
            printf("%d\n", total_series[i]);
        }
#endif
        free(total_series);
    }else{
        MPI_Gather(local_series, local_count, MPI_INT, total_series, local_count, MPI_INT, 0, MPI_COMM_WORLD);
    }

    local_runtime = timeDiff(start_time, end_time);
    MPI_Reduce(&local_runtime, &total_runtime, 1, MPI_INT, MPI_MAX, 0, MPI_COMM_WORLD);
    if (my_rank == 0)
        printf("Total runtime (in microseconds): %d\n", total_runtime);

    free(local_series);
    MPI_Finalize();
    return 0;
}

void compute_Mi_mul_M(int *Mi, int A, int B, int P){
    Mi[0] = (Mi[0] * A) % P;
    Mi[2] = (Mi[2] * A + B) % P;
}

/*
    NOTE: base is a 2*2 matrix
*/
void serial_matrix(int *base, int A, int B, int P, int iter_times, int *random_numbers){
    int i;
    if(iter_times <= 0){
        return;
    }
    compute_Mi_mul_M(base, A, B, P);
    random_numbers[0] = base[2];
    for(i = 1; i < iter_times; ++i){
        compute_Mi_mul_M(base, A, B, P);
        random_numbers[i] = base[2];
    }
}