CUPQ代码解析

最新推荐文章于 2024-05-23 15:50:56 发布

LazyStorm

最新推荐文章于 2024-05-23 15:50:56 发布

阅读量695

点赞数

分类专栏： programming

本文链接：https://blog.csdn.net/JINXER/article/details/104720714

版权

programming 专栏收录该内容

2 篇文章 0 订阅

订阅专栏

cupq 是github上一个cuda实现的优先级队列的库，适用于many-to-many的最短路计算。先上链接: cuda-pq

作者思想是从二叉堆演化而来的。先简单介绍二叉堆。
二叉堆的结构性：堆在逻辑上就是一个完全二叉树（方便计算孩子和父亲在向量中的index），在物理上是一个向量。二叉堆的堆序性是，每个父亲节点要比所有儿子要小。二叉堆主要有insert,pop操作。

pop
每次pop操作，都弹出最顶端的root元素，这是在heap中最小的。之后为了保持树形结构，会把最后一个元素放到root。这时候，堆序性不能保证，因此，需要用下滤操作（propagate down），从不满足堆序性的顶端开始，比较父亲和所有儿子中的最小值(minchild)，如果minchild比父亲小，那么需要交换二者的值。一直到满足堆序性为止。
insert
每次insert一个值，都会把它放到向量的末尾，之后用上滤维持堆序性。与下滤类似，上滤是比较插入元素是否比其父亲要小，如果小的话，需要交换二者位置。可能会进行多次交换，直到堆序性满足。

机智的你可能会发现，每次交换有3次操作，因此如果一开始把插入值备份，如果父亲比插入值大，那么直接把父亲复制到儿子身上，直到父亲比插入值小，这时把插入值放在儿子身上，堆序性满足。

作者没有用经典二叉堆，而是用32叉堆，也就是每个node有32个儿子。每个node自身有32个元素，每个元素是一个pair,<32 bit float, 32 bit integer>, 也就是<点到source 的距离，点在图中的位置>。pair是已经根据distance排序好的，最小的元素在最左边，称为top。使用32是为了方便线程束(warp)的原子操作。

  D_F index_t pop(Array<value_t, Limit> &sTops, int &sPosition,
                  index_t &sTop2) {

    index_t ret = sTop2;
    if (sPosition == 0) {
      mData[mSize - 1].mData = 0x0fffffff; // reset all
      mData[mSize - 1].mData2 = 0;         // reset all
      updateTop(sTops, mSize - 1);
      --mSize;
      setPosition(sPosition, WarpSize);
    }

    if (mSize > 0) { // otherwise do nothing (popping from an empty container)
      if (sPosition == 0)
        setTop(sTops, 0, 0x0fffffff);
      mData[0].replaceSmallestDiscard(mData[mSize - 1].pop(sPosition));
      int pos = sPosition - 1;
      setPosition(sPosition, pos);
      updateTop(sTops, mSize - 1);

      updateTop(sTops, 0);
      if (mSize > 1) {
        auto sTop2 = __shfl_sync(0xffffffff, mData[0].mData2, 0, WarpSize);
        DevicePair<value_t, index_t> tmp =
            propagateDown(make_DevicePair(sTops[0], sTop2), sTops);
        if (threadIdx.x == 0) {
          mData[0].mData = tmp.first;
          mData[0].mData2 = tmp.second;
        }
        updateTop(sTops, 0);
      }
    }
    updateTop2(sTop2);
    return ret;
  }

每次pop，弹出第一个元素，如果此时向量不为空，那么把最后一个元素补在第一个node(而不是元素）处。此处调用replaceSmallestDiscard函数。

  D_F void replaceSmallestDiscard(DevicePair<Type, Type2> value) {
    // instructions in PTX:
    // create bitmask 00000111
    int vote = __ballot_sync(0xffffffff, mData <= value.first);
    // find the position of first "1" in the warp
    int b;
    asm("bfind.u32 %0, %1;" : "=r"(b) : "r"(vote));
    if (b == -1 /*I'm the smallest*/) {
      return;
    }
    if (threadIdx.x <= b) {// top has became the parent, so shift left 1 place
      mData = __shfl_down_sync(vote, mData, 1);
      mData2 = __shfl_down_sync(vote, mData2, 1);
    }
    // insert in the empty spot
    if (threadIdx.x == b) {
      mData = value.first;
      mData2 = value.second;
    }
  }

最后一个元素m放到第一个Node里排序，先用__ballot()来判断第一个Node里32个元素是否比m大，如果大的话就是1,得到一个掩码类似0000011111. 之后，用bfind找到第一个1出现的地方，这里就是5. 然后，对每一个threadId.x<b的线程，使用__shfl_down_sync()，等于把node[0]中1–6的元素都向前移动一格，此时5和6是一样的，那么把m插到5处。
在这里插入图片描述
这个时候，node[0]对了，但他和儿子之间不一定满足堆序性。此时需要进行下滤，propagateDown函数。

  D_F __forceinline__ DevicePair<value_t, index_t>
  propagateDown(DevicePair<value_t, index_t> n, Array<value_t, Limit> &sTops) {
/// ret is the pair of the new root
    if (mSize < 2) // no children
      return n;

    auto mchild = minChild(sTops, 0);

    DevicePair<value_t, index_t> ret = n;

    if (n.first < sTops[mchild])
      // heap property is OK, just return
      return n;
      
    ret = mData[mchild].replaceSmallest(n);
    updateTop(sTops, mchild);

    if (mSize <= Arity + 1) // index is a leaf already (1<=index<=33)
      return ret;

    auto index = mchild;

    index_t sTop2 = mData[index].mData2;

    // broadcast, sTop2  are all same in 1 warp
    sTop2 = __shfl_sync(0xffffffff, sTop2, 0, WarpSize);

    bool leaf = child(index, 0) >= mSize;
    while (!leaf) {
      mchild = minChild(sTops, index);

      if (sTops[index] <= sTops[mchild]) {
        // OK, heap property is fine
        break;
      } else {

        n = make_DevicePair(sTops[index], sTop2);
        // swap the min in the two warps (parent and child):
        // do the swap between the 2 warps (a double replace)
        if (threadIdx.x == 0) {
          mData[index].mData = mData[mchild].mData;
          mData[index].mData2 = mData[mchild].mData2;
        }
        //  mData[index]<-->mData[mchild]
        mData[mchild].replaceSmallestDiscard(
            n); // throw away the returned value (used it already above)
        updateTop(sTops, mchild);
        updateTop(sTops, index);

        // get the next min
        sTop2 = mData[mchild].mData2;
        sTop2 = __shfl_sync(0xffffffff, sTop2, 0, WarpSize);
      }

      index = mchild;
      leaf = child(index, 0) >= mSize;
    }

    return ret;
  }

比较此时node[0].top和儿子们的tops，如果node[0].top比儿子们tops中最小的要大，那么不满足堆序性。怎么看出来儿子们哪个更小？要用minChild()函数

  D_F int minChild(Array<value_t, Limit> &sTops, int index) {
    static_assert(Arity == 32, "if you want to generalize change also the "
                               "lines below (assuming 32-way heap)");

    auto firstchild_ = child(index, 0);
    bool leaf = firstchild_ > mSize; // index is a leaf, no childs
    int nchild = (mSize - 1) % Arity;

    if (leaf && nchild == 1) {
      return firstchild_;
    }

    index_t ret, ret1;
    value_t reg0, reg1;

    ret = threadIdx.x;

    int id = child(index, threadIdx.x);
    if (id >= mSize)
      reg0 =
          0x0fffffff; // avoid access out of bound (and huge shmem allocation)
    else
      reg0 = sTops[id]; // mData[id].top();

    reg1 = __shfl_down_sync(0xffffffff, reg0, 1);

    for (int i = 1; i <= 16; i *= 2) {
      ret1 = __shfl_down_sync(0xffffffff, ret, i);
      if (reg0 > reg1) {
        ret = ret1;
      }
      if (leaf && nchild <= i) {
        ret = __shfl_sync(0xffffffff, ret, 0, WarpSize);
        return child(index, ret);
      }
      if (reg0 > reg1) {
        reg0 = reg1;
      }
      reg1 = __shfl_down_sync(0xffffffff, reg0, i * 2);
    }

    ret = __shfl_sync(0xffffffff, ret, 0);
    return child(index, ret);
  }

具体在做什么，用下图可以一目了然：
在第一轮中，reg0=sTops[id]就是这个线程掌管的儿子的top，reg1就是右边一个线程的值，如果右边比左边的小，那么把右边直接复制到左边。一直迭代到最后，reg0就是最小的top，ret就是reg0所在的线程，将其广播后（__shfl_sync），每个线程返回值都是儿子们中top最小的那个的index.
在这里插入图片描述

得到最小的儿子，用父亲的top和儿子的top比较，不满足堆序性的话二者交换。这里，父亲的top直接用最小儿子的top替换，但是父亲的top在儿子这里不一定是top了，因此要调用replaceSmallestDiscard（）来进行排序，原理和replaceSmallest()类似，只不过没有返回值。直到儿子是leaf，那么就可以结束。
需要注意的是，在更新heap(mData，一个pair组成的数组)的同时，还会更新share memory中的sTop2,sTop,sPosition。一个block里面有4个warp，也就是有4个sTop2,sTop,sPosition。sTop是一个Limit大小的数组，记录了heap里面每一个Node的top值。sTop2记录了root的top。sPosition记录的是整个Heap的大小。以上三个值只能由threadId.x==0的线程进行更新，其他线程只能读取，作为右值，否则会造成bank conflict。比如想知道某个Node的top的时候。如果想更新heap，必须先更新local memory中的mData，再warp中指定一个线程将其写到shmem中。
不同warp计算不同source node开始的最短路问题。每个线程里都有一个heap，但是由于shmem，每个heap在pop和insert时候都是一样的。问题来了，如果同时insert同一个pair，会怎样。。。

一些疑问：看了好几天，发现有问题，里面参数传递是一个pair<32 bit float, 32 bit integer>, 也就是<点到source 的距离，点在图中的位置>。用32-bit表示int3肯定是不行了，是否可以把第二个参数换成指向voxel的指针(64位)？感觉不行，只能把pair结构体换成4个32bit的int，前一个是distance后三个是int3。

现在问题是，每个最短路问题是否需要全图？如果是的话那么内存消耗会很高。
忽然想到，做BFS不需要使用优先级队列，只要有个队列，或者是一个unordered_set就行了。。。

可以试一试， dynamic parallism+ update range 外情况处理。

或者试试这个。。。？添加链接描述