ComputeShaderSort11 Sample_compute shader 排序-CSDN博客

本文介绍了一种在DirectX11中利用计算着色器4.0特性实现Bitonic排序算法的方法，重点阐述了如何通过优化内存访问以提高性能。通过将数据集分为交替递增和递减的序列，最终合并并排序以产生有序数据。此外，文章还详细解释了在计算着色器中实现Bitonic排序的具体步骤，包括在组共享内存中加载、同步和比较元素的过程。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

根据自己理解翻译，水平有限。。。

Sample Overview

This sample demonstrates the basic usage of the DirectX 11 Compute Shader 4.0 feature to implement a bitonic sort algorithm. It also highlights the considerations that must be taken to achieve good performance.

本实例通过实现bitonic排序算法演示了D3D11中计算着色器4.0特性的基本用法，着重强调如何通过CS提高性能。

Bitonic Sort

Bitonic sort is a simple algorithm that works by sorting the data set into alternating ascending and descending sorted sequences. These sequences can then be combined and sorted to produce larger sequences. This is repeated until you produce one final ascending sequence for the sorted data.

（bitonic排序原理参考：1、http://blog.csdn.net/qiul12345/article/details/7089501

2、http://hi.baidu.com/abcdxyzk/item/d1aa49dc60c49cfacb0c3918

3、百度百科：

1968年Batcher提出了两个著名的排序方法：奇偶排序和Bitonic排序，由于该类方法在开关网络，并行处理系统，多访问存储系统等方面有着重要的应用价值。

所谓双调序列(Bitonic Sequence)是指由一个非严格增序列X和非严格减序列Y构成的序列，比如序列（23,10,8,3,5,7,11,78）。

定义:一个序列a1,a2,…,an是双调序列(Bitonic Sequence)，如果：

(1)存在一个ak(1≤k≤n), 使得a1≥…≥ak≤…≤an成立；或者

(2)序列能够循环移位满足条件(1)

双调归并网络是基于 Batcher定理而构建的。 Batcher定理是说将任意一个长为2n的双调序列A分为等长的两半X和Y，将X中的元素与Y中的元素一一按原序比较，即a[i]与a[i+n](i<n)比较，将较大者放入MAX序列，较小者放入MIN序列。则得到的MAX和MIN序列仍然是双调序列，并且MAX序列中的任意一个元素不小于MIN序列中的任意一个元素。

根据这个原理，我们可以将一个输入的n元素双调序列首先通过洗牌比较操作得到一个MAX序列和一个MIN序列，然后通过两个n/2阶双调归并器处理就可以得到一个有序序列。

）

This example illustrates how to sort eight integers:

Start: The initial unsorted data

Step 1: Sort every two elements ascending and descending分别按照升序（蓝色箭头）和降序（红色箭头）对相邻元素进行排序

Step 2: Sort every four elements ascending and descending, and then sort every two elements先四个相邻元素分别按照升序和降序排序，再相邻排序

Step 3: Sort all eight elements ascending, then every four, and finally every two先是8个元素升序排（将后面小的调前），再每四个排，再每两个排，看图示就清楚

Bitonic Sort with Compute Shader

Now let's look at how to implement the bitonic sort in computer shader for a single thread group. To achieve good performance when implementing the sorting algorithm, it is important to limit the amount of memory accesses where possible. Because this algorithm has very few ALU operations and is limited by its memory accesses, we perform portions of the sort in shared memory, which is significantly faster. Unfortunately, there are two problems that must be worked around. First, there is a limited amount of group shared memory and a limited number of threads in a group. And second, in CS4.0, the group shared memory supports random access reads but it does not support random access writes. Even with these limitations, it is possible to create an efficient implementation using group shared memory.

现在来看看在计算着色器中如何通过一个线程组中实现bitonic排序，为了提高性能，需要尽量限制内存访问的次数，由于该算法较少ALU操作，并限制了内存访问，因而在共享内存执行排序过程，这样会非常的快，不过，会遇到两个问题，首先，组共享内存的数量限制，及组内线程的数量限制，其次，在CS4.0中，组共享内存支持随机读取但不支持随机写入。即便如此，通过组共享内存还是有可能提高排序效率

Step 1: Load the group shared memory. Each thread loads one element.载入组共享内存，每个线程载入一个元素

    shared_data[GI] = Data[DTid.x];

Step 2: Next, the threads must by synchronized to guarantee that all of the elements are loaded because the next operation will perform a random access read.

线程需要同步保证所有的元素都载入，因为下一步将执行随机读取过程

    GroupMemoryBarrierWithGroupSync();

Step 3: Now each thread must pick the min or max of the two elements it is comparing. The thread cannot compare and swap both elements because that would require random access writes.

now每个线程选取两个元素中最小或最大的，线程不能比较交换这两个元素，因为该操作需要随机写入（CS4.0只能随机读取不能随机写入）

    unsigned int result = ((shared_data[GI & ~j] <= shared_data[GI | j]) == (bool)(g_iLevelMask & DTid.x))? shared_data[GI ^ j] : shared_data[GI];

Step 4: Again, the threads must be synchronized. This is to prevent any threads from performing the write operation before all threads have completed the read.

再次需要线程同步，这样防止线程执行写操作，在所有线程完成前述读取操作前

    GroupMemoryBarrierWithGroupSync();

Step 5: The min or max is now stored in group shared memory and synchronized. (The algorithm loops back to step 3 and must finish all writes before threads start reading.)

比较的最小或最大值现储存在组共享内存中并进行同步，这样，算法再循环执行步骤3线程开始读取之前，完成了所有的写入操作

    shared_data[GI] = result;
    GroupMemoryBarrierWithGroupSync();

Step 6: With the memory sorted, the results can be stored back to the buffer.内存排序完，排序结果储存回缓冲区

    Data[DTid.x] = shared_data[GI];

Sorting More Data

The bitonic sort shader we have created works great when the data set is small enough to run with one thread group. Unfortunately, for CS4.0, this means a maximum of 512 elements, which is the largest power of 2 number of threads in a group. To solve this, we can add two additional steps to the algorithm. When we need to sort a section that is too large to be processed by a single group of threads, we transpose the entire data set. With the data transposed, larger sort steps can be performed entirely in shared memory without changing the bitonic sort algorithm. Once the large steps are completed, the data can be transposed back to complete the smaller steps of the sort

上述bitonic排序计算器在排序的数据量少到足以在一个线程组内运行时表现较好，可是在CS4.0中，这个数据量的最大值为512个元素，为了解除该限制，在上述步骤再增加两个额外步骤，当排序的数据量超过一个线程组的处理能力时，对整个数据组进行倒置。通过倒置，不需要改变上述排序算法就能完全在共享内存中执行数据组中较大值部分数据集的排序过程，一旦较大值部分的排序完成，数据又可倒置回来完成较小（看下面图示。。。。）

groupshared

Mark a variable for thread-group-shared memory for compute shaders. In D3D10 the maximum total size of all variables with the groupshared storage class is 16kb, in D3D11 the maximum size is 32kb. See examples.

在CS4.0中，线程组中共享内存最大为16kb，即1024/8(byte) *16 /4(int)=512

For illustration, the same 8 numbers from the first example will be sorted, but this time using only a maximum of 4 threads per group. Steps 1 and 2 can be completed by dispatching two thread groups, each with 4 threads. However, the first portion of Step 3 requires you to compare and swap elements that are spread too far apart in memory to be handled by one thread group.

来个解释，同样的8个数字，但这次是每个组中采用4个线程，步骤1,2可以通过dispatch2个线程组完成，每个线程组4个线程（共8个），然而，步骤3的第一部分需要比较和置换内存中分布太散的元素以至于不能在一个线程组中处理

Step 3A: Perform a 2x4 transpose of the 8 element data set执行8个元素的数据组中的2*4置换，即执行矩阵转置过程，看下图

1 2 5 7

8 6 4 3

=======》

1 8

2 6

5 4

7 3

Step 3B: Now the eight elements can be sorted within shared memory by dispatching two thread groups现在这8个元素就能在共享内存中通过2个线程组执行排序

Step 3C: Transpose the data back再转置回来

1 2 4 3

8 6 5 7

======》

1 8

2 6

4 5

3 7

Step 3D: Perform the remainder of the sort for four elements and then two elements执行剩余的排序过程。。。

Transpose

Implementing a transpose in Compute Shader is simple, but making it efficient requires a little bit of care. For best memory performance, it is preferable to access memory in a nice linear and consecutive pattern. Reading a row of data from the source with multiple threads is naturally a linear memory access. However, when that row is written to the destination as a column, the writes are no longer consecutive in memory. To achieve the best performance, a square block of data is first read into group shared memory as multiple contiguous memory reads. Then the shared memory is accessed as column data so that it can be written back as multiple contiguous memory writes. This allows us to shift the burden of the nonlinear access pattern to the high-performance group shared memory.

在CS中实现转置很简单，但是要有效率则需要费点心思，为了最大限度提升内存性能，最好是以线性且连续的方式来访问内存，多个线程从原数据集中读一行数据就是一种线性的内存访问，然而，当该行数据被写到一列中时，写的过程就不再是内存连续的了，为了获得最大性能，一方块数据先读到共享内存中作为多个临近内存的读取，然后共享内存以列数据方式访问，这样就能以多个临近的内存写过程的方式写回内存，这样呢就把非线性访问过程转换成高性能的组共享内存。