This sample demonstrates the basic usage of the DirectX 11 Compute Shader 4.0 feature to implement a bitonic sort algorithm. It also highlights the considerations that must be taken to achieve good performance.


Bitonic Sort

Bitonic sort is a simple algorithm that works by sorting the data set into alternating ascending and descending sorted sequences. These sequences can then be combined and sorted to produce larger sequences. This is repeated until you produce one final ascending sequence for the sorted data.





所谓双调序列(Bitonic Sequence)是指由一个非严格增序列X和非严格减序列Y构成的序列,比如序列(23,10,8,3,5,7,11,78)。
定义:一个序列a1,a2,…,an是双调序列(Bitonic Sequence),如果:
(1)存在一个ak(1≤k≤n), 使得a1≥…≥ak≤…≤an成立;或者
双调归并网络是基于 Batcher定理而构建的。 Batcher定理是说将任意一个长为2n的双调序列A分为等长的两半X和Y,将X中的元素与Y中的元素一一按原序比较,即a[i]与a[i+n](i<n)比较,将较大者放入MAX序列,较小者放入MIN序列。则得到的MAX和MIN序列仍然是双调序列,并且MAX序列中的任意一个元素不小于MIN序列中的任意一个元素。

This example illustrates how to sort eight integers:

Start: The initial unsorted data

Step 1: Sort every two elements ascending and descending分别按照升序(蓝色箭头)和降序(红色箭头)对相邻元素进行排序

Step 2: Sort every four elements ascending and descending, and then sort every two elements先四个相邻元素分别按照升序和降序排序,再相邻排序

Step 3: Sort all eight elements ascending, then every four, and finally every two先是8个元素升序排(将后面小的调前),再每四个排,再每两个排,看图示就清楚

Bitonic Sort with Compute Shader

Now let's look at how to implement the bitonic sort in computer shader for a single thread group. To achieve good performance when implementing the sorting algorithm, it is important to limit the amount of memory accesses where possible. Because this algorithm has very few ALU operations and is limited by its memory accesses, we perform portions of the sort in shared memory, which is significantly faster. Unfortunately, there are two problems that must be worked around. First, there is a limited amount of group shared memory and a limited number of threads in a group. And second, in CS4.0, the group shared memory supports random access reads but it does not support random access writes. Even with these limitations, it is possible to create an efficient implementation using group shared memory.


Step 1: Load the group shared memory. Each thread loads one element.载入组共享内存,每个线程载入一个元素

    shared_data[GI] = Data[DTid.x];

Step 2: Next, the threads must by synchronized to guarantee that all of the elements are loaded because the next operation will perform a random access read.



Step 3: Now each thread must pick the min or max of the two elements it is comparing. The thread cannot compare and swap both elements because that would require random access writes.


    unsigned int result = ((shared_data[GI & ~j] <= shared_data[GI | j]) == (bool)(g_iLevelMask & DTid.x))? shared_data[GI ^ j] : shared_data[GI];

Step 4: Again, the threads must be synchronized. This is to prevent any threads from performing the write operation before all threads have completed the read.



Step 5: The min or max is now stored in group shared memory and synchronized. (The algorithm loops back to step 3 and must finish all writes before threads start reading.)


    shared_data[GI] = result;

Step 6: With the memory sorted, the results can be stored back to the buffer.内存排序完,排序结果储存回缓冲区

    Data[DTid.x] = shared_data[GI];

Sorting More Data

The bitonic sort shader we have created works great when the data set is small enough to run with one thread group. Unfortunately, for CS4.0, this means a maximum of 512 elements, which is the largest power of 2 number of threads in a group. To solve this, we can add two additional steps to the algorithm. When we need to sort a section that is too large to be processed by a single group of threads, we transpose the entire data set. With the data transposed, larger sort steps can be performed entirely in shared memory without changing the bitonic sort algorithm. Once the large steps are completed, the data can be transposed back to complete the smaller steps of the sort



Mark a variable for thread-group-shared memory for compute shaders. In D3D10 the maximum total size of all variables with the groupshared storage class is 16kb, in D3D11 the maximum size is 32kb. See examples.

在CS4.0中,线程组中共享内存最大为16kb,即1024/8(byte) *16 /4(int)=512


For illustration, the same 8 numbers from the first example will be sorted, but this time using only a maximum of 4 threads per group. Steps 1 and 2 can be completed by dispatching two thread groups, each with 4 threads. However, the first portion of Step 3 requires you to compare and swap elements that are spread too far apart in memory to be handled by one thread group.


Step 3A: Perform a 2x4 transpose of the 8 element data set执行8个元素的数据组中的2*4置换,即执行矩阵转置过程,看下图

1    2   5   7

8    6   4   3   


1     8

2     6

5     4

7     3

Step 3B: Now the eight elements can be sorted within shared memory by dispatching two thread groups现在这8个元素就能在共享内存中通过2个线程组执行排序

Step 3C: Transpose the data back再转置回来

1  2  4  3 

8  6  5  7


1  8

2   6

4   5

3   7

Step 3D: Perform the remainder of the sort for four elements and then two elements执行剩余的排序过程。。。


Implementing a transpose in Compute Shader is simple, but making it efficient requires a little bit of care. For best memory performance, it is preferable to access memory in a nice linear and consecutive pattern. Reading a row of data from the source with multiple threads is naturally a linear memory access. However, when that row is written to the destination as a column, the writes are no longer consecutive in memory. To achieve the best performance, a square block of data is first read into group shared memory as multiple contiguous memory reads. Then the shared memory is accessed as column data so that it can be written back as multiple contiguous memory writes. This allows us to shift the burden of the nonlinear access pattern to the high-performance group shared memory.






