筛子排序（SieveSort) - 7

铸人

已于 2024-09-26 00:09:40 修改

阅读量291

点赞数 5

文章标签：算法数据结构

于 2024-09-26 00:05:47 首次发布

本文链接：https://blog.csdn.net/u014161757/article/details/142535801

版权

先前提到的想法已经实现了。

也就是说，彻底的不用memcpy，不折腾内存。

首先我们可以把这种算法称为“乒乓归并”（ping-pong merge）。

具体是什么原理呢？

对于给定的数组（可以非常大），首先做一个和它一样大的数组（result），然后把原来的数组（a）按照先前的归并思路进行划分，递归分配任务到排序函数中，如果还能继续划分就继续直到达到基本单位（比如256）。然后对数据进行排序，并把排序的结果，放在对侧的数组（result）中。递归返回到上一层的时候，在上面划分的每个区域都已经完成了16路归并，每个区域中的元素都是有序的。这时候再在这个结果数组中进行16路归并，并把结果写回到原数组（a），然后继续递归返回到上一层。在上面这一层中（a），16个区域显然也已经有序，对16个区域进行归并，结果放在数组（result）中……这样一直向上，最终可能有两种情况：1，结果归结在a里面，则释放result数组，返回a数组；2，结果在result数组中，释放a数组，并把原来指向a数组的指针指向result数组，并返回。我们可以用对指针的引用或者输入参数设定为指向数组的指针。

这样来回反复的归并，就避免了反复的向同一个数组复制数据。所有数据操作过程都是on-the-fly完成的。由此性能也得到了极大的提升。如果把前面用筛子排序法实现256数据排序的函数调用，换成std::sort，性能还能略微提升。

实测的结果如下：


i=8,t=256
==================================
samples:256
repeats:1
omp: 16 threads
sieve sort speed:50.2513K/s
std sort speed:  147.059K/s
t1(seive):1.99e-05 s
t2(std::):6.8e-06 s
ratio:34.1709%

i=12,t=4096
==================================
samples:4096
repeats:1
omp: 16 threads
sieve sort speed:10.8225K/s
std sort speed:  6.89655K/s
t1(seive):9.24e-05 s
t2(std::):0.000145 s
ratio:156.926%

i=16,t=65536
==================================
samples:65536
repeats:1
omp: 16 threads
sieve sort speed:0.621195K/s
std sort speed:  0.304813K/s
t1(seive):0.0016098 s
t2(std::):0.0032807 s
ratio:203.796%

i=20,t=1048576
==================================
samples:1048576
repeats:1
omp: 16 threads
sieve sort speed:0.0284363K/s
std sort speed:  0.0143527K/s
t1(seive):0.0351663 s
t2(std::):0.0696735 s
ratio:198.126%

i=24,t=16777216
==================================
samples:16777216
repeats:1
omp: 16 threads
sieve sort speed:0.00224595K/s
std sort speed:  0.000867732K/s
t1(seive):0.445247 s
t2(std::):1.15243 s
ratio:258.829%

i=28,t=268435456
==================================
samples:268435456
repeats:1
omp: 16 threads
sieve sort speed:0.000144218K/s
std sort speed:  4.5778e-05K/s
t1(seive):6.93396 s
t2(std::):21.8446 s
ratio:315.037%

可见实测效果普遍达到std::sort的两倍到3倍。

修改后的核心代码如下：

bool sieve_sort_core(uint32_t* a, size_t n, uint32_t* result, int depth, int omp_depth);
bool sieve_sort_omp(uint32_t* a, size_t n, uint32_t* result, int depth, int omp_depth) {
	size_t loops = 0, stride = 0, reminder = 0;
	__mmask16 mask = 0;
	if (!get_config(n, loops, stride, reminder, mask)) return false;
	if (omp_depth > 0 && depth >= 2) {
#pragma omp parallel for
		for (int i = 0; i < loops; i++) {
			sieve_sort_core(a + i * stride,
				(i == loops - 1 && reminder > 0) ? reminder : stride,
				result + i * stride,
				depth - 1, omp_depth - 1);
		}
	}
	else {
		for (int i = 0; i < loops; i++) {
			sieve_sort_core(a + i * stride,
				(i == loops - 1 && reminder > 0) ? reminder : stride,
				result + i * stride,
				depth - 1, omp_depth - 1);
		}
	}
	if (depth >= 4 && ((depth - 3) & 1) == 1) {
		std::swap(result, a);
	}
	return sieve_collect(n, loops, stride, reminder, mask, result, a);
}
bool sieve_sort_core(uint32_t* a, size_t n, uint32_t* result, int depth, int omp_depth) {
	return (n <= _256)
		? sieve_sort_256(a, n, result)
		: sieve_sort_omp(a, n, result, depth, omp_depth)
		;
}

bool sieve_sort(uint32_t** pa, size_t n, int omp_depth = 32)
{
	bool done = false;
	//max(n)==256P (2^60)
	if (pa == nullptr || *pa == nullptr || n > _256P)
		return false;
	else if (n == 0)
		return true;
	else if (n == 1) {
		return true;
	}
	else if (n == 2) {
		uint32_t a0 = *pa[0], a1 = *pa[1];
		*pa[0] = std::min(a0, a1);
		*pa[1] = std::max(a0, a1);
		return true;
	}
	else {
		uint32_t* result = new uint32_t[n];
		if (result != nullptr) {
			int max_depth = get_depth(n);
			done = sieve_sort_core(*pa, n, result, max_depth, omp_depth);
			if (max_depth >= 4 && ((max_depth & 1) == 0)) {
				std::swap(*pa, result);
			}
			delete[] result;
		}
	}
	return done;
}

P.S. 为什么要考虑排序的问题？

排序看似是很基本的问题，但是当数据的数量极其巨大的时候，它就变成了一个非常困难的问题。对于人来说，可以错，可以遗漏，但是对于机器来说，没法错，没法遗漏。所有的数据之间彼此的关联性共同构建了整个数组的顺序。所以它实际上是一个巨大的整体。对于这种巨大的整体进行排序显然是十分缓慢的。在多核多线程或者SIMD的条件下，显然我们希望能够把事情处理到可以大而化之的程度，这样的话各种并行性就能够用得上了。所以说，怎么把这些数据分开，怎么在小范围内排序，又如何从小范围的有序实现大范围的有序，才是这个问题中最有价值的部分。

不多说了，具体实现请参阅github上的源码。