位图排序及其扩展应用——《编程珠玑》读书笔记

最新推荐文章于 2014-11-09 18:20:58 发布

huagong_adu

最新推荐文章于 2014-11-09 18:20:58 发布

阅读量3k

点赞数 3

分类专栏：算法文章标签：读书扩展编程 output file iterator

本文链接：https://blog.csdn.net/huagong_adu/article/details/7627978

版权

算法专栏收录该内容

17 篇文章 0 订阅

订阅专栏

一、基本的位图排序

问题1：输入一个包含n=100万个正整数的文件，每个正整数都小于N=1000万，而且这100万个正整数没有重复，对这个文件的数字进行排序，保存结果到文件中。要求占用尽可能小的内存，速度尽可能快。

分析解决：如果用一个int保存一个正整数，一个int为4 Byte，100万个数要用400万 Byte，约为4M。如果用快排，时间复杂度为O(nlogn)。

考虑到问题的特殊性，所有数字均为正整数，且都不重复，这样的问题可以用位图解决。每个数字对应位图中的一位，如果数字出现则置1，否则置0。一个int 4 Byte可以保存32个数，因为所有的数都小于1000万，所以可以先用大小为1000万的位图来记录这100万个数，最后从头扫描这个位图，把置1的数字输出就是按序的结果。用位图排序需要的空间约为1.25M，时间复杂度为O(N)，无论空间还是时间都比快排好。

伪代码如下：

/* phase 1: initialize set to empty */
for i = [0, N)
        bit[i] = 0
/* phase 2: insert present elements into the set */
for each i in the input file
        bit[i] = 1
/* phase 3: write the sorted output */
for i = [0, N)
        if bit[i] = 1
                write i on the output file

程序实现：首先要先生成一个100万的不重复的正整数文件，而且每个数都小于1000万，生成的方法可以参考我之前写的

抽样问题——《编程珠玑》读书笔记

这篇文章。我采用的是Floyd的方法，抽出来之后数字是有序的，需要打乱他们的顺序，如何打乱可以参考我的洗牌程序这篇文章。生成不重复的随机数的程序如下：

#include <iostream>
#include <cstdlib>
#include <ctime>
#include <set>
#include <vector>
#include <fstream>

using namespace std;

// generate random number between i and j, 
// both i and j are inclusive
int randint(int i, int j)
{
	if (j < i)
	{ int t = i; i = j; j = t; }
	int ret = i + rand() % (j - i + 1);
	return ret;
}
// floyd sample, take m random number without
// duplicate from n
void floyd_f2(int n, int m, set<int> &S)
{
	for (int i = n - m; i < n; ++i)
	{
		int j = randint(0, i);
		if (S.insert(j).second)
			continue;
		else
			S.insert(i);
	}
}
// shuffle the data set V
void knuth_shuffle(vector<int> &V)
{
	int n = V.size();
	for (int i = n - 1; i != 0; --i)
	{
		int j = randint(0, i);
		int t = V[i]; V[i] = V[j]; V[j] = t;
	}
}

template<typename T>
void output_file(T beg, T end, char *file)
{
	ofstream outfile(file);
	if (!outfile)
	{
		cout << "file \"" << file << "\" not exists" << endl;
		return;
	}
	while (beg != end)
	{
		outfile << *beg << endl;
		++beg;
	}
	outfile.close();
}

void help()
{
	cout << "usage:" << endl;
	cout << "./Floyd_F2 n m output_file_name" << endl;
}

int main(int argc, char* argv[])
{
	if (argc != 4)
	{
		help();
		return 1;
	}
	srand(time(NULL));
	int n = atoi(argv[1]);
	int m = atoi(argv[2]);
	set<int> S;
	// sample
	floyd_f2(n, m, S);
	// shuffle
	vector<int> V(S.begin(), S.end());
	knuth_shuffle(V);
	// output
	vector<int>::iterator VBeg = V.begin();
	vector<int>::iterator VEnd = V.end();
	//output(VBeg, VEnd);
	output_file(VBeg, VEnd, argv[3]);

	return 0;
}

有了数据之后接着用位图算法对数据进行排序。我们用int数组来表示位图，1000万个位的位图需要大小N=(1000万/32+1)大小的数组（加1是因为1000万/32可能有余数，剩下那部分数据需要多一个int来表示）。

拿到一个数i之后首先要知道把这个数放在位图的哪个位置。假设数组为array，因为一个int可以表示32个数，所以i的在数组中的位置为(i/32)，即array[i/32]，具体在数组array[i/32]的哪一位呢？可以通过i%32得到。知道了数字在位图中的位置之后就可以把数字放入位图中，进行置位、测试和清空等操作，这几个操作的C++代码实现如下所示，采用位操作服进行计算：

#define BITWORD 	32
#define SHIFT 		5
#define MARK 		0x1F
#define N 			10000000
#define COUNT 		((N) / (BITWORD))

int ary[COUNT + 1];

void set(int i)
{
	ary[i >> SHIFT] |= (1 << (i & MARK));
}

bool test(int i)
{
	return (ary[i >> SHIFT] & (1 << (i & MARK)));
}

void clr(int i)
{
	ary[i >> SHIFT] &= ~(1 << (i & MARK));
}

整个位图排序的C++代码实现如下：

#include <iostream>
#include <fstream>
#include <vector>
#include <string>
#include <sstream>

using namespace std;

#define BITWORD 	32
#define SHIFT 		5
#define MARK 		0x1F
#define N 			10000000
#define COUNT 		((N) / (BITWORD))

int ary[COUNT + 1];

void set(int i)
{
	ary[i >> SHIFT] |= (1 << (i & MARK));
}

bool test(int i)
{
	return (ary[i >> SHIFT] & (1 << (i & MARK)));
}

void clr(int i)
{
	ary[i >> SHIFT] &= ~(1 << (i & MARK));
}

void help()
{
	cout << "usage:" << endl;
	cout << "./BitSort inputfile outputfile" << endl;
}

int main(int argc, char *argv[])
{
	if (argc != 3)
	{
		help();
		return 1;
	}
	ifstream infile(argv[1]);
	if (!infile)
	{
		cout << "file \"" << argv[1] << "\" not exists" << endl;
		return 1;
	}

	time_t t_start, t_end;
	t_start = time(NULL);

	// read the data and set the data in the bit map
	string line;
	istringstream istream;
	int num = 0;
	while (getline(infile, line))
	{
		istream.str(line);
		istream >> num; // read the number
		set(num); // set the number
		istream.clear();
	}
	infile.close();

	ofstream outfile(argv[2]);
	if (!outfile)
	{
		cout << "create output file \"" << argv[2] << "\" failed" << endl;
		return 1;
	}
	// read the bit map and write to the file
	for (int i = 0; i <= N; ++i)
	{
		if (test(i))
			outfile << i << endl;
	}
	outfile.close();

	t_end = time(NULL);
	cout << "time collapse: " << difftime(t_end, t_start) << " s" << endl;
	cout << "need " << ((double)N / (8 * 1000000)) << " M memory" << endl;
	return 0;
}

二、位图排序扩展

问题2：如果输入的正整数允许存在重复，而且至多只能重复10次，又该怎么对这100万个数字进行排序呢？

分析解决：问题1只能处理没有重复的正整数的情况，如果输入中的数字存在重复那么上面的位图算法就不再适用。考虑到问题的限制：每个数字最多只能重复10次，原来的位图算法用一个位表示一个数字，一个位只有两种状态：1和0，分别表示这个数字存在和不存在，如果对位图进行小小的改进，用几个位来表示一个数字，这几个位的数字表示该位的数字出现的次数，这样就可以用位图进行排序。因为最多只能重复10次，可以用4个位来表示一个数，这样空间是原来基本位图排序的4倍，需要约5M的内存空间，时间复杂度还是O(N)。

程序实现：每个数字对应数组中的位置和前面分析类似，一个int可以表示32/4=8个数字，对一个正整数i，先找到其对应数组的下标位置：i/8，再找到其起始位：4*(i%8)。

置位：当i每出现一次则在其起始位上加1；

测试i出现次数：因为每个数字占4位，可以通过对0x0F进行移位，移到i对应的位置上，相与，再移回低位上得到i出现的次数。

清空：和测试相反，相与的时候与0xF0相与。

这几个操作的C++实现代码如下：

#define BITWORD 	8
#define SHIFT 		3
#define MARK 		0x07
#define TEST 		0x0F
#define POS 		((i & MARK) << 2)
#define N 			10000000
#define COUNT 		((N) / (BITWORD))

int ary[COUNT + 1];

void set(int i)
{
	ary[i >> SHIFT] += 1 << POS;
}

// return the presence count of number i, used for output
int test(int i)
{
	return (ary[i >> SHIFT] & (TEST << POS)) >> POS;
}

void clr(int i)
{
	ary[i >> SHIFT] &= ~(TEST << ((i & MARK) << 2));
}

具体实现基本和原来的位图排序差不多，只是在输出结果的时候要根据数字重复出现的次数进行迭代输出：

	// read the bit map and write to the file
	for (int i = 0; i <= N; ++i)
	{
		int count = test(i); // get the count of number i's presence
		for (int j = 0; j != count; ++j)
			outfile << i << endl;
	}

三、位图的扩展应用

位图的优势一个是节省空间，通常一个int只能表示1个数字，用位图可以表示多个数字，二是速度快，可以直接索引到具体的位置。除了用于排序外，还能用于：

找出重复出现的数字：每次进行test，如果test返回非零值，则表示已经存在该数字