对含有含量数据的磁盘文件中的数据进行排序

最新推荐文章于 2023-12-30 15:42:18 发布

Juery_Lee

最新推荐文章于 2023-12-30 15:42:18 发布

阅读量593

点赞数

分类专栏： C++基础技术

本文链接：https://blog.csdn.net/Juery_Lee/article/details/8363094

版权

C++基础技术专栏收录该内容

12 篇文章 0 订阅

订阅专栏

编程珠玑上说道：

可以有两种方法：1、归并 2、位图

但是具体如何实现呢：

首先谈谈归并：

（问题假设）：10^7个不重复的数据量的磁盘文件

源文件大约是40M,所以把源文件分成40分，每份通过快排得到有序，之后对40份有序文件进行归并排序。

声明一个40个大小的临时数据，一次存入每个文件的头一个数字，然后通过最小堆，把第一个堆里的数字存入文件，然后从那个数字所处的文件中读入下一个数字，直到文件都读完。

大约耗时20s，主要是磁盘的读写很耗时

然后谈谈位图：

经过分析发现如果只用一次那么需要的内存大于1M，但是如果分两次那么每次只需0.65M。

对文件进行第一次扫描，如果小于5000000则进行位图表示，写入文件。

对文件进行第二次扫描，如果大于5000000则进行位图表示，写入文件。

耗时6s。

参考代码：

//位图方案解决10^7个数据量的文件的排序问题
//如果有重复的数据，那么只能显示其中一个其他的将被忽略
#include <iostream>
#include <bitset>
#include <assert.h>
#include <time.h>
using namespace std;

const int max_each_scan = 5000000;

int main()
{
    clock_t begin = clock();
    bitset<max_each_scan> bit_map;
    bit_map.reset();

    // open the file with the unsorted data
    FILE *fp_unsort_file = fopen("data.txt", "r");
    assert(fp_unsort_file);
    int num;

    // the first time scan to sort the data between 0 - 4999999
    while (fscanf(fp_unsort_file, "%d ", &num) != EOF)
    {
        if (num < max_each_scan)
            bit_map.set(num, 1);
    }

    FILE *fp_sort_file = fopen("sort.txt", "w");
    assert(fp_sort_file);
    int i;

    // write the sorted data into file
    for (i = 0; i < max_each_scan; i++)
    {
        if (bit_map[i] == 1)
            fprintf(fp_sort_file, "%d ", i);
    }

    // the second time scan to sort the data between 5000000 - 9999999
    int result = fseek(fp_unsort_file, 0, SEEK_SET);
    if (result)
        cout << "fseek failed!" << endl;
    else
    {
        bit_map.reset();
        while (fscanf(fp_unsort_file, "%d ", &num) != EOF)
        {
            if (num >= max_each_scan && num < 10000000)
            {
                num -= max_each_scan;
                bit_map.set(num, 1);
            }
        }
        for (i = 0; i < max_each_scan; i++)
        {
            if (bit_map[i] == 1)
                fprintf(fp_sort_file, "%d ", i + max_each_scan);
        }
    }

    clock_t end = clock();
    cout<<"用位图的方法，耗时："<<endl;
    cout << (end - begin) / CLK_TCK << "s" << endl;
    fclose(fp_sort_file);
    fclose(fp_unsort_file);
    return 0;
}