大规模文本文件的外部归并排序（python实现）

最新推荐文章于 2024-10-02 23:40:03 发布

嘎嘎能睡..

最新推荐文章于 2024-10-02 23:40:03 发布

阅读量1k

点赞数 21

文章标签：算法排序算法数据结构 python 数据库开发

本文链接：https://blog.csdn.net/2401_85258252/article/details/141871050

版权

下述代码实现的是一种外部排序算法，它主要利用了归并排序的思想，并且使用了堆（优先队列）来辅助合并过程。这里的核心思想是将一个大文件分割成多个小文件（称为“块”或“chunk”），然后对这些小文件进行排序，最后将它们合并成一个有序的大文件。

这个过程可以分为三个主要步骤：

1. 分割（Splitting）：将大文件分割成多个小文件，每个小文件包含一定数量的记录（例如，每行一个记录）。

2. 局部排序（Local Sorting）：对每个小文件进行排序。由于小文件的大小适中，可以将其全部加载到内存中进行排序，通常使用快速排序、归并排序或其他高效的内部排序算法。

3. 归并（Merging）：将所有已排序的小文件合并成一个有序的大文件。这是通过使用一个最小堆来实现的，堆中的每个元素都是下一个要处理的记录（来自某个小文件的第一行），以及该记录所在文件的索引。每次从堆中取出最小的记录并写入输出文件，然后从相应的小文件中读取下一条记录，将其加入堆中。这个过程重复进行，直到所有小文件都被完全读取。

① 首先，生成一个由随机字母组成的.txt的大型文本文件。

import random
import string

def generate_random_word(length=5):
    letters = string.ascii_lowercase  # 包含所有小写字母
    return ''.join(random.choice(letters) for _ in range(length))

with open('random_words.txt', 'w') as file:
    for _ in range(1000000):
        random_word = generate_random_word(5)  # 每个单词长度为5
        if random.random() < 0.05:  # 有5%的概率添加一个空行
            file.write('\n')
        file.write(random_word + '\n')

文本文件内容如图所示（这里设定的单词长度为5，空行插入率为5%）：

② 导入需要的包

import heapq
import os
import argparse

③ 步骤一：从原始文件中读取数据块并进行排序

def sort_and_save_chunk(chunk, chunk_num):
    # 对数据块进行排序
    chunk.sort()
    # 将排序后的数据块写入一个中间文件，命名为 chunk_{chunk_num}.txt
    with open(f'chunk_{chunk_num}.txt', 'w') as chunk_file:
        for word in chunk:
            chunk_file.write(f'{word}\n')

步骤二：分割原始文件成多个小文件，并去除空行

def split_file(input_file, chunk_size):
    chunk = []  # 用于存储数据块
    chunk_num = 1  # 数据块编号
    with open(input_file, 'r') as f:
        for line in f:
            line = line.strip()
            if line:  # 检查行是否非空
                chunk.append(line)  # 从原始文件中读取单词并存储在数据块中
                if len(chunk) == chunk_size:  # 数据块大小达到 chunk_size 时，执行排序和保存
                    sort_and_save_chunk(chunk, chunk_num)
                    chunk = []  # 重置数据块
                    chunk_num += 1
        if chunk:  # 处理剩余的数据块
            sort_and_save_chunk(chunk, chunk_num)

步骤三：归并排序

def merge_sorted_files(output_file, chunk_files):
    with open(output_file, 'w') as output:
        file_handles = []  # 用来存储所有中间文件的文件句柄
        data = []  # 存储堆中的数据，每个元素是一个包含单词和文件索引的元组

        # 遍历所有中间文件
        for i, chunk_file in enumerate(chunk_files):
            handle = open(chunk_file, 'r')  # 打开文件
            file_handles.append(handle)   # 将文件句柄添加到列表中
            line = handle.readline()  # 读取第一个单词
            if line:
                data.append((line.strip(), i))  # 如果读取到单词，将其添加到数据列表中

        heapq.heapify(data)  # 将数据列表转换为最小堆

        while data:
            smallest_word, file_index = heapq.heappop(data)  # 从堆中取出最小单词及其索引
            output.write(f'{smallest_word}\n')  # 将最小单词写入输出文件

            next_word = file_handles[file_index].readline()  # 读取下一个单词
            if next_word:
                heapq.heappush(data, (next_word.strip(), file_index))  # 将新单词添加到堆中

        for handle in file_handles:
            handle.close()  # 关闭所有中间文件的文件句柄

④ 定义主函数，并构建命令行界面

def main():
    parser = argparse.ArgumentParser(description="Sort large files containing words using external sorting and remove empty lines.")
    parser.add_argument("input_file", type=str, help="Input file containing words to be sorted.")
    parser.add_argument("output_file", type=str, help="Output file for the sorted words.")
    parser.add_argument("--chunk_size", type=int, default=10000, help="Size of each chunk to sort and save.")
    args = parser.parse_args()

    # 分割文件
    split_file(args.input_file, args.chunk_size)

    # 获取所有中间文件
    chunk_files = [f for f in os.listdir() if f.startswith('chunk_') and f.endswith('.txt')]

    # 归并排序
    merge_sorted_files(args.output_file, chunk_files)

if __name__ == "__main__":
    main()