数据分割排序的实现

最新推荐文章于 2024-03-28 22:27:47 发布

番茄超蛋

最新推荐文章于 2024-03-28 22:27:47 发布

阅读量562

点赞数

分类专栏： java 文章标签： java

本文链接：https://blog.csdn.net/hotthought/article/details/78833836

版权

java 专栏收录该内容

16 篇文章 0 订阅

订阅专栏

给你1个文件bigdata，大小4663M，5亿个数，文件中的数据随机,如下一行一个整数：

6196302

3557681

6121580

2039345

2095006

1746773

7934312

2016371

7123302

8790171

2966901

...

7005375

现在要对这个文件进行排序，怎么搞？（题目地址https://mp.weixin.qq.com/s/K94xtyTA50vU6UGG_ho23Q）

文章给出的一种解决方式就是，先把大文件分割成若干小文件，然后把小文件的数据进行排序，然后把数据归并合成最后的结果，归并的原理就是比较几个小文件中的最小值。min=min（min(s1),min(s3),min(s3),...）,这也是比较经典的一道面试题。下面我们来实现一下。

第一步，先制造一些数据，由于是本地电脑，我先少量的制造一些数据，只造了一万条数据

private static void createData(String fileName, int lineNum) {
        System.out.println("=============制造新的数据=====开始============");
        int batchSize = 200;
        Random rd = new Random();
        StringBuilder sb = new StringBuilder();
        for (int i=1; i<=lineNum; i++ ){
            int num = rd.nextInt(100000000);
            sb.append(num).append("\n");
            if(i%batchSize == batchSize-1){
                FileUtils.write2File(fileName,sb.toString());
                // sb.delete(0,sb.length());
                sb = new StringBuilder();
            }
        }

        // 批处理余下的
        if(sb.length() > 0){
            FileUtils.write2File(fileName,sb.toString());
            // sb.delete(0,sb.length());
        }
        System.out.println("=============制造数据=====结束============");
    }

第二步开始分割这些数据，每一千条切割到一个小文件中保存

// 将大文件进行切割，分成一些小的文件
    private static List<String> partData(String fileName, int size) {
        System.out.println("========分割小文件====开始====size=="+size);
        // 然后取每个排序好的小文件的最小值进行比较，取最小的
        FileReader fr = null;
        List<String> partFileNameList = new ArrayList<String>();
        try {

            fr = new FileReader(fileName);
            BufferedReader br = new BufferedReader(fr);
            String line = null;
            int lineNum = 0;
            StringBuilder sb = new StringBuilder();
            String partFileName = "";
            int batchNum = 0;
            while ((line = br.readLine()) != null){
                lineNum++;
                sb.append(line).append("\n");
                // 批量处理
                if (lineNum % size == 0){
                    batchNum++;
                    partFileName = partFileNamePrefix+batchNum+".txt";
                    FileUtils.write2File(partFileName, sb.toString());
                    partFileNameList.add(partFileName);
                    // sb.delete(0, sb.length());
                    sb = new StringBuilder();
                }
            }

            // 处理余量
            if(sb.length() > 0){
                batchNum++;
                partFileName = partFileNamePrefix+batchNum+".txt";
                FileUtils.write2File(partFileName, sb.toString());
                partFileNameList.add(partFileName);
                // sb.delete(0, sb.length());
                // sb = new StringBuilder();
            }
        } catch (IOException e) {
            e.printStackTrace();
        } finally {
            if(fr != null){
                try {
                    fr.close();
                } catch (IOException e) {
                    e.printStackTrace();
                }
            }
        }
        System.out.println("========分割小文件====结束======");
        return partFileNameList;
    }

第三步，把这些小文件做内部排序，然后保存为新的文件，数据量不大，使用冒泡排序

 private static String sortPartFileData(String partFile) {
        System.out.println("======单个小文件排序======开始==========");
        // 读取数据
        FileReader fr = null;
        BufferedReader br = null;
        String sortedFileName = null;
        try {
            List<Integer> list = new ArrayList<Integer>();
            fr = new FileReader(partFile);
            br = new BufferedReader(fr);
            String line = null;
            while((line = br.readLine()) != null){
                if(NumberUtils.isDigits(line)){
                    list.add(NumberUtils.toInt(line));
                }
            }
            // 放入List，对List进行排序
            list = SortUtils.bubbleSort(list);
            // 排序后的结果写入新的文件，返回新文件名
            sortedFileName = partFile.replace("aa_part_", "aa_part_sorted_");
            StringBuilder sb = new StringBuilder();
            for (Integer num : list){
                sb.append(num).append("\n");
            }
            // 写入新文件
            FileUtils.write2File(sortedFileName, sb.toString());
        } catch (IOException e) {
            e.printStackTrace();
        } finally {
            if(fr != null){
                try {
                    fr.close();
                } catch (IOException e) {
                    e.printStackTrace();
                }
            }
        }
        System.out.println("======单个小文件排序======结束==========");
        return sortedFileName;
    }

第三步，把结果归并，每次取各个子文件的最小值，其中有个比较纠结的地方，我是把所有子文件都读到内存中的，这样还是相当于把整个大文件读入内存，如果文件非常大的话，就需要大内存，但是如果不把所有数据一次读入内存的话，每比较一次最小值都要重复读取一次子文件，这个IO的消耗会更加的大，这个地方我也没有想到更好的解决方案。

// 取每个子文件的最小值，然后排序
    private static String sortDataFromSortedPartFile(List<String> sortedPartFileList) {
        String resultFileName = "D:\\data\\a\\aa_sorted_result.txt";
        // 记录每个子文件取值的位置
        Map<String, Integer> minIndex = new HashMap<String, Integer>();
        for (String fn : sortedPartFileList){
            minIndex.put(fn,0);
        }
        // 权衡IO开关消耗和内存消耗，确定方案，是每次打开读写还是子文件内容保存在内存中
        // 此处demo的数据量不大，为了简便我就都读进来了
        Map<String, List<String>> partData = new HashMap<String, List<String>>();
        // 目前数据不是很大，先把数据读出来
        for (String fn : sortedPartFileList){
            List<String> dataList = FileUtils.readFileData(fn);
            partData.put(fn, dataList);
        }

        StringBuilder sb = new StringBuilder();
        int counter = 0;
        int batchSize = 10000;
        // 权衡IO开关消耗和内存消耗，确定方案，是每次打开读写还是子文件内容保存在内存中
        while(true){
            int min = Integer.MAX_VALUE;
            String useFn = "";
            for (String fn : sortedPartFileList){
                int index = minIndex.get(fn);
                if(index < 0){
                  continue;
                }
                // 如果连不到值， -1代表子文件没有数据了
                List<String> dataList = partData.get(fn);
                if (index >= dataList.size()){
                    minIndex.put(fn,-1);
                    continue;
                }

                String dataStr = dataList.get(index);

                int data = NumberUtils.toInt(dataStr, min);
                if(data < min){
                    min = data;
                    useFn = fn;
                }
            }

            if (isPartFileDataOver(minIndex)){
                // 批量余下的
                if(sb.length() > 0){
                    FileUtils.write2File(resultFileName, sb.toString());
                }
                System.out.println("===========排序结束===========");
                break;
            }

            System.out.println(useFn+"=======min=========="+min);
            // 循环结束得到最小的一个值
            sb.append(min).append("\n");
            // 记录每个子部分取值的指针
            minIndex.put(useFn,minIndex.get(useFn)+1);
            counter++;
            if(counter % batchSize == 0){
                FileUtils.write2File(resultFileName, sb.toString());
                sb = new StringBuilder();
            }
        }

        return resultFileName;
    }

其中有个比较纠结的地方，我是把所有子文件都读到内存中的，这样还是相当于把整个大文件读入内存，如果文件非常大的话，就需要大内存，但是如果不把所有数据一次读入内存的话，每比较一次最小值都要重复读取一次子文件，这个IO的消耗会更加的大，这个地方我也没有想到更好的解决方案。望高手留言指教。