Java 大文件读取排序

需求:
csv文件中存在千万级数据,需要按照某一列进行排序
比如
1,royzhou1985@163.com,13752468532,123,1
1,royzhou1985@183.com,13752465532,123,1
1,royzhou1985@173.com,13752463532,123,1

要求可以按照其中某一列,比如邮件地址或者手机号码排序。

实现:
为了不导致内存溢出,每次读取一定数量的记录,比如10W行记录。使用Java API Collections.sort()排序,然后写到一个临时文件。 这样就可以将大文件拆分成很多个排序了的小文件.

然后打开这些小文件,全部读取一行,获取最小的值,然后从那个文件再读取一行,循环判断直到所有文件读取结束。

问题:
性能较差,合成大文件的时候需要做很多次比较。

改进:
分割文件可以用多线程,讲读取到的10W行记录在线程中排序处理。
合成文件?

代码实现:

package com.royzhou.sort;

import java.io.BufferedReader;
import java.io.BufferedWriter;
import java.io.File;
import java.io.FileInputStream;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.InputStreamReader;
import java.io.OutputStreamWriter;
import java.util.ArrayList;
import java.util.Collections;
import java.util.List;
import java.util.concurrent.BlockingQueue;
import java.util.concurrent.LinkedBlockingQueue;
import java.util.concurrent.ThreadPoolExecutor;
import java.util.concurrent.TimeUnit;

import com.roy.zhou.util.SysConfig;

public class FileSorter {

public int sort(String fileName, LineProcessor lineProcessor) throws InterruptedException, IOException {
int index = split(fileName, lineProcessor);
int totalCount = merge(fileName, index, lineProcessor);
return totalCount;
}

public int split(String fileName, LineProcessor lineProcessor) throws InterruptedException, IOException {
int fileIndex = 0;
BlockingQueue<Runnable> workQueue = new LinkedBlockingQueue<Runnable>();
ThreadPoolExecutor pool = new ThreadPoolExecutor(SysConfig.THREAD_NUMBER, SysConfig.THREAD_NUMBER, 600, TimeUnit.SECONDS, workQueue);
BufferedReader br = null;
int row = 0;
String sLine = null;
String sKey = null;
List<SortedData> sList = new ArrayList<SortedData>();
//LineProcessor lineProcessor = new CSVLineProcessor(SysConfig.KEY_INDEX);
br = new BufferedReader(new InputStreamReader(new FileInputStream(fileName)));
while((sLine=br.readLine())!=null) {
sKey = lineProcessor.process(sLine);
sList.add(new SortedData(sKey, sLine, br));
row++;

if(row!=0 && row%SysConfig.BATCH_ROW_COUNT==0) {
new FileSplitController(pool, workQueue, sList, fileName, fileIndex).dispatchTask();
sList = new ArrayList<SortedData>();
fileIndex++;
}
}
/**
* check whether there is still records to be processed
*/
if(sList.size()>0) {
new FileSplitController(pool, workQueue, sList, fileName, fileIndex).dispatchTask();
fileIndex++;
}
while(workQueue.size()>1) {
Thread.sleep(5000);
}
pool.shutdown();
/**
* if all task still not finish, sleep 5 seconds and check again
*/
while(!pool.isTerminated()) {
pool.awaitTermination(5, TimeUnit.SECONDS);
}
return fileIndex;
}

public int merge(String fileName, int index, LineProcessor lineProcessor) throws IOException {
int totalCount = 0;
BufferedWriter bw = null;
BufferedReader[] fileReaders = new BufferedReader[index];
List<SortedData> sortedDatas = new ArrayList<SortedData>(index);
File[] tempFiles = new File[index];
String iFilePath = null;
String outputFile = null;
String sLine = null;
String sKey = null;
for(int i=0; i<index; i++) {
iFilePath = fileName + ".tmp" + i;
tempFiles[i] = new File(iFilePath);
fileReaders[i] = new BufferedReader(new InputStreamReader(new FileInputStream(iFilePath)));
sLine = fileReaders[i].readLine();
sKey = lineProcessor.process(sLine);
sortedDatas.add(new SortedData(sKey, sLine, fileReaders[i]));
}

outputFile = fileName + ".sorted";
bw = new BufferedWriter(new OutputStreamWriter(new FileOutputStream(outputFile)));

SortedData smallestData = null;
String smallestKey = null;
String smallestContent = null;
String tempKey = null;
String tempContent = null;
BufferedReader tempReader = null;
while(true) {
Collections.sort(sortedDatas);
smallestData = sortedDatas.get(0);
smallestKey = smallestData.getKey();
if(smallestKey==null || "".equals(smallestKey)) {
break;
}
smallestContent = smallestData.getContent();
tempReader = smallestData.getFileReader();
bw.write(smallestContent + "\n");
totalCount++;
tempContent = tempReader.readLine();
tempKey = lineProcessor.process(tempContent);
sortedDatas.set(0, new SortedData(tempKey, tempContent, tempReader));
}

bw.flush();
bw.close();

for(int i=0; i<index; i++) {
fileReaders[i].close();
tempFiles[i].delete();
}
System.out.println(totalCount);
return totalCount;
}
}
  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值