一、背景
如果我们文件本来比较小,我们就可以直接读入内存进行统计即可,但是我们文件比较大,一起性读入程序就会报这个错误了
举例代码:
FileInputStream in = new FileInputStream("word");
BufferedReader bufferedReader = new BufferedReader(new InputStreamReader(in));
StringBuilder sb = new StringBuilder();
String line;
while ((line = bufferedReader.readLine()) != null) {
sb.append(line);
}
word是一个4G的文件,运行后报错如下:
内存不够。所以我们不能一次性将文件读入内存。
二、解决方案
2.1单线程版本
用缓冲buffer读一次后,进行统计,然后再buffer,直到输入流读完毕,程序如下:
public void compare_with_single() throws IOException{
//这里一定要用BUFFer 不要直接读,BUFFER快很多
BufferedInputStream in = new BufferedInputStream(new FileInputStream("word"));
byte[] buf = new byte[4*1024];
int len = 0;
Map<String , Integer> total = new HashMap<>();
long start = System.currentTimeMillis();
while ((len = in.read(buf)) != -1){ //将数据都读到buf里
//将其拷贝出来,因为最后一次可能会读不满,因此要复制出来,否则最后一次末尾可能结果就会出错
byte[] bytes = Arrays.copyOfRange(buf, 0, len);
String str = new String(bytes);
//直接默认根据空格分割,就不用专门开数组了,用HashMoreTokens和nextToke依次取元素即可
StringTokenizer stringTokenizer = new StringTokenizer(str);
while(stringTokenizer.hasMoreTokens()){
String strTemp = stringTokenizer.nextToken();
total.put(strTemp , total.getOrDefault(strTemp, 0) + 1);
}
}
System.out.println(total.get("aabaa"));
System.out.println("time: " + (System.currentTimeMillis() - start) + "ms");
}
读取文件的时候一定要用BUFFER读,否则会很慢。
消耗时间:81740ms
词频次数:319783
这里我们是单线程再跑,如果我们多线程会不会有性能提升呢?
2.2多线程版本:
//多线程版本
//先来一个线程池
final ForkJoinPool pool = ForkJoinPool.commonPool();
//多线程类
class CountTask implements Callable<HashMap<String, Integer>>{
//是分段读取处理,因此需要开始和结束
private final long start;
private final long end;
private final String fileName;
public CountTask(String fileName ,long start, long end ) {
this.start = start;
this.end = end;
this.fileName = fileName;
}
@Override
public HashMap<String, Integer> call() throws Exception {
HashMap<String, Integer> map = new HashMap<>();
FileChannel channel = new RandomAccessFile(this.fileName, "rw").getChannel();
//[start, end] -> Memory
//Device -> Kernel Space -> UserSpace(buffer) -> Thread
//将内核映射到用户
MappedByteBuffer mbuf = channel.map(
FileChannel.MapMode.READ_ONLY,
this.start,
this.end - this.start
);
String str = StandardCharsets.US_ASCII.decode(mbuf).toString();
StringTokenizer stringTokenizer = new StringTokenizer(str);
while (stringTokenizer.hasMoreTokens()) {
String strTemp = stringTokenizer.nextToken();
map.put(strTemp, map.getOrDefault(strTemp, 0) + 1);
}
return map;
}
}
//传入文件和分片大小
public void run(String fileName, long chunkSize) throws ExecutionException, InterruptedException {
File file = new File(fileName);
long fileSize = file.length();
long position = 0;
long start = System.currentTimeMillis();
ArrayList<Future<HashMap<String, Integer>>> tasks = new ArrayList<>();
while(position < fileSize){
//分片处理
long next = Math.min(position + chunkSize, fileSize);
CountTask task = new CountTask(fileName, position, next);
position = next;
//某个线程开始统计数据了
ForkJoinTask<HashMap<String, Integer>> future = pool.submit(task);
tasks.add(future);
}
//将每个线程求解的结果放入一起
HashMap<String, Integer> totalMap = new HashMap<>();
for (Future<HashMap<String, Integer>> task : tasks) {
HashMap<String, Integer> map = task.get();
for (Map.Entry<String, Integer> entry : map.entrySet()) {
if(totalMap.containsKey(entry.getKey())) totalMap.put(entry.getKey(), totalMap.get(entry.getKey()) + entry.getValue());
else totalMap.put(entry.getKey(), entry.getValue());
}
}
System.out.println("time: " + (System.currentTimeMillis() - start) +"ms");
System.out.println(totalMap.get("aabaa"));
}
@Test
public void count() throws ExecutionException, InterruptedException {
WordCount counter = new WordCount();
counter.run("word", 1024*1024);
}
结果:
消耗时间: 115909ms
词频次数:320106
为什么时间还会更多呢? 主要我程序里面几个统计好的Map合在一个里面,先要检查是否有Key,这里有些耗时,但是我们可以把处理长度调节更长,这样合并的map就更少,那么我们时间也就更少,我们改成20倍程度。
@Test
public void count() throws ExecutionException, InterruptedException {
WordCount counter = new WordCount();
counter.run("word", 1024*1024*20);
}
消耗时间: : 26671ms
词频次数:320107
大家感兴趣的话,可以调节这个长度,然后去看看效果如何