1 顺序扫描
提到计算目录数据大小,我们首先想到的会是顺序遍历每个文件,并累加遍历后的结果。如下面例子,该例子使用顺序计算目录大小的方法。
public class TotalFileSizeSequential{
private long getTotalSizeOfFileInDir(final File file){
if(file.isFile()){
return file.length();
}
final File[] children = file.listFile();
if(children!=null){
for(final File child:children){
total += getTotalSizeOfFileInDir(child);//递归遍历
}
}
return total
}
public void static main(String[] args){
Scanner scanner = new Scanner(System.in);
String str = scanner.next(); //输入目录
final long start = System.nanoTime();
final long total = new TotalFileSizeSequential().getTotalSizeOfFilesInDir(new File(str));
final long end = System.nanoTime();
System.out.println("文件总大小 " + total);
System.out.println("所用时间 " + (end-start)/1.0e9);
}
}
2 线程不安全扫描
从上面例子可以看出,这里并为使用了多线程来进行计算,而是按照执行顺序来累加结果。虽然结果是正确的,但却不是高效的。所以接下来我们使用多线程的一个例子,但该例子遇到了一个问题,那就是线程死锁问题,先看下再进行具体分析:
上面的这个例子开启了一个100大小的线程池,并在Future执行结果的时候设定了运行时间,超过运行时间将抛出异常。这里,当我们的目录没那么多文件的时候,并不存在问题,但当目录包含很大文件的时候,此时每当遇到一个子目录,就通过service.submit执行任务,既把改任务调度给其他线程,照如此下去,当我们还未深入到最底层目录时,由于线程数目的限制,导致线程池内的等待某些任务的相应,而这些任务却在ExcotorService的队列中等待执行的机会,因为线程是递归的,即从最外层到最里层,所以在等待最里层返回结果,但是最里层又没有额外的线程来执行,于是形成死锁状态。最后因为设置了超时,避免程序处于假死状态。public class NavelyConcurrentTotalFileSize{ private long getTotalSizeOfFileInDir(ExecutorService service,File file) throws InterruptedException,ExecutionException,TimeoutException{ if(file.isFile()){ return file.length(); } long total = 0; final File[] children = file.listFile(); if(children != null){ final List<Future<Long>> partialTotalFuture = new ArrayList<Future<Long>>(); for(final File child:children){ partialTotalFuture.add(service.submit(new Callable<Long>{ //执行任务 public Long call() throws InterruptedException, ExecutionException,TimeoutException{ return getTotalSizeOfFilesInDir(service,child); } })); } for(Future<Long> partialTotalFuture:partialTotalFutures){ total += partialTotalFuture.get(100,TimeUnit.SECODES);//计算结果 } } return tatal; } private long getTotalSizeOfFile(string fileName) throws InterruptedException, ExecutionException,TimeoutException{ ExecutorService service = Executors.newFixedThreadPool(100); try{ return getTotalSizeOfFileInDir(service,new File(fileName)); }finally{ service.shutdown(); } }
public void static main(String[] args){ Scanner scanner = new Scanner(System.in); String str = scanner.next(); //输入目录 final long start = System.nanoTime(); final long total = new NaivelyConcurrentTotalFileSize().getTotalSizeOfFile(str); final long end = System.nanoTime(); System.out.println("文件总大小 " + total); System.out.println("所用时间 " + (end-start)/1.0e9); }
}
3 线程安全但不简洁扫描
因此,通过前面的分析,我们知道这里死锁的原因是因为子目录多而占用了线程,所以改进一种办法,这就是在扫描目录的子目录和文件的时候,把该目录下的子目录列表和所有文件的大小都返回给主线程。同样,先看例子再进行解释:
package gibbon.thread;
import java.io.File;
import java.io.InputStreamReader;
import java.util.ArrayList;
import java.util.List;
import java.util.Scanner;
import java.util.concurrent.Callable;
import java.util.concurrent.ExecutionException;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
import java.util.concurrent.Future;
import java.util.concurrent.TimeUnit;
import java.util.concurrent.TimeoutException;
public class ConcurrentTotalFileSize {
class SubDirectoriesAndSize{
final public long size;
final public List<File> subDirectories;
public SubDirectoriesAndSize(final long totalSize,final List<File> theSubDirs){
size = totalSize;
subDirectories = theSubDirs;
}
}
private SubDirectoriesAndSize getTotalAndSubDirs(final File file){
long total = 0;
final List<File> subDirectories = new ArrayList<File>();
//检查文件是不是一个目录
if(file.isDirectory()){
//获得该文件的子文件
final File[] children = file.listFiles();
if(children != null){
for(final File child:children){
System.out.println("get the child file: " + child.getPath().toString());
if(child.isFile()){
total += child.length();
}else{
subDirectories.add(child);
}
}
}
}
return new SubDirectoriesAndSize(total, subDirectories);
}
private long getTotalSizeOfFileInDir(final File file)
throws InterruptedException,ExecutionException,TimeoutException{
long total = 0;
long count = 0;
final ExecutorService service = Executors.newFixedThreadPool(100);
try {
final List<File> directories = new ArrayList<File>();
directories.add(file);
for(int i=0;i<directories.size();i++){
System.out.println("get the file is: " + directories.get(i).toString());
}
while(!directories.isEmpty()){
final List<Future<SubDirectoriesAndSize>> partialResults =
new ArrayList<Future<SubDirectoriesAndSize>>();
for(final File directory:directories){
System.out.println("file is " + directory.toPath().toString());
partialResults.add(
service.submit(new Callable<SubDirectoriesAndSize>() {
public SubDirectoriesAndSize call(){
return getTotalAndSubDirs(directory);
}
}));
}
directories.clear();
for(final Future<SubDirectoriesAndSize> partialResultFuture:partialResults){
final SubDirectoriesAndSize subDirectoriesAndSize =
partialResultFuture.get(100, TimeUnit.SECONDS);//这里设置的时间和线程池的大小是有关系的
directories.addAll(subDirectoriesAndSize.subDirectories);
total += subDirectoriesAndSize.size;
}
}
} catch (Exception e) {
// TODO: handle exception
service.shutdown();
}
return total;
}
public static void main(String[] args)
throws InterruptedException,ExecutionException,TimeoutException{
Scanner scanner = new Scanner(System.in);
String str = scanner.next();
final long start = System.nanoTime();
final long total = new ConcurrentTotalFileSize().getTotalSizeOfFileInDir(new File(str));
final long end = System.nanoTime();
System.out.println("Total Size: " + total);
System.out.println("Time taken: " + (end-start)/1.0e9);
}
}
在上面的例子中,当输入最顶层目录之后,只要遇到还有待扫描的目录,那么就在单独的线程中调用getTotalAndSubDirs()为每个目录执行计算任务。当所有线程的响应返回之后,就可以得到文件的大小的部分和累加,并把子目录列表放进待扫描队列中,当所有的子目录扫描完毕之后,就能得到整个目录的大小。于是上面的线程执行限时可以不加。
4 线程安全且代码简洁扫描
上面的例子是虽然实现了多线程,但并不简洁。接下来使用BlockingQueue实现扫描操作。BlockingQueue的特点是:如果队列里没有可用空间,则插入操作会被阻塞,若队列里没有可用数据,则删除操作会被阻塞。例子如下:
package gibbon.thread;
import java.io.File;
import java.util.Scanner;
import java.util.concurrent.ArrayBlockingQueue;
import java.util.concurrent.BlockingQueue;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
import java.util.concurrent.TimeUnit;
import java.util.concurrent.atomic.AtomicLong;
import javax.management.RuntimeErrorException;
public class ConcurrentTotalFileSizeWQueue {
private ExecutorService service;
private BlockingQueue<Long> fileSizes = new ArrayBlockingQueue<Long>(500);
AtomicLong pendingFileVisits = new AtomicLong();
private void startExploreDir(final File file){
pendingFileVisits.incrementAndGet(); //共享变量递增
service.execute(new Runnable() {
@Override
public void run() {
// TODO Auto-generated method stub
exploreDir(file);
}
});
}
private void exploreDir(File file){
long fileSize = 0;
if(file.isFile()){
fileSize = file.length();
}else{
File[] children = file.listFiles();
if(children != null){
for(File child:children){
if(child.isFile()){
fileSize += child.length();
}else{
startExploreDir(child);
}
}
}
}
try {
fileSizes.put(fileSize);//这个就是阻塞队列的作用,当上面某个线程操作完成进入这一步时,
//线程将被删除,那么线程运行的结果怎么办,于是就在此处完成线程
//间数据交换和同步的操作。这也是为什么说这个例子的代码简洁的原因
} catch (Exception e) {
// TODO: handle exception
throw new RuntimeException(e);
}
pendingFileVisits.decrementAndGet();
}
private long getTotalSizeOfFile(final String fileName)
throws InterruptedException{
long total = 0;
service = Executors.newFixedThreadPool(100);
try{
startExploreDir(new File(fileName));
while(pendingFileVisits.get()>0 || fileSizes.size()>0){
Long size = fileSizes.poll(10,TimeUnit.SECONDS);
total += size;
}
}finally{
service.shutdown();
}
return total;
}
public static void main(String[] args) throws InterruptedException{
Scanner scanner = new Scanner(System.in);
String str = scanner.next();
long start = System.nanoTime();
long total = new ConcurrentTotalFileSizeWQueue().getTotalSizeOfFile(str);
long end = System.nanoTime();
System.out.println("Total Size: " + total);
System.out.println("Time taken: " + (end-start)/1.0e9);
}
}
从上面的例子可以看出,使用阻塞队列来并发地解决计算文件的大小问题可以避免死锁问题,每个线程计算所得的部分文件的大小的值插入到一个队列中,而主线程可以遍历该队列获得每部分结果并进行累加。在上面执行队列任务时,当队列满时,便阻塞其他要操作的任务,等待队列中某些任务的完成其他任务才能继续。
5 线程安全且简洁高效扫描
最后我们使用ForkJoinTask来完成目录计算扫描。所谓ForkJoinTask,顾名思义,就是线程间协同合作。Fork-join API主要用于处理那些规模合理的任务,即如果每个任务所分担的开销不大,则该API就可以达到合理的吞吐量。同时,其也希望所有任务都不要有副作用(如改变共享变量)并且没有同步或锁操作。如下例子:package gibbon.thread;
import java.io.File;
import java.util.ArrayList;
import java.util.List;
import java.util.Scanner;
import java.util.concurrent.ExecutionException;
import java.util.concurrent.ForkJoinPool;
import java.util.concurrent.ForkJoinTask;
import java.util.concurrent.RecursiveTask;
import java.util.concurrent.TimeoutException;
public class FileSize {
private final static ForkJoinPool forkJoinPool = new ForkJoinPool();
private static class FileSizeFinder extends RecursiveTask<Long>{
File file;
public FileSizeFinder(File file){
this.file = file;
}
@Override
protected Long compute() {
// TODO Auto-generated method stub
long size = 0;
if(file.isFile()){
return file.length();
}else{
File[] children = file.listFiles();
if(children != null){
List<ForkJoinTask<Long>> tasks = new ArrayList<ForkJoinTask<Long>>();
for(File child:children){
if(child.isFile()){
size += child.length();
}else{
tasks.add(new FileSizeFinder(child));
}
}
for(ForkJoinTask<Long> task:invokeAll(tasks)){ //等待所有的子任务完成之后才会执行下一步循环操作。在任务被阻塞时,
//其他程序也可以去帮忙完成其他任务
size += task.join();
}
}
}
return size;
}
}
public static void main(final String[] args)
throws InterruptedException,ExecutionException,TimeoutException{
Scanner scanner = new Scanner(System.in);
String str = scanner.next();
System.out.println("get the output--->" + str);
final long start = System.nanoTime();
final long total = forkJoinPool.invoke(new FileSizeFinder(new File(str)));
final long end = System.nanoTime();
System.out.println("Total Size: " + total);
System.out.println("Time taken: " + (end-start)/1.0e9);
}
}
Fork-join API非常适合解决那些可以递归分解至小到足以顺序运行的问题。