最近学校的课程作业要求自己实现一个MapReduce来计算莎士比亚用的最多的单词,并将结果输出到txt文件中。
编写文件处理方法
public class FileOperator {
public static String readFile(String path){
StringBuilder res = new StringBuilder();
try(InputStreamReader read = new InputStreamReader(new FileInputStream(path));
BufferedReader reader = new BufferedReader(read)){
while(true){
String line = reader.readLine();
if(line==null)
break;
res.append(line).append(" ");
}
} catch (IOException e) {
e.printStackTrace();
}
return res.toString();
}
public static String outputResultToFile(List<Map.Entry<String,Integer>> sortedReduce){
File writeName = new File("./output.txt");
try {
if(writeName.exists()){
if(!writeName.delete())
throw new IOException("File already exist and can't be deleted");
}
if(!writeName.createNewFile())
throw new IOException("Failed in creating file");
} catch (IOException e) {
e.printStackTrace();
}
try(BufferedWriter out = new BufferedWriter(new OutputStreamWriter(new FileOutputStream(writeName)))) {
for(Map.Entry<String,Integer> entry:sortedReduce){
out.write(entry.getKey()+","+entry.getValue());
out.newLine();
}
out.flush();
} catch (IOException e) {
e.printStackTrace();
}
return "Succeed in computing";
}
public static List<Map.Entry<String,Integer>> sort(Map<String,Integer> reduce){
List<Map.Entry<String, Integer>> sortedReduce = new ArrayList<>(reduce.entrySet());
sortedReduce.sort((e1, e2) -> -(e1.getValue() - e2.getValue()));
return sortedReduce;
}
}
readFile()方法将输入的txt转换为一个字符串输出给MapFunc类处理,outputResultToFile()方法将Reducer类输出的结果写入文件中。这里的reader和writer都是使用的带缓冲区的BufferedReader和BufferedWriter。可以加快文件读取和写入速度。
编写Map类
因为我设计的Map和Reduce之间使用生产者消费者模式,所以线程之间需要阻塞队列来作为缓冲区传递数据。首先编写一个Transfer类
public class Transfer {
public static final BlockingDeque<Map<String, List<Integer>>> buffer = new LinkedBlockingDeque<>();
public static final BlockingDeque[] pipeline = new LinkedBlockingDeque[8];
static {
for (int i = 0; i < pipeline.length ; i++) {
pipeline[i] = new LinkedBlockingDeque<Map<String,List<Integer>>>();
}
}
}
该类的两个类变量就是缓冲区,第一个buffer是一个阻塞队列,而第二个pipeline是一个阻塞队列数组,我这里是想测试所有线程只用一个阻塞队列和每两个线程之间就用一个阻塞队列,这两种方法中哪种性能更佳。
接下来编写Map类,stopwords是一个set,用于存放不计入统计的单词。将stopwords的初始化逻辑写入static代码块中,在类加载的初始化阶段对其进行初始化,而不是每一个实例都初始化一次。
对于输入的字符串处理,将其转换成一个String数组,数组的每一个元素都是一个单词,具体的处理逻辑见代码。
map()方法是只用一个阻塞队列的方法,mapUsingPipeline()是每两个线程之间使用一个阻塞队列的方法。用线程池来管理任务,这样可以实现线程重用,而不是每一个工作任务都去创建一个新线程,完成之后再回收。CountDownLatch用于主线程的等待,当所有工作线程都完成任务后,主线程才关闭线程池。我这里设置了8个线程,将数组分片成8段给每个线程分别处理。
public class MapFunc {
private static final Set<String> stopWords;
//阻塞队列(缓冲区),每个任务处理完后放入缓冲区让reduce处理
private static final BlockingDeque<Map<String,List<Integer>>> blockingDeque =Transfer.buffer;
//管道
private static final BlockingDeque[] pipeline = Transfer.pipeline;
static {
stopWords = new HashSet<>();
String path1 = "./stopwords1.txt";
String path2 = "./stopwords2.txt";
getStopWords(path1);
getStopWords(path2);
}
private static void getStopWords(String path) {
try(InputStreamReader read = new InputStreamReader(new FileInputStream(path));
BufferedReader reader = new BufferedReader(read)){
while(true){
String line = reader.readLine();
if(line==null)
break;
stopWords.add(line.toLowerCase());
}
} catch (IOException e) {
e.printStackTrace();
}
}
public void map(String string){
String[] temp = generateUnfilteredWords(string);
//TODO 多线程处理
ExecutorService executorService = Executors.newCachedThreadPool();
final int tasks = 8;//处理map的任务数
final CountDownLatch countDownLatch = new CountDownLatch(tasks);//所有线程处理完毕后关闭线程池
for (int i = 0; i < tasks ; i++) {
int finalI = i;
int purSize = temp.length/tasks;
executorService.execute(()->{
Map<String,List<Integer>> keyValues = new ConcurrentHashMap<>();
int index = finalI*purSize;
for (int j = index; j < index+ purSize; j++) {
String element = temp[j].toLowerCase();
if (!element.equals("")&&!stopWords.contains(element)) {
if(keyValues.get(element)==null){
//list应该保证线程安全
List<Integer> list = new CopyOnWriteArrayList<>();
list.add(1);
keyValues.put(element,list);
continue;
}
keyValues.get(element).add(1);
}
}
try {
blockingDeque.put(keyValues);
} catch (InterruptedException e) {
e.printStackTrace();
}
countDownLatch.countDown();
});
}
try {
countDownLatch.await();
} catch (InterruptedException e) {
e.printStackTrace();
}
executorService.shutdown();
}
public void mapUsingPipeline(String string){
String[] temp = generateUnfilteredWords(string);
//TODO 多线程处理
ExecutorService executorService = Executors.newCachedThreadPool();
final int tasks = 8;//处理map的任务数
final CountDownLatch countDownLatch = new CountDownLatch(tasks);//所有线程处理完毕后关闭线程池
for (int i = 0; i < tasks ; i++) {
int finalI = i;
int purSize = temp.length/tasks;
executorService.execute(()->{
Map<String,List<Integer>> keyValues = new ConcurrentHashMap<>();
int index = finalI*purSize;
for (int j = index; j < index+ purSize; j++) {
String element = temp[j].toLowerCase();
if (!element.equals("")&&!stopWords.contains(element)) {
if(keyValues.get(element)==null){
//list应该保证线程安全
List<Integer> list = new CopyOnWriteArrayList<>();
list.add(1);
keyValues.put(element,list);
continue;
}
keyValues.get(element).add(1);
}
}
try {
this.pipeline[finalI].put(keyValues);
} catch (InterruptedException e) {
e.printStackTrace();
}
countDownLatch.countDown();
});
}
try {
countDownLatch.await();
} catch (InterruptedException e) {
e.printStackTrace();
}
executorService.shutdown();
}
public static String[] generateUnfilteredWords(String string){
char[] chars = string.toCharArray();
for (int i = 0; i <chars.length ; i++) {
char c = chars[i];
if(!isCharacter(c,chars,i)){
chars[i] = ' ';
}
}
return String.valueOf(chars).split(" ");
}
private static boolean isCharacter(char c,char[] chars,int index){
return Character.isUpperCase(c) || Character.isLowerCase(c) ||
(index > 0 && index < chars.length - 1
&& c == '\'' && chars[index + 1] != ' ' && chars[index - 1] != ' ');
}
}
编写reducer
reducer类中的每个线程从阻塞队列中读取对应map线程的处理结果,并将各自分别的结果写入到一个作为类属性的ConcurrentHashMap中。
public class ReduceFunc {
public static final Map<String,Integer> reduceResult = new ConcurrentHashMap<>();
//缓冲区,接收map任务处理完后的结果
private static final BlockingDeque<Map<String,List<Integer>>> blockingDeque = Transfer.buffer;
private static final BlockingDeque[] pipeline = Transfer.pipeline;
public Map<String,Integer> reduce(){
ExecutorService executorService = Executors.newCachedThreadPool();
int tasks = 8;
CountDownLatch countDownLatch = new CountDownLatch(tasks);
for (int i = 0; i <tasks ; i++) {
//reduce任务
executorService.execute(()->{
try {
//从缓冲区拿map任务的结果,如果还没有就阻塞
Map<String,List<Integer>> map1 = blockingDeque.take();
for(Map.Entry<String,List<Integer>> entry:map1.entrySet()){
int sum = 0;
try {
for(Integer integer:entry.getValue()){
sum+=integer;
}
//将本任务的处理结果加入最终结果中
reduceResult.put(entry.getKey(),reduceResult.getOrDefault(entry.getKey(),0)+sum);
}
catch (NullPointerException e){
System.out.println(entry);
e.printStackTrace();
}
}
} catch (InterruptedException e) {
e.printStackTrace();
}
countDownLatch.countDown();
});
}
try {
countDownLatch.await();
} catch (InterruptedException e) {
e.printStackTrace();
}
executorService.shutdown();
return reduceResult;
}
public Map<String,Integer> reduceUsingPipeline(){
ExecutorService executorService = Executors.newCachedThreadPool();
int tasks = 8;
CountDownLatch countDownLatch = new CountDownLatch(tasks);
for (int i = 0; i <tasks ; i++) {
int finalI = i;
//reduce任务
executorService.execute(()->{
try {
//从缓冲区拿map任务的结果,如果还没有就阻塞
Map<String,List<Integer>> map1 = (Map<String,List<Integer>>)pipeline[finalI].take();
for(Map.Entry<String,List<Integer>> entry:map1.entrySet()){
int sum = 0;
try {
for(Integer integer:entry.getValue()){
sum+=integer;
}
//将本任务的处理结果加入最终结果中
reduceResult.put(entry.getKey(),reduceResult.getOrDefault(entry.getKey(),0)+sum);
}
catch (NullPointerException e){
System.out.println(entry);
e.printStackTrace();
}
}
} catch (InterruptedException e) {
e.printStackTrace();
}
countDownLatch.countDown();
});
}
try {
countDownLatch.await();
} catch (InterruptedException e) {
e.printStackTrace();
}
executorService.shutdown();
return reduceResult;
}
}
编写测试方法
public class TestMain {
public static void main(String[] args) {
String res = FileOperator.readFile("./shakespeare.txt");
MapFunc mapper = new MapFunc();
ReduceFunc reducer = new ReduceFunc();
long begin = System.currentTimeMillis();
mapper.map(res);
Map<String,Integer> reduce = reducer.reduce();
System.out.println("processing takes "+String.valueOf(((double) System.currentTimeMillis()-begin)/1000)+"s");
System.out.println(FileOperator.outputResultToFile(FileOperator.sort(reduce)));
}
}
运行测试方法,可以看到类路径中生成了output.txt。测试成功,经测试,只用一个阻塞队列和每两个线程之间就用一个阻塞队列的性能相差不大。