MapReduce编程最基础的范例应该就是Wordcount了,然后大部分就是要做一遍最大值最小值的计算。课上老师用的课本是《MapReduce编程与设计模式》,里面第一章就介绍了Wordcount ,接下来就是最大值最小值平均值标准差,其数据来源于Stack Overflow网站上的评论内容,包括评论时间、评论用户ID,评论文本。并且是以.xml文件形式做输入文件。因此读入到mapper时需要先将xml转化为map的键值对形式。transformXmlToMap(value.toString());
以下是输入文件的形式,随便造的几组数据,只改动了评论时间与用户ID,评论文本内容是直接粘的。
<row Id="1" PostId="35314" Score="39" Text="not sure why this is getting downvoted -- it is correct! Double check it in your compiler if you don't believe him!" CreationDate="2018-09-06T08:07:10.730" UserId="1" />
<row Id="1" PostId="35315" Score="39" Text="not sure why this is getting downvoted -- it is correct! Double check it in your compiler if you don't believe him!" CreationDate="2007-09-06T08:05:33.730" UserId="1" />
<row Id="1" PostId="35316" Score="39" Text="not sure why this is getting downvoted -- it is correct! Double check it in your compiler if you don't believe him!" CreationDate="2008-09-06T08:07:10.730" UserId="1" />
<row Id="1" PostId="35317" Score="39" Text="not sure why this is getting downvoted -- it is correct! Double check it in your compiler if you don't believe him!" CreationDate="2008-08-06T08:07:26.730" UserId="1" />
<row Id="2" PostId="35318" Score="39" Text="not sure why this is getting downvoted -- it is correct! Double check it in your compiler if you don't believe him!" CreationDate="2008-05-06T08:11:10.730" UserId="1" />
<row Id="2" PostId="35319" Score="39" Text="not sure why this is getting downvoted -- it is correct! Double check it in your compiler if you don't believe him!" CreationDate="2008-09-06T08:12:10.730" UserId="1" />
<row Id="2" PostId="35320" Score="39" Text="not sure why this is getting downvoted -- it is correct! Double check it in your compiler if you don't believe him!" CreationDate="2008-06-06T08:03:10.730" UserId="1" />
<row Id="2" PostId="35321" Score="39" Text="not sure why this is getting downvoted -- it is correct! Double check it in your compiler if you don't believe him!" CreationDate="2008-09-06T08:07:10.880" UserId="1" />
<row Id="2" PostId="35322" Score="39" Text="not sure why this is getting downvoted -- it is correct! Double check it in your compiler if you don't believe him!" CreationDate="2016-09-06T08:07:39.730" UserId="1" />
<row Id="2" PostId="35323" Score="39" Text="not sure why this is getting downvoted -- it is correct! Double check it in your compiler if you don't believe him!" CreationDate="2008-03-06T08:07:10.730" UserId="1" />
<row Id="3" PostId="35324" Score="39" Text="not sure why this is getting downvoted -- it is correct! Double check it in your compiler if you don't believe him!" CreationDate="2007-09-06T08:00:22.730" UserId="1" />
这在课本上是没有看到这个函数的内部实现的,但是仍是一个基本的工具类,可以自己实现,目的就是将文本抠出来转换成map形式存储。
public static final String[] REDIS_INSTANCES = {
"p0", "p1", "p2", "p3",
"p4", "p6" };
// This helper function parses the stackoverflow into a Map for us.
public static Map<String, String> transformXmlToMap(String xml) {
Map<String, String> map = new HashMap<String, String>();
try {
String[] tokens = xml.trim().substring(5, xml.trim().length() - 3)
.split("\"");
for (int i = 0; i < tokens.length - 1; i += 2) {
String key = tokens[i].trim();
String val = tokens[i + 1];
map.put(key.substring(0, key.length() - 1), val);
}
} catch (StringIndexOutOfBoundsException e) {
System.err.println(xml);
}
return map;
}
然后接下来就是一个最大最小值类:
该代码完全取自课本原文,但接下来的平均值和标准差的计算,课本限于篇幅就没有打出完整代码,只写出了核心的mapper部分与reducer部分。但是仍可根据最大最小值的范例写出模板化的主类部分。
首先可以看到计算最大最小值类中存在三个域值,可以知道该代码处理的是一个用户多个评论中的最早评论时间【最小值】与最晚评论时间【最大值】,并且还包括一个count计算用户总量。
并且值得注意的是其中重载了toString函数,作为输出格式。
其次,MapReduce本身在运行过程中自带排序,无论用户目的是否需要排序,其都会在mapper过程中自动排序,因为排序对大多数处理都是有利的,这也是MapReduce本身的机制。
package mapreduce_2019;
import java.io.DataInput;
import java.io.DataOutput;
import java.io