MapReduce编程：最大值、最小值、平均值、计数、中位数、标准差

最新推荐文章于 2023-04-06 20:23:01 发布

kuronekonano

最新推荐文章于 2023-04-06 20:23:01 发布

阅读量4.1k

点赞数 1

分类专栏： MapReduce编程

本文链接：https://blog.csdn.net/kuronekonano/article/details/89284316

版权

博客介绍了如何使用MapReduce进行最大值、最小值、平均值、计数等统计分析，特别是针对Stack Overflow评论数据。通过transformXmlToMap函数解析XML文件，Mapper和Reducer实现计算，包括最大最小值类的核心逻辑，并讨论MapReduce的内在排序机制。此外，还提及了计算平均值和标准差的方法，以及中位数的朴素实现。

摘要由CSDN通过智能技术生成

MapReduce编程最基础的范例应该就是Wordcount了，然后大部分就是要做一遍最大值最小值的计算。课上老师用的课本是《MapReduce编程与设计模式》，里面第一章就介绍了Wordcount ，接下来就是最大值最小值平均值标准差，其数据来源于Stack Overflow网站上的评论内容，包括评论时间、评论用户ID，评论文本。并且是以.xml文件形式做输入文件。因此读入到mapper时需要先将xml转化为map的键值对形式。transformXmlToMap(value.toString());
以下是输入文件的形式，随便造的几组数据，只改动了评论时间与用户ID，评论文本内容是直接粘的。

<row Id="1" PostId="35314" Score="39" Text="not sure why this is getting downvoted -- it is correct! Double check it in your compiler if you don't believe him!" CreationDate="2018-09-06T08:07:10.730" UserId="1" /> 
<row Id="1" PostId="35315" Score="39" Text="not sure why this is getting downvoted -- it is correct! Double check it in your compiler if you don't believe him!" CreationDate="2007-09-06T08:05:33.730" UserId="1" />
<row Id="1" PostId="35316" Score="39" Text="not sure why this is getting downvoted -- it is correct! Double check it in your compiler if you don't believe him!" CreationDate="2008-09-06T08:07:10.730" UserId="1" /> 
<row Id="1" PostId="35317" Score="39" Text="not sure why this is getting downvoted -- it is correct! Double check it in your compiler if you don't believe him!" CreationDate="2008-08-06T08:07:26.730" UserId="1" /> 
<row Id="2" PostId="35318" Score="39" Text="not sure why this is getting downvoted -- it is correct! Double check it in your compiler if you don't believe him!" CreationDate="2008-05-06T08:11:10.730" UserId="1" /> 
<row Id="2" PostId="35319" Score="39" Text="not sure why this is getting downvoted -- it is correct! Double check it in your compiler if you don't believe him!" CreationDate="2008-09-06T08:12:10.730" UserId="1" /> 
<row Id="2" PostId="35320" Score="39" Text="not sure why this is getting downvoted -- it is correct! Double check it in your compiler if you don't believe him!" CreationDate="2008-06-06T08:03:10.730" UserId="1" /> 
<row Id="2" PostId="35321" Score="39" Text="not sure why this is getting downvoted -- it is correct! Double check it in your compiler if you don't believe him!" CreationDate="2008-09-06T08:07:10.880" UserId="1" /> 
<row Id="2" PostId="35322" Score="39" Text="not sure why this is getting downvoted -- it is correct! Double check it in your compiler if you don't believe him!" CreationDate="2016-09-06T08:07:39.730" UserId="1" /> 
<row Id="2" PostId="35323" Score="39" Text="not sure why this is getting downvoted -- it is correct! Double check it in your compiler if you don't believe him!" CreationDate="2008-03-06T08:07:10.730" UserId="1" /> 
<row Id="3" PostId="35324" Score="39" Text="not sure why this is getting downvoted -- it is correct! Double check it in your compiler if you don't believe him!" CreationDate="2007-09-06T08:00:22.730" UserId="1" />

这在课本上是没有看到这个函数的内部实现的，但是仍是一个基本的工具类，可以自己实现，目的就是将文本抠出来转换成map形式存储。

public static final String[] REDIS_INSTANCES = {
    "p0", "p1", "p2", "p3",
			"p4", "p6" };

	// This helper function parses the stackoverflow into a Map for us.
	public static Map<String, String> transformXmlToMap(String xml) {
   
		Map<String, String> map = new HashMap<String, String>();
		try {
   
			String[] tokens = xml.trim().substring(5, xml.trim().length() - 3)
					.split("\"");

			for (int i = 0; i < tokens.length - 1; i += 2) {
   
				String key = tokens[i].trim();
				String val = tokens[i + 1];

				map.put(key.substring(0, key.length() - 1), val);
			}
		} catch (StringIndexOutOfBoundsException e) {
   
			System.err.println(xml);
		}

		return map;
	}

然后接下来就是一个最大最小值类：
该代码完全取自课本原文，但接下来的平均值和标准差的计算，课本限于篇幅就没有打出完整代码，只写出了核心的mapper部分与reducer部分。但是仍可根据最大最小值的范例写出模板化的主类部分。

首先可以看到计算最大最小值类中存在三个域值，可以知道该代码处理的是一个用户多个评论中的最早评论时间【最小值】与最晚评论时间【最大值】，并且还包括一个count计算用户总量。
并且值得注意的是其中重载了toString函数，作为输出格式。

其次，MapReduce本身在运行过程中自带排序，无论用户目的是否需要排序，其都会在mapper过程中自动排序，因为排序对大多数处理都是有利的，这也是MapReduce本身的机制。

package mapreduce_2019;
import java.io.DataInput;
import java.io.DataOutput;
import java.io

最低0.47元/天解锁文章

kuronekonano

关注

1
点赞
踩
8

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

专栏目录