文本文档
三个文件:index.html,hadoop.html,spark.html
每个文件里有一些单词
index.html : hadoop hadoop hadoop hadoop index bigdata
hadoop.html : hadoop hadoop is nice nice best
spark.html : spark is best best best
结果集
排序单词,在单词后罗列出现次数最多的网页,也要排序
best : spark.html:3;hadoop.html:1;
hadoop : index.html:4;hadoop.html:2;
...
实现思路:
- 先将数据分片,获取文件名做为key,value设为1计数
- 将每行按空格拆分,拆出单词与文件名拼接成“index.html_hadoop”形式做为键值
- 自定义combine类,拆分键值,形成“index.html:3”形式作为value输出,hadoop单词作为key
- 在reduce阶段,定义treemap,实现comparator接口,实现value按value.value排序,最后将结果拼接成"hadoop:index:3;"形式;
SortDev.java
package MR;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.InputSplit;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.FileSplit;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import java.io.IOException;
import java.util.*;
public class SortRev {
//自定义的mapper类
public static class MyMapper extends Mapper<Object, Text, Text, Text> {
public Text k = new Text();
public Text v = new Text("1");
/**
* 抽象map函数 (map阶段的核心业务逻辑)
*/
@Override
protected void map(Object key, Text value,Context context) throws IOException, InterruptedException {
//获取文件名称
InputSplit is = context.getInputSplit();
String filename = (