学习篇-Hadoop-MapReduce-词频统计

最新推荐文章于 2024-06-30 17:10:23 发布

不要跟我说对不起

最新推荐文章于 2024-06-30 17:10:23 发布

阅读量3.6k

点赞数

分类专栏： hadoop 文章标签： mapreduce 大数据 hadoop

本文链接：https://blog.csdn.net/u012365780/article/details/105853479

版权

hadoop 专栏收录该内容

16 篇文章 0 订阅

订阅专栏

文章目录

一、Hadoop-MapReduce-词频统计-Mapper

简要说明：Maps input key/value pairs to a set of intermediate key/value pairs.

释义：Mapper就是将输入的键/值对转换到一组中间键/值对
在这里插入图片描述

Mapper中传入的泛型含义
- KEYIN： Map任务读数据的key类型，offset，是每行数据起始位置的偏移量，LongWritable不再是Java中的Long
- VALUEIN：Map任务读数据的value类型，其实就是一行行的字符串，Text不再是Java中的String
- KEYOUT：map方法自定义实现输出的key的类型，例如：对于词频统计就是Text【注意不再是String】
- VALUEOUT：map方法自定义实现输出的value的类型，例如：对于词频统计就是IntWritable【注意不能是Integer】

自定义词频Mapper：WordCountMapper

/**
 * @ClassName WordCountMapper
 * @Description 词频统计mapper
 * @Author eastern
 * @Date 2020/4/29 下午2:15
 * @Version 1.0
 *
 * 对于词频统计：（word,1）KEYOUT就是String VALUEOUT就是Integer
 * LongWritable对应Long
 * Text对应String
 * IntWritable对应Integer
 **/
public class WordCountMapper extends Mapper<LongWritable, Text, Text, IntWritable> {
	@Override
	protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
		// 通过分隔符分割单词
		String[] words = value.toString().split("\t");
  		// 遍历单词  
		for (String word: words) {
     		 // 写入到context中
			context.write(new Text(word), new IntWritable(1));
		}
	}
}

二、Hadoop-MapReduce-词频统计-Reducer

简要说明：Reduces a set of intermediate values which share a key to a smaller set of values.

释义：Reduce 将一组中间值转化成共享一个key，value合并成一组较小的值

比如：

 # 从文件中读取的单词
 (hello,1) (world,1)
 (hello,1) (world,1)
 (hello,1) (world,1)
 (welcome,1)
 # map的输出到reduce端，是按照相同的key分发到一个reduce上去执行
 reduce1:	(hello,1) (hello,1) (hello,1) ===> (hello, <1,1,1>)
 reduce2:	(world,1) (world,1) (world,1) ===> (world, <1,1,1>)
 reduce3:	(welcome,1) ===> (welcome, <1>)

在这里插入图片描述

Reducer中传入的泛型含义
- KEYIN： Map输出的Key的类型
- VALUEIN：Map输出的Value的类型
- KEYOUT：reduce方法自定义实现输出的key的类型，例如：对于词频统计就是Text【注意不再是String】
- VALUEOUT：reduce方法自定义实现输出的value的类型，例如：对于词频统计就是IntWritable【注意不能是Integer】

自定义词频Reducer：WordCountReducer

/**
 * @ClassName WordCountReducer
 * @Description 词频统计Reducer
 * @Author eastern
 * @Date 2020/4/29 下午3:18
 * @Version 1.0
 **/
public class WordCountReducer extends Reducer<Text, IntWritable, Text, IntWritable> {

	@Override
	protected void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException,
			InterruptedException {
		int count = 0;
		Iterator<IntWritable> iterator = values.iterator();
		// <1,1,1>
		while (iterator.hasNext()) {
			IntWritable value = iterator.next();
			count += value.get();
		}
		context.write(key, new IntWritable(count));
	}
}

三、Hadoop-MapReduce-词频统计-Driver

/**
 * @ClassName WordCountApp
 * @Description Driver:配置Mapper Reducer的相关属性 提交到本地运行
 * @Author eastern
 * @Date 2020/4/29 下午4:35
 * @Version 1.0
 **/
public class WordCountApp {

	public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {

		System.setProperty("HADOOP_USER_NAME", "root");
		// 设置HDFS的Configuration
		Configuration configuration = new Configuration();
		configuration.set("fs.defaultFS", "hdfs://139.129.240.xxx:8020");
		configuration.set("dfs.client.use.datanode.hostname", "true");
		configuration.set("dfs.replication", "1");


		// 创建一个job
		Job job = Job.getInstance(configuration);

		// 设置Job对应的参数：主类
		job.setJarByClass(WordCountApp.class);

		// 设置Job对应的参数：设置自定义的Mapper和Reducer处理类
		job.setMapperClass(WordCountMapper.class);
		job.setReducerClass(WordCountReducer.class);

		// 设置Job对应的参数：Mapper输出key和value的类型
		job.setMapOutputKeyClass(Text.class);
		job.setMapOutputValueClass(IntWritable.class);

		// 设置Job对应的参数：Reducer输出key和value的类型
		job.setOutputKeyClass(Text.class);
		job.setOutputValueClass(IntWritable.class);

		// 设置Job对应的参数：设置输入/输出路径
		FileInputFormat.setInputPaths(job, new Path("/hdfsapi/test/second/words.txt"));
		FileOutputFormat.setOutputPath(job, new Path("/wordcount/output"));

		// 提交job
		job.waitForCompletion(true);
	}
}

四、Hadoop-MapReduce-词频统计-本地测试

去掉连接hdfs的配置

设置Job对应的参数：设置输入/输出路径，设置成本地路径即可。

public class WordCountLocalFileApp {

	public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {

		// 创建一个job
		Job job = Job.getInstance();

		// 设置Job对应的参数：主类
		job.setJarByClass(WordCountLocalFileApp.class);

		// 设置Job对应的参数：设置自定义的Mapper和Reducer处理类
		job.setMapperClass(WordCountMapper.class);
		job.setReducerClass(WordCountReducer.class);

		// 设置Job对应的参数：Mapper输出key和value的类型
		job.setMapOutputKeyClass(Text.class);
		job.setMapOutputValueClass(IntWritable.class);

		// 设置Job对应的参数：Reducer输出key和value的类型
		job.setOutputKeyClass(Text.class);
		job.setOutputValueClass(IntWritable.class);

		// 设置Job对应的参数：设置输入/输出路径
		FileInputFormat.setInputPaths(job, new Path("/Users/xxxx/IdeaProjects/bigdata/hadoop-mapreduce/src/main/resources/words.txt"));
		FileOutputFormat.setOutputPath(job, new Path("/Users/xxxx/IdeaProjects/bigdata/hadoop-mapreduce/src/main/resources/output"));

		// 提交job
		job.waitForCompletion(true);
	}
}

五、Hadoop-MapReduce-词频统计-Combiner

在这里插入图片描述

map端的聚合操作就叫combiner
combiner的优点/局限
- 减少IO，提升执行效率
- 求除法运算时，不适合。
案例代码改造：将每个map的输出，先进行累加操作，再输出到reducer
```
// 设置Combiner
job.setCombinerClass(WordCountReducer.class);
```

不要跟我说对不起

关注

0
点赞
踩
15

收藏

觉得还不错? 一键收藏
0
评论
学习篇-Hadoop-MapReduce-词频统计

文章目录一、Hadoop-MapReduce-词频统计-Mapper二、Hadoop-MapReduce-词频统计-Reducer三、Hadoop-MapReduce-词频统计-Driver一、Hadoop-MapReduce-词频统计-Mapper简要说明：Maps input key/value pairs to a set of intermediate key/value pairs....
复制链接

扫一扫

专栏目录