MapReduce之WordCount

最新推荐文章于 2024-09-20 15:32:13 发布

弗瑞得姆

最新推荐文章于 2024-09-20 15:32:13 发布

阅读量89

点赞数

文章标签： java hadoop mapreduce

本文链接：https://blog.csdn.net/aiyin9511/article/details/104320744

版权

（1）用户编写的程序分成三个部分：Mapper，Reducer，Driver(提交运行 mr 程序的客户端)
（2）Mapper 的输入数据是 KV 对的形式（KV 的类型可自定义）
（3）Mapper 的输出数据是 KV 对的形式（KV 的类型可自定义）
（4）Mapper 中的业务逻辑写在 map()方法中
（5）map()方法（maptask 进程）对每一个<K,V>调用一次
（6）Reducer 的输入数据类型对应 Mapper 的输出数据类型，也是 KV
（7）Reducer 的业务逻辑写在 reduce()方法中
（8）Reducetask 进程对每一组相同 k 的<k,v>组调用一次 reduce()方法
（9）用户自定义的 Mapper 和 Reducer 都要继承各自的父类
（10）整个程序需要一个 Drvier 来进行提交，提交的是一个描述了各种必要信息的 job 对象

定义一个 mapper 类

//首先要定义四个泛型的类型
//keyin: LongWritable valuein: Text
//keyout: Text valueout:IntWritable
public class WordCountMapper extends Mapper<LongWritable, Text, Text, IntWritable>{
	//map 方法的生命周期：框架每传一行数据就被调用一次
	//key : 这一行的起始点在文件中的偏移量
	//value: 这一行的内容
	@Override
	protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException 
{
		//拿到一行数据转换为 string
		String line = value.toString();
		//将这一行切分出各个单词
		String[] words = line.split(" ");
		//遍历数组，输出<单词，1>
		for(String word:words){
		context.write(new Text(word), new IntWritable(1));
		}
	}
}

定义一个 reducer 类

//生命周期：框架每传递进来一个 kv 组，reduce 方法被调用一次
@Override
protected void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, 
InterruptedException {
	//定义一个计数器
	int count = 0;
	//遍历这一组 kv 的所有 v，累加到 count 中
	for(IntWritable value:values){
		count += value.get();
	}
	context.write(key, new IntWritable(count));
	}
}

定义一个主类，用来描述 job 并提交 job


```java
public class WordCountRunner {
//把业务逻辑相关的信息（哪个是 mapper，哪个是 reducer，要处理的数据在哪里，输出的结果放哪里……）描述成一个 job 对象
//把这个描述好的 job 提交给集群去运行
public static void main(String[] args) throws Exception {
	Configuration conf = new Configuration();
	Job wcjob = Job.getInstance(conf);
	//指定我这个 job 所在的 jar 包
	// wcjob.setJar("/home/hadoop/wordcount.jar");
	wcjob.setJarByClass(WordCountRunner.class);
	wcjob.setMapperClass(WordCountMapper.class);
	wcjob.setReducerClass(WordCountReducer.class);
	//设置我们的业务逻辑 Mapper 类的输出 key 和 value 的数据类型
	wcjob.setMapOutputKeyClass(Text.class);
	wcjob.setMapOutputValueClass(IntWritable.class);
	//设置我们的业务逻辑 Reducer 类的输出 key 和 value 的数据类型
	wcjob.setOutputKeyClass(Text.class);
	wcjob.setOutputValueClass(IntWritable.class);
	//指定要处理的数据所在的位置
	FileInputFormat.setInputPaths(wcjob, "hdfs://hdp-server01:9000/wordcount/data/big.txt");
	//指定处理完成之后的结果所保存的位置
	FileOutputFormat.setOutputPath(wcjob, new Path("hdfs://hdp-server01:9000/wordcount/output/"));
	//向 yarn 集群提交这个 job
	boolean res = wcjob.waitForCompletion(true);
	System.exit(res?0:1);
}

MapReduce 程序运行模式

本地运行模式

（1）mapreduce 程序是被提交给 LocalJobRunner 在本地以单进程的形式运行
（2）而处理的数据及输出结果可以在本地文件系统，也可以在 hdfs 上
（3）怎样实现本地运行？写一个程序，不要带集群的配置文件
本质是程序的 conf 中是否有 mapreduce.framework.name=local 以及
yarn.resourcemanager.hostname 参数
（4）本地模式非常便于进行业务逻辑的 debug，只要在 eclipse 中打断点即可。

集群运行模式

（1）将 mapreduce 程序提交给 yarn 集群，分发到很多的节点上并发执行
（2）处理的数据和输出结果应该位于 hdfs 文件系统
（3）提交集群的实现步骤：将程序打成 JAR 包，然后在集群的任意一个节点上用 hadoop 命令启动hadoop jar wordcount.jar cn.itcast.bigdata.mrsimple.WordCountDriver args