[Big Data]菜鸟的Hadoop (Before YARN) 学习笔记（一） WordCount

最新推荐文章于 2023-05-19 09:30:06 发布

Felix_Ou

最新推荐文章于 2023-05-19 09:30:06 发布

阅读量564

点赞数

分类专栏： Big Data

本文链接：https://blog.csdn.net/Felix_Ou/article/details/53931790

版权

Big Data 专栏收录该内容

1 篇文章 0 订阅

订阅专栏

菜鸟的Hadoop (Before YARN)学习笔记（一） WordCount

配置之后再补。先行略过。之前花了挺多时间在配置，但是手一抖没Mark Down，实在后悔。

1. New a project

OK. Finish.

2. Project Structure and Coding

可以看到，其实Mapreduce已经很自动地把要用的JAR等等必要的资源文件放一起了。

那这里会有一个示例的WordCount.

来自Hadoop Source Examples,，可能有不同版本，不过实现原理一致。

[java] view plain copy

package com.wordCount;
import java.io.IOException;
import java.util.StringTokenizer;
importorg.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
importorg.apache.hadoop.mapreduce.lib.input.FileInputFormat;
importorg.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
importorg.apache.hadoop.util.GenericOptionsParser;
public class WordCount {
public static class TokenizerMapper
extends Mapper<Object, Text, Text, IntWritable>{
//TokenizerMapper 继承自 Mapper<Object, Text, Text, IntWritable>
//LongWritable, IntWritable, Text 用于封装数据类型，以便串行化从而便于在分布式环境中进行数据交换，应可理解为long ,int, string 。
//在MapReduce中，Mapper从一个输入分片中读取数据，然后经过Shuffle and Sort阶段分发数据给Reducer.
// Map函数接收一个<key,value>形式的输入，然后同样产生一个<key,value>形式的中间输出，Hadoop函数接收一个如<key,(list of values)>形式的输入，然后对这个value集合进行处理，每个reduce产生0或1个输出，reduce的输出也是<key,value>形式的。
//Map类继承自MapReduceBase，并且它实现了Mapper接口，Mapper接口是一个规范类型，它有4种形式的参数，分别用来指定map的输入key值类型、输入value值类型、输出key值类型和输出value值类型。在本例中，因为使用的是TextInputFormat，它的输出key值是LongWritable类型，输出value值是Text类型，所以map的输入类型为<LongWritable,Text>。在本例中需要输出<word,1>这样的形式，因此输出的key值类型是Text，输出的value值类型是IntWritable。
// 简单说应该是，根据需求定义类型：
// Object Input key Type:
// Text Input value Type:
// Text Output key Type:
// IntWritable Output value Type:
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
// New one & word.
public void map(Object key, Text value, Context context
) throws IOException, InterruptedException{
StringTokenizer itr = new StringTokenizer(value.toString());
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
context.write(word, one);
}
}
}
// 数据对(K/V)是从传入的Context获取的。我们也可以从map方法看出，输出结果K/V对也是通过Context来完成的。在前期的例子中使用的是void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter)
// public StringTokenizer(String str,String delim, boolean returnDelims)
第一个参数就是要分隔的String，第二个是分隔字符集合，第三个参数表示分隔符号是否作为标记返回，如果不指定分隔字符，默认的是：”\t\n\r\f”
// boolean hasMoreTokens() ：返回是否还有分隔符
// String nextToken()：返回从当前位置到下一个分隔符的字符串。
// context.write(key, value);
public static class IntSumReducer
extends Reducer<Text,IntWritable,Text,IntWritable> {
private IntWritable result = new IntWritable();
// New a intWritable 作为结果
public void reduce(Text key, Iterable<IntWritable> values,
Context context
) throws IOException,InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
result.set(sum);
context.write(key, result);
}
}
// IntSumReducer 继承自 Reducer<Text,IntWritable,Text,IntWritable>
//参数的理解也是一样的，inputtext/ IntWritable output text/IntWritable
//重写reduce输出结果，循环所有的map值,把word ==> one 的key/value对进行汇总
// Map过程输出<key,values>中key为单个单词，而values是对应单词的计数值所组成的列表，Map的输出就是Reduce的输入，所以reduce方法只要遍历values并求和，即可得到某个单词的总次数。
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
String[] otherArgs = new GenericOptionsParser(conf,args).getRemainingArgs();
if (otherArgs.length != 2) {
System.err.println("Usage: wordcount <in> <out>");
System.exit(2);
}
Job job = new Job(conf, "word count");
job.setJarByClass(WordCount.class); //指定主类
job.setMapperClass(TokenizerMapper.class);
job.setCombinerClass(IntSumReducer.class);
job.setReducerClass(IntSumReducer.class);
job.setOutputKeyClass(Text.class); //设置输出数据的关键字类
job.setOutputValueClass(IntWritable.class);
//以上也是为了设定各自的类
FileInputFormat.addInputPath(job, new Path(otherArgs[0])); //文件输入
FileOutputFormat.setOutputPath(job, new Path(otherArgs[1])); //文件输出
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
//以上是执行。首先是Configurationconf = new Configuration();初始化配置类。
// GenericOptionsParser是hadoop框架中解析命令行参数的基本类。它能够辨别一些标准的命令行参数，能够使应用程序轻易地指定namenode，jobtracker，以及其他额外的配置资源。
//getRemainingArgs 限制命令行输入参数，取其数组长度不为2的话报错退出。
// public String[]getRemainingArgs()
Returns an array of Strings containing onlyapplication-specific arguments.
Returns:
array of Strings containing theun-parsed arguments or empty array if commandLine was not defined.
//任务的输出和输入路径则由命令行参数指定
// waitForCompletion开始

三．图解

以下摘自虾皮工作室

http://www.cnblogs.com/xia520pi/archive/2012/05/16/2504205.html

多谢作者的讲解。

　　1）将文件拆分成splits，由于测试用的文件较小，所以每个文件为一个split，并将文件按行分割形成<key,value>对，如图4-1所示。这一步由MapReduce框架自动完成，其中偏移量（即key值）包括了回车所占的字符数（Windows和Linux环境会不同）。

图4-1 分割过程

2）将分割好的<key,value>对交给用户定义的map方法进行处理，生成新的<key,value>对，如图4-2所示。

图4-2 执行map方法

　　3）得到map方法输出的<key,value>对后，Mapper会将它们按照key值进行排序，并执行Combine过程，将key至相同value值累加，得到Mapper的最终输出结果。如图4-3所示。

图4-3 Map端排序及Combine过程

4）Reducer先对从Mapper接收的数据进行排序，再交由用户自定义的reduce方法进行处理，得到新的<key,value>对，并作为WordCount的输出结果，如图4-4所示。

图4-4 Reduce端排序及输出结果

四．运行结果

要通过Run Configurations

输入Arguments.

Run at Hadoop

证明成功了。

Felix_Ou

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
[Big Data]菜鸟的Hadoop (Before YARN) 学习笔记（一） WordCount

菜鸟的Hadoop (Before YARN)学习笔记（一） WordCount 配置之后再补。先行略过。之前花了挺多时间在配置，但是手一抖没Mark Down，实在后悔。 1. New a projectOK. Finish.2. Project Structure and Coding可以看到，其实Mapreduce已经很自动地把要用的JAR等等必
复制链接

扫一扫