MR执行步骤
map任务处理
1.1 读取输入文件内容,解析成key、value对。对输入文件的每一行,解析成key、value对。每一个键值对调用一次map函数。
1.2 写自己的逻辑,对输入的key、value处理,转换成新的key、value输出。
1.3 对输出的key、value进行分区。
1.4 对不同分区的数据,按照key进行排序、分组。相同key的value放到一个集合中。
1.5 (可选)分组后的数据进行归约。reduce任务处理
2.1 对多个map任务的输出,按照不同的分区,通过网络copy到不同的reduce节点。
2.2 对多个map任务的输出进行合并、排序。写reduce函数自己的逻辑,对输入的key、value处理,转换成新的key、value输出。
2.3 把reduce的输出保存到文件中。
WordCount示例
Map
package hadoop.mr;
import java.io.IOException;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
/**
* Mapper<KEYIN, VALUEIN, KEYOUT, VALUEOUT>
*/
public class WordCMapper
extends Mapper<LongWritable, Text, Text, LongWritable> {
/**
* mapper数据处理
* @param k1 偏移量(该行内容开头在文件的位置)
* @param v1 一行内容
*/
@Override
protected void map(LongWritable k1, Text v1, Context context)
throws IOException, InterruptedException {
//accept value by ervey key
String line = v1.toString();
//split
String[] words = line.split(" ");
//loop
for (String word : words) {
//send
context.write(new Text(word), new LongWritable(1));
}
}
}
Reduce
package hadoop.mr;
import java.io.IOException;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
/**
* Reducer<KEYIN, VALUEIN, KEYOUT, VALUEOUT>
*/
public class WordCReducer
extends Reducer<Text, LongWritable, Text, LongWritable> {
@Override
protected void reduce(Text k2, Iterable<LongWritable> v2s, Context context)
throws IOException, InterruptedException {
//define a counter
long counter = 0;
//loop
for (LongWritable l : v2s) {
counter += l.get();
}
//write
context.write(k2, new LongWritable(counter));
}
}
submit Job
package hadoop.mr;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
/**
* 1 分析业务逻辑、确定输入输出数据类型
*
* 2 自定义一个类,继承Mapper<KEYIN, VALUEIN, KEYOUT, VALUEOUT>,重写map()方法
*
* 3 自定义一个类,继承Reducer<KEYIN, VALUEIN, KEYOUT, VALUEOUT>,重写reduce()方法
*
* 4 将自定义的mapper和reducer通过Job对象组装起来
*/
public class WordCount {
public static void main(String[] args) throws Exception {
WordCount.class.newInstance().init(args);
}
public void init(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = Job.getInstance(conf);
//notice
job.setJarByClass(WordCount.class);
//set mapper`s property
job.setMapperClass(WordCMapper.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(LongWritable.class);
FileInputFormat.setInputPaths(job, new Path(args[0]));
//set reducer`s property
job.setReducerClass(WordCReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(LongWritable.class);
FileOutputFormat.setOutputPath(job, new Path(args[1]));
//submit
job.waitForCompletion(true);
}
}
运行
导JAR:choose project —> export —> JAR file(可指定Main Class)
拷贝:将导出的jar拷贝到Linux任一文件下(我这拷贝在/home/runJar下)
执行:hadoop jar /home/runJar/wc.jar /test.c /run (/test.c:用于map读取hdfs上的文件,/run:reduce处理完后存放文件在hdfs上的位置)
若导JAR时未指定Main Class,运行时则需要指定Main Class路径:
hadoop jar /home/runJar/wc.jar hadoop.mr.WordCount /test.c /run