MapReduce Job
- 每个MapReduce任务被初始化为一个Job
- 每个Job对应两个阶段Map和Reduce,分别对应Map函数和Reduce函数
这个过程中间是键值对的传递
MapReduce流程:
Mapper
作为mapper,继承
org.apache.hadoop.mapreduce.Mapper
public class Mapper<KEYIN,VALUEIN,KEYOUT,VALUEOUT>
键类实现WritableComparable, 值类Writable
void setup(Context context);
map任务之前执行,可以打开数据库,数据预处理
void close()
作为map任务结束前的最后一个操作,该函数完成所有的结尾工作,如关闭数据库,关闭文件等.
void map(KEY key, VALUE value, Context context)
对输入的key1, value1, 执行map操作
void run(Context context)
执行复杂的控制,比如多线程map
Map方法*
如果我想写一个MapReduce的程序,其实就需要重新写一个map的方法,将原来的map方法给覆盖掉.
一个map用于处理一个单独的键值对
Reducer
Reducer任务接受来自各个mapper的输出时,按照键对输入数据(map输出数据),将相同的键的值归并(shuffle), 并进行排序(sort),然后调用reduce函数,并通过迭代处理那些与指定键相关联的值,生成一个列表<K3, V3>.
继承org.apache.hadoop.mapreduce.Reducer
protected void map(KEYIN key, VALUEIN value, org.apache.hadoop.mapreduce.Mapper.Context context) throws IOException InterruptedException
— 该函数处理一个给定的键值对(K1, V1),生成一个键值对(K2, V2)的列表。
— Context.write(key, value):输出map的计算结果
— Context可提供对Mapper相关附加信息的记录,形成任务进度
例子:#Wordcount.java#
package ex6;
/**
* Licensed to the Apache Software Foundation (ASF) under one
* or more contributor license agreements. See the NOTICE file
* distributed with this work for additional information
* regarding copyright ownership. The ASF licenses this file
* to you under the Apache License, Version 2.0 (the
* "License"); you may not use this file except in compliance
* with the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
import java.io.IOException;
import java.util.StringTokenizer;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;
public class WordCount {
public static class TokenizerMapper extends
Mapper<Object, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
//这个是收尾,可以做数据库的连接,关闭之类的操作
@Override
protected void cleanup(Mapper<Object, Text, Text, IntWritable>.Context context)
throws IOException, InterruptedException {
// TODO 自动生成的方法存根
super.cleanup(context);
}
//这个方法是支持多线程
@Override
public void run(Mapper<Object, Text, Text, IntWritable>.Context context)
throws IOException, InterruptedException {
// TODO 自动生成的方法存根
super.run(context);
}
//进行配置,在进入mapper之前就会调用.
@Override
protected void setup(Mapper<Object, Text, Text, IntWritable>.Context context)
throws IOException, InterruptedException {
// TODO 自动生成的方法存根
super.setup(context);
}
public void map(Object key, Text value, Context context) throws IOException, InterruptedException {
StringTokenizer itr = new StringTokenizer(value.toString(), "\t\n\r\f,.:;?![]' ");
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
context.write(word, one);
}
}
}
public static class IntSumReducer extends
Reducer<Text, IntWritable, Text, IntWritable> {
private IntWritable result = new IntWritable();
public void reduce(Text key, Iterable<IntWritable> values, Context context)
throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
result.set(sum);
context.write(key, result);
}
}
//Java驱动
@SuppressWarnings("deprecation")
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs();
if (otherArgs.length < 2) {
System.err.println("Usage: wordcount <in> [<in>...] <out>");
System.exit(2);
}
Job job = new Job(conf, "word count");
job.setJarByClass(WordCount.class);
job.setMapperClass(TokenizerMapper.class);
job.setReducerClass(IntSumReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
for (int i = 0; i < otherArgs.length - 1; ++i) {
FileInputFormat.addInputPath(job, new Path(otherArgs[i]));
}
FileOutputFormat.setOutputPath(job, new Path(otherArgs[otherArgs.length - 1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
预定义类:
Hadoop预定义的Mapper的实现
Marper<K,V>,(MR1中的IdentityMapper<K,V>)
实现Mapper<K,V,K,V>就输出直接映射到输出
InverseMapper<K,V>
实现Mapper<K,V,V,K>发转键值对,实现了key和value的交换
RegexMapper
实现Mapper<K, Text, LongWritable>,为每个常规表达式的匹配项生成一个(match, 1)队.
TokenCountMapper
实现Mapper<K, Text, Text, LongWritable>,当输入的值为分词时,生成一个(token, 1)队.
Hadoop预定义的Reducer的实现:
Reducer,(MR1中IdentityReducer<k, v>)
实现Reducer<K, V, K, V>,将输入直接映射到输出
IntSumReducer, LongSumReducer
实现<K, IntWritable, K, IntWritable>,计算与给定键相对应的所有值的和
实现<K, LongWritable, K, IntWritable>, 计算与给定键相对应的所有值的和
例子:#MRPre-Defined#
package ex6;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.map.InverseMapper;
import org.apache.hadoop.mapreduce.lib.map.TokenCounterMapper;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.reduce.IntSumReducer;
public class MRPreDefined {
@SuppressWarnings("deprecation")
public static void main(String[] args) throws Exception {
Configuration conf=new Configuration();
Job job = new Job(conf, "word count");
job.setJarByClass(MRPreDefined.class);
FileInputFormat.setInputPaths(job, new Path("testdata/input3"));
FileOutputFormat.setOutputPath(job, new Path("testdata/output3-3"));
//test1 直接输出
// job.setOutputKeyClass(LongWritable.class); //输出Key的数据类型
// job.setOutputValueClass(Text.class); //输出Value的数据类型
// job.setMapperClass(Mapper.class); //预定义
// job.setReducerClass(Reducer.class); //预定义
//
//test2 逆转输出
// job.setOutputKeyClass(Text.class); //输出Key的数据类型
// job.setOutputValueClass(LongWritable.class); //输出Value的数据类型
// job.setMapperClass(InverseMapper.class); //预定义
// job.setReducerClass(Reducer.class); //预定义
//test3 求和输出
job.setOutputKeyClass(Text.class); //输出Key的数据类型
job.setOutputValueClass(IntWritable.class); //输出Value的数据类型
job.setMapperClass(TokenCounterMapper.class); //预定义
job.setReducerClass(IntSumReducer.class); //预定义
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}