map函数的输出类型必须与reduce的输入类型相同,下面的K1,V1,K2,V2等为抽象类型
map: (K1, V1) -> (K2, V2)
combiner: (K2, list(V2)) -> list(K2, V2)
reduce: (K2, list(V2)) -> list(K3, V3)
一般情况下,combiner和reduce的参数是相同的,即K2=K3, V2=V3
partition对中间结果(K1, V2)处理,返回分区索引,实际上,分区由键决定,即一个键对应一个分区
parititon: (K2, V2) -> integer
类型参数
根据输入输出类型将Context对象参数化
KEYIN VALUEIN KEYOUT VALUEOUT为类型参数
public class Mapper<KEYIN, VALUEIN, KEYOUT, VALUEOUT> {
public abstract class Context
implements MapContext<KEYIN,VALUEIN,KEYOUT,VALUEOUT> {
}
}
public class Reducer<KEYIN,VALUEIN,KEYOUT,VALUEOUT> {
public abstract class Context
implements ReduceContext<KEYIN,VALUEIN,KEYOUT,VALUEOUT> {
}
}
类型匹配
1) Java泛型有限制,类型擦出导致运行过程中类型信息并非一直可见,所以Hadoop明确设定数据类型
2) MR配置可能也出现不兼容的类型,因为配置在编译时无法检查,类型冲突是在作业执行过程中检查出来的
默认的MR作业
package thisisnobody.defaultmapreduce;
import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;
/**
*
* @author ZLP 显示默认Job的设置
* 默认处理输入:文件偏移量 + 行
* 默认结果输出:文件偏移量 + 行
*/
public class MinimalMapReduceWithDefaults extends Configured implements Tool {
public static void main(String[] args) throws Exception {
int exitCode = ToolRunner.run(new MinimalMapReduceWithDefaults(), args);
System.exit(exitCode);
}
@Override
public int run(String[] args) throws Exception {
Job job = JobBuilder.parseInputAndOutput(this, getConf(), args);
if (job == null)
return -1;
/*
* Mapper默认设置
* 输入格式TextInputFormat,键为LongWritable,值为Text
* Mapper类
* 输出键类型LongWritable,输出值类型Text
*/
job.setInputFormatClass(TextInputFormat.class);
job.setMapperClass(Mapper.class);
job.setMapOutputKeyClass(LongWritable.class);
job.setMapOutputValueClass(Text.class);
/*
* Reducer默认设置
* Reduce任务数量1
* Reducer类
* 输出类TextOutputFormat,最后使用Tab将键值分开
* 输出键LongWritable,输出值Text
*/
job.setNumReduceTasks(1);
job.setReducerClass(Reducer.class);
job.setOutputKeyClass(LongWritable.class);
job.setOutputValueClass(Text.class);
job.setOutputFormatClass(TextOutputFormat.class);
/*
* Partitioner默认设置
* HashPartitioner 对记录的键进行哈希操作决定记录的区,每个分区由一个reduce任务处理,分区数等于reduce任务数
* 如果有多个reduce分区,HashPartitioner很重要,均衡性
*/
return job.waitForCompletion(true) ? 0 : 1;
}
}
class JobBuilder {
public static Job parseInputAndOutput(Tool tool, Configuration conf, String[] args) throws IOException {
Path in = new Path("c:/users/zlp/desktop/defaults.txt");
Path out = new Path("c:/users/zlp/desktop/defaultmapreduce");
Job job = Job.getInstance(conf);
job.setJarByClass(tool.getClass());
FileSystem fs = FileSystem.get(conf);
if (fs.exists(out)) {
fs.delete(out, true);
}
FileInputFormat.addInputPath(job, in);
FileOutputFormat.setOutputPath(job, out);
return job;
}
public static void printUsage(Tool tool, String extraArgsUsage) {
System.err.printf(tool.getClass().getSimpleName(), extraArgsUsage);
GenericOptionsParser.printGenericCommandUsage(System.err);
}
}
默认Streaming作业