Hadoop2.X MR作业流
情景概述:作为HFDS的高层建筑,MR被设计与在大型分布式文件系统之上的离线数据运算,在对一些运算时效性要求不高的场景中更适合于MR作业,MR在ETL流不同阶段可扮演不同的角色,甚至在某些场景下基于MR的链式操作可完成ETL的整个流程.
MR概述:Hadoop MR(Mapper Reduce) 是一个软件架构的实现,用户处理大批量的的离线数据作业,运行于大型集群中,硬件可靠和容错.MR核心思想将作业的输入转化为多个并行的Map程序,然后进行Reduce输出运算结果.作业的输入和输出都存储于文件系统,由MR 框架负责作业的调度,监控和失败重试.
| 基于基于计算优于移动数据的设计,MR运算节点与数据节点通常是同一个节点.
| MR框架有一个ResourceManager主节点每一个集群节点上有一个NodeManager每一个应用程序有一个MRAppMaster
| 一个MR程序至少应该有 输入/输出,输入Map以及 Reduce函数,通常由实现适当的接口或者抽象类而来. 还应该有其它的作业参数包括作业的配置参数
| Hadoop作业由job client进行提交,jobclient配置ResourceManager,ResourceManager分配资源及集群节点,然后开始调度,监控任务,为job-client提供任务状态和诊断信息
MR作业流,图片来源于互联网
输入,输出:
| MR作业基于<key,value>对展开作业,框架以一组<key,value>作为输入,一组<key,value>作为输出,期间数据类型可以发生变化.
| 分布式作业中,MR要求key,value必须是可序列化的需要实现Writable接口,此外key必须实现WritableComparable接口用于MR框架进行以key为依据的排序操作
| 一个MR作业的输入/输入看起来是如下的一种演变:
(input) <k1, v1> -> map -><k2, v2> -> combine -> <k2, v2> -> reduce -> <k3,v3> (output)
MR接口
| Mapper接口用于将作业的输入映射到key/value集合,Mapper 操作输入key/value集合生成中间key/values记录,一个给定的输入可能对应0个或者多个输出key/values对,Mapper接口定义如下
public class Mapper<KEYIN, VALUEIN,KEYOUT, VALUEOUT> {
…
public voidrun(Context context) throws IOException, InterruptedException {
}
}
| Shuffle,对Maplper阶段的输出数据进行Shuffle操作,期间伴随数据的排序操作.
| Sort,MR框架默认使用keys对Reduce的输入数据进行分组,Mapper操作处理的是Reduce阶段输出按key分组后的数据.
| Secondary Sort,可通过实现Comparator接口通过Job.setSortComparatorClass(Class) 来自定义key的比较规则来实现key的排序,可通过 Job.setGroupingComparatorClass(Class)来指定key分组的规则,
| Reduce 接口用户合并多个Mapper操作的中间结果,一个MR作业中可允许多个Reduce并发处理. Reduce通常涉及三个阶段,shuffle,sort,reduce,Reduce接口如下
public classReducer<KEYIN,VALUEIN,KEYOUT,VALUEOUT>{
…
protected void reduce(KEYIN key, Iterable<VALUEIN> values,Context context
) throws IOException,InterruptedException {
}
}
| Reducer NONE, MR框架允许一个MR作业只有Mapper操作而没有Reduce操作,这样Mapper的输出将直接输出到对应的文件系统,且在输出到文件系统前不会发生排序操作.
| Partitioner, Partitioner 对Mapper操作的中间结果按key进行划分,分区数和Reduce数量一致, Partitioner控制着中间结果的key会被分配到多个Reduce中的哪一个进行Reduce操作.
| Counter, MR提供的计数组件,用于统计MR状态.
附件:
| 一个默认的MR程序,补全了默认的加载项:,该例是完整的可直接运行
public class T {
publicstatic void main(String[] args) throws Exception {
Configurationconf = new Configuration();
Jobjob = Job.getInstance(conf, "TEST");
//默认的输入配置
{
/*
* key:LongWriteable ;value:Text
*/
job.setInputFormatClass(TextInputFormat.class);
}
//Mapper默认配置
{
job.setMapperClass(Mapper.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(IntWritable.class);
}
//默认的Map路由算法
{
/*对每条记录进行Hash操作决定该记录应该属于哪个分区,每个分区对应一个Reduce任务 ==>分区数等于作业Reduce的个数*/
job.setPartitionerClass(HashPartitioner.class);
}
//Reduce默认配置
{
/*Reduce之前数据被使用key值进行过排序操作*/
job.setReducerClass(Reducer.class);
job.setNumReduceTasks(1);
}
//默认的输出配置
{
/*将键/值转换成字符串并使用制表符分开,然后一条一条的进行输出*/
job.setOutputFormatClass(TextOutputFormat.class);
job.setMapOutputKeyClass(IntWritable.class);
job.setMapOutputValueClass(Text.class);
}
FileInputFormat.setInputPaths(job,new Path(args[0]));
FileOutputFormat.setOutputPath(job,new Path(args[1]));
if(!job.waitForCompletion(true))
return;
}
}
| WorldCout 实现,该例是完整的
public class WordCount {
publicstatic void main(String[] args) throws Exception {
Configurationconf = new Configuration();
Jobjob = Job.getInstance(conf, "WordCount");
job.setInputFormatClass(CombineTextInputFormat.class);
job.setJarByClass(cn.com.xiaofen.A.WordCount.class);
job.setMapperClass(WordMaper.class);
job.setReducerClass(WordReduce.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
CombineTextInputFormat.setInputPaths(job,new Path(args[0]));
FileOutputFormat.setOutputPath(job,new Path(args[1]));
if(!job.waitForCompletion(true))
return;
}
}
class WordMaper extendsMapper<LongWritable, Text, Text, IntWritable> {
publicvoid map(LongWritable ikey, Text ivalue, Context context) throws IOException,InterruptedException {
Textword = null;
finalIntWritable one = new IntWritable(1);
StringTokenizerst = new StringTokenizer(ivalue.toString());
while(st.hasMoreTokens()) {
word= new Text(st.nextToken());
context.write(word,one);
}
}
}
class WordReduce extends Reducer<Text,IntWritable, Text, IntWritable> {
@Override
protectedvoid reduce(Text key, Iterable<IntWritable> values, Context context)
throwsIOException, InterruptedException {
intsum = 0;
for(IntWritable intw : values) {
sum+= intw.get();
}
context.write(key,new IntWritable(sum));
}
}
| 一个稍复杂的示例,TOPN实现
public class CarTopN {
publicstatic final String TOPN = "TOPN";
publicstatic void main(String[] args) throws Exception {
Pathinput = new Path(args[0]);
Pathoutput = new Path(args[1]);
IntegerN = Integer.parseInt(args[2]);
Configurationconf = new Configuration();
//define the N
conf.setInt(CarTopN.TOPN,N);
Jobjob = Job.getInstance(conf, "CAR_Top10_BY_TGSID");
job.setJarByClass(cn.com.zjf.MR_04.CarTopN.class);
job.setInputFormatClass(CombineTextInputFormat.class);
job.setMapperClass(CarTopNMapper.class);
//not use
//job.setCombinerClass(Top10Combine.class);
job.setReducerClass(CarTopNReduce.class);
job.setNumReduceTasks(1);
job.setMapOutputValueClass(IntWritable.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
job.setPartitionerClass(ITGSParition.class);
FileSystemfs = FileSystem.get(conf);
//预处理文件 .只读取写完毕的文件 .writed结尾 .只读取文件大小大于0的文件
{
FileStatuschilds[] = fs.globStatus(input, new PathFilter() {
publicboolean accept(Path path) {
if(path.toString().endsWith(".writed")) {
returntrue;
}
returnfalse;
}
});
Pathtemp = null;
for(FileStatus file : childs) {
temp= new Path(file.getPath().toString().replaceAll(".writed",""));
if(fs.listStatus(temp)[0].getLen() > 0) {
FileInputFormat.addInputPath(job,temp);
}
}
}
CombineTextInputFormat.setMaxInputSplitSize(job,67108864);
//强制清理输出目录
if(fs.exists(output)) {
fs.delete(output,true);
}
FileOutputFormat.setOutputPath(job,output);
if(!job.waitForCompletion(true))
return;
}
}
class ITGSParition extendsPartitioner<Text, Text> {
@Override
publicint getPartition(Text key, Text value, int numPartitions) {
return(Math.abs(key.hashCode())) % numPartitions;
}
}
class CarTopNMapper extendsMapper<LongWritable, Text, Text, IntWritable> {
@Override
protectedvoid map(LongWritable key, Text value, Mapper<LongWritable, Text, Text,IntWritable>.Context context)
throwsIOException, InterruptedException {
Stringtemp = value.toString();
if(temp.length() > 13) {
temp= temp.substring(12);
String[]items = temp.split(",");
if(items.length > 10) {
//CarPlate As Key
try{
Stringtgsid = items[14].substring(6);
Integer.parseInt(tgsid);
context.write(newText(tgsid), new IntWritable(1));
}catch (Exception e) {
e.printStackTrace();
}
}
}
}
}
class CarTopNCombine extendsReducer<Text, IntWritable, Text, IntWritable> {
//有序Map 始终存储TOPN,不存储多余的数据
privatefinal TreeMap<Integer, String> tm = new TreeMap<Integer, String>();
privateint N;
@Override
protectedvoid setup(Reducer<Text, IntWritable, Text, IntWritable>.Context context)
throwsIOException, InterruptedException {
Configurationconf = context.getConfiguration();
N= conf.getInt(CarTopN.TOPN, 10);
}
@Override
protectedvoid reduce(Text key, Iterable<IntWritable> values,
Reducer<Text,IntWritable, Text, IntWritable>.Context context) throws IOException,InterruptedException {
Integerweight = 0;
for(IntWritable iw : values) {
weight+= iw.get();
}
tm.put(weight,key.toString());
//保证只有TOPN
if(tm.size() > N) {
tm.remove(tm.firstKey());
}
}
@Override
protectedvoid cleanup(Reducer<Text, IntWritable, Text, IntWritable>.Contextcontext)
throwsIOException, InterruptedException {
//将最终的数据进行发射输出给下一阶段
for(Integer key : tm.keySet()) {
context.write(newText("byonet:" + tm.get(key)), new IntWritable(key));
}
}
}
// Top10核心计算方法
// 尽量避免在Java集合中存储Hadoop数据类型,可能会出现奇怪的问题
class CarTopNReduce extendsReducer<Text, IntWritable, Text, IntWritable> {
privatefinal TreeMap<Integer, String> tm = new TreeMap<Integer, String>();
privateint N;
@Override
protectedvoid setup(Reducer<Text, IntWritable, Text, IntWritable>.Context context)
throwsIOException, InterruptedException {
Configurationconf = context.getConfiguration();
N= conf.getInt(CarTopN.TOPN, 10);
}
@Override
protectedvoid reduce(Text key, Iterable<IntWritable> values,
Reducer<Text,IntWritable, Text, IntWritable>.Context arg2) throws IOException,InterruptedException {
Integerweight = 0;
for(IntWritable iw : values) {
weight+= iw.get();
}
tm.put(weight,key.toString());
if(tm.size() > N) {
tm.remove(tm.firstKey());
}
}
@Override
protectedvoid cleanup(Reducer<Text, IntWritable, Text, IntWritable>.Contextcontext)
throwsIOException, InterruptedException {
for(Integer key : tm.keySet()) {
context.write(newText("byonet:" + tm.get(key)), new IntWritable(key));
}
}
}