hadoop2.2 MapReduce and yarn(二) MapReduce in MR v2 API

最新推荐文章于 2021-05-20 15:03:40 发布

zhuyu4839

最新推荐文章于 2021-05-20 15:03:40 发布

阅读量854

点赞数

分类专栏：一步一步学习hadoop2.2 文章标签： hadoop

本文链接：https://blog.csdn.net/zhuyu4839/article/details/22465803

版权

一步一步学习hadoop2.2 专栏收录该内容

15 篇文章 0 订阅

订阅专栏

MapReduce

1. 首先了解MapReduce的功能:一个分布式系统(Distribute System)是用来处理大计算量的数据,即当计算量在一台计算机无法处理的情况下,就通过把整个计算过程分成很多个小的计算块,通过Master分派给分布式系统中集群的Cluster,Cluster计算完成后结果返回给Master,如此迭代;在Hadoop模型中MapReduce即为实现.

MapReduce in MR v2执行流程:

2. MapReduce in MR v2

先来看一下如下代码(文本文件使用空格分词,统计每个分词出现的次数):

public class MapClass extends Mapper<Object, Text, Text, IntWritable> {

private Text record = new Text();
private static final IntWritable recbytes = new IntWritable(1);

/**
* Construct of this class.
*/
public MapClass() {
System.out.println("mapper instance....");
}

public void map(Object key, Text value, Context context) throws IOException, InterruptedException {
String line = value.toString();
// 没有配置 RecordReader，所以默认采用 line 的实现，
// key 就是行号，value 就是行内容，
if (line == null || line.equals(""))
return;
String[] words = line.split("\\s+");

for (int i = 0; i < words.length; i++) {
record.clear();
record.set(words[i]);
context.write(record, recbytes);
}
}
}

public class ReduceClass extends Reducer<Text, IntWritable, Text, IntWritable> {

private IntWritable result = new IntWritable();

/**
* Construct of this class.
*/
public ReduceClass() {
System.out.println("reducer instance....");
}

public void reduce(Text key, Iterable<IntWritable> values, Context context)
throws IOException, InterruptedException {

int tmp = 0;
for (IntWritable val : values) {
tmp = tmp + val.get();
}
result.set(tmp);
context.write(key, result);// 输出最后的汇总结果
}

}

public class LogAnalysiser {

public static void main(String[] args) throws IOException,
ClassNotFoundException, InterruptedException {

if (args == null || args.length < 4) {
System.out.println("need inputpath and outputpath");
System.exit(1);
}

// 输入文件夹
String inputpath = args[0];
// 输出文件夹
String outputpath = args[1];
// 输入文件
String shortin = args[2];
// 输出文件
String shortout = args[3];

if (shortin.indexOf(File.separator) >= 0)
shortin = shortin.substring(shortin.lastIndexOf(File.separator));
if (shortout.indexOf(File.separator) >= 0)
shortout = shortout.substring(shortout.lastIndexOf(File.separator));

SimpleDateFormat formater = new SimpleDateFormat("yyyy.MM.dd.HH.mm");
shortout = new StringBuffer(shortout).append("-")
.append(formater.format(new Date())).toString();

shortin = inputpath + shortin;
shortout = outputpath + shortout;

File inputdir = new File(inputpath);
File outputdir = new File(outputpath);

if (!inputdir.exists() || !inputdir.isDirectory()) {
System.out.println("inputpath not exist or isn't dir!");
System.exit(1);
}

if (!outputdir.exists()) {
new File(outputpath).mkdirs();
}

Job job = Job.getInstance(new Configuration(), LogAnalysiser.class.toString());

job.setJarByClass(LogAnalysiser.class);
job.setJobName("analysisjob");

job.setOutputKeyClass(Text.class);// 输出的 key 类型，在 OutputFormat 会检查
job.setOutputValueClass(IntWritable.class); // 输出的 value 类型，在OutputFormat 会检查

job.setMapperClass(MapClass.class);
job.setReducerClass(ReduceClass.class);
job.setCombinerClass(ReduceClass.class);

job.setNumReduceTasks(2);// 强制需要有两个 Reduce 来分别处理流量和次数的统计
FileInputFormat.setInputPaths(job, new Path(shortin));// hdfs 中的输入路径
FileOutputFormat.setOutputPath(job, new Path(shortout));// hdfs 中输出路径

Date startTime = new Date();
System.out.println("Job started: " + startTime);
job.waitForCompletion(true);
Date end_time = new Date();
System.out.println("Job ended: " + end_time);
System.out.println("The job took "
+ (end_time.getTime() - startTime.getTime()) / 1000
+ " seconds.");
// 删除输入和输出的临时文件
// org.apache.hadoop.fs.FileSystem.get(new Configuration()).copyToLocalFile(new Path(shortout), new Path(outputpath + "/out"));
// fileSys.delete(new Path(shortin),true);
// fileSys.delete(new Path(shortout),true);
System.exit(0);
}
}

MapClass负责处理数据输入,在hadoop MapReduce(PartitionerClass)调度ReduceClass(对输入数据处理)同时处理.

zhuyu4839

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
hadoop2.2 MapReduce and yarn(二) MapReduce in MR v2 API

MapReduce1. 首先了解MapReduce的功能:一个分布式系统(Distribute System)是用来处理大计算量的数据,即当计算量在一台计算机无法处理的情况下,就通过把整个计算过程分成很多个小的计算块,通过Master分派给分布式系统中集群的Cluster,Cluster计算完成后结果返回给Master,如此迭代;在Hadoop模型中MapReduce即为实现.MapRed
复制链接

扫一扫

专栏目录