_00003 Hadoop MapReduce体系结构_mapreduce框架有什么组件来共同组成的？-CSDN博客

本文链接：https://blog.csdn.net/u012185296/article/details/20492463

博文作者：妳那伊抹微笑
个性签名：世界上最遥远的距离不是天涯，也不是海角，而是我站在妳的面前，妳却感觉不到我的存在
技术方向： Flume+Kafka+Storm+Redis/Hbase+Hadoop+Hive+Mahout+Spark ... 云计算技术
转载声明：可以转载, 但必须以超链接形式标明文章原始出处和作者信息及版权声明，谢谢合作！
qq交流群： 214293307 云计算之嫣然伊笑

（期待与你一起学习，共同进步）

# MapReduce的介绍

# MapReduce是Hadoop的分布式计算框架，由两个阶段组成，分别是map和reduce阶段，对于程序员而言，使用过程非常简单，只要覆盖map阶段中的map方法和reduce节点的reduce方法即可

# map和reduce阶段的形参的键值对的形式

# mapreduce的执行流程

瓶颈：磁盘IO

# mapreduce执行原理

1.1 读取输入文件内容，解析成key、value对。对输入文件的每一行，解析成key、value对。每一个键值对调用一次map函数。

1.2 写自己的逻辑，对输入的key、value处理，转换成新的key、value输出。

1.3 对输出的key、value进行分区。

1.4 对不同分区的数据，按照key进行排序、分组。相同key的value放到一个集合中。

1.5 (可选)分组后的数据进行归约。(Combine)

2.0 reduce任务处理

2.1 对多个map任务的输出，按照不同的分区，通过网络copy到不同的reduce节点。

2.2 对多个map任务的输出进行合并、排序。写reduce函数自己的逻辑，对输入的key、value处理，转换成新的key、value输出。

2.3 把reduce的输出保存到文件中。

例子：实现WordCountApp

# 第一个统计单词的java程序（hadoop自带的例子源码）

packageorg.apache.hadoop.examples;

importjava.io.IOException;

importjava.util.StringTokenizer;

importorg.apache.hadoop.conf.Configuration;

importorg.apache.hadoop.fs.Path;

importorg.apache.hadoop.io.IntWritable;

importorg.apache.hadoop.io.Text;

importorg.apache.hadoop.mapreduce.Job;

importorg.apache.hadoop.mapreduce.Mapper;

importorg.apache.hadoop.mapreduce.Reducer;

importorg.apache.hadoop.mapreduce.lib.input.FileInputFormat;

importorg.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

importorg.apache.hadoop.util.GenericOptionsParser;

@SuppressWarnings("all")

public classWordCount {

public static class TokenizerMapperextends Mapper<Object, Text, Text, IntWritable> {

private final static IntWritableone = new IntWritable(1);

private Text word = new Text();

public void map(Object key, Textvalue, Context context) throws IOException, InterruptedException {

StringTokenizer itr = newStringTokenizer(value.toString());

while (itr.hasMoreTokens()){

word.set(itr.nextToken());

context.write(word,one);

}

public static class IntSumReducer extendsReducer<Text, IntWritable, Text, IntWritable> {

private IntWritable result = newIntWritable();

public void reduce(Text key,Iterable<IntWritable> values,

Context context)throws IOException, InterruptedException {

int sum = 0;

for (IntWritable val :values) {

sum += val.get();

}

result.set(sum);

context.write(key, result);

}

public static void main(String[] args)throws Exception {

Configuration conf = newConfiguration();

String[] otherArgs = newGenericOptionsParser(conf, args).getRemainingArgs();

if (otherArgs.length != 2) {

System.err.println("Usage:wordcount <in> <out>");

System.exit(2);

}

Job job = new Job(conf, "wordcount");

job.setJarByClass(WordCount.class);

job.setMapperClass(TokenizerMapper.class);

job.setCombinerClass(IntSumReducer.class);

job.setReducerClass(IntSumReducer.class);

job.setOutputKeyClass(Text.class);

job.setOutputValueClass(IntWritable.class);

FileInputFormat.addInputPath(job,new Path(otherArgs[0]));

FileOutputFormat.setOutputPath(job,new Path(otherArgs[1]));

System.exit(job.waitForCompletion(true)? 0 : 1);

}

# 下面运行命令跟输出结果

[hadoop@masterhadoop-1.1.2]$ hadoop jar hadoop-yting-wordcounter.jarorg.apache.hadoop.examples.WordCount /user/hadoop/20140303/test.txt/user/hadoop/20140303/output001

14/03/0310:43:51 INFO input.FileInputFormat: Total input paths to process : 1

14/03/0310:43:52 INFO mapred.JobClient: Running job: job_201403020905_0001

14/03/0310:43:53 INFO mapred.JobClient: map 0%reduce 0%

14/03/0310:44:12 INFO mapred.JobClient: map 100%reduce 0%

14/03/03 10:44:25INFO mapred.JobClient: map 100% reduce100%

14/03/0310:44:29 INFO mapred.JobClient: Job complete: job_201403020905_0001

14/03/0310:44:29 INFO mapred.JobClient: Counters: 29

14/03/0310:44:29 INFO mapred.JobClient: JobCounters

14/03/03 10:44:29INFO mapred.JobClient: Launchedreduce tasks=1

14/03/0310:44:29 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=19773

14/03/0310:44:29 INFO mapred.JobClient: Totaltime spent by all reduces waiting after reserving slots (ms)=0

14/03/0310:44:29 INFO mapred.JobClient: Totaltime spent by all maps waiting after reserving slots (ms)=0

14/03/0310:44:29 INFO mapred.JobClient: Launched map tasks=1

14/03/0310:44:29 INFO mapred.JobClient: Data-local map tasks=1

14/03/0310:44:29 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=13148

14/03/0310:44:29 INFO mapred.JobClient: FileOutput Format Counters

14/03/0310:44:29 INFO mapred.JobClient: BytesWritten=188

14/03/0310:44:29 INFO mapred.JobClient: FileSystemCounters

14/03/0310:44:29 INFO mapred.JobClient: FILE_BYTES_READ=171

14/03/0310:44:29 INFO mapred.JobClient: HDFS_BYTES_READ=310

14/03/0310:44:29 INFO mapred.JobClient: FILE_BYTES_WRITTEN=101391

14/03/0310:44:29 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=188

14/03/0310:44:29 INFO mapred.JobClient: FileInput Format Counters

14/03/0310:44:29 INFO mapred.JobClient: BytesRead=197

14/03/0310:44:29 INFO mapred.JobClient: Map-Reduce Framework

14/03/0310:44:29 INFO mapred.JobClient: Mapoutput materialized bytes=163

14/03/0310:44:29 INFO mapred.JobClient: Mapinput records=8

14/03/0310:44:29 INFO mapred.JobClient: Reduce shuffle bytes=163

14/03/0310:44:29 INFO mapred.JobClient: Spilled Records=56

14/03/0310:44:29 INFO mapred.JobClient: Map output bytes=376

14/03/0310:44:29 INFO mapred.JobClient: CPUtime spent (ms)=4940

14/03/0310:44:29 INFO mapred.JobClient: Totalcommitted heap usage (bytes)=63926272

14/03/0310:44:29 INFO mapred.JobClient: Combine input records=45

14/03/0310:44:29 INFO mapred.JobClient: SPLIT_RAW_BYTES=113

14/03/0310:44:29 INFO mapred.JobClient: Reduce input records=28

14/03/0310:44:29 INFO mapred.JobClient: Reduce input groups=28

14/03/0310:44:29 INFO mapred.JobClient: Combine output records=28

14/03/0310:44:29 INFO mapred.JobClient: Physical memory (bytes) snapshot=111722496

14/03/0310:44:29 INFO mapred.JobClient: Reduce output records=28

14/03/0310:44:29 INFO mapred.JobClient: Virtual memory (bytes) snapshot=468000768

14/03/0310:44:29 INFO mapred.JobClient: Mapoutput records=45

[hadoop@masterhadoop-1.1.2]$ hadoop fs -ls /user/hadoop/20140303/output001

Found 3 items

-rw-r--r-- 1 hadoop supergroup 0 2014-03-03 10:44/user/hadoop/20140303/output001/_SUCCESS

drwxr-xr-x - hadoop supergroup 0 2014-03-03 10:43/user/hadoop/20140303/output001/_logs

-rw-r--r-- 1 hadoop supergroup 188 2014-03-03 10:44/user/hadoop/20140303/output001/part-r-00000

[hadoop@masterhadoop-1.1.2]$ hadoop fs -text /user/hadoop/20140303/output001/part-t-00000

text: File doesnot exist: /user/hadoop/20140303/output001/part-t-00000

[hadoop@masterhadoop-1.1.2]$ hadoop fs -text /user/hadoop/20140303/output001/part-r-00000

a 1

again 1

and 1

changce 1

easy 1

forever 1

give 1

hand 1

heart 2

hold 1

i 1

is 1

it 1

love 1

me 6

meimei 1

miss 1

see 1

show 1

smile 1

so 1

soul 1

take 3

the 2

to 4

until 1

what 1

you 6

# 最小的MapReduce（默认设置）

Configurationconfiguration = new Configuration();

Job job = newJob(configuration, "HelloWorld");

job.setInputFormat(TextInputFormat.class);

job.setMapperClass(IdentityMapper.class);

job.setMapOutputKeyClass(LongWritable.class);

job.setMapOutputValueClass(Text.class);

job.setPartitionerClass(HashPartitioner.class);

job.setNumReduceTasks(1);

job.setReducerClass(IdentityReducer.class);

job.setOutputKeyClass(LongWritable.class);

job.setOutputValueClass(Text.class);

job.setOutputFormat(TextOutputFormat.class);

job.waitForCompletion(true);

# 序列化

# Writable

# 数据流单向的

# LongWritable不能进行加减等操作（没必要，java的基本类型都已经弄了这些功能了）

# JobTracker，TaskTracker

# JobTracker

负责接收用户提交的作业，负责启动、跟踪任务执行。

JobSubmissionProtocol是JobClient与JobTracker通信的接口。

InterTrackerProtocol是TaskTracker与JobTracker通信的接口。

# TaskTracker

负责执行任务

# JobClient

是用户作业与JobTracker交互的主要接口。

负责提交作业的，负责启动、跟踪任务执行、访问任务状态和日志等。

# 执行过程

还想说这个图片上传不上来，大于2M，郁闷、、、

<p>
<a target="_blank" href="http://user.qzone.qq.com/1042658081" color="blue">妳那伊抹微笑</a>
</p>
<a target="_blank" href="http://user.qzone.qq.com/1042658081" color="blue">The you smile until forever 、、、、、、、、、、、、、、、、、、、、、</a>