hadoop源码阅读

最新推荐文章于 2024-05-14 04:08:00 发布

weixin_30502965

最新推荐文章于 2024-05-14 04:08:00 发布

阅读量106

点赞数

文章标签：大数据 java

原文链接：http://www.cnblogs.com/wylwyl/p/10250464.html

版权

1、hadoop源码下载

下载地址：https://mirrors.tuna.tsinghua.edu.cn/apache/hadoop/common/

2、我们看一下hadoop源码中提供的一个程序WordCount

 1 /**
 2  * Licensed to the Apache Software Foundation (ASF) under one
 3  * or more contributor license agreements.  See the NOTICE file
 4  * distributed with this work for additional information
 5  * regarding copyright ownership.  The ASF licenses this file
 6  * to you under the Apache License, Version 2.0 (the
 7  * "License"); you may not use this file except in compliance
 8  * with the License.  You may obtain a copy of the License at
 9  *
10  *     http://www.apache.org/licenses/LICENSE-2.0
11  *
12  * Unless required by applicable law or agreed to in writing, software
13  * distributed under the License is distributed on an "AS IS" BASIS,
14  * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
15  * See the License for the specific language governing permissions and
16  * limitations under the License.
17  */
18 package org.apache.hadoop.examples;
19 
20 import java.io.IOException;
21 import java.util.StringTokenizer;
22 
23 import org.apache.hadoop.conf.Configuration;
24 import org.apache.hadoop.fs.Path;
25 import org.apache.hadoop.io.IntWritable;
26 import org.apache.hadoop.io.Text;
27 import org.apache.hadoop.mapreduce.Job;
28 import org.apache.hadoop.mapreduce.Mapper;
29 import org.apache.hadoop.mapreduce.Reducer;
30 import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
31 import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
32 import org.apache.hadoop.util.GenericOptionsParser;
33 
34 public class WordCount {
35 
36     // Mapper<Object,Text,Text,IntWritable>
37     // Object是输入的key的类型
38     // Text是输入的value的类型
39     // Text是输出的key的类型
40     // IntWritable是输出的value的类型
41   public static class TokenizerMapper 
42        extends Mapper<Object, Text, Text, IntWritable>{
43     
44     private final static IntWritable one = new IntWritable(1);
45     private Text word = new Text();
46       
47     public void map(Object key, Text value, Context context
48                     ) throws IOException, InterruptedException {
49       StringTokenizer itr = new StringTokenizer(value.toString());
50       while (itr.hasMoreTokens()) {
51         word.set(itr.nextToken());
52         context.write(word, one);
53       }
54     }
55   }
56   
57   public static class IntSumReducer 
58        extends Reducer<Text,IntWritable,Text,IntWritable> {
59     private IntWritable result = new IntWritable();
60 
61     public void reduce(Text key, Iterable<IntWritable> values, 
62                        Context context
63                        ) throws IOException, InterruptedException {
64       int sum = 0;
65       for (IntWritable val : values) {
66         sum += val.get();
67       }
68       result.set(sum);
69       context.write(key, result);
70     }
71   }
72 
73   public static void main(String[] args) throws Exception {
74     Configuration conf = new Configuration();
75     String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs();
76     if (otherArgs.length < 2) {
77       System.err.println("Usage: wordcount <in> [<in>...] <out>");
78       System.exit(2);
79     }
80     Job job = Job.getInstance(conf, "word count");
81     job.setJarByClass(WordCount.class);
82     job.setMapperClass(TokenizerMapper.class);
83     job.setCombinerClass(IntSumReducer.class);
84     job.setReducerClass(IntSumReducer.class);
85     job.setOutputKeyClass(Text.class);
86     job.setOutputValueClass(IntWritable.class);
87     for (int i = 0; i < otherArgs.length - 1; ++i) {
88       FileInputFormat.addInputPath(job, new Path(otherArgs[i]));
89     }
90     FileOutputFormat.setOutputPath(job,
91       new Path(otherArgs[otherArgs.length - 1]));
92     System.exit(job.waitForCompletion(true) ? 0 : 1);
93   }
94 }

Map()阶段

其中的42-55行，MapReduce程序需要继承org.apache.hadoop.mapreduce.Mapper 这个类，并在这个类中的继承类中自定义实现Map()方法

其中 org.apache.hadoop.mapreduce.Mapper 要求的参数有四个（keyIn、valueIn、keyOut、valueOut），即Map（）任务的输入和输出都是< key，value >对的形式

源代码此处各个参数的意义是：

Object：输入< key, value >对的 key 值，此处为文本数据的起始位置的偏移量。在大部分程序下这个参数可以直接使用 Long 类型，源码此处使用Object做了泛化。
Text：输入< key, value >对的 value 值，此处为一段具体的文本数据。
Text：输出< key, value >对的 key 值，此处为一个单词。
IntWritable：输出< key, value >对的 value 值，此处固定为 1 。IntWritable 是 Hadoop 对 Integer 的进一步封装，使其可以进行序列。

1     private final static IntWritable one = new IntWritable(1);
2     private Text word = new Text();

此处定义了两个变量：

one：类型为Hadoop定义的 IntWritable 类型，其本质就是序列化的 Integer ，one 变量的值恒为 1 。
word：因为在WordCount程序中，Map 端的任务是对输入数据按照单词进行切分，每个单词为 Text 类型。

1    public void map(Object key, Text value, Context context
2                     ) throws IOException, InterruptedException {
3       StringTokenizer itr = new StringTokenizer(value.toString());
4       while (itr.hasMoreTokens()) {
5         word.set(itr.nextToken());
6         context.write(word, one);
7       }
8     }

这段代码为Map端的核心，定义了Map Task 所需要执行的任务的具体逻辑实现。
map() 方法的参数为 Object key, Text value, Context context，其中：

key：输入数据在原数据中的偏移量。
value：具体的数据数据，此处为一段字符串。
context：用于暂时存储 map() 处理后的结果。

方法内部首先把输入值转化为字符串类型，并且对Hadoop自带的分词器 StringTokenizer 进行实例化用于存储输入数据。之后对输入数据从头开始进行切分，把字符串中的每个单词切分成< key, value >对的形式，如：< hello , 1>、< world, 1> …

Reduce()阶段

1 public static class IntSumReducer 
2        extends Reducer<Text,IntWritable,Text,IntWritable> {}

import org.apache.hadoop.mapreduce.Reducer 类的参数也是四个（keyIn、valueIn、keyOut、valueOut），即Reduce（）任务的输入和输出都是< key，value >对的形式。

源代码中此处的各个参数的含义：

Text：输入< key, value >对的key值，此处为一个单词
IntWritable：输入< key, value >对的value值。
Text：输出< key, value >对的key值，此处为一个单词
IntWritable：输出< key, value >对，此处为相同单词词频累加之后的值。实际上就是一个数字。

 1   public void reduce(Text key, Iterable<IntWritable> values, 
 2                        Context context
 3                        ) throws IOException, InterruptedException {
 4       int sum = 0;
 5       for (IntWritable val : values) {
 6         sum += val.get();
 7       }
 8       result.set(sum);
 9       context.write(key, result);
10     }

Reduce()函数的三个参数：

Text：输入< key, value >对的key值，也就是一个单词
value：这个地方值得注意，在前面说到了，在MapReduce任务中，除了我们自定义的map()和reduce()之外，在从map 刀reduce 的过程中，系统会自动进行combine、shuffle、sort等过程对map task的输出进行处理，因此reduce端的输入数据已经不仅仅是简单的< key, value >对的形式，而是一个一系列key值相同的序列化结构，如：< hello，1，1，2，2，3…>。因此，此处value的值就是单词后面出现的序列化的结构：（1，1，1，2，2，3…….）
context：临时存储reduce端产生的结果

因此在reduce端的代码中，对value中的值进行累加，所得到的结果就是对应key值的单词在文本中所出现的词频。

main()函数

 1 public static void main(String[] args) throws Exception {
 2     Configuration conf = new Configuration();    
 3         // 获取我们在执行这个任务时传入的参数，如输入数据所在路径、输出文件的路径的等
 4     String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs();
 5         //因为此任务正常运行至少要给出输入和输出文件的路径，因此如果传入的参数少于两个，程序肯定无法运行。
 6     if (otherArgs.length < 2) {
 7       System.err.println("Usage: wordcount <in> [<in>...] <out>");
 8       System.exit(2);
 9     }
10     Job job = Job.getInstance(conf, "word count");  // 实例化job，传入参数，job的名字叫 word count
11     job.setJarByClass(WordCount.class);  //使用反射机制，加载程序
12     job.setMapperClass(TokenizerMapper.class);  //设置job的map阶段的执行类
13     job.setCombinerClass(IntSumReducer.class);  //设置job的combine阶段的执行类
14     job.setReducerClass(IntSumReducer.class);  //设置job的reduce阶段的执行类
15     job.setOutputKeyClass(Text.class);  //设置程序的输出的key值的类型
16     job.setOutputValueClass(IntWritable.class);  //设置程序的输出的value值的类型
17     for (int i = 0; i < otherArgs.length - 1; ++i) {
18       FileInputFormat.addInputPath(job, new Path(otherArgs[i]));
19     }  //获取我们给定的参数中，输入文件所在路径
20     FileOutputFormat.setOutputPath(job,
21       new Path(otherArgs[otherArgs.length - 1]));  //获取我们给定的参数中，输出文件所在路径
22     System.exit(job.waitForCompletion(true) ? 0 : 1);  //等待任务完成，任务完成之后退出程序
23   }
24 }

转载于:https://www.cnblogs.com/wylwyl/p/10250464.html

weixin_30502965

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
hadoop源码阅读

1、hadoop源码下载下载地址：https://mirrors.tuna.tsinghua.edu.cn/apache/hadoop/common/2、我们看一下hadoop源码中提供的一个程序WordCount 1 /** 2 * Licensed to the Apache Software Foundation (ASF) under one 3 * ...
复制链接

扫一扫