http://hadoop.apache.org/docs/r1.1.1/mapred_tutorial.html
下面是Hadoop集群系列的目录安排,按照一星期发布一期的原则进行。希望大家的关注。
目录安排:
1)Hadoop集群_第1期_CentOS安装配置_V1.0
4)Hadoop集群_第4期_SecureCRT使用_V1.0
5)Hadoop集群_第5期_Hadoop安装配置_V1.1
6)Hadoop集群_第5期副刊_JDK和SSH无密码配置_V1.0
7)Hadoop集群_第6期_WordCount运行详解_V1.0
8)Hadoop集群_第7期_Eclipse开发环境设置_V1.0
8)Hadoop集群_第9期_MapReduce初级案例_V1.0
10)Hadoop集群_第10期_MySQL关系数据库_V1.0
11)Hadoop集群_第10期副刊_常用MySQL数据库命令_V1.0
12)Hadoop集群_第11期_HBase简介及安装_V1.0
13)Hadoop集群_第11期副刊_HBase之旅_V1.0
14)Hadoop集群_第12期_HBase应用开发_V1.0
15)Hadoop集群_第12期副刊_HBase性能优化_V1.0
16)Hadoop集群_第13期_Hive简介及安装_V1.0
17)Hadoop集群_第14期_Hive应用开发_V1.0
18)Hadoop集群_第14期副刊_Hive性能优化_V1.0
19)Hadoop集群_第15期_HBase、Hive与RDBMS_V1.1
20)Hadoop集群_第16期_ZooKeeper简介及安装_V1.0
21)Hadoop集群_第17期_ZooKeeper应用开发_V1.0
* * * * * *
22)Hadoop集群_第18期_Sqoop_V1.0
23)Hadoop集群_第19期_MapReduce进阶_V1.0
24)Hadoop集群_第20期_Pig_V1.0
25)Hadoop集群_第21期_Avro_V1.0
26)Hadoop集群_第22期_Mahout_V1.0
27)Hadoop集群_第23期_Chukwa_V1.0
28)Hadoop集群_第24期_Cassandra_V1.0
29)Hadoop集群_第25期_Hadoop管理_V1.0
备注:红色(已发布)、蓝色(已写完)、橙色(正在写)、黑色(没有写)
--
MapReduce Tutorial
Purpose
This document comprehensively describes all user-facing facets of the Hadoop MapReduce framework and serves as a tutorial.
Prerequisites
Ensure that Hadoop is installed, configured and is running. More details:
- Single Node Setup for first-time users.
- Cluster Setup for large, distributed clusters.
Overview
Hadoop MapReduce is a software framework for easily writing applications which process vast amounts of data (multi-terabyte data-sets) in-parallel on large clusters (thousands of nodes) of commodity hardware in a reliable, fault-tolerant manner.
A MapReduce job usually splits the input data-set into independent chunks which are processed by the map tasks in a completely parallel manner. The framework sorts the outputs of the maps, which are then input to the reduce tasks. Typically both the input and the output of the job are stored in a file-system. The framework takes care of scheduling tasks, monitoring them and re-executes the failed tasks.
Typically the compute nodes and the storage nodes are the same, that is, the MapReduce framework and the Hadoop Distributed File System (see HDFS Architecture Guide) are running on the same set of nodes. This configuration allows the framework to effectively schedule tasks on the nodes where data is already present, resulting in very high aggregate bandwidth across the cluster.
The MapReduce framework consists of a single master JobTracker and one slave TaskTracker per cluster-node. The master is responsible for scheduling the jobs' component tasks on the slaves, monitoring them and re-executing the failed tasks. The slaves execute the tasks as directed by the master.
Minimally, applications specify the input/output locations and supply map and reduce functions via implementations of appropriate interfaces and/or abstract-classes. These, and other job parameters, comprise the job configuration. The Hadoop job client then submits the job (jar/executable etc.) and configuration to the JobTracker which then assumes the responsibility of distributing the software/configuration to the slaves, scheduling tasks and monitoring them, providing status and diagnostic information to the job-client.
Although the Hadoop framework is implemented in JavaTM, MapReduce applications need not be written in Java.
- Hadoop Streaming is a utility which allows users to create and run jobs with any executables (e.g. shell utilities) as the mapper and/or the reducer.
- Hadoop Pipes is a SWIG- compatible C++ API to implement MapReduce applications (non JNITM based).
Inputs and Outputs
The MapReduce framework operates exclusively on <key, value> pairs, that is, the framework views the input to the job as a set of <key, value> pairs and produces a set of <key, value> pairs as the output of the job, conceivably of different types.
The key and value classes have to be serializable by the framework and hence need to implement the Writable interface. Additionally, the key classes have to implement the WritableComparable interface to facilitate sorting by the framework.
Input and Output types of a MapReduce job:
(input) <k1, v1> -> map -> <k2, v2> -> combine -> <k2, v2> -> reduce -> <k3, v3> (output)
Example: WordCount v1.0
Before we jump into the details, lets walk through an example MapReduce application to get a flavour for how they work.
WordCount is a simple application that counts the number of occurences of each word in a given input set.
This works with a local-standalone, pseudo-distributed or fully-distributed Hadoop installation (Single Node Setup).
Source Code
WordCount.java | |
---|---|
1. | package org.myorg; |
2. | |
3. | import java.io.IOException; |
4. | import java.util.*; |
5. | |
6. | import org.apache.hadoop.fs.Path; |
7. | import org.apache.hadoop.conf.*; |
8. | import org.apache.hadoop.io.*; |
9. | import org.apache.hadoop.mapred.*; |
10. | import org.apache.hadoop.util.*; |
11. | |
12. | public class WordCount { |
13. | |
14. | public static class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { |
15. | private final static IntWritable one = new IntWritable(1); |
16. | private Text word = new Text(); |
17. | |
18. | public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { |
19. | String line = value.toString(); |
20. | StringTokenizer tokenizer = new StringTokenizer(line); |
21. | while (tokenizer.hasMoreTokens()) { |
22. | word.set(tokenizer.nextToken()); |
23. | output.collect(word, one); |
24. | } |
25. | } |
26. | } |
27. | |
28. | public static class Reduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { |
29. | public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { |
30. | int sum = 0; |
31. | while (values.hasNext()) { |
32. | sum += values.next().get(); |
33. | } |
34. | output.collect(key, new IntWritable(sum)); |
35. | } |
36. | } |
37. | |
38. | public static void main(String[] args) throws Exception { |
39. | JobConf conf = new JobConf(WordCount.class); |
40. | conf.setJobName("wordcount"); |
41. | |
42. | conf.setOutputKeyClass(Text.class); |
43. | conf.setOutputValueClass(IntWritable.class); |
44. | |