Hadoop MapReduce 入门

最新推荐文章于 2024-09-11 13:35:24 发布

lucklilili

最新推荐文章于 2024-09-11 13:35:24 发布

阅读量78

点赞数

分类专栏： Apache Hadoop 文章标签： hadoop

原文链接：https://hadoop.apache.org/docs/stable/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduceTutorial.html

版权

Apache Hadoop 同时被 2 个专栏收录

9 篇文章 1 订阅

订阅专栏

MapReduce

1 篇文章 0 订阅

订阅专栏

Hadoop MapReduce is a software framework for easily writing applications which process vast amounts of data (multi-terabyte data-sets) in-parallel on large clusters (thousands of nodes) of commodity hardware in a reliable, fault-tolerant manner.

A MapReduce job usually splits the input data-set into independent chunks which are processed by the map tasks in a completely parallel manner. The framework sorts the outputs of the maps, which are then input to the reduce tasks. Typically both the input and the output of the job are stored in a file-system. The framework takes care of scheduling tasks, monitoring them and re-executes the failed tasks.

Typically the compute nodes and the storage nodes are the same, that is, the MapReduce framework and the Hadoop Distributed File System (see HDFS Architecture Guide) are running on the same set of nodes. This configuration allows the framework to effectively schedule tasks on the nodes where data is already present, resulting in very high aggregate bandwidth across the cluster.

The MapReduce framework consists of a single master ResourceManager, one worker NodeManager per cluster-node, and MRAppMaster per application (see YARN Architecture Guide).

Minimally, applications specify the input/output locations and supply map and reduce functions via implementations of appropriate interfaces and/or abstract-classes. These, and other job parameters, comprise the job configuration.

The Hadoop job client then submits the job (jar/executable etc.) and configuration to the ResourceManager which then assumes the responsibility of distributing the software/configuration to the workers, scheduling tasks and monitoring them, providing status and diagnostic information to the job-client.

Although the Hadoop framework is implemented in Java™, MapReduce applications need not be written in Java.

Hadoop MapReduce是一个软件框架，用于轻松编写应用程序，以可靠、容错的方式在大型集群（数千个节点）上并行处理大量数据（多TB数据集）。

MapReduce作业通常将输入数据集拆分为独立的块，这些块由映射任务以完全并行的方式处理。该框架对映射的输出进行排序，然后将其输入到reduce任务。通常，作业的输入和输出都存储在文件系统中。该框架负责调度任务、监视任务并重新执行失败的任务。

通常，计算节点和存储节点是相同的，即MapReduce框架和Hadoop分布式文件系统（请参阅HDFS体系结构指南）在同一组节点上运行。此配置允许框架在已经存在数据的节点上有效地调度任务，从而在整个集群中获得非常高的聚合带宽。

MapReduce框架由单个主资源管理器、每个群集节点一个工作节点管理器和每个应用程序一个MRAppMaster组成（请参阅《纱线体系结构指南》）。

至少，应用程序通过实现适当的接口和/或抽象类来指定输入/输出位置、提供映射和减少功能。这些参数和其他作业参数构成作业配置。

然后，Hadoop作业客户端将作业（jar/可执行文件等）和配置提交给ResourceManager，ResourceManager负责将软件/配置分发给工作人员，安排任务并对其进行监控，向作业客户端提供状态和诊断信息。

虽然Hadoop框架是用Java实现的™, MapReduce应用程序不需要用Java编写。

The MapReduce framework operates exclusively on <key, value> pairs, that is, the framework views the input to the job as a set of <key, value> pairs and produces a set of <key, value> pairs as the output of the job, conceivably of different types.

The key and value classes have to be serializable by the framework and hence need to implement the Writable interface. Additionally, the key classes have to implement the WritableComparable interface to facilitate sorting by the framework.

Input and Output types of a MapReduce job:

(input) <k1, v1> -> map -> <k2, v2> -> combine -> <k2, v2> -> reduce -> <k3, v3> (output)

MapReduce框架专门对<key，value>对进行操作，也就是说，该框架将作业的输入视为一组<key，value>对，并生成一组<key，value>对作为作业的输出，可以想象为不同的类型。

键和值类必须由框架序列化，因此需要实现可写接口。此外，关键类必须实现WritableComparable接口，以便于按照框架进行排序。

MapReduce作业的输入和输出类型：

（输入）<k1，v1>->映射-><k2，v2>->合并-><k2，v2>->减少-><k3，v3>（输出）

import java.io.IOException;
import java.util.StringTokenizer;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

public class WordCount {

  public static class TokenizerMapper
       extends Mapper<Object, Text, Text, IntWritable>{

    private final static IntWritable one = new IntWritable(1);
    private Text word = new Text();

    public void map(Object key, Text value, Context context
                    ) throws IOException, InterruptedException {
      StringTokenizer itr = new StringTokenizer(value.toString());
      while (itr.hasMoreTokens()) {
        word.set(itr.nextToken());
        context.write(word, one);
      }
    }
  }

  public static class IntSumReducer
       extends Reducer<Text,IntWritable,Text,IntWritable> {
    private IntWritable result = new IntWritable();

    public void reduce(Text key, Iterable<IntWritable> values,
                       Context context
                       ) throws IOException, InterruptedException {
      int sum = 0;
      for (IntWritable val : values) {
        sum += val.get();
      }
      result.set(sum);
      context.write(key, result);
    }
  }

  public static void main(String[] args) throws Exception {
    Configuration conf = new Configuration();
    Job job = Job.getInstance(conf, "word count");
    job.setJarByClass(WordCount.class);
    job.setMapperClass(TokenizerMapper.class);
    job.setCombinerClass(IntSumReducer.class);
    job.setReducerClass(IntSumReducer.class);
    job.setOutputKeyClass(Text.class);
    job.setOutputValueClass(IntWritable.class);
    FileInputFormat.addInputPath(job, new Path(args[0]));
    FileOutputFormat.setOutputPath(job, new Path(args[1]));
    System.exit(job.waitForCompletion(true) ? 0 : 1);
  }
}