1 MapReduce任务的基础知识
这一章,整体的介绍MapReduce任务。读完这章,你能编写和执行单机模式的MapReduce任务程序。
这章中的样例程序假设你已经完成了第一章的设置。你可以在一个专用的本地模式配置下,使用一台单机执行这些样例程序,你不需要启动Hadoop核心框架。对于调试和单元测试,单机模式配置也是最理想的。你能够从Apress网站(http://www.apress.com)上这本书所在的页面下载这些样例代码。这些可下载的代码也包含一个用来执行样例的JAR文件。
下面我们就开始查看MapReduce任务的必要组成要素。
1.1 Hadoop MapReduce任务的基本构成要素
用户可以配置和向框架提交MapReduce任务(简言之,作业)。一个MapReduce任务包括Map任务,混淆过程,排序过程和一套Reduce任务。框架然后会管理作业的分配和执行,收集输出和向用户传递作业结果。
一个作业的组成要素如表格2-1和图标2-1所示。
表格 2‑1 MapReduce任务的构成要素
要素 | 由谁处理 |
配置作业 | 用户 |
输入分割和派遣 | Hadoop框架 |
接受分割的输入后,每个Map任务的启动 | Hadoop框架 |
Map函数,对于每个键值对被调用一次 | 用户 |
混淆,分割和排序Map的的输出并得到快 | Hadoop框架 |
排序,将混淆的块进行组合和排序 | Hadoop框架 |
接受排序快后,每个Reduce任务的启动 | Hadoop框架 |
Reduce函数,对于每一个键值和对象的所有数据值被调用一次 | 用户 |
收集输出结果,在输出目录存储输出结果,输出结果分为N个部分,N是Reduce任务的号码 | Hadoop框架 |
图表 2‑1
用户负责处理作业初始化,指定输入位置,指定输入和确保输入格式和位置都是正确的。框架负责在集群中TaskTracker节点上派遣作业,执行map过程,混淆过程,排序过程和reduce过程,把输出写入输出目录,最后通知用户作业完成状态。
这一章的所有样例程序都是基于文件MapReduceIntro.java,如列表2-1所示。文件MapReduceIntro.java的代码所创建的作业一行一行的读取输入,然后,根据每一行第一个Tab字符前面的部分排序这些行,如果某一行没有Tab字符,框架会根据整个行进行排序。MapReduceIntro.java文件是一个简单的实现了配置和执行MapReduce任务的样例程序。
列表 2-1 MapReduceIntro.java
package com.apress.hadoopbook.examples.ch2; import java.io.IOException; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapred.FileInputFormat; import org.apache.hadoop.mapred.FileOutputFormat; import org.apache.hadoop.mapred.JobClient; import org.apache.hadoop.mapred.JobConf; import org.apache.hadoop.mapred.KeyValueTextInputFormat; import org.apache.hadoop.mapred.RunningJob; import org.apache.hadoop.mapred.lib.IdentityMapper; import org.apache.hadoop.mapred.lib.IdentityReducer; import org.apache.log4j.Logger; /** A very simple MapReduce example that reads textual input where * each record is a single line, and sorts all of the input lines into * a single output file. * * The records are parsed into Key and Value using the first TAB * character as a separator. If there is no TAB character the entire * line is the Key. * * * @author Jason Venner * */ public class MapReduceIntro { protected static Logger logger = Logger.getLogger(MapReduceIntro.class); /** * Configure and run the MapReduceIntro job. * * @param args * Not used. */ public static void main(final String[] args) { try { /** Construct the job conf object that will be used to submit this job * to the Hadoop framework. ensure that the jar or directory that * contains MapReduceIntroConfig.class is made available to all of the * Tasktracker nodes that will run maps or reduces for this job. */ final JobConf conf = new JobConf(MapReduceIntro.class); /** * Take care of some housekeeping to ensure that this simple example * job will run */ MapReduceIntroConfig. exampleHouseKeeping(conf, MapReduceIntroConfig.getInputDirectory(), MapReduceIntroConfig.getOutputDirectory()); /** * This section is the actual job configuration portion /** * Configure the inputDirectory and the type of input. In this case * we are stating that the input is text, and each record is a * single line, and the first TAB is the separator between the key * and the value of the record. */ conf.setInputFormat(KeyValueTextInputFormat.class); FileInputFormat.setInputPaths(conf, MapReduceIntroConfig.getInputDirectory()); /** Inform the framework that the mapper class will be the * {@link IdentityMapper}. This class simply passes the * input Key Value pairs directly to its output, which in * our case will be the shuffle. */ conf.setMapperClass(IdentityMapper.class); /** Configure the output of the job to go to the output * directory. Inform the framework that the Output Key * and Value classes will be {@link Text} and the output * file format will {@link TextOutputFormat}. The * TextOutput format class joins produces a record of * output for each Key,Value pair, with the following * format. Formatter.format( "%s/t%s%n", key.toString(), * value.toString() );. * * In addition indicate to the framework that there will be * 1 reduce. This results in all input keys being placed * into the same, single, partition, and the final output * being a single sorted file. */ FileOutputFormat.setOutputPath(conf, MapReduceIntroConfig.getOutputDirectory()); conf.setOutputKeyClass(Text.class); conf.setOutputValueClass(Text.class); conf.setNumReduceTasks(1); /** Inform the framework that the reducer class will be the {@link * IdentityReducer}. This class simply writes an output record key, * value record for each value in the key, valueset it receives as * input. The value ordering is arbitrary. */ conf.setReducerClass(IdentityReducer.class); logger .info("Launching the job."); /** Send the job configuration to the framework and request that the * job be run. */ final RunningJob job = JobClient.runJob(conf); logger.info("The job has completed."); if (!job.isSuccessful()) { logger.error("The job failed."); System.exit(1); } logger.info("The job completed successfully."); System.exit(0); } catch (final IOException e) { logger.error("The job has failed due to an IO Exception", e); e.printStackTrace(); } } }
1.1.1 输入分割
框架能够在多台机器上派遣作业的不同部分,所以,它需要把输入分割成不同的部分,然后它把输入的每一个部分传给一个单独的分布式作业。输入的每一部分称为输入分割。一套配置参数的组合和事实上读取输入记录的类的功能决定了框架是如何根据事实的输入文件构造输入分割的。我们将在第六章介绍这些参数。
一个输入分割通常是一个来自一个输入文件的连续的记录组,在这种情况下,至少会存在N个输入分割,N是输入文件的个数。如果设置的Map任务数多于输入文件数,或者某个文件大小超过了输入分割的最大尺寸,那么,框架就会从一个输入文件中构造多个输入分割。输入分割的数量和大小会极大的影响整体作业效率。
1.1.2 一个简单的Map功能:IdentityMapper
Hadoop框架提供了一个简单的map功能,称为IdentityMapper。它可以被应用在仅仅需要Reduce输入,不需要转换原始输入的作业上。在这一小节,我们研究IdentityMapper类的代码,如列表2-2所示。如果你下载了Hadoop核心安装程序,然后,根据第一章的指令,你能够在安装目录下找到这份代码,它位于${HADOOP_HOME}/src/mapred/org/apache/hadoop/mapred/lib/IdentityMapper.java。
列表2-2 IdentityMapper.java
/** * Licensed to the Apache Software Foundation (ASF) under one * or more contributor license agreements. See the NOTICE file * distributed with this work for additional information * regarding copyright ownership. The ASF licenses this file * to you under the Apache License, Version 2.0 (the * "License"); you may not use this file except in compliance * with the License. You may obtain a copy of the License at * * http://www.apache.org/licenses/LICENSE-2.0 * * Unless required by applicable law or agreed to in writing, software * distributed under the License is distributed on an "AS IS" BASIS, * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. * See the License for the specific language governing permissions and * limitations under the License. */ package org.apache.hadoop.mapred.lib; import java.io.IOException; import org.apache.hadoop.mapred.Mapper; import org.apache.hadoop.mapred.OutputCollector; import org.apache.hadoop.mapred.Reporter; import org.apache.hadoop.mapred.MapReduceBase; /** Implements the identity function, mapping inputs directly to outputs. */ public class IdentityMapper extends MapReduceBase implements Mapper { /** The identify function. Input key/value pair is written directly to * output.*/ public void map(K key, V val, OutputCollector output, Reporter reporter) throws IOException { output.collect(key, val); } }
这份代码的关键在于这一行,output.collect(key,val),这一行的作用是传递键值对给框架,框架将做进一步的处理。
所有的map功能必须实现Mapper接口,这确保map功能被调用时总是传递进入一个关键字(关键字是一个WritableComparable类的实例),键值(键值是一个Writable类的一个实例),一个输出对象和一个报表对象。到现在为止,只需要记得报表对象是有用的,我们会在这一章中后面“创建一个客户化的Mapper和Reducer”小节中详细讨论报表对象。
请注意,你可以在Apress网站(http://www.apress.com)中这本书的下载页面中找到Mapper.java和Reducer.java接口的代码,这些代码和其他的样例代码是在一起的。
框架针对输入中的每一个记录调用一次你的map功能。你的map功能可能在多个Java虚拟机下有多个实例在运行,这些Java虚拟机可能运行在多台分布式机器上。框架会为你调度这些操作。
下面介绍两种通用的Mapper。
- 忽略键值,仅仅传递关键字给框架的Mapper实现。
public void map(K key, V val, OutputCollector output, Reporter reporter) throws IOException { output.collect(key, null); /** Note, no value, just a null */ }
- 转换键值成为小写的Mapper实现。
/** put the keys in lower case. */ public void map(Text key, V val, OutputCollector output, Reporter reporter) throws IOException { Text lowerCaseKey = new Text( key.toString().toLowerCase()); output.collect(lowerCaseKey, value); }
1.1.3 一个简单的Reduce功能:IdentityReducer
Hadoop框架对于每一个关键字调用一次reduce功能。在每次调用时,框架都会提供关键字和关键字所对应的所有键值。
框架提供的类IdentityReducer为每个输入键值产生一个输出记录的样例实现。
列表2-3 IdentityReducer.java
/** * Licensed to the Apache Software Foundation (ASF) under one * or more contributor license agreements. See the NOTICE file * distributed with this work for additional information * regarding copyright ownership. The ASF licenses this file * to you under the Apache License, Version 2.0 (the * "License"); you may not use this file except in compliance * with the License. You may obtain a copy of the License at * * http://www.apache.org/licenses/LICENSE-2.0 * * Unless required by applicable law or agreed to in writing, software * distributed under the License is distributed on an "AS IS" BASIS, * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. * See the License for the specific language governing permissions and * limitations under the License. */ package org.apache.hadoop.mapred.lib; import java.io.IOException; import java.util.Iterator; import org.apache.hadoop.mapred.Reducer; import org.apache.hadoop.mapred.OutputCollector; import org.apache.hadoop.mapred.Reporter; import org.apache.hadoop.mapred.MapReduceBase; /** Performs no reduction, writing all input values directly to the output. */ public class IdentityReducer extends MapReduceBase implements Reducer { /** Writes all keys and values directly to output. */ public void reduce(K key, Iterator values, OutputCollector output, Reporter reporter) throws IOException { while (values.hasNext()) { output.collect(key, values.next()); } }
如果你需要你的作业的输入被排序,reducer功能必须按照顺序传递键值对象给output.collect()。然而,reduce阶段可以输出有同一个关键字和不同键值的任意数量的记录。这就是为什么Map任务是多线程的,而Reduce任务是单线程的。
下面介绍两种通用的Reducer的实现。
忽略键值,仅仅传递关键字的Reducer的实现。
public void map(K key, V val, OutputCollector output, Reporter reporter) throws IOException { output.collect(key, null); }
为每一个关键字提供了计数信息的Reducer的实现。
protected Text count = new Text(); /** Writes all keys and values directly to output. */ public void reduce(K key, Iterator values, OutputCollector output, Reporter reporter) throws IOException { int i = 0; while (values.hasNext()) { i++ } count.set( "" + i ); output.collect(key, count); }