
随着捕获的数据的数量每年增加,我们的存储也需要增加。很多公司正在认识到“数据为王”这一道理,但是我们如何分析这些数据呢?答案就是“通过Hadoop”。在本系列的第二篇文章中,java编程专家Steven Haines将会解释什么是MapReduce应用,以及如何构建一个简单的MapReduce应用。



在你能够使用Hadoop之前,你需要安装Java 6(或者更新版本),你可以从Oracle的网站上下载对应你的平台的版本。另外,如果你是运行在Windows上,由于Hadoop运行的正式开发和部署平台是Linux,所以你需要使用Cygwin来运行Hadoop。Mac OXS用户可以无问题地原生态运行Hadoop。

Hadoop可以从它的Releases页面下载,但是它的版本号结构解释起来具有一点儿挑战性。简而言之,1.x的代码分支包含当前的稳定发行版,2.x.x分支包含用于版本2的Hadoop的alpha代码,0.22.x的代码分支与2.x.x的相同,除了没有security,0.23的代码分支去除了高可用性(high availability)。0.20.x的代码分支是历史遗留问题,你可以忽略。对于本文中的例子,我将使用0.23.x代码分支,写这篇文章是该分支的最新版本是0.23.5,但是对于生产环境,你可能会想下载1.x版本或2.x.x版本。




Usage: hadoop [--config confdir] COMMAND

       where COMMAND is one of:

  fs                   run a generic filesystem user client

  version              print the version

  jar <jar>            run a jar file

  distcp <srcurl> <desturl> copy file or directories recursively

  archive -archiveName NAME -p <parent path> <src>* <dest> create a hadoop archive

  classpath            prints the class path needed to get the

                       Hadoop jar and the required libraries

  daemonlog            get/set the log level for each daemon


  CLASSNAME            run the class named CLASSNAME

Most commands print help when invoked w/o parameters.


hadoop jar <jar-file-name>


在任何编程语言中,你编写的第一段程序通常都是一个“Hello,World”程序。对Hadoop和MapReduce而言,每个人编写的标准程序是Word Count应用程序。Word Count应用程序统计在大量的文本中每个单词出现的次数。它是一个学习MapReduce的完美例子,因为它的mapping和reducing步骤很琐细,但却引导你采用MapReduce的方式思考。下面是对Word Count应用程序中各个组件及其功能的总结:

  • FileInputFormat:我们定义一个FileInputFormat去读取指定目录下(作为第一个参数传递给MapReduce应用程序)的所有文件,并传递这些文件给一个TextInputFormat(见Listing 1)以便分发给我们的mappers。
  • TextInputFormat:Hadoop默认的InputFormat是TextInputFormat,它每次读取一行,并把字节偏移作为key(LongWritable),将这行文本作为value(Text),并返回key。
  • Word Count Mapper:这是一个我们写的类用来把InputFormat传给它的单行文本标记化成单词,然后把单词本身和一个用于表示我们见过这个词的数字“1”绑在一起
  • Combiner:在开发环境中我们不需要combiner,但是combiner(或combiner的功能)是由reducer(在本文后面会有描述)实现的,在传递(键/值)对(key/value pair)到reducer之前运行在本地节点上。应用combiner能够急剧地提示性能,但是你需要确保combining你的结果不会破坏你的reducer:为了能让reducer承担combiner的功能,它的操作必须是可结合的(即reducer应与combiner一样能与map结合),否则,发送到reducer的map将不会产生正确的结果。
  • Word Count Reducer: word count  reducer接受一个映射(map),它映射每个单词到记录该单词所有被mapper观察到的次数的列表。没有combiner,reducer将会接受一个单词和一个全为”1”的集合,但是由于我们让reducer承担combiner的功能,我们接受到得将是一个各个待被相加到一起的数字的集合。
  • TextOutputFormat:本例中,我们使用TextOutputFormat类,并告诉它key为Text类型,value为IntWritable类型。
  • FileOutputFormat:TextOutputFormat发送它的格式化输出到FileOutputFormat,后者将结果写入到自己创建的”output”目录中。


Listing 1 显示了我们的第一个MapReduce应用程序的源代码。

package com.geekcap.hadoopexamples;

import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.Mapper;
import org.apache.hadoop.mapred.Reducer;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.mapred.MapReduceBase;
import org.apache.hadoop.mapred.OutputCollector;
import org.apache.hadoop.mapred.Reporter;
import org.apache.hadoop.mapred.KeyValueTextInputFormat;
import org.apache.hadoop.mapred.TextOutputFormat;
import org.apache.hadoop.mapred.JobClient;
import org.apache.hadoop.mapred.FileInputFormat;
import org.apache.hadoop.mapred.FileOutputFormat;

import java.io.IOException;
import java.util.Iterator;
import java.util.StringTokenizer;

public class WordCount extends Configured implements Tool {

    public static class MapClass extends MapReduceBase
            implements Mapper<LongWritable, Text, Text, IntWritable>
        private Text word = new Text();
        private final static IntWritable one = new IntWritable( 1 );

        public void map( LongWritable key, // Offset into the file
                         Text value,
                         OutputCollector<Text, IntWritable> output,
                         Reporter reporter) throws IOException
            // Get the value as a String
            String text = value.toString().toLowerCase();

            // Replace all non-characters
            text = text.replaceAll( "'", "" );
            text = text.replaceAll( "[^a-zA-Z]", " " );

            // Iterate over all of the words in the string
            StringTokenizer st = new StringTokenizer( text );
            while( st.hasMoreTokens() )
                // Get the next token and set it as the text for our "word" variable
                word.set( st.nextToken() );

                // Output this word as the key and 1 as the value
                output.collect( word, one );

    public static class Reduce extends MapReduceBase
            implements Reducer<Text, IntWritable, Text, IntWritable>
        public void reduce( Text key, Iterator<IntWritable> values,
                            OutputCollector<Text, IntWritable> output,
                            Reporter reporter) throws IOException
            // Iterate over all of the values (counts of occurrences of this word)
            int count = 0;
            while( values.hasNext() )
                // Add the value to our count
                count += values.next().get();

            // Output the word with its count (wrapped in an IntWritable)
            output.collect( key, new IntWritable( count ) );

    public int run(String[] args) throws Exception
        // Create a configuration
        Configuration conf = getConf();

        // Create a job from the default configuration that will use the WordCount class
        JobConf job = new JobConf( conf, WordCount.class );

        // Define our input path as the first command line argument and our output path as the second
        Path in = new Path( args[0] );
        Path out = new Path( args[1] );

        // Create File Input/Output formats for these paths (in the job)
        FileInputFormat.setInputPaths( job, in );
        FileOutputFormat.setOutputPath( job, out );

        // Configure the job: name, mapper, reducer, and combiner
        job.setJobName( "WordCount" );
        job.setMapperClass( MapClass.class );
        job.setReducerClass( Reduce.class );
        job.setCombinerClass( Reduce.class );

        // Configure the output
        job.setOutputFormat( TextOutputFormat.class );
        job.setOutputKeyClass( Text.class );
        job.setOutputValueClass( IntWritable.class );

        // Run the job
        return 0;

    public static void main(String[] args) throws Exception
        // Start the WordCount MapReduce application
        int res = ToolRunner.run( new Configuration(),
                new WordCount(),
                args );
        System.exit( res );





run()方法通过定义input和output路径来设置job,然后创造FileInputFormat和FileOutputFormat对象,这两个对象记录着对应路径。设置input和output format与剩余的其他设置有所不同,因为我们创造它们的实例,并把对job的引用传递给它们。其他的设置通过调用job的setter方法中的一个来完成的。


  • key:在文本中的字节偏移量。
  • value:文件中的单行文本。
  • output:OutputCollector是一种机制,通过它我们可以输出我们想要传递给reducer的键/值对(key/value pair)。
  • reporter:用来将任务处理的进度回报给Hadoop server。在本例中没有使用。

MapClass 通过value的toString()方法把value提取到String,然后做一些转换:它把String转换为小字母格式(lowercase)以便我们能够将像“Apple”这样的单词与“apple”匹配,它删掉单引号,将所有非字符用空格取代。然后它用white space将String拆分,然后对String中的所有标记进行迭代。对于找的的每个标记,它将word变量的text设为标记,然后将word作为key,数字1的静态IntWrapper变量作为value送出。我们可以每次创建一个新的Text word,但是考虑到程序运行的次数,将word作为成员变量来维护,而不是每次都重新创建它,能够带来性能上的提升。

Reduce 类的reduce()方法接收与map()方法同样的参数集,唯一不同的是,它的key是word,它接收指向value列表的迭代器(Iterator)。在本例中,它接收的内容将会像是word“apple”和指向包含值1,1,1,1集合的Iterator。但是因为我们希望使用reducer来完成combiner的功能,因此,我们我们不仅统计条目的数量,而且通过调用IntWritable的get()方法提取value,并将其加到我们的和中。最后,reduce()方法返回与它接收的(word)同样的key,以及出现的次数的和。


Listing 2显示的是贬义这段代码的Maven POM文件。

Listing 2 pom.xml

<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
  xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">








   mvn clean install

把这些整合起来,我们需要可观的文本文件用来统计单词。大文本文件的一个很好的来源就是Project Gutenberg,它包含超过100,000本免费电子书。对于我的例子,我选择Moby Dick。下载这些电子书中的一本,然后把它放在你的硬盘的一个目录中(你的这个目录中应只有这一个文件)。一旦完成,你就可以通过执行hadoop命令,把包含电子书的目录路径,以及目标文件路径传给它,以此来运行你的MapReduce项目。例如:

hadoop jar hadoop-examples-1.0-SNAPSHOT.jar com.geekcap.hadoopexamples.WordCount  ~/apps/hadoop-0.23.5/test-data output


2012-12-11 22:27:08.929 java[37044:1203] Unable to load realm info from SCDynamicStore

2012-12-11 22:27:09.023 java[37044:1203] Unable to load realm info from SCDynamicStore

12/12/11 22:27:09 WARN conf.Configuration: session.id is deprecated. Instead, use dfs.metrics.session-id

12/12/11 22:27:09 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId=

12/12/11 22:27:09 INFO jvm.JvmMetrics: Cannot initialize JVM Metrics with processName=JobTracker, sessionId= - already initialized

12/12/11 22:27:09 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable

12/12/11 22:27:09 WARN snappy.LoadSnappy: Snappy native library not loaded

12/12/11 22:27:09 INFO mapred.FileInputFormat: Total input paths to process : 1

12/12/11 22:27:10 INFO mapreduce.JobSubmitter: number of splits:1

12/12/11 22:27:10 WARN conf.Configuration: mapred.jar is deprecated. Instead, use mapreduce.job.jar

12/12/11 22:27:10 WARN conf.Configuration: mapred.output.value.class is deprecated. Instead, use mapreduce.job.output.value.class

12/12/11 22:27:10 WARN conf.Configuration: mapred.job.name is deprecated. Instead, use mapreduce.job.name

12/12/11 22:27:10 WARN conf.Configuration: mapred.input.dir is deprecated. Instead, use mapreduce.input.fileinputformat.inputdir

12/12/11 22:27:10 WARN conf.Configuration: mapred.output.dir is deprecated. Instead, use mapreduce.output.fileoutputformat.outputdir

12/12/11 22:27:10 WARN conf.Configuration: mapred.map.tasks is deprecated. Instead, use mapreduce.job.maps

12/12/11 22:27:10 WARN conf.Configuration: mapred.output.key.class is deprecated. Instead, use mapreduce.job.output.key.class

12/12/11 22:27:10 WARN conf.Configuration: mapred.working.dir is deprecated. Instead, use mapreduce.job.working.dir

12/12/11 22:27:10 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_local_0001

12/12/11 22:27:10 INFO mapreduce.Job: The url to track the job:http://localhost:8080/

12/12/11 22:27:10 INFO mapred.LocalJobRunner: OutputCommitter set in config null

12/12/11 22:27:10 INFO mapreduce.Job: Running job: job_local_0001

12/12/11 22:27:10 INFO mapred.LocalJobRunner: OutputCommitter is org.apache.hadoop.mapred.FileOutputCommitter

12/12/11 22:27:10 INFO mapred.LocalJobRunner: Waiting for map tasks

12/12/11 22:27:10 INFO mapred.LocalJobRunner: Starting task: attempt_local_0001_m_000000_0

12/12/11 22:27:10 INFO mapred.Task:  Using ResourceCalculatorPlugin : null

12/12/11 22:27:10 INFO mapred.MapTask: numReduceTasks: 1

12/12/11 22:27:10 INFO mapred.MapTask: (EQUATOR) 0 kvi 26214396(104857584)

12/12/11 22:27:10 INFO mapred.MapTask: mapreduce.task.io.sort.mb: 100

12/12/11 22:27:10 INFO mapred.MapTask: soft limit at 83886080

12/12/11 22:27:10 INFO mapred.MapTask: bufstart = 0; bufvoid = 104857600

12/12/11 22:27:10 INFO mapred.MapTask: kvstart = 26214396; length = 6553600

12/12/11 22:27:11 INFO mapred.LocalJobRunner: 

12/12/11 22:27:11 INFO mapred.MapTask: Starting flush of map output

12/12/11 22:27:11 INFO mapred.MapTask: Spilling map output

12/12/11 22:27:11 INFO mapred.MapTask: bufstart = 0; bufend = 2027118; bufvoid = 104857600

12/12/11 22:27:11 INFO mapred.MapTask: kvstart = 26214396(104857584); kvend = 25353164(101412656); length = 861233/6553600

12/12/11 22:27:11 INFO mapreduce.Job: Job job_local_0001 running in uber mode : false

12/12/11 22:27:11 INFO mapreduce.Job:  map 0% reduce 0%

12/12/11 22:27:12 INFO mapred.MapTask: Finished spill 0

12/12/11 22:27:12 INFO mapred.Task: Task:attempt_local_0001_m_000000_0 is done. And is in the process of committing

12/12/11 22:27:12 INFO mapred.LocalJobRunner: file:/Users/shaines/apps/hadoop-0.23.5/test-data/mobydick.txt:0+1212132

12/12/11 22:27:12 INFO mapred.Task: Task 'attempt_local_0001_m_000000_0' done.

12/12/11 22:27:12 INFO mapred.LocalJobRunner: Finishing task: attempt_local_0001_m_000000_0

12/12/11 22:27:12 INFO mapred.LocalJobRunner: Map task executor complete.

12/12/11 22:27:12 INFO mapred.Task:  Using ResourceCalculatorPlugin : null

12/12/11 22:27:12 INFO mapred.Merger: Merging 1 sorted segments

12/12/11 22:27:12 INFO mapred.Merger: Down to the last merge-pass, with 1 segments left of total size: 247166 bytes

12/12/11 22:27:12 INFO mapred.LocalJobRunner: 

12/12/11 22:27:12 INFO mapreduce.Job:  map 100% reduce 0%

12/12/11 22:27:12 INFO mapred.Task: Task:attempt_local_0001_r_000000_0 is done. And is in the process of committing

12/12/11 22:27:12 INFO mapred.LocalJobRunner: 

12/12/11 22:27:12 INFO mapred.Task: Task attempt_local_0001_r_000000_0 is allowed to commit now

12/12/11 22:27:12 INFO output.FileOutputCommitter: Saved output of task 'attempt_local_0001_r_000000_0' to file:/Users/shaines/Documents/Workspace/hadoop-examples/target/output/_temporary/0/task_local_0001_r_000000

12/12/11 22:27:12 INFO mapred.LocalJobRunner: reduce > reduce

12/12/11 22:27:12 INFO mapred.Task: Task 'attempt_local_0001_r_000000_0' done.

12/12/11 22:27:13 INFO mapreduce.Job:  map 100% reduce 100%

12/12/11 22:27:13 INFO mapreduce.Job: Job job_local_0001 completed successfully

12/12/11 22:27:13 INFO mapreduce.Job: Counters: 24

File System Counters

FILE: Number of bytes read=2683488

FILE: Number of bytes written=974132

FILE: Number of read operations=0

FILE: Number of large read operations=0

FILE: Number of write operations=0

Map-Reduce Framework

Map input records=21573

Map output records=215309

Map output bytes=2027118

Map output materialized bytes=247174

Input split bytes=113

Combine input records=215309

Combine output records=17107

Reduce input groups=17107

Reduce shuffle bytes=0

Reduce input records=17107

Reduce output records=17107

Spilled Records=34214

Shuffled Maps =0

Failed Shuffles=0

Merged Map outputs=0

GC time elapsed (ms)=32

Total committed heap usage (bytes)=264110080

File Input Format Counters 

Bytes Read=1212132

File Output Format Counters 

Bytes Written=182624


a       4687

aback   2

abaft   2

abandon 3

abandoned       7

abandonedly     1

abandonment     2


your    251

youre   6

youve   1

zephyr  1

zeuglodon       1

zones   3

zoology 2

zoroaster       1

这个输出包含它找到的所有单词,以及单词出现的次数。在Moby Dick这本书中单词”a”出现4687次,而单词”your”只出现了251次。




如果你正在找一本能够帮你用MapReduce方式思考的好书,O’Reilly的MapReduce Design Patterns(MapReduce设计模式)是个不错的选择。我读过许多书以此来帮助我搭建和设置Hadoop,但是MapReduce Design Patterns是我发现的第一本能够帮助我真正理解如何解决MapReduce问题的书。我强烈推荐它!

原文:Steven Haines "Building a MapReduce Application with Hadoop"

