《hadoop权威指南》学习笔记-MapReduce应用开发（上）

最新推荐文章于 2022-05-15 16:23:19 发布

summerDG

最新推荐文章于 2022-05-15 16:23:19 发布

阅读量2.2k

点赞数

文章标签： Hadoop mapreduce 应用测试工具

本文链接：https://blog.csdn.net/summerDG/article/details/15636327

版权

上的内容主要讲mapreduce应用的测试，以及在代码中执行命令行内容。

文章中的conf文件夹要自己在当前目录下创建，然后再在该目录下创建三个.xml文件，这三个文件的内容就按照书上的内容输进去。待会我会介绍这些内容的具体含义。

首先我们要明确-conf这个命令的含义，-conf适用于修改配置文件的，例如书中的这句命令：

hadoop fs -conf conf/hadoop-localhost.xml -ls .

其实这句的意思就是用conf/hadoop-localhost.xml的配置信息去修改现已有的配置信息，当然只会修改这个xml文件中配置的属性，其他属性不做修改。由于-conf可以更改系统的配置信息，更改hadoop-localhost.xml里面的fs.default.name，就可以实现访问不同的hdfs了，当然如果设置的地址不存在，那么就连接失败。读者们可以自己去试一下，我把该文件中的fs.default.name设置为其他地址（就像书上的地址）就没办法读取到文件目录，只有设为我自己的才能读取。

下面就介绍GenericOptionParser,Tool,ToolRunner这三者的作用以及关系。

查看GenericOptionParser类的介绍，官方给出的说法是解析命令行的工具其构造函数也很简单，GenericOptionsParser(Configuration conf, String[] args) ，只需要把传入当前的配置类和命令行参数就可以解析这条命令了。

Tool只是个接口，这个接口继承了Configurable接口，只有一个run()方法。Configurable接口定义了两个方法，setconf和getconf，一看名字就应该知道是什么作用了吧！

ToolRunner这个类可是没有继承任何类或接口的，它主要的一个方法就是run，那我们来看一下这个方法的源码：

059      public static int run(Configuration conf, Tool tool, String[] args) 
060        throws Exception{
061        if(conf == null) {
062          conf = new Configuration();
063        }
064        GenericOptionsParser parser = new GenericOptionsParser(conf, args);
065        //set the configuration back, so that Tool can configure itself
066        tool.setConf(conf);
067        
068        //get the args w/o generic hadoop args
069        String[] toolArgs = parser.getRemainingArgs();
070        return tool.run(toolArgs);
071      }
072      
073      /**
074       * Runs the <code>Tool</code> with its <code>Configuration</code>.
075       * 
076       * Equivalent to <code>run(tool.getConf(), tool, args)</code>.
077       * 
078       * @param tool <code>Tool</code> to run.
079       * @param args command-line arguments to the tool.
080       * @return exit code of the {@link Tool#run(String[])} method.
081       */
082      public static int run(Tool tool, String[] args) 
083        throws Exception{
084        return run(tool.getConf(), tool, args);
085      }

这里run方法的两个主要参数就是一个继承了Tool的类，还有就是命令行了，从源码可以看出来在检查参数有效的情况下，会创建GenericOptionParser类来解析命令行，然后会将没能解析掉的命令行传递给Tool来解析（因为很可能这些参数是用来给Tool使用的），GenericOptionParser.getRemainingArgs()就是用来返回没有解析的命令行的（如果命令行为空，它会返回null，如果命令行正好，返回的也是null，如果命令行包含hadoop通用命令以外的部分，就会返回这部分不能解析的成分），最后调用Tool.run来执行剩余的处理工作。

再看一下configured这个类，它是configurable的一个具体的实现

027    public class Configured implements Configurable {
028    
029      private Configuration conf;
030    
031      /** Construct a Configured. */
032      public Configured() {
033        this(null);
034      }
035      
036      /** Construct a Configured. */
037      public Configured(Configuration conf) {
038        setConf(conf);
039      }
040    
041      // inherit javadoc
042      @Override
043      public void setConf(Configuration conf) {
044        this.conf = conf;
045      }
046    
047      // inherit javadoc
048      @Override
049      public Configuration getConf() {
050        return conf;
051      }
052    
053    }

实现很简单！

-D选项用于设置特定的属性，如果不存在这条属性就创建一条，而且-D的选项优先级要高于配置文件的优先级，所以可以利用这个选项实现真正的覆盖配置。我们通常可以将默认属性放在配置文件里，在需要的时候去更改。

这段时间还顺便学习了一下在IDEA中运行hadoop的程序，由于只是刚接触，所以只实现了部分功能，再此我将章的测试代码在IDEA中运行。

使用IDEA的原因只是因为eclipse中对maven的构建给人很不习惯，不像是IDEA本身工程的构建就是maven的风格，而且也是听着讨论组里的大牛们说很好用才用的，确实IDEA给人的感觉更灵活、更酷，执行速度方面和流畅度方面，我个人感觉IDEA要稍微快一点。

测试的话，我会教大家在命令行和IDEA中分别测试。

首先看第一个测试代码（由于《hadoop权威指南2》的内容已经过期，新的api已经没有第二版中的那么繁琐了，所以这里借用的是第三版的代码），这本说介绍的测试步骤是很值得借鉴的，就是我们先从小的数据量测试，然后扩大规模，最后放到集群上测试。

代码1：

package v1;
// cc MaxTemperatureMapperTestV1 Unit test for MaxTemperatureMapper
// == MaxTemperatureMapperTestV1Missing
// vv MaxTemperatureMapperTestV1
import java.io.IOException;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mrunit.mapreduce.MapDriver;
import org.junit.*;

public class MaxTemperatureMapperTest {

  @Test
  public void processesValidRecord() throws IOException, InterruptedException {
    Text value = new Text("0043011990999991950051518004+68750+023550FM-12+0382" +
                                  // Year ^^^^
        "99999V0203201N00261220001CN9999999N9-00111+99999999999");
                              // Temperature ^^^^^
    new MapDriver<LongWritable, Text, Text, IntWritable>()
      .withMapper(new MaxTemperatureMapper())
      .withInputValue(value)
      .withOutput(new Text("1950"), new IntWritable(-11))
      .runTest();
  }
// ^^ MaxTemperatureMapperTestV1
  @Ignore // since we are showing a failing test in the book
// vv MaxTemperatureMapperTestV1Missing
  @Test
  public void ignoresMissingTemperatureRecord() throws IOException,
      InterruptedException {
    Text value = new Text("0043011990999991950051518004+68750+023550FM-12+0382" +
                                  // Year ^^^^
        "99999V0203201N00261220001CN9999999N9+99991+99999999999");
                              // Temperature ^^^^^
    new MapDriver<LongWritable, Text, Text, IntWritable>()
      .withMapper(new MaxTemperatureMapper())
      .withInputValue(value)
      .runTest();
  }
// ^^ MaxTemperatureMapperTestV1Missing
  @Test
  public void processesMalformedTemperatureRecord() throws IOException,
      InterruptedException {
    Text value = new Text("0335999999433181957042302005+37950+139117SAO  +0004" +
                                  // Year ^^^^
        "RJSN V02011359003150070356999999433201957010100005+353");
                              // Temperature ^^^^^
    new MapDriver<LongWritable, Text, Text, IntWritable>()
      .withMapper(new MaxTemperatureMapper())
      .withInputValue(value)
      .withOutput(new Text("1957"), new IntWritable(1957))
      .runTest();
  }
// vv MaxTemperatureMapperTestV1
}

MapReduce的测试工具是MRUnit，类似于JUnit。这个测试实例就很好地展现给我们如何去使用这个测试工具。

主要涉及到的一个类就是MRUnit的MapDriver类，看第一个测试的代码，也应该可以看出来MapDriver的泛型设置的和Mapper函数的泛型设置一样，然后装入测验的Mapper函数，再装入输入的内容，接着就是把输出与提供的内容作对比（和JUnit的assertThat一样）看是否一样，最后runTest执行测试。

第二段测试代码，由于这个气温值+9999不符合规定，在Mapper中被过滤掉了，所以最后没有输出，也就不用作比较。

第三段测试代码和第一段的原理是一样的，只是第二部分字段没有温度了，还是年份，所以最后才那样写。

代码2：

package v1;
// == MaxTemperatureReducerTestV1
import java.io.IOException;
import java.util.*;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mrunit.mapreduce.ReduceDriver;
import org.junit.*;

public class MaxTemperatureReducerTest {
  
  //vv MaxTemperatureReducerTestV1
  @Test
  public void returnsMaximumIntegerInValues() throws IOException,
      InterruptedException {
    new ReduceDriver<Text, IntWritable, Text, IntWritable>()
      .withReducer(new MaxTemperatureReducer())
      .withInputKey(new Text("1950"))
      .withInputValues(Arrays.asList(new IntWritable(10), new IntWritable(5)))
      .withOutput(new Text("1950"), new IntWritable(10))
      .runTest();
  }
  //^^ MaxTemperatureReducerTestV1
}

下面来看Reducer的测试代码

代码2:

package v1;
// == MaxTemperatureReducerTestV1
import java.io.IOException;
import java.util.*;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mrunit.mapreduce.ReduceDriver;
import org.junit.*;

public class MaxTemperatureReducerTest {
  
  //vv MaxTemperatureReducerTestV1
  @Test
  public void returnsMaximumIntegerInValues() throws IOException,
      InterruptedException {
    new ReduceDriver<Text, IntWritable, Text, IntWritable>()
      .withReducer(new MaxTemperatureReducer())
      .withInputKey(new Text("1950"))
      .withInputValues(Arrays.asList(new IntWritable(10), new IntWritable(5)))
      .withOutput(new Text("1950"), new IntWritable(10))
      .runTest();
  }
  //^^ MaxTemperatureReducerTestV1
}

我们可以看到Reducer的测试主要用到的一个类就是ReduceDriver，这和MapDriver很像，看代码可以知道ReduceDriver的泛型设置和Reducer的泛型设置一样，然后假如需要测试的Reducer类，在设置输入的键值和内容，接着与给定的输出相比较，修后运行测试代码。

这里有兴趣的同学可以看看MapDriver和ReduceDriver的api的介绍，（以MapDriver为例）注意的一点就是runTest这个函数其实是从TestDriver继承下来的，在TestDriver中定义了这个抽象函数，在MapDriverBase中得以实现，同样还有run这个函数，这个函数在MapDriverBase中被定义为抽象类，在MapDriver中得以实现，这两个函数的区别是很大的，runTest的测试结果要和预定的键值对作比较（就是在withOutput中添加的键值对，还有一个类似的函数是addOutput，这两个api的区别我暂时搞不懂），而run的话就直接输出键值对，忽略与预定的键值对作比较。

这里提一下命令行下的测试方法，把本书的源代码下载下来后，通过README可以了解到利用

mvn -DskipTest -Dhadoop.version=1.2.1进行编译（此处的version我是自己定义的，因为这是我的版本，但要修改一下hadoop-meta中的pom.xml，把当中的hadoop版本换过了，一定要保证联网状态），编译好后进入ch05（就是我们这一章的内容），输入mvn test -Dhadoop.version=1.2.1，记住一定要加版本限制，否则会找不到hadoop的所有相关的包，如果只是想测试其中的一个代码，可以利用

mvntest -Dhadoop.version=1.2.1 -Dtest=MaxTemperatureReducerTest来测试，看一下执行效果：

summerdg@summerdg-virtual-machine:~/hadoop-test/ch05$ mvn test -Dhadoop.version=1.2.1 -Dtest=MaxTemperatureReducerTest
[INFO] Scanning for projects...
[INFO]                                                                         
[INFO] ------------------------------------------------------------------------
[INFO] Building Chapter 5: Developing a MapReduce Application 3.0
[INFO] ------------------------------------------------------------------------
[INFO] 
[INFO] --- maven-enforcer-plugin:1.0.1:enforce (enforce-versions) @ ch05 ---
[INFO] 
[INFO] --- maven-resources-plugin:2.6:resources (default-resources) @ ch05 ---
[INFO] Using 'UTF-8' encoding to copy filtered resources.
[INFO] Copying 7 resources
[INFO] 
[INFO] --- maven-compiler-plugin:3.1:compile (default-compile) @ ch05 ---
[INFO] Nothing to compile - all classes are up to date
[INFO] 
[INFO] --- maven-resources-plugin:2.6:testResources (default-testResources) @ ch05 ---
[INFO] Using 'UTF-8' encoding to copy filtered resources.
[INFO] Copying 1 resource
[INFO] 
[INFO] --- maven-compiler-plugin:3.1:testCompile (default-testCompile) @ ch05 ---
[INFO] Nothing to compile - all classes are up to date
[INFO] 
[INFO] --- maven-surefire-plugin:2.5:test (default-test) @ ch05 ---
[INFO] Surefire report directory: /home/summerdg/hadoop-test/ch05/target/surefire-reports

-------------------------------------------------------
 T E S T S
-------------------------------------------------------
Running v1.MaxTemperatureReducerTest
Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 4.3 sec

Results :

Tests run: 1, Failures: 0, Errors: 0, Skipped: 0

[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 11.118s
[INFO] Finished at: Tue Nov 12 16:20:07 CST 2013
[INFO] Final Memory: 7M/18M
[INFO] ------------------------------------------------------------------------
summerdg@summerdg-virtual-machine:~/hadoop-test/ch05$

再说一下在IDEA中的测试，首先，你得利用mvn idea:idea生成一个ipr文件，我的是root.ipr,然后打开IDEA，打开工程选择这个文件，然后就装载成功了，但是此时还有个问题就是这个工程里事实上没有hadoop相关的包，这就需要你File->Project Structure->modules->Ch05->Dependencies中点击“+”添加jar and directories选择hadoop目录下的hadoop-core-1.2.1.jar,再添加这个目录下的hadoop-test-1.2.1.jar,继续添加这个目录的lib文件夹，点击OK。最后点击你要测试的java文件的右键run（或者ctrl+shift+F10）就可以了

本地运行测试数据

下面给出一段驱动程序的代码，这个驱动相当于最后的作业装入，只是这部分结合了我们前面的关于命令解释的知识。

package v2;

// cc MaxTemperatureDriverV2 Application to find the maximum temperature
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;

import v1.MaxTemperatureReducer;

// vv MaxTemperatureDriverV2
public class MaxTemperatureDriver extends Configured implements Tool {

  @Override
  public int run(String[] args) throws Exception {
    if (args.length != 2) {
      System.err.printf("Usage: %s [generic options] <input> <output>\n",
          getClass().getSimpleName());
      ToolRunner.printGenericCommandUsage(System.err);
      return -1;
    }
    
    Job job = new Job(getConf(), "Max temperature");
    job.setJarByClass(getClass());

    FileInputFormat.addInputPath(job, new Path(args[0]));
    FileOutputFormat.setOutputPath(job, new Path(args[1]));
    
    job.setMapperClass(MaxTemperatureMapper.class);
    job.setCombinerClass(MaxTemperatureReducer.class);
    job.setReducerClass(MaxTemperatureReducer.class);

    job.setOutputKeyClass(Text.class);
    job.setOutputValueClass(IntWritable.class);
    
    return job.waitForCompletion(true) ? 0 : 1;
  }
  
  public static void main(String[] args) throws Exception {
    int exitCode = ToolRunner.run(new MaxTemperatureDriver(), args);
    System.exit(exitCode);
  }
}

这里run的内容和我们先前的mapreduce程序写法完全一样，这里不再赘述。这里讲一下在IDEA下运行这段代码，顺便可以用于日后的MapReduce程序的调试。

点击run->Edit Configurations,然后点击“+”选择Application，Name自己取

大家看我的配置应该已经差不多懂了吧，Working directory就是你的工作目录，module就是要运行类的包，

Main class就是你运行的类，最重要的是参数的写法，和我先前在eclipse中的设置是一样的，不过现在看来要比eclipse方便不少，起码不需要安装相应的插件。然后直接run就可以了。

下面是驱动程序的测试代码：

package v3;
// cc MaxTemperatureDriverTestV3 A test for MaxTemperatureDriver that uses a local, in-process job runner
import static org.hamcrest.Matchers.is;
import static org.hamcrest.Matchers.nullValue;
import static org.junit.Assert.assertThat;

import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStream;
import java.io.InputStreamReader;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.FileUtil;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.fs.PathFilter;
import org.junit.Test;

public class MaxTemperatureDriverTest {
  
  public static class OutputLogFilter implements PathFilter {
    public boolean accept(Path path) {
      return !path.getName().startsWith("_");
    }
  }
  
//vv MaxTemperatureDriverTestV3
  @Test
  public void test() throws Exception {
    Configuration conf = new Configuration();
    conf.set("fs.default.name", "file:///");
    conf.set("mapred.job.tracker", "local");
    
    Path input = new Path("input/ncdc/micro");
    Path output = new Path("output");
    
    FileSystem fs = FileSystem.getLocal(conf);
    fs.delete(output, true); // delete old output
    
    MaxTemperatureDriver driver = new MaxTemperatureDriver();
    driver.setConf(conf);
    
    int exitCode = driver.run(new String[] {
        input.toString(), output.toString() });
    assertThat(exitCode, is(0));
    
    checkOutput(conf, output);
  }
//^^ MaxTemperatureDriverTestV3

  private void checkOutput(Configuration conf, Path output) throws IOException {
    FileSystem fs = FileSystem.getLocal(conf);
    Path[] outputFiles = FileUtil.stat2Paths(
        fs.listStatus(output, new OutputLogFilter()));
    assertThat(outputFiles.length, is(1));
    
    BufferedReader actual = asBufferedReader(fs.open(outputFiles[0]));
    BufferedReader expected = asBufferedReader(
        getClass().getResourceAsStream("/expected.txt"));
    String expectedLine;
    while ((expectedLine = expected.readLine()) != null) {
      assertThat(actual.readLine(), is(expectedLine));
    }
    assertThat(actual.readLine(), nullValue());
    actual.close();
    expected.close();
  }
  
  private BufferedReader asBufferedReader(InputStream in) throws IOException {
    return new BufferedReader(new InputStreamReader(in));
  }
}

这段代码也许让大家看的云里雾里的，主要是OutPutLogFilter这个类影响的，我们去看一下PathFilter这个接口，只定义了一个add，OutPutLogFilter中已经实现，再看FileSystem这个类，虽然前面也讲过这个类，但是当时毕竟所用的函数和现在有不同，所以看一下源码了解一下listStatus这个函数，虽然我们用的不是下面的形式，但是由于所有同名函数最后都是调用这个得以实现，所以就拿出listStatus的“老祖宗”看看，在源码里，我们就清楚的知道了他是怎么工作的了。

     private void listStatus(ArrayList<FileStatus> results, Path f,
          PathFilter filter) throws FileNotFoundException, IOException {
        FileStatus listing[] = listStatus(f);
       if (listing == null) {
          throw new IOException("Error accessing " + f);
        }
    
        for (int i = 0; i < listing.length; i++) {
          if (filter.accept(listing[i].getPath())) {
            results.add(listing[i]);
          }
       }
     }

accept用于测试指定抽象路径名是否应该包含在一个路径名列表，符合要求就把路径名添加到列表里。