Hadoop Tool,ToolRunner原理分析

最新推荐文章于 2022-10-18 08:00:00 发布

yaoyaostep

最新推荐文章于 2022-10-18 08:00:00 发布

阅读量1.7k

点赞数 1

分类专栏： hadoop linux

hadoop 同时被 2 个专栏收录

17 篇文章 0 订阅

订阅专栏

linux

3 篇文章 0 订阅

订阅专栏

public interface Configurable {

void setConf(Configuration conf);

Configuration getConf();

}

Configurable接口只定义了两个方法：setConf与 getConf。

Configured类实现了Configurable接口：

public class Configured implements Configurable {

private Configuration conf;

public Configured() {

this(null);

}

public Configured(Configuration conf) {

setConf(conf);

}

public void setConf(Configuration conf) {

this.conf = conf;

}

public Configuration getConf() {

return conf;

}

}

Tool接口继承了Configurable接口，只有一个run()方法。(接口继承接口)

public interface Tool extends Configurable {

int run(String [] args) throws Exception;

}

继承关系如下：

再看ToolRunner类的一部分：

public class ToolRunner {

public static int run(Configuration conf, Tool tool, String[] args)

throws Exception{

if(conf == null) {

conf = new Configuration();

}

GenericOptionsParser parser = new GenericOptionsParser(conf, args);

//set the configuration back, so that Tool can configure itself

tool.setConf(conf);

//get the args w/o generic hadoop args

String[] toolArgs = parser.getRemainingArgs();

return tool.run(toolArgs);

}

}

从ToolRunner的静态方法run()可以看到，其通过GenericOptionsParser 来读取传递给run的job的conf和命令行参数args，处理hadoop的通用命令行参数，然后将剩下的job自己定义的参数(toolArgs = parser.getRemainingArgs();)交给tool来处理,再由tool来运行自己的run方法。

通用命令行参数指的是对任意的一个job都可以添加的，如：

-conf <configuration file> specify a configuration file

-D <property=value> use value for given property

-fs <local|namenode:port> specify a namenode

-jt <local|jobtracker:port> specify a job tracker

-files <comma separated list of files> specify comma separated

files to be copied to the map reduce cluster

-libjars <comma separated list of jars> specify comma separated

jar files to include in the classpath.

-archives <comma separated list of archives> specify comma

separated archives to be unarchived on the compute machines.

一个典型的实现Tool的程序：

/**

MyApp 需要从命令行读取参数，用户输入命令如，

$bin/hadoop jar MyApp.jar -archives test.tgz arg1 arg2

-archives 为hadoop通用参数，arg1 ,arg2为job的参数

public class MyApp extends Configured implements Tool {

//implemet Tool’s run

public int run(String[] args) throws Exception {

Configuration conf = getConf();

// Create a JobConf using the processed conf

JobConf job = new JobConf(conf, MyApp.class);

// Process custom command-line options

Path in = new Path(args[1]);

Path out = new Path(args[2]);

// Specify various job-specific parameters

job.setJobName(“my-app”);

job.setInputPath(in);

job.setOutputPath(out);

job.setMapperClass(MyApp.MyMapper.class);

job.setReducerClass(MyApp.MyReducer.class);

JobClient.runJob(job);

}

public static void main(String[] args) throws Exception {

// args由ToolRunner来处理

int res = ToolRunner.run(new Configuration(), new MyApp(), args);

System.exit(res);

}

}

使用ToolRunner让参数传递更简单

关于MapReduce运行和参数配置，你是否有下面的烦恼：

将MapReduce Job配置参数写到java代码里，一旦变更意味着修改java文件源码、编译、打包、部署一连串事情。
当MapReduce 依赖配置文件的时候，你需要手工编写java代码使用DistributedCache将其上传到HDFS中，以便map和reduce函数可以读取。
当你的map或reduce 函数依赖第三方jar文件时，你在命令行中使用”-libjars”参数指定依赖jar包时，但根本没生效。

其实，Hadoop有个ToolRunner类，它是个好东西，简单好用。无论在《Hadoop权威指南》还是Hadoop项目源码自带的example，都推荐使用ToolRunner。

下面我们看下src/example目录下WordCount.java文件，它的代码结构是这样的：
public class WordCount {
    // 略...
    public static void main(String[] args) throws Exception {
        Configuration conf = new Configuration();
        String[] otherArgs = new GenericOptionsParser(conf, 
                                            args).getRemainingArgs();
        // 略...
        Job job = new Job(conf, "word count");
        // 略...
        System.exit(job.waitForCompletion(true) ? 0 : 1);
    }
}
WordCount.java中使用到了GenericOptionsParser这个类，它的作用是将命令行中参数自动设置到变量conf中。举个例子，比如我希望通过命令行设置reduce task数量，就这么写：
bin/hadoop jar MyJob.jar com.xxx.MyJobDriver -Dmapred.reduce.tasks=5
上面这样就可以了，不需要将其硬编码到java代码中，很轻松就可以将参数与代码分离开。

其它常用的参数还有”-libjars”和-“files”，使用方法一起送上：
bin/hadoop jar MyJob.jar com.xxx.MyJobDriver -Dmapred.reduce.tasks=5 \ 
    -files ./dict.conf  \
    -libjars lib/commons-beanutils-1.8.3.jar,lib/commons-digester-2.1.jar
参数”-libjars”的作用是上传本地jar包到HDFS中MapReduce临时目录并将其设置到map和reduce task的classpath中；参数”-files”的作用是上传指定文件到HDFS中mapreduce临时目录，并允许map和reduce task读取到它。这两个配置参数其实都是通过DistributeCache来实现的。

至此，我们还没有说到ToolRunner，上面的代码我们使用了GenericOptionsParser帮我们解析命令行参数，编写ToolRunner的程序员更懒，它将 GenericOptionsParser调用隐藏到自身run方法，被自动执行了，修改后的代码变成了这样：
public class WordCount extends Configured implements Tool {

    @Override
    public int run(String[] arg0) throws Exception {
        Job job = new Job(getConf(), "word count");
        // 略...
        System.exit(job.waitForCompletion(true) ? 0 : 1);
        return 0;
    }

    public static void main(String[] args) throws Exception {
        int res = ToolRunner.run(new Configuration(), new WordCount(), args);
        System.exit(res);
    }
}
看看代码上有什么不同：

让WordCount继承Configured并实现Tool接口。
重写Tool接口的run方法，run方法不是static类型，这很好。
在WordCount中我们将通过getConf()获取Configuration对象。

yaoyaostep

关注

1
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
Hadoop Tool,ToolRunner原理分析

public interface Configurable {void setConf(Configuration conf); Configuration getConf();}Configurable接口只定义了两个方法：setConf与 getConf。 Configured类实现了Configurable接口： public class Conf
复制链接

扫一扫