hadoop Configured Configrable Configuration Tool 源码详解

项目github地址:bitcarmanlee easy-algorithm-interview-and-practice
欢迎大家star,留言,一起学习进步

在用java写MR的时候,定义类的第一行一般都是如下方式:

public class XXX extends Configured implements Tool

run方法的一个实例如下:

public int run(String[] args) throws Exception {
        Configuration conf = getConf();
        GenericOptionsParser optionparser = new GenericOptionsParser(conf, args);
        conf = optionparser.getConfiguration();

        @SuppressWarnings("deprecation")
		Job job = new Job(conf, "JudgeIfOrder");
        ...
        ...
        ...
        return (job.waitForCompletion(true) ? 0 : 1);
	}

main方法如下:

	public static void main(String[] args) throws Exception {
		int res = ToolRunner.run(new Configuration(), new JudgeIfOrder(),args);
		System.exit(res);
	}

为什么要这么写?这些类与接口内部是怎样实现的?他们之间是什么关系?相信不少小伙伴都对此会有疑问。为此,我结合相关源码,试图为大家缕缕hadoop里的作业具体是怎样配置的。这些Configured,Configrable,Tool等等又都是些什么鬼。

首先上一部分源码
Configurable接口

@InterfaceAudience.Public
@InterfaceStability.Stable
public interface Configurable {

  /** Set the configuration to be used by this object. */
  void setConf(Configuration conf);

  /** Return the configuration used by this object. */
  Configuration getConf();
}

Configured类

/** Base class for things that may be configured with a {@link Configuration}. */
@InterfaceAudience.Public
@InterfaceStability.Stable
public class Configured implements Configurable {

  private Configuration conf;

  /** Construct a Configured. */
  public Configured() {
    this(null);
  }
  
  /** Construct a Configured. */
  public Configured(Configuration conf) {
    setConf(conf);
  }

  // inherit javadoc
  @Override
  public void setConf(Configuration conf) {
    this.conf = conf;
  }

  // inherit javadoc
  @Override
  public Configuration getConf() {
    return conf;
  }

Tool接口:

@InterfaceAudience.Public
@InterfaceStability.Stable
public interface Tool extends Configurable {
  /**
   * Execute the command with the given arguments.
   * 
   * @param args command specific arguments.
   * @return exit code.
   * @throws Exception
   */
  int run(String [] args) throws Exception;
}

从源码上来看,首先我们可以明确的一点是:
Configurable是这三者底层的接口,Configured类简单地实现了Configurable接口,而Tool接口继承了Configurable接口。

run方法中,第一句是

Configuration conf = getConf();

因为类XXX已经继承了Configured,所以实际上是调用Configured里的getConf()方法,得到了一个Configuration对象。 Configuration类是配置模块中的最底层的类,从它的package信息就可以看出:

package org.apache.hadoop.conf;

Configuration到底有多底层?咱们看看它源码的一小部分:

  static{
    //print deprecation warning if hadoop-site.xml is found in classpath
    ClassLoader cL = Thread.currentThread().getContextClassLoader();
    if (cL == null) {
      cL = Configuration.class.getClassLoader();
    }
    if(cL.getResource("hadoop-site.xml")!=null) {
      LOG.warn("DEPRECATED: hadoop-site.xml found in the classpath. " +
          "Usage of hadoop-site.xml is deprecated. Instead use core-site.xml, "
          + "mapred-site.xml and hdfs-site.xml to override properties of " +
          "core-default.xml, mapred-default.xml and hdfs-default.xml " +
          "respectively");
    }
    addDefaultResource("core-default.xml");
    addDefaultResource("core-site.xml");
  }

同学们请看最后两行,它会加载core-default.xml与core-site.xml这两个最基础的配置文件。

GenericOptionsParser这个类,一看就知道是解析参数用的。

       GenericOptionsParser optionparser = new GenericOptionsParser(conf, args);
        conf = optionparser.getConfiguration();

上面这两行代码,自然就是用来解析输入参数用的。。。

咱们来看最后一行代码

int res = ToolRunner.run(new Configuration(), new JudgeIfOrder(),args);

上ToolRunner的源码

public static int run(Configuration conf, Tool tool, String[] args) 
    throws Exception{
    if(conf == null) {
      conf = new Configuration();
    }
    GenericOptionsParser parser = new GenericOptionsParser(conf, args);
    //set the configuration back, so that Tool can configure itself
    tool.setConf(conf);
    
    //get the args w/o generic hadoop args
    String[] toolArgs = parser.getRemainingArgs();
    return tool.run(toolArgs);
  }

public static int run(Tool tool, String[] args) 
    throws Exception{
    return run(tool.getConf(), tool, args);
  }

ToolRunner中与run相关的方法是这两个。我们的代码里给run方法传了三个参数,实际上执行的就是ToolRunner里第一个run方法了,ToolRunner会调用tool接口中的run方法。而我们自己写的XXX类实现了tool接口,重写了run方法。所以最后实际上调用的是我们自己写的run()方法了!

写到这里,同学们应该差不多明白里头的来龙去脉了吧。。。

在stackoverflow里看到了一段相关的描述,直接copy过来,就不翻译了,请自行阅读
这里写图片描述

以下是 Hadoop 集群实现串并行求圆周率的代码: 1. 串行求圆周率 ```java import java.util.Random; public class SerialPi { public static void main(String[] args) { long startTime = System.currentTimeMillis(); long n = 1000000; // 模拟次数 long m = 0; // 圆内点数 Random rand = new Random(); for (long i = 0; i < n; i++) { double x = rand.nextDouble(); double y = rand.nextDouble(); if (x * x + y * y <= 1) { // 判断点是否在圆内 m++; } } double pi = 4.0 * m / n; // 计算圆周率 System.out.println("Pi = " + pi); long endTime = System.currentTimeMillis(); System.out.println("Time: " + (endTime - startTime) + "ms"); } } ``` 2. 并行求圆周率 首先,需要在 Hadoop 集群上安装好 Hadoop,并且准备好一个输入文件,每行包含一个模拟次数。 接着,创建一个 MapReduce 作业来并行计算圆周率。 Mapper: ```java import java.io.IOException; import java.util.Random; import org.apache.hadoop.io.LongWritable; import org.apache.hadoop.io.NullWritable; import org.apache.hadoop.mapreduce.Mapper; public class ParallelPiMapper extends Mapper<LongWritable, NullWritable, NullWritable, NullWritable> { private long n; // 模拟次数 private long m; // 圆内点数 protected void setup(Context context) throws IOException, InterruptedException { super.setup(context); n = context.getConfiguration().getLong("n", 1000000); m = 0; } public void map(LongWritable key, NullWritable value, Context context) throws IOException, InterruptedException { Random rand = new Random(); for (long i = 0; i < n; i++) { double x = rand.nextDouble(); double y = rand.nextDouble(); if (x * x + y * y <= 1) { // 判断点是否在圆内 m++; } } } protected void cleanup(Context context) throws IOException, InterruptedException { super.cleanup(context); context.getCounter("Pi", "m").increment(m); } } ``` Reducer: ```java import java.io.IOException; import org.apache.hadoop.io.NullWritable; import org.apache.hadoop.mapreduce.Reducer; public class ParallelPiReducer extends Reducer<NullWritable, NullWritable, NullWritable, NullWritable> { private long m; // 圆内点数 protected void setup(Context context) throws IOException, InterruptedException { super.setup(context); m = 0; } public void reduce(NullWritable key, Iterable<NullWritable> values, Context context) throws IOException, InterruptedException { for (NullWritable value : values) { m += context.getCounter("Pi", "m").getValue(); } } protected void cleanup(Context context) throws IOException, InterruptedException { super.cleanup(context); double pi = 4.0 * m / (context.getConfiguration().getLong("n", 1000000) * context.getNumReduceTasks()); // 计算圆周率 System.out.println("Pi = " + pi); } } ``` Driver: ```java import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.conf.Configured; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.NullWritable; import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.mapreduce.lib.output.NullOutputFormat; import org.apache.hadoop.util.Tool; import org.apache.hadoop.util.ToolRunner; public class ParallelPiDriver extends Configured implements Tool { public static void main(String[] args) throws Exception { int exitCode = ToolRunner.run(new Configuration(), new ParallelPiDriver(), args); System.exit(exitCode); } public int run(String[] args) throws Exception { if (args.length != 2) { System.err.println("Usage: ParallelPiDriver <input> <numReduceTasks>"); return 1; } Configuration conf = getConf(); conf.setLong("n", Long.parseLong(args[1])); Job job = Job.getInstance(conf, "Parallel Pi"); job.setJarByClass(ParallelPiDriver.class); job.setMapperClass(ParallelPiMapper.class); job.setMapOutputKeyClass(NullWritable.class); job.setMapOutputValueClass(NullWritable.class); job.setReducerClass(ParallelPiReducer.class); job.setOutputFormatClass(NullOutputFormat.class); job.setNumReduceTasks(Integer.parseInt(args[1])); job.setInputFormatClass(GenerateInputFormat.class); GenerateInputFormat.setInputPath(job, new Path(args[0])); return job.waitForCompletion(true) ? 0 : 1; } } ``` 最后,使用以下命令提交作业: ``` hadoop jar ParallelPi.jar ParallelPiDriver <input> <numReduceTasks> ``` 其中,`<input>` 指定输入文件路径,`<numReduceTasks>` 指定 Reduce 任务数。
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值