Hadoop MultiOutputs 实现

最新推荐文章于 2023-11-07 16:19:43 发布

wf1982

最新推荐文章于 2023-11-07 16:19:43 发布

阅读量7.9k

点赞数 1

分类专栏：云计算文章标签： hadoop mapreduce

本文链接：https://blog.csdn.net/wf1982/article/details/19816061

版权

云计算专栏收录该内容

42 篇文章 0 订阅

订阅专栏

工作中有人问 MultiOutputs 实现为啥在指定reduce数为1时结果文件数依然是好多个？这其实由其实现逻辑决定的。

在MR中一般job都可以通过map reduce 默认的OutputCollector 实现写入作业初始化时指定格式的输出中，只能一个文件格式。当需要将结果分门别类区分或者使用不同格式存储在多个文件结果中时就需要 MultiOutputs了。

MultiOutputs 添加输出得api

 /**
   * Adds a named output for the job.
   * <p/>
   *
   * @param conf              job conf to add the named output
   * @param namedOutput       named output name, it has to be a word, letters
   *                          and numbers only, cannot be the word 'part' as
   *                          that is reserved for the
   *                          default output.
   * @param outputFormatClass OutputFormat class.
   * @param keyClass          key class
   * @param valueClass        value class
   */
  public static void addNamedOutput(JobConf conf, String namedOutput,
                                Class<? extends OutputFormat> outputFormatClass,
                                Class<?> keyClass, Class<?> valueClass) {
    addNamedOutput(conf, namedOutput, false, outputFormatClass, keyClass,
      valueClass);
  }

  /**
   * Adds a multi named output for the job.
   * <p/>
   *
   * @param conf              job conf to add the named output
   * @param namedOutput       named output name, it has to be a word, letters
   *                          and numbers only, cannot be the word 'part' as
   *                          that is reserved for the
   *                          default output.
   * @param outputFormatClass OutputFormat class.
   * @param keyClass          key class
   * @param valueClass        value class
   */
  public static void addMultiNamedOutput(JobConf conf, String namedOutput,
                               Class<? extends OutputFormat> outputFormatClass,
                               Class<?> keyClass, Class<?> valueClass) {
    addNamedOutput(conf, namedOutput, true, outputFormatClass, keyClass,
      valueClass);
  }

通过以上2个api可以设定文件命名前缀文件输出格式等，每种结果格式可以不一样。

在MultiOutputs 实现中通过getCollector方法实例化一个OutputCollector，实现输出

public OutputCollector getCollector(String namedOutput, String multiName,
                                      Reporter reporter)
    throws IOException {

    checkNamedOutputName(namedOutput);
    if (!namedOutputs.contains(namedOutput)) {
      throw new IllegalArgumentException("Undefined named output '" +
        namedOutput + "'");
    }
    boolean multi = isMultiNamedOutput(conf, namedOutput);

    if (!multi && multiName != null) {
      throw new IllegalArgumentException("Name output '" + namedOutput +
        "' has not been defined as multi");
    }
    if (multi) {
      checkTokenName(multiName);
    }

    String baseFileName = (multi) ? namedOutput + "_" + multiName : namedOutput;

    final RecordWriter writer =
      getRecordWriter(namedOutput, baseFileName, reporter);

    return new OutputCollector() {

      @SuppressWarnings({"unchecked"})
      public void collect(Object key, Object value) throws IOException {
        writer.write(key, value);
      }

    };
  }

上面中 RecordWriter 是 MultiOutputs 中的内部类 InternalFileOutputFormat 实现，

baseFileName

是文件名前缀，后面用于初始化文件名，如果是multi的话文件名前缀由

namedOutput + "_" + multiName

组成，否则是

namedOutput

，看下该类里getRecordWriter（）实现中关键出得到文件名是如何实现的？

String nameOutput = job.get(CONFIG_NAMED_OUTPUT, null);
String fileName = getUniqueName(job, baseFileName);

其中getUniqueName 方法沿用的FileOutputFormat 的实现。

public static String getUniqueName(JobConf conf, String name) {
    int partition = conf.getInt("mapred.task.partition", -1);
    if (partition == -1) {
      throw new IllegalArgumentException(
        "This method can only be called from within a Job");
    }

    String taskType = (conf.getBoolean("mapred.task.is.map", true)) ? "m" : "r";

    NumberFormat numberFormat = NumberFormat.getInstance();
    numberFormat.setMinimumIntegerDigits(5);
    numberFormat.setGroupingUsed(false);

    return name + "-" + taskType + "-" + numberFormat.format(partition);
  }

无论是map 还是reduce 中实现的MultiOutputs 输出，均通过 mapred.task.partition 实现文件名构造，通过name，type(map or reduce )以及 map分区编号表示一个文件名。

所以使用该类时文件数并不是由reduce个数决定，而是 name 的组成（或者说是否multi）以及 mapred.task.partition决定。