Apache Hadoop Mapreduce作业执行前篇之任务执行前准备（上）

最新推荐文章于 2020-07-01 11:54:56 发布

猫君之上

最新推荐文章于 2020-07-01 11:54:56 发布

阅读量587

点赞数 1

分类专栏： # Apache Hadoop 文章标签： Hadoop MapReduce 任务提交 job.waitForCompletion

本文链接：https://blog.csdn.net/qq_33713328/article/details/88572940

版权

一.创建Job作业过程

1.获取Job作业对象

方式一：

Configuration conf = new Configuration();

Job job=Job.getInstance(conf);

public static Job getInstance(Configuration conf) throws IOException {
   
        JobConf jobConf = new JobConf(conf);
        return new Job(jobConf);
}

这种方式可以自定义配置，可以通过Configuration类的对象来设置文件系统,而不是使用本地文件系统(当然你可以通过创建配置文件，自动加载的形式来使用hfds分布式文件系统)，但是上述代码没有对conf做任何设置，因此默认使用的是本地文件系统。可以通过如下设置来设置为hdfs文件系统。

conf.set("fs.defaultFS", "http://mycat01:9000");

方式二：

Job job=Job.getInstance();

public static Job getInstance() throws IOException {
   
    // create with a null Cluster
    return getInstance(new Configuration());
}

这种方式与方式一不应用set方法设置一样，毕竟底层还是调用的方式的获取实例的方法。此时，两者都是使用Configuration conf = new Configuration();来确定配置，而这个配置，我们知道，当创建conf的时候就会加载很多默认的配置，例如：hdfs-default.xml中的配置等。而这个配置应用的是本地文件系统，在windows上默认是file:///,当然你可以在项目的src目录下创建对应的类似core-default.xml这种文件，让系统自动加载这个配置文件。那么这样的话都可以应用自定义配置了。只不过方式一更灵活一点。

2.Job实例的动作（一）之普通方法与构造器

使用自定义配置来创建一个Job(作业实例)，创建过程中，job会复制一份conf的配置，以至于间歇性的修改不会影响正在配置的参数，有必要的时候，会根据conf的参数配置创建一个集群。

public static Job getInstance(Configuration conf) throws IOException {
   
    // create with a null Cluster
    JobConf jobConf = new JobConf(conf);
    return new Job(jobConf);
}

接下来我们看看上面第一行代码new JobConf(conf)：

public JobConf(Configuration conf) {
   
    super(conf);
    
    if (conf instanceof JobConf) {
   // 从配置文件中获取秘钥和表示的读写对象的实例，并进行设置
      JobConf that = (JobConf)conf;
      credentials = that.credentials;
    }
    
    checkAndWarnDeprecation();  //对配置文件中过时属性的日志warn级别的输出
}

重要的是第一行：调用父类Configuration的构造方法：

public Configuration(Configuration other) {
   
   this.resources = (ArrayList<Resource>) other.resources.clone();
   synchronized(other) {
   
     if (other.properties != null) {
   
       this.properties = (Properties)other.properties.clone();
     }

     if (other.overlay!=null) {
   
       this.overlay = (Properties)other.overlay.clone();
     }

     this.restrictSystemProps = other.restrictSystemProps;
     this.updatingResource = new ConcurrentHashMap<String, String[]>(
         other.updatingResource);
     this.finalParameters = Collections.newSetFromMap(
         new ConcurrentHashMap<String, Boolean>());
     this.finalParameters.addAll(other.finalParameters);
   }
   
    synchronized(Configuration.class) {
   
      REGISTRY.put(this, null);
    }
    this.classLoader = other.classLoader;
    this.loadDefaults = other.loadDefaults;
    setQuietMode(other.getQuietMode());
  }

从代码中三个clone方法的调用，我想你应该已经明白了，这里只是对你从Job中传递过来的conf做了一个克隆，然后放到resources，properties等容器里面。

读到这里我们发现，我们大体上只干了一件事，那就是复制一份用户自定义的conf对象的配置信息到Configuration的一些属性中。

3.Job实例动作（二）之静态代码块

单看Job类的静态代码块：

static {
   
  ConfigUtil.loadResources();
}

在往内部：

public static void loadResources() {
   
  addDeprecatedKeys();  //  添加过时的配置属性
  Configuration.addDefaultResource("mapred-default.xml");
  Configuration.addDefaultResource("mapred-site.xml");
  Configuration.addDefaultResource("yarn-default.xml");
  Configuration.addDefaultResource("yarn-site.xml");
}

后面几个则是加载对应的xml配置文件中的配置，这个是在Job作业对象创建前就加载的

另外在JobConf中静态代码块中执行的动作和Job中的一样，这里不做说明。

再来看看Configuration类的：

static{
   
  //print deprecation warning if hadoop-site.xml is found in classpath
  ClassLoader cL = Thread.currentThread().getContextClassLoader();
  if (cL == null) {
   
    cL = Configuration.class.getClassLoader();
  }
  if(cL.getResource("hadoop-site.xml")!=null) {
   
    LOG.warn("DEPRECATED: hadoop-site.xml found in the classpath. " +
        "Usage of hadoop-site.xml is deprecated. Instead use core-site.xml, "
        + "mapred-site.xml and hdfs-site.xml to override properties of " +
        "core-default.xml, mapred-default.xml and hdfs-default.xml " +
        "respectively");
  }
  addDefaultResource("core-default.xml");
  addDefaultResource("core-site.xml");
}

上述代码意思：去获取当前线程的类加载器对象，如果没有获得的话，就是用当前Configuration的类加载器。因为hadoop-site.xml是之前的版本支持的默认配置文件，所以在此判断，然后日志输出相关过时信息。最后的话会加载类路径下core-default.xml和core-site.xml两个配置文件.

4.创建Job实例的过程中执行的动作总结

1.加载Job,JobConf,Configuration三大核心的类

2.加载默认的配置文件中的配置(包括mapred-default.xml,mapred-site.xml,yarn-site.xml,yarn-default.xml,core-site.xml和core-default.xml)

3.对job构造器传递过来的新的Configuration的对象的配置信息进行拷贝，存到Configuration的某些属性里。（相当于用户传递过来的新的配置并没有应用，只是拷贝了一份然后存下来）

二.Mapper类与Reducer类的配置

1.Mapper类的设置

Driver中代码：

job.setMapperClass(WordCountMapper.class);

job中的setMapperClass方法：

public void setMapperClass(Class<? extends Mapper> cls) throws IllegalStateException {
   
    ensureState(JobState.DEFINE);  // 确认Job的状态
    conf.setClass(MAP_CLASS_ATTR, cls, Mapper.class);  //设置Mapper类
}

其中ensureState(JobState.DEFINE):

private void ensureState(JobState state) throws IllegalStateException {
   
    if (state != this.state) {
   
      throw new IllegalStateException("Job in state "+ this.state + 
                                      " instead of " + state);
    }

    if (state == JobState.RUNNING && cluster == null) {
   
      throw new IllegalStateException
        ("Job in state " + this.state
         + ", but it isn't attached to any job tracker!");
    }
}

可见ensureState用于确认当时时候是未提交(非运行)状态。只有处于DEFINE状态的作业才可以设置MapperClass或者ReducerClass。

作业未提交状态为：DEFINE

作业提交状态为：RUNING

再来看看conf.setClass(MAP_CLASS_ATTR, cls, Mapper.class):

MAP_CLASS_ATTR # mapreduce.job.map.class

cls # 自己传入的Mapper类型

Mapper.class # 框架本身自带的Mapper默认类

public void setClass(String name, Class<?> theClass, Class<?> xface) {
   
    if (!xface.isAssignableFrom(theClass))
      throw new RuntimeException(theClass+" not "+xface.getName());
    set(name, theClass.getName());
}

即如果传过来的自定义Mapper类不是Mapper的子类，那么会跑一个运行时异常，否则的话则通过Configuration中属性设置的MAP_CLASS_ATTR即mapreduce.job.map.class的key来设置运行需要的Mapper类。

2.Reducer类的设置

job.setReducerClass(WordCountReducer.class)

对于job来说传递的类不同，但是在Configuration中两者调用的都是一个方法，即如果不设置Reducer类的话，会使用默认的Reducer类作为Reducer类。

3.配置Map或者Reducer输出类型

//指定自定义的mapper类的输出键值类型
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass

最低0.47元/天解锁文章

猫君之上

关注

1
点赞
踩
2

收藏

觉得还不错? 一键收藏
2
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

专栏目录