Nutch 使用入门(三）——配置文件的加载

最新推荐文章于 2023-12-06 22:34:18 发布

iteye_4063

最新推荐文章于 2023-12-06 22:34:18 发布

阅读量122

点赞数

分类专栏： java 文章标签： XML 配置管理互联网 QQ Hadoop

java 专栏收录该内容

9 篇文章 0 订阅

订阅专栏

/** 
 *本人亦初学者，如有不正确的地方请多多指教。谢谢！ 
 *部分内容参考自互联网，如有冒犯，请见谅。
 **/

Nutch的配置文件主要有三类：

1.Hadoop的配置文件，Hadoop-default.xml和Hadoop-site.xml。

2.Nutch的配置文件，Nutch-default.xml和Nutch-site.xml。

3.Nutch的插件的配置文件，这些插件的配置文件在加载插件的时候由插件自行加载，如filter的配置文件。

配置文件的加载顺序决定了配置文件的优先级，先加载的配置文件优先级低，后加载的配置文件优先级高，优先级低的配置会被优先级高的配置覆盖。因此，了解Nutch配置文件加载的顺序对学习使用Nutch是非常必要的。下面我们通过对Nutch源代码的分析来看看Nutch加载配置文件的过程。

Nutch1.0使用入门（一）介绍了Nutch主要命令--crawl的使用，下面我们就从crawl的main类（org.apache.nutch.crawl.Crawl)的main方法开始分析：

Crawl类main方法中加载配置文件的源码如下：

Configuration conf = NutchConfiguration.create();
    conf.addResource("crawl-tool.xml");
    JobConf job = new NutchJob(conf);

上面代码中，生成了一个NutchConfiguration类的对象，NutchConfiguration是Nutch管理自己配置文件的类，Configuration是Hadoop管理自己配置文件的类。下面我们进入NutchConfiguration类的create()方法。

 /** Create a {@link Configuration} for Nutch. */
  public static Configuration create() {
    Configuration conf = new Configuration();
    addNutchResources(conf);
    return conf;
  }

create()方法中，先生成了一个Configuration类的对象。请看Configuration类中的源码：

 /** A new configuration. */
  public Configuration() {
    this(true);
  }

  /** A new configuration where the behavior of reading from the default 
   * resources can be turned off.
   * 
   * If the parameter {@code loadDefaults} is false, the new instance
   * will not load resources from the default files. 
   * @param loadDefaults specifies whether to load from the default files
   */
  public Configuration(boolean loadDefaults) {
    if (LOG.isDebugEnabled()) {
      LOG.debug(StringUtils.stringifyException(new IOException("config()")));
    }
    if (loadDefaults) {
      resources.add("hadoop-default.xml");
      resources.add("hadoop-site.xml");
    }
  }

由上面代码可以看出，在创建Configuration对象的时候，会依次加载hadoop-default.xml和hadoop-site.xml这两个配置文件。所以Hadoop-site.xml中的配置会覆盖hadoop-default.xml中的配置。了解完Hadoop配置文件的加载，我们回到刚才的create()方法里面。创建了Configuration对象后，将调用addNutchResources(conf)方法。

/** Add the standard Nutch resources to {@link Configuration}. */
  public static Configuration addNutchResources(Configuration conf) {
    conf.addResource("nutch-default.xml");
    conf.addResource("nutch-site.xml");
    return conf;
  }

我们看到，先加载了nutch-default.xml文件，后加载了nutch-site.xml文件。所以nutch-site.xml中的配置会覆盖nutch-default.xml中的配置。下面我们回到crawl类的main方法，继续往下看。调用了conf.addResource("crawl-tool.xml");这表明crawl-tool.xml配置文件是最后加载。

通过上面简单的源码分析，我们不难看出Nutch配置文件的优先级。

Nutch自己的配置文件：crawl-tool.xml > nutch-site.xml > nutch-default.xml

hadoop的配置文件：hadoop-site.xml > hadoop-default.xml

当然，因为nutch的配置文件后于Hadoop的配置文件的加载，nutch的配置也会覆盖Hadoop配置文件中的配置。需要明白的是覆盖的不是整个配置文件，而是单独的Property。

iteye_4063

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Nutch 使用入门(三）——配置文件的加载

/** *本人亦初学者，如有不正确的地方请多多指教。谢谢！ *部分内容参考自互联网，如有冒犯，请见谅。 **/ Nutch的配置文件主要有三类：1.Hadoop的配置文件，Hadoop-default.xml和Hadoop-site.xml。2.Nutch的配置文件，Nutch-default.xml和Nutch-site.xml。3.Nutch的插件的配置文件，...
复制链接

扫一扫

专栏目录