Nutch 2.0 之抓取流程简单分析

最新推荐文章于 2017-04-30 12:48:34 发布

amuseme_lu

最新推荐文章于 2017-04-30 12:48:34 发布

阅读量1w

点赞数

分类专栏： Nutch 文章标签： generator 代码分析 exception null object string

本文链接：https://blog.csdn.net/amuseme_lu/article/details/7777426

版权

本文分析了Nutch 2.0的抓取流程，包括InjectorJob、GeneratorJob、FetcherJob、ParserJob、DbUpdaterJob和SolrIndexerJob的源代码细节。Nutch 2.0在数据存储层进行抽象，支持大规模数据抓取，并允许用户扩展数据存储，简化了一些原本需要MapReduce任务完成的操作。尽管仍处于不成熟阶段，但展示了Nutch未来的发展方向。

摘要由CSDN通过智能技术生成

Nutch 2.0 抓取流程介绍
---------------------

1. 整体流程

InjectorJob => GeneratorJob => FetcherJob => ParserJob => DbUpdaterJob => SolrIndexerJob

InjectorJob : 从文件中得到一批种子网页，把它们放到抓取数据库中去
GeneratorJob: 从抓取数据库中产生要抓取的页面放到抓取队列中去
FetcherJob: 对抓取队列中的网页进行抓取,在reducer中使用了生产/消费者模型
ParserJob: 对抓取完成的网页进行解析，产生一些新的链接与网页内容的解析结果
DbUpdaterJob: 把新产生的链接更新到抓取数据库中去
SolrIndexerJob: 对解析后的内容进行索引建立

2. InjectorJob分析

下面是InjectorJob的启动函数，代码如下

  public Map<String,Object> run(Map<String,Object> args) throws Exception {
    getConf().setLong("injector.current.time", System.currentTimeMillis());
    Path input;
    Object path = args.get(Nutch.ARG_SEEDDIR);
    if (path instanceof Path) {
      input = (Path)path;
    } else {
      input = new Path(path.toString());
    }
    numJobs = 2;
    currentJobNum = 0;
    status.put(Nutch.STAT_PHASE, "convert input");
    currentJob = new NutchJob(getConf(), "inject-p1 " + input);
    FileInputFormat.addInputPath(currentJob, input);
	// mapper方法，从文件中解析出url，写入数据库
    currentJob.setMapperClass(UrlMapper.class);
    currentJob.setMapOutputKeyClass(String.class);
	// map 的输出为WebPage，它是用Gora compile生成的，可以通过Gora把它映射到不同的数据库中，
    currentJob.setMapOutputValueClass(WebPage.class);
	// 输出到GoraOutputFormat
    currentJob.setOutputFormatClass(GoraOutputFormat.class);
    DataStore<String, WebPage> store = StorageUtils.createWebStore(currentJob.getConfiguration(),
        String.class, WebPage.class);
    GoraOutputFormat.setOutput(currentJob, store, true);
    currentJob.setReducerClass(Reducer.class);
    currentJob.setNumReduceTasks(0);
    currentJob.waitForCompletion(true);
    ToolUtil.recordJobStatus(null, currentJob, results);
    currentJob = null;


    status.put(Nutch.STAT_PHASE, "merge input with db");
    status.put(Nutch.STAT_PROGRESS, 0.5f);
    currentJobNum = 1;
    currentJob = new NutchJob(getConf(), "inject-p2 " + input);
    StorageUtils.initMapperJob(currentJob, FIELDS, String.class,
        WebPage.class, InjectorMapper.class);
    currentJob.setNumReduceTasks(0);
    ToolUtil.recordJobStatus(null, currentJob, results);
    status.put(Nutch.STAT_PROGRESS, 1.0f);
    return results;
  }

因为InjectorJob扩展自NutchTool，实现了它的run方法。

我们可以看到，这里有两个MR任务，第一个主要是从文件中读入种子网页，写到DataStore数据库中，第二个MR任务主要是对数据库中的WebPage对象做一个分数与抓取间隔的设置。它使用到一个initMapperJob方法，代码如下

  public static <K, V> void initMapperJob(Job job,

最低0.47元/天解锁文章

amuseme_lu

关注

0
点赞
踩
9

收藏

觉得还不错? 一键收藏
9
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

专栏目录

Nutch 2.0 之 抓取流程简单分析

1. 整体流程

2. InjectorJob分析

Nutch 2.0 之抓取流程简单分析