nutch源码分析—solrindex

2021SC@SDUSC

本章开始分析nutch源码的最后一步,即通过“bin/nutch solrindex http://localhost:8983/solr crawl/crawldb/ -linkdb crawl/linkdb/ -dir crawl/segments/ -filter -normalize”命令在solr服务器上建立索引。
首先看nutch执行脚本的其中一段,

 

elif [ "$COMMAND" = "solrindex" ] ; then
  CLASS="org.apache.nutch.indexer.IndexingJob -D solr.server.url=$1"
  shift

solrindex最后执行IndexingJob的main函数,并将参数“http://localhost:8983/solr”存入名称为solr.server.url变量。


  public static void main(String[] args) throws Exception {
    final int res = ToolRunner.run(NutchConfiguration.create(),
        new IndexingJob(), args);
    System.exit(res);
  }

  public int run(String[] args) throws Exception {

    index(crawlDb, linkDb, segments, noCommit, deleteGone, params, filter, normalize, addBinaryContent, base64);
    return 0;
  }

  public void index(Path crawlDb, Path linkDb, List<Path> segments,
      boolean noCommit, boolean deleteGone, String params,
      boolean filter, boolean normalize, boolean addBinaryContent,
      boolean base64) throws IOException {


    final JobConf job = new NutchJob(getConf());
    job.setJobName("Indexer");

    IndexWriters writers = new IndexWriters(getConf());
    IndexerMapReduce.initMRJob(crawlDb, linkDb, segments, job, addBinaryContent);

    ...

    final Path tmp = new Path("tmp_" + System.currentTimeMillis() + "-"
        + new Random().nextInt());

    FileOutputFormat.setOutputPath(job, tmp);
    RunningJob indexJob = JobClient.runJob(job);
    writers.open(job, "commit");
    writers.commit();
  

IndexerMapReduce::initMRJob

public static void initMRJob(Path crawlDb, Path linkDb,
      Collection<Path> segments, JobConf job, boolean addBinaryContent) {

    for (final Path segment : segments) {
      FileInputFormat.addInputPath(job, new Path(segment,
          CrawlDatum.FETCH_DIR_NAME));
      FileInputFormat.addInputPath(job, new Path(segment,
          CrawlDatum.PARSE_DIR_NAME));
      FileInputFormat.addInputPath(job, new Path(segment, ParseData.DIR_NAME));
      FileInputFormat.addInputPath(job, new Path(segment, ParseText.DIR_NAME));

      if (addBinaryContent) {
        FileInputFormat.addInputPath(job, new Path(segment, Content.DIR_NAME));
      }
    }
    FileInputFormat.addInputPath(job, new Path(crawlDb, CrawlDb.CURRENT_NAME));

    if (linkDb != null) {
      Path currentLinkDb = new Path(linkDb, LinkDb.CURRENT_NAME);
      FileInputFormat.addInputPath(job, currentLinkDb);
    }

    job.setInputFormat(SequenceFileInputFormat.class);

    job.setMapperClass(IndexerMapReduce.class);
    job.setReducerClass(IndexerMapReduce.class);

    job.setOutputFormat(IndexerOutputFormat.class);
    job.setOutputKeyClass(Text.class);
    job.setMapOutputValueClass(NutchWritable.class);
    job.setOutputValueClass(NutchWritable.class);
  }

设置Job的输入为crawl/segments/*/下的crawl_fetch、crawl_parse、parse_data、parse_text、content目录,crawl/crawldb下的current目录和crawl下的linkdb目录。设置Mapper和Reducer为IndexerMapReduce,写函数为IndexerOutputFormat,下面一一来看。

IndexerMapReduce::map、

public void map(Text key, Writable value,
      OutputCollector<Text, NutchWritable> output, Reporter reporter)
          throws IOException {

    String urlString = filterUrl(normalizeUrl(key.toString()));
    if (urlString == null) {
      return;
    } else {
      key.set(urlString);
    }

    output.collect(key, new NutchWritable(value));
  }

剩下的下一篇再来叙述

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值