nutch源码分析—solrindex

最新推荐文章于 2024-07-06 17:43:25 发布

啊哇哇哇无无无无无无无无无无无

最新推荐文章于 2024-07-06 17:43:25 发布

阅读量119

点赞数

分类专栏： nutch 文章标签： solr lucene apache

本文链接：https://blog.csdn.net/weixin_47876869/article/details/122155284

版权

nutch 专栏收录该内容

13 篇文章 0 订阅

订阅专栏

2021SC@SDUSC

本章开始分析nutch源码的最后一步，即通过“bin/nutch solrindex http://localhost:8983/solr crawl/crawldb/ -linkdb crawl/linkdb/ -dir crawl/segments/ -filter -normalize”命令在solr服务器上建立索引。
首先看nutch执行脚本的其中一段，

elif [ "$COMMAND" = "solrindex" ] ; then
  CLASS="org.apache.nutch.indexer.IndexingJob -D solr.server.url=$1"
  shift

solrindex最后执行IndexingJob的main函数，并将参数“http://localhost:8983/solr”存入名称为solr.server.url变量。


  public static void main(String[] args) throws Exception {
    final int res = ToolRunner.run(NutchConfiguration.create(),
        new IndexingJob(), args);
    System.exit(res);
  }

  public int run(String[] args) throws Exception {

    index(crawlDb, linkDb, segments, noCommit, deleteGone, params, filter, normalize, addBinaryContent, base64);
    return 0;
  }

  public void index(Path crawlDb, Path linkDb, List<Path> segments,
      boolean noCommit, boolean deleteGone, String params,
      boolean filter, boolean normalize, boolean addBinaryContent,
      boolean base64) throws IOException {


    final JobConf job = new NutchJob(getConf());
    job.setJobName("Indexer");

    IndexWriters writers = new IndexWriters(getConf());
    IndexerMapReduce.initMRJob(crawlDb, linkDb, segments, job, addBinaryContent);

    ...

    final Path tmp = new Path("tmp_" + System.currentTimeMillis() + "-"
        + new Random().nextInt());

    FileOutputFormat.setOutputPath(job, tmp);
    RunningJob indexJob = JobClient.runJob(job);
    writers.open(job, "commit");
    writers.commit();

IndexerMapReduce::initMRJob

public static void initMRJob(Path crawlDb, Path linkDb,
      Collection<Path> segments, JobConf job, boolean addBinaryContent) {

    for (final Path segment : segments) {
      FileInputFormat.addInputPath(job, new Path(segment,
          CrawlDatum.FETCH_DIR_NAME));
      FileInputFormat.addInputPath(job, new Path(segment,
          CrawlDatum.PARSE_DIR_NAME));
      FileInputFormat.addInputPath(job, new Path(segment, ParseData.DIR_NAME));
      FileInputFormat.addInputPath(job, new Path(segment, ParseText.DIR_NAME));

      if (addBinaryContent) {
        FileInputFormat.addInputPath(job, new Path(segment, Content.DIR_NAME));
      }
    }
    FileInputFormat.addInputPath(job, new Path(crawlDb, CrawlDb.CURRENT_NAME));

    if (linkDb != null) {
      Path currentLinkDb = new Path(linkDb, LinkDb.CURRENT_NAME);
      FileInputFormat.addInputPath(job, currentLinkDb);
    }

    job.setInputFormat(SequenceFileInputFormat.class);

    job.setMapperClass(IndexerMapReduce.class);
    job.setReducerClass(IndexerMapReduce.class);

    job.setOutputFormat(IndexerOutputFormat.class);
    job.setOutputKeyClass(Text.class);
    job.setMapOutputValueClass(NutchWritable.class);
    job.setOutputValueClass(NutchWritable.class);
  }

设置Job的输入为crawl/segments/*/下的crawl_fetch、crawl_parse、parse_data、parse_text、content目录，crawl/crawldb下的current目录和crawl下的linkdb目录。设置Mapper和Reducer为IndexerMapReduce，写函数为IndexerOutputFormat，下面一一来看。

IndexerMapReduce::map、

public void map(Text key, Writable value,
      OutputCollector<Text, NutchWritable> output, Reporter reporter)
          throws IOException {

    String urlString = filterUrl(normalizeUrl(key.toString()));
    if (urlString == null) {
      return;
    } else {
      key.set(urlString);
    }

    output.collect(key, new NutchWritable(value));
  }

剩下的下一篇再来叙述

啊哇哇哇无无无无无无无无无无无

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
nutch源码分析—solrindex

2021SC@SDUSC本章开始分析nutch源码的最后一步，即通过“bin/nutch solrindex http://localhost:8983/solr crawl/crawldb/ -linkdb crawl/linkdb/ -dir crawl/segments/ -filter -normalize”命令在solr服务器上建立索引。首先看nutch执行脚本的其中一段，elif [ "$COMMAND" = "solrindex" ] ; then CLASS="org.a.
复制链接

扫一扫