nutch源码分析---3

最新推荐文章于 2021-12-28 12:29:22 发布

二侠

最新推荐文章于 2021-12-28 12:29:22 发布

阅读量714

点赞数

分类专栏： nutch-1.12源码分析

本文链接：https://blog.csdn.net/conansonic/article/details/52251937

版权

nutch-1.12源码分析专栏收录该内容

6 篇文章 0 订阅

订阅专栏

nutch源码分析—fetch

根据上一章的分析，“bin/nutch fetch crawl/segments/*”这条命令最终会调用org.apache.nutch.fetcher.Fetcher的main函数。

  public static void main(String[] args) throws Exception {
    int res = ToolRunner.run(NutchConfiguration.create(), new Fetcher(), args);
    System.exit(res);
  }

ToolRunner的run函数进而调用Fetcher的run函数。

Fetcher::run

  public int run(String[] args) throws Exception {

    Path segment = new Path(args[0]);
    int threads = getConf().getInt("fetcher.threads.fetch", 10);

    for (int i = 1; i < args.length; i++) {
      if (args[i].equals("-threads")) {
        threads = Integer.parseInt(args[++i]);
      }
    }

    getConf().setInt("fetcher.threads.fetch", threads);

    fetch(segment, threads);
    return 0;
  }

获取抓取网页的线程数threads，默认为10，segment为crawl/segments/2*的目录路径，最后调用fetch函数。

Fetcher::run->fetch

  public void fetch(Path segment, int threads) throws IOException {
    checkConfiguration();

    JobConf job = new NutchJob(getConf());
    job.setJobName("fetch " + segment);
    job.setInt("fetcher.threads.fetch", threads);
    job.set(Nutch.SEGMENT_NAME_KEY, segment.getName());
    job.setSpeculativeExecution(false);
    FileInputFormat.addInputPath(job, new Path(segment,
        CrawlDatum.GENERATE_DIR_NAME));
    job.setInputFormat(InputFormat.class);
    job.setMapRunnerClass(Fetcher.class);

    FileOutputFormat.setOutputPath(job, segment);
    job.setOutputFormat(FetcherOutputFormat.class);
    job.setOutputKeyClass(Text.class);
    job.setOutputValueClass(NutchWritable.class);

    JobClient.runJob(job);
  }

checkConfiguration在配置文件中检查是否配置了http.agent.name属性，如果没有设置则抛出异常。接下来创建hadoop的Job，输入为crawl/segments/2*下的crawl_generate目录，由命令generate生成，处理函数为Fetcher中的run函数，输出也为crawl/segments/2*目录下，FetcherOutputFormat定义了最后如何输出。

Fetcher::run

  public void run(RecordReader<Text, CrawlDatum> input,
      OutputCollector<Text, NutchWritable> output, Reporter reporter)
      throws IOException {

    ...

    feeder = new QueueFeeder(input, fetchQueues, threadCount
        * queueDepthMuliplier);
    feeder.start();

    for (int i = 0; i < threadCount; i++) {
      FetcherThread t = new FetcherThread(getConf(), getActiveThreads(), fetchQueues, 
          feeder, spinWaiting, lastRequestStart, reporter, errors, segmentName,
          parsing, output, storingContent, pages, bytes);
      fetcherThreads.add(t);
      t.start();
    }

    ...

  }

Fetcher的run函数首先创建一个共享队列QueueFeeder，然后创建QueueFeeder(feeder)，用于读取crawl/crawldb/2*下的url和CrawlDatum，把它们放到共享队列FetchItemQueues(fetchQueues)中。
然后创建FetcherThread，并调用其start函数开始抓取网页。

Fetcher::run->QueueFeeder::run

  public void run() {
    boolean hasMore = true;

    while (hasMore) {

      ...

      int feed = size - queues.getTotalSize();
      if (feed <= 0) {
        Thread.sleep(1000);
        continue;
      } else {
        while (feed > 0 && hasMore) {
          Text url = new Text();
          CrawlDatum datum = new CrawlDatum();
          hasMore = reader.next(url, datum);
          if (hasMore) {
            queues.addFetchItem(url, datum);
            feed--;
          }
        }
      }
    }
  }

feed变量表示共享队列FetchItemQueues中是否有空闲位置可以插入待抓取的url和CrawlDatum，如果feed小于0，表示空间不足，就需要进程睡眠等待，如果feed大于0，表示空间足够，此时通过RecordReader（reader）的next函数从crawl/crawldb/2*/crawl_generate文件夹中依次读取url和CrawlDatum，调用addFetchItem函数将其封装成FetchItem并添加到共享队列中。

Fetcher::run->FetcherThread::run

  public void run() {

    FetchItem fit = null;
    try {
      while (true) {

        ...

        fit = ((FetchItemQueues) fetchQueues).getFetchItem();

        ...

        try {
          do {
            Protocol protocol = this.protocolFactory.getProtocol(fit.url
                .toString());
            BaseRobotRules rules = protocol.getRobotRules(fit.url, fit.datum);

            ...

            ProtocolOutput output = protocol.getProtocolOutput(fit.url,
                fit.datum);
            ProtocolStatus status = output.getStatus();
            Content content = output.getContent();
            ParseStatus pstatus = null;
            ((FetchItemQueues) fetchQueues).finishFetchItem(fit);
            String urlString = fit.url.toString();
            switch (status.getCode()) {

            ...

            case ProtocolStatus.SUCCESS:
              pstatus = output(fit.url, fit.datum, content, status,
                  CrawlDatum.STATUS_FETCH_SUCCESS, fit.outlinkDepth);
              ...
              break;

            ...

            }

            ...

          } while (redirecting && (redirectCount <= maxRedirect));

        } catch (Throwable t) {

        }
      }

    } catch (Throwable e) {

    } finally {

    }
  }

protocolFactory为ProtocolFactory，根据url的协议（例如http、ftp等），从插件库中获取协议类org.apache.nutch.protocol.http.Http。Http的getRobotRules函数用于抓取对应url下的robots.txt文件，robots.txt是爬虫协议，搜索引擎应该根据该协议设定抓取策略，例如哪些页面可以或不可以抓取，这里假设为null。
接下来通过Http的getProtocolOutput函数从url对应的地址读取内容并返回一个ProtocolOutput，内部封装了从url读取出的信息以及状态码。
最后调用output将刚刚得到的数据写入文件中。

Fetcher::run->FetcherThread::run->Http::getProtocolOutput

 public ProtocolOutput getProtocolOutput(Text url, CrawlDatum datum) {

    String urlString = url.toString();
    try {
      URL u = new URL(urlString);

      long startTime = System.currentTimeMillis();
      Response response = getResponse(u, datum, false);

      if (this.responseTime) {
        int elapsedTime = (int) (System.currentTimeMillis() - startTime);
        datum.getMetaData().put(RESPONSE_TIME, new IntWritable(elapsedTime));
      }

      int code = response.getCode();
      datum.getMetaData().put(Nutch.PROTOCOL_STATUS_CODE_KEY,
        new Text(Integer.toString(code)));

      byte[] content = response.getContent();
      Content c = new Content(u.toString(), u.toString(),
          (content == null ? EMPTY_CONTENT : content),
          response.getHeader("Content-Type"), response.getHeaders(), this.conf);

      if (code == 200) { 
        return new ProtocolOutput(c);
      }

      ...
    } catch (Throwable e) {

    }
  }

getProtocolOutput的主要任务是从url地址处根据协议获取信息（一般情况下就是下载url对应的网页），并刷新CrawlDatum中信息，最后创建ProtocolOutput并返回，ProtocolOutput内保存了两个重要的内容，一是url地址处对应的内容（例如网站html文件内的内容），二是保存了本次请求的状态信息，例如是否成功等等。

Fetcher::run->FetcherThread::run->output

  private ParseStatus output(Text key, CrawlDatum datum, Content content,
      ProtocolStatus pstatus, int status, int outlinkDepth) {

    datum.setStatus(status);
    datum.setFetchTime(System.currentTimeMillis());
    if (pstatus != null)
      datum.getMetaData().put(Nutch.WRITABLE_PROTO_STATUS_KEY, pstatus);

    ParseResult parseResult = null;
    if (content != null) {
      Metadata metadata = content.getMetadata();

      if (content.getContentType() != null)
        datum.getMetaData().put(new Text(Metadata.CONTENT_TYPE),
            new Text(content.getContentType()));

      metadata.set(Nutch.SEGMENT_NAME_KEY, segmentName);
      try {
        scfilters.passScoreBeforeParsing(key, datum, content);
      } catch (Exception e) {

      }

      ...

      content.getMetadata().add(Nutch.FETCH_STATUS_KEY,
          Integer.toString(status));
    }

    try {
      output.collect(key, new NutchWritable(datum));
      if (content != null && storingContent)
        output.collect(key, new NutchWritable(content));

      ...

    } catch (IOException e) {

    }
    return null;
  }

省略的代码和解析有关，放在下一章分析。output函数向CrawlDatum中记录各个信息，然后通过collect收集CrawlDatum和Content，最后由FetcherOutputFormat定义了如何输出。

FetcherOutputFormat::getRecordWriter

  public RecordWriter<Text, NutchWritable> getRecordWriter(final FileSystem fs,
      final JobConf job, final String name, final Progressable progress)
      throws IOException {

    Path out = FileOutputFormat.getOutputPath(job);
    final Path fetch = new Path(new Path(out, CrawlDatum.FETCH_DIR_NAME), name);
    final Path content = new Path(new Path(out, Content.DIR_NAME), name);

    final CompressionType compType = SequenceFileOutputFormat
        .getOutputCompressionType(job);

    Option fKeyClassOpt = MapFile.Writer.keyClass(Text.class);
    org.apache.hadoop.io.SequenceFile.Writer.Option fValClassOpt = SequenceFile.Writer.valueClass(CrawlDatum.class);
    org.apache.hadoop.io.SequenceFile.Writer.Option fProgressOpt = SequenceFile.Writer.progressable(progress);
    org.apache.hadoop.io.SequenceFile.Writer.Option fCompOpt = SequenceFile.Writer.compression(compType);

    final MapFile.Writer fetchOut = new MapFile.Writer(job,
        fetch, fKeyClassOpt, fValClassOpt, fCompOpt, fProgressOpt);

    return new RecordWriter<Text, NutchWritable>() {
      private MapFile.Writer contentOut;
      private RecordWriter<Text, Parse> parseOut;

      {
        if (Fetcher.isStoringContent(job)) {
          Option cKeyClassOpt = MapFile.Writer.keyClass(Text.class);
          org.apache.hadoop.io.SequenceFile.Writer.Option cValClassOpt = SequenceFile.Writer.valueClass(Content.class);
          org.apache.hadoop.io.SequenceFile.Writer.Option cProgressOpt = SequenceFile.Writer.progressable(progress);
          org.apache.hadoop.io.SequenceFile.Writer.Option cCompOpt = SequenceFile.Writer.compression(compType);
          contentOut = new MapFile.Writer(job, content,
              cKeyClassOpt, cValClassOpt, cCompOpt, cProgressOpt);
        }

        if (Fetcher.isParsing(job)) {
          parseOut = new ParseOutputFormat().getRecordWriter(fs, job, name,
              progress);
        }
      }

      public void write(Text key, NutchWritable value) throws IOException {

        Writable w = value.get();

        if (w instanceof CrawlDatum)
          fetchOut.append(key, w);
        else if (w instanceof Content && contentOut != null)
          contentOut.append(key, w);
        else if (w instanceof Parse && parseOut != null)
          parseOut.write(key, (Parse) w);
      }

      public void close(Reporter reporter) throws IOException {
        fetchOut.close();
        if (contentOut != null) {
          contentOut.close();
        }
        if (parseOut != null) {
          parseOut.close(reporter);
        }
      }

    };

  }

FETCH_DIR_NAME和DIR_NAME常量分别为crawl_fetch和content，getRecordWriter在crawl/segments/2*目录下创建这两个目录和对应的输出流。getRecordWriter函数最后返回RecordWriter，对应的write函数根据数据类型输出到不同的文件中。