【无标题】

2021SC@SDUSC

Fetcher::run->FetcherThread::run

public void run() {

FetchItem fit = null;
try {
  while (true) {

    ...

    fit = ((FetchItemQueues) fetchQueues).getFetchItem();

    ...

    try {
      do {
        Protocol protocol = this.protocolFactory.getProtocol(fit.url
            .toString());
        BaseRobotRules rules = protocol.getRobotRules(fit.url, fit.datum);

        ...

        ProtocolOutput output = protocol.getProtocolOutput(fit.url,
            fit.datum);
        ProtocolStatus status = output.getStatus();
        Content content = output.getContent();
        ParseStatus pstatus = null;
        ((FetchItemQueues) fetchQueues).finishFetchItem(fit);
        String urlString = fit.url.toString();
        switch (status.getCode()) {

        ...

        case ProtocolStatus.SUCCESS:
          pstatus = output(fit.url, fit.datum, content, status,
              CrawlDatum.STATUS_FETCH_SUCCESS, fit.outlinkDepth);
          ...
          break;

        ...

        }

        ...

      } while (redirecting && (redirectCount <= maxRedirect));

    } catch (Throwable t) {

    }
  }

} catch (Throwable e) {

} finally {

}

}

protocolFactory为ProtocolFactory,根据url的协议(例如http、ftp等),从插件库中获取协议类org.apache.nutch.protocol.http.Http。Http的getRobotRules函数用于抓取对应url下的robots.txt文件,robots.txt是爬虫协议,搜索引擎应该根据该协议设定抓取策略,例如哪些页面可以或不可以抓取,这里假设为null。
接下来通过Http的getProtocolOutput函数从url对应的地址读取内容并返回一个ProtocolOutput,内部封装了从url读取出的信息以及状态码。
最后调用output将刚刚得到的数据写入文件中。

Fetcher::run->FetcherThread::run->Http::getProtocolOutput

public ProtocolOutput getProtocolOutput(Text url, CrawlDatum datum) {

String urlString = url.toString();
try {
  URL u = new URL(urlString);

  long startTime = System.currentTimeMillis();
  Response response = getResponse(u, datum, false);

  if (this.responseTime) {
    int elapsedTime = (int) (System.currentTimeMillis() - startTime);
    datum.getMetaData().put(RESPONSE_TIME, new IntWritable(elapsedTime));
  }

  int code = response.getCode();
  datum.getMetaData().put(Nutch.PROTOCOL_STATUS_CODE_KEY,
    new Text(Integer.toString(code)));

  byte[] content = response.getContent();
  Content c = new Content(u.toString(), u.toString(),
      (content == null ? EMPTY_CONTENT : content),
      response.getHeader("Content-Type"), response.getHeaders(), this.conf);

  if (code == 200) { 
    return new ProtocolOutput(c);
  }

  ...
} catch (Throwable e) {

}

}

getProtocolOutput的主要任务是从url地址处根据协议获取信息(一般情况下就是下载url对应的网页),并刷新CrawlDatum中信息,最后创建ProtocolOutput并返回,ProtocolOutput内保存了两个重要的内容,一是url地址处对应的内容(例如网站html文件内的内容),二是保存了本次请求的状态信息,例如是否成功等等。

Fetcher::run->FetcherThread::run->output

private ParseStatus output(Text key, CrawlDatum datum, Content content,
ProtocolStatus pstatus, int status, int outlinkDepth) {

datum.setStatus(status);
datum.setFetchTime(System.currentTimeMillis());
if (pstatus != null)
  datum.getMetaData().put(Nutch.WRITABLE_PROTO_STATUS_KEY, pstatus);

ParseResult parseResult = null;
if (content != null) {
  Metadata metadata = content.getMetadata();

  if (content.getContentType() != null)
    datum.getMetaData().put(new Text(Metadata.CONTENT_TYPE),
        new Text(content.getContentType()));

  metadata.set(Nutch.SEGMENT_NAME_KEY, segmentName);
  try {
    scfilters.passScoreBeforeParsing(key, datum, content);
  } catch (Exception e) {

  }

  ...

  content.getMetadata().add(Nutch.FETCH_STATUS_KEY,
      Integer.toString(status));
}

try {
  output.collect(key, new NutchWritable(datum));
  if (content != null && storingContent)
    output.collect(key, new NutchWritable(content));

  ...

} catch (IOException e) {

}
return null;

}

省略的代码和解析有关,放在下一章分析。output函数向CrawlDatum中记录各个信息,然后通过collect收集CrawlDatum和Content,最后由FetcherOutputFormat定义了如何输出。

FetcherOutputFormat::getRecordWriter

public RecordWriter<Text, NutchWritable> getRecordWriter(final FileSystem fs,
final JobConf job, final String name, final Progressable progress)
throws IOException {

Path out = FileOutputFormat.getOutputPath(job);
final Path fetch = new Path(new Path(out, CrawlDatum.FETCH_DIR_NAME), name);
final Path content = new Path(new Path(out, Content.DIR_NAME), name);

final CompressionType compType = SequenceFileOutputFormat
    .getOutputCompressionType(job);

Option fKeyClassOpt = MapFile.Writer.keyClass(Text.class);
org.apache.hadoop.io.SequenceFile.Writer.Option fValClassOpt = SequenceFile.Writer.valueClass(CrawlDatum.class);
org.apache.hadoop.io.SequenceFile.Writer.Option fProgressOpt = SequenceFile.Writer.progressable(progress);
org.apache.hadoop.io.SequenceFile.Writer.Option fCompOpt = SequenceFile.Writer.compression(compType);

final MapFile.Writer fetchOut = new MapFile.Writer(job,
    fetch, fKeyClassOpt, fValClassOpt, fCompOpt, fProgressOpt);

return new RecordWriter<Text, NutchWritable>() {
  private MapFile.Writer contentOut;
  private RecordWriter<Text, Parse> parseOut;

  {
    if (Fetcher.isStoringContent(job)) {
      Option cKeyClassOpt = MapFile.Writer.keyClass(Text.class);
      org.apache.hadoop.io.SequenceFile.Writer.Option cValClassOpt = SequenceFile.Writer.valueClass(Content.class);
      org.apache.hadoop.io.SequenceFile.Writer.Option cProgressOpt = SequenceFile.Writer.progressable(progress);
      org.apache.hadoop.io.SequenceFile.Writer.Option cCompOpt = SequenceFile.Writer.compression(compType);
      contentOut = new MapFile.Writer(job, content,
          cKeyClassOpt, cValClassOpt, cCompOpt, cProgressOpt);
    }

    if (Fetcher.isParsing(job)) {
      parseOut = new ParseOutputFormat().getRecordWriter(fs, job, name,
          progress);
    }
  }

  public void write(Text key, NutchWritable value) throws IOException {

    Writable w = value.get();

    if (w instanceof CrawlDatum)
      fetchOut.append(key, w);
    else if (w instanceof Content && contentOut != null)
      contentOut.append(key, w);
    else if (w instanceof Parse && parseOut != null)
      parseOut.write(key, (Parse) w);
  }

  public void close(Reporter reporter) throws IOException {
    fetchOut.close();
    if (contentOut != null) {
      contentOut.close();
    }
    if (parseOut != null) {
      parseOut.close(reporter);
    }
  }

};

}

FETCH_DIR_NAME和DIR_NAME常量分别为crawl_fetch和content,getRecordWriter在crawl/segments/2*目录下创建这两个目录和对应的输出流。getRecordWriter函数最后返回RecordWriter,对应的write函数根据数据类型输出到不同的文件中。

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值