Nutch中CrawlDatum的FetchTime的设置了解

昨天看错了,实际上对于爬取成功的url,在update()阶段,程序会将url的FetchTime+FetchInterval作为最终的下次FetchTime,这个FetchTime已经不再代表网页成功Fetch的时间,而是作为下次Fetch的时间,如果在小于新的FetchTime的时间内对该url进行爬去,程序将会过滤掉该url。

在CrawlDbReducer中的reduce函数:

    case CrawlDatum.STATUS_FETCH_SUCCESS:         // succesful fetch
    case CrawlDatum.STATUS_FETCH_REDIR_TEMP:      // successful fetch, redirected
    case CrawlDatum.STATUS_FETCH_REDIR_PERM:
    case CrawlDatum.STATUS_FETCH_NOTMODIFIED:     // successful fetch, notmodified
      // determine the modification status
      int modified = FetchSchedule.STATUS_UNKNOWN;
      if (fetch.getStatus() == CrawlDatum.STATUS_FETCH_NOTMODIFIED) {
        modified = FetchSchedule.STATUS_NOTMODIFIED;
      } else {
        if (oldSet && old.getSignature() != null && signature != null) {
          if (SignatureComparator._compare(old.getSignature(), signature) != 0) {
            modified = FetchSchedule.STATUS_MODIFIED;
          } else {
            modified = FetchSchedule.STATUS_NOTMODIFIED;
          }
        }
      }
      // set the schedule
      System.err.println("1:result.fetchtime="+result.getFetchTime());
      result = schedule.setFetchSchedule((Text)key, result, prevFetchTime,
          prevModifiedTime, fetch.getFetchTime(), fetch.getModifiedTime(), modified);
      // set the result status and signature
      System.err.println("2:result.fetchtime="+result.getFetchTime());

      if (modified == FetchSchedule.STATUS_NOTMODIFIED) {
        result.setStatus(CrawlDatum.STATUS_DB_NOTMODIFIED);
        if (oldSet) result.setSignature(old.getSignature());
      } else {
        switch (fetch.getStatus()) {
        case CrawlDatum.STATUS_FETCH_SUCCESS:
          result.setStatus(CrawlDatum.STATUS_DB_FETCHED);
          break;
        case CrawlDatum.STATUS_FETCH_REDIR_PERM:
          result.setStatus(CrawlDatum.STATUS_DB_REDIR_PERM);
          break;
        case CrawlDatum.STATUS_FETCH_REDIR_TEMP:
          result.setStatus(CrawlDatum.STATUS_DB_REDIR_TEMP);
          break;
        default:
          LOG.warn("Unexpected status: " + fetch.getStatus() + " resetting to old status.");
          if (oldSet) result.setStatus(old.getStatus());
          else result.setStatus(CrawlDatum.STATUS_DB_UNFETCHED);
        }
        result.setSignature(signature);
        if (metaFromParse != null) {
            for (Entry<Writable, Writable> e : metaFromParse.entrySet()) {
              result.getMetaData().put(e.getKey(), e.getValue());
            }
          }
      }
      // if fetchInterval is larger than the system-wide maximum, trigger
      // an unconditional recrawl. This prevents the page to be stuck at
      // NOTMODIFIED state, when the old fetched copy was already removed with
      // old segments.
      if (maxInterval < result.getFetchInterval())
        result = schedule.forceRefetch((Text)key, result, false);
      break;

通过跟踪打印result的FetchTime值的情况,可以发现,程序在调用schedule.setFetchSchedule()函数之后,值FetchTime的值发生了变化,所以可以肯定是该函数改变了当前url的状态类CrawlDatum的FetchTime状态。

CrawlDbReducer类中,调用的FetchSchedule的扩展为DefaultFetchSchedule类,他的源代码:

public class DefaultFetchSchedule extends AbstractFetchSchedule {

  @Override
  public CrawlDatum setFetchSchedule(Text url, CrawlDatum datum,
          long prevFetchTime, long prevModifiedTime,
          long fetchTime, long modifiedTime, int state) {
//	System.err.println("+++++++++++++++++++555555555555555+++++++++++++>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>");
    datum = super.setFetchSchedule(url, datum, prevFetchTime, prevModifiedTime,
        fetchTime, modifiedTime, state);
    if (datum.getFetchInterval() == 0 ) {
      datum.setFetchInterval(defaultInterval);
    }
    datum.setFetchTime(fetchTime + (long)datum.getFetchInterval() * 1000);
    datum.setModifiedTime(modifiedTime);
    return datum;
  }
}

可以看到该类中,只有一个方法setFetchSchedule(),该函数最终将datum的FetchTime的值设置为 datum.setFetchTime(fetchTime + (long)datum.getFetchInterval() * 1000);

 

 

 

 

 


 

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

WitsMakeMen

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值