昨天看错了,实际上对于爬取成功的url,在update()阶段,程序会将url的FetchTime+FetchInterval作为最终的下次FetchTime,这个FetchTime已经不再代表网页成功Fetch的时间,而是作为下次Fetch的时间,如果在小于新的FetchTime的时间内对该url进行爬去,程序将会过滤掉该url。
在CrawlDbReducer中的reduce函数:
case CrawlDatum.STATUS_FETCH_SUCCESS: // succesful fetch
case CrawlDatum.STATUS_FETCH_REDIR_TEMP: // successful fetch, redirected
case CrawlDatum.STATUS_FETCH_REDIR_PERM:
case CrawlDatum.STATUS_FETCH_NOTMODIFIED: // successful fetch, notmodified
// determine the modification status
int modified = FetchSchedule.STATUS_UNKNOWN;
if (fetch.getStatus() == CrawlDatum.STATUS_FETCH_NOTMODIFIED) {
modified = FetchSchedule.STATUS_NOTMODIFIED;
} else {
if (oldSet && old.getSignature() != null && signature != null) {
if (SignatureComparator._compare(old.getSignature(), signature) != 0) {
modified = FetchSchedule.STATUS_MODIFIED;
} else {
modified = FetchSchedule.STATUS_NOTMODIFIED;
}
}
}
// set the schedule
System.err.println("1:result.fetchtime="+result.getFetchTime());
result = schedule.setFetchSchedule((Text)key, result, prevFetchTime,
prevModifiedTime, fetch.getFetchTime(), fetch.getModifiedTime(), modified);
// set the result status and signature
System.err.println("2:result.fetchtime="+result.getFetchTime());
if (modified == FetchSchedule.STATUS_NOTMODIFIED) {
result.setStatus(CrawlDatum.STATUS_DB_NOTMODIFIED);
if (oldSet) result.setSignature(old.getSignature());
} else {
switch (fetch.getStatus()) {
case CrawlDatum.STATUS_FETCH_SUCCESS:
result.setStatus(CrawlDatum.STATUS_DB_FETCHED);
break;
case CrawlDatum.STATUS_FETCH_REDIR_PERM:
result.setStatus(CrawlDatum.STATUS_DB_REDIR_PERM);
break;
case CrawlDatum.STATUS_FETCH_REDIR_TEMP:
result.setStatus(CrawlDatum.STATUS_DB_REDIR_TEMP);
break;
default:
LOG.warn("Unexpected status: " + fetch.getStatus() + " resetting to old status.");
if (oldSet) result.setStatus(old.getStatus());
else result.setStatus(CrawlDatum.STATUS_DB_UNFETCHED);
}
result.setSignature(signature);
if (metaFromParse != null) {
for (Entry<Writable, Writable> e : metaFromParse.entrySet()) {
result.getMetaData().put(e.getKey(), e.getValue());
}
}
}
// if fetchInterval is larger than the system-wide maximum, trigger
// an unconditional recrawl. This prevents the page to be stuck at
// NOTMODIFIED state, when the old fetched copy was already removed with
// old segments.
if (maxInterval < result.getFetchInterval())
result = schedule.forceRefetch((Text)key, result, false);
break;
通过跟踪打印result的FetchTime值的情况,可以发现,程序在调用schedule.setFetchSchedule()函数之后,值FetchTime的值发生了变化,所以可以肯定是该函数改变了当前url的状态类CrawlDatum的FetchTime状态。
CrawlDbReducer类中,调用的FetchSchedule的扩展为DefaultFetchSchedule类,他的源代码:
public class DefaultFetchSchedule extends AbstractFetchSchedule {
@Override
public CrawlDatum setFetchSchedule(Text url, CrawlDatum datum,
long prevFetchTime, long prevModifiedTime,
long fetchTime, long modifiedTime, int state) {
// System.err.println("+++++++++++++++++++555555555555555+++++++++++++>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>");
datum = super.setFetchSchedule(url, datum, prevFetchTime, prevModifiedTime,
fetchTime, modifiedTime, state);
if (datum.getFetchInterval() == 0 ) {
datum.setFetchInterval(defaultInterval);
}
datum.setFetchTime(fetchTime + (long)datum.getFetchInterval() * 1000);
datum.setModifiedTime(modifiedTime);
return datum;
}
}
可以看到该类中,只有一个方法setFetchSchedule(),该函数最终将datum的FetchTime的值设置为 datum.setFetchTime(fetchTime + (long)datum.getFetchInterval() * 1000);