nutch源码分析—updatedb
“bin/nutch updatedb crawl/crawldb crawl/segments/2*”这条命令最终会执行org.apache.nutch.crawl.CrawlDb的main函数。
public static void main(String[] args) throws Exception {
int res = ToolRunner.run(NutchConfiguration.create(), new CrawlDb(), args);
System.exit(res);
}
ToolRunner的run函数最终调用CrawlDb的run函数。
CrawlDb::run
public int run(String[] args) throws Exception {
...
update(new Path(args[0]), dirs.toArray(new Path[dirs.size()]), normalize,
filter, additionsAllowed, force);
return 0;
}
public void update(Path crawlDb, Path[] segments, boolean normalize,
boolean filter, boolean additionsAllowed, boolean force)
throws IOException {
...
JobConf job = CrawlDb.createJob(getConf(), crawlDb);
for (int i = 0; i < segments.length; i++) {
Path fetch = new Path(segments[i], CrawlDatum.FETCH_DIR_NAME);
Path parse = new Path(segments[i], CrawlDatum.PARSE_DIR_NAME);
FileInputFormat.addInputPath(job, fetch);
FileInputFormat.addInputPath(job, parse);
}
JobClient.runJob(job);
CrawlDb.install(job, crawlDb);
}
public static JobConf createJob(Configuration config, Path crawlDb)
throws IOException {
Path newCrawlDb = new Path(crawlDb, Integer.toString(new Random()
.nextInt(Integer.MAX_VALUE)));
JobConf job = new NutchJob(config);
job.setJobName("crawldb " + crawlDb);
Path current = new Path(crawlDb, CURRENT_NAME);
if (FileSystem.get(job).exists(current)) {
FileInputFormat.addInputPath(job, current);
}
job.setInputFormat(SequenceFileInputFormat.class);
job.setMapperClass(CrawlDbFilter.class);
job.setReducerClass(CrawlDbReducer.class);
FileOutputFormat.setOutputPath(job, newCrawlDb);
job.setOutputFormat(MapFileOutputFormat.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(CrawlDatum.class);
return job;
}
update函数会创建hadoop的Job,设置map函数为CrawlDbFilter中的map函数,reduce函数为CrawlDbReducer中的reduce函数,输入为crawl/crawldb下的current,crawl/segments/2*/下的crawl_fetch和crawl_parse,输出为crawl/crawldb下的由随机整数名构造的文件夹。最后调用CrawlDb的install更新crawl/crawldb下的文件,将由随机整数为名称的文件夹更名为current,将原来的current更名为old,并删除原来的old文件夹。
下面来看CrawlDbFilter的map函数。
CrawlDbFilter::map
public void map(Text key, CrawlDatum value,
OutputCollector<Text, CrawlDatum> output, Reporter reporter)
throws IOException {
String url = key.toString();
url = normalizers.normalize(url, scope);
url = filters.filter(url);
if (url != null) {
newKey.set(url);
output.collect(newKey, value);
}
}
map函数很简单,一是通过normalize函数对url的格式标准化,二是通过filter函数对url进行过滤。
下面来看reduce函数,由于reduce函数较长,下面分为两部分来看。
CrawlDbReducer::reduce
第一部分
public void reduce(Text key, Iterator<CrawlDatum> values,
OutputCollector<Text, CrawlDatum> output, Reporter reporter)
throws IOException {
CrawlDatum fetch = new CrawlDatum();
CrawlDatum old = new CrawlDatum();
boolean fetchSet = false;
boolean oldSet = false;
byte[] signature = null;
boolean multiple = false;
linked.clear();
org.apache.hadoop.io.MapWritable metaFromParse = null;
while (values.hasNext()) {
CrawlDatum datum = values.next();
if (!multiple && values.hasNext())
multiple = true;
if (CrawlDatum.hasDbStatus(datum)) {
if (!oldSet) {
if (multiple) {
old.set(datum);
} else {
old = datum;
}
oldSet = true;
} else {
if (old.getFetchTime() < datum.getFetchTime())
old.set(datum);
}
continue;
}
if (CrawlDatum.hasFetchStatus(datum)) {
if (!fetchSet) {
if (multiple) {
fetch.set(datum);
} else {
fetch = datum;
}
fetchSet = true;
} else {
if (fetch.getFetchTime() < datum.getFetchTime())
fetch.set(datum);
}
continue;
}
switch (datum.getStatus()) {
case CrawlDatum.STATUS_LINKED:
CrawlDatum link;
if (multiple) {
link = new CrawlDatum();
link.set(datum);
} else {
link = datum;
}
linked.insert(link);
break;
case CrawlDatum.STATUS_SIGNATURE:
signature = datum.getSignature();
break;
case CrawlDatum.STATUS_PARSE_META:
metaFromParse = datum.getMetaData();
break;
default:
}
}
int numLinks = linked.size();
List<CrawlDatum> linkList = new ArrayList<CrawlDatum>(numLinks);
for (int i = numLinks - 1; i >= 0; i--) {
linkList.add(linked.pop());
}
if (!oldSet && !additionsAllowed)
return;
if (!fetchSet && linkList.size() > 0) {
fetch = linkList.get(0);
fetchSet = true;
}
if (!fetchSet) {
if (oldSet) {
output.collect(key, old);
} else {
}
return;
}
...
}
reduce函数的第一部分的主要功能就是收集某个url对应的所有CrawlDatum信息,从中找到处于和DB相关的状态里最新的CrawlDatum,保存在old中,以及处于和FETCH相关的状态里最新的CrawlDatum,保存在fetch中,将所有状态为STATUS_LINKED的CrawlDatum保存在linked,用来做后面的分数修改。DB状态和FETCH状态的大致理解是和存入的文件夹有关,例如在crawl/crawldb里存入的状态大多是DB状态,在crawl/segments里存入的状态大多是FETCH状态,如果只有DB状态没有FETCH状态,则说明是第一次存入crawl/crawldb中,或者从上次更新到crawl/crawldb中还未抓取过,此时直接返回old就行了。
CrawlDbReducer::reduce
第二部分
public void reduce(Text key, Iterator<CrawlDatum> values,
OutputCollector<Text, CrawlDatum> output, Reporter reporter)
throws IOException {
...
if (signature == null)
signature = fetch.getSignature();
long prevModifiedTime = oldSet ? old.getModifiedTime() : 0L;
long prevFetchTime = oldSet ? old.getFetchTime() : 0L;
result.set(fetch);
if (oldSet) {
if (old.getMetaData().size() > 0) {
result.putAllMetaData(old);
if (fetch.getMetaData().size() > 0)
result.putAllMetaData(fetch);
}
if (old.getModifiedTime() > 0 && fetch.getModifiedTime() == 0) {
result.setModifiedTime(old.getModifiedTime());
}
}
switch (fetch.getStatus()) {
...
case CrawlDatum.STATUS_FETCH_SUCCESS:
if (metaFromParse != null) {
for (Entry<Writable, Writable> e : metaFromParse.entrySet()) {
result.getMetaData().put(e.getKey(), e.getValue());
}
}
int modified = FetchSchedule.STATUS_UNKNOWN;
if (fetch.getStatus() == CrawlDatum.STATUS_FETCH_NOTMODIFIED) {
modified = FetchSchedule.STATUS_NOTMODIFIED;
} else if (fetch.getStatus() == CrawlDatum.STATUS_FETCH_SUCCESS) {
if (oldSet && old.getSignature() != null && signature != null) {
if (SignatureComparator._compare(old.getSignature(), signature) != 0) {
modified = FetchSchedule.STATUS_MODIFIED;
} else {
modified = FetchSchedule.STATUS_NOTMODIFIED;
}
}
}
result = schedule.setFetchSchedule(key, result, prevFetchTime,
prevModifiedTime, fetch.getFetchTime(), fetch.getModifiedTime(),
modified);
if (modified == FetchSchedule.STATUS_NOTMODIFIED) {
result.setStatus(CrawlDatum.STATUS_DB_NOTMODIFIED);
result.setModifiedTime(prevModifiedTime);
if (oldSet)
result.setSignature(old.getSignature());
} else {
switch (fetch.getStatus()) {
case CrawlDatum.STATUS_FETCH_SUCCESS:
result.setStatus(CrawlDatum.STATUS_DB_FETCHED);
break;
...
default:
if (oldSet)
result.setStatus(old.getStatus());
else
result.setStatus(CrawlDatum.STATUS_DB_UNFETCHED);
}
result.setSignature(signature);
}
break;
default:
}
scfilters.updateDbScore(key, oldSet ? old : null, result, linkList);
output.collect(key, result);
}
reduce函数的第二部分首先设置最终的结果result为对应FETCH状态最新的CrawlDatum,即fetch,接下来将old和fetch中的最新meta信息写入result中,先写old再写fetch是因为要使fetch中的meta信息能覆盖old中的,再往下,如果fetch中没有修改时间,就使用old的修改时间。下面假设fetch的状态为STATUS_FETCH_SUCCESS,metaFromParse为第一部分中状态为STATUS_PARSE_META的CrawlDatum的meta信息,如果该信息存在,就直接写入result中。reduce函数接下来比较签名,如果相等,表示即使fetch过网页,但和原先的网页一样,此时result最终的状态为STATUS_DB_NOTMODIFIED,反之,则为STATUS_DB_FETCHED。reduce函数最后根据所有状态为STATUS_LINKED的CrawlDatum更新分数,最后就输出到crawl/crawldb的临时文件夹中。最后通过CrawlDb的install函数进行替换操作,该函数已经在第一章就已经分析过了。