2021SC@SDUSC
nutch源码分析—updatedb
“bin/nutch updatedb crawl/crawldb crawl/segments/2*”这条命令最终会执行org.apache.nutch.crawl.CrawlDb的main函数。
public static void main(String[] args) throws Exception {
int res = ToolRunner.run(NutchConfiguration.create(), new CrawlDb(), args);
System.exit(res);
}
ToolRunner的run函数最终调用CrawlDb的run函数。
CrawlDb::run
public int run(String[] args) throws Exception {
...
update(new Path(args[0]), dirs.toArray(new Path[dirs.size()]), normalize,
filter, additionsAllowed, force);
return 0;
}
public void update(Path crawlDb, Path[] segments, boolean normalize,
boolean filter, boolean additionsAllowed, boolean force)
throws IOException {
...
JobConf job = CrawlDb.createJob(getConf(), crawlDb);
for (int i = 0; i < segments.length; i++) {
Path fetch = new Path(segments[i], CrawlDatum.FETCH_DIR_NAME);
Path parse = new Path(segments[i], CrawlDatum.PARSE_DIR_NAME);
FileInputFormat.addInputPath(job, fetch);
FileInputFormat.addInputPath(job, parse);
}
JobClient.runJob(job);
CrawlDb.install(job, crawlDb);
}
public static JobConf createJob(Configuration config, Path crawlDb)
throws IOException {
Path newCrawlDb = new Path(crawlDb, Integer.toString(new Random()
.nextInt(Integer.MAX_VALUE)));
JobConf job = new NutchJob(config);
job.setJobName("crawldb " + crawlDb);
Path current = new Path(crawlDb, CURRENT_NAME);
if (FileSystem.get(job).exists(current)) {
FileInputFormat.addInputPath(job, current);
}
job.setInputFormat(SequenceFileInputFormat.class);
job.setMapperClass(CrawlDbFilter.class);
job.setReducerClass(CrawlDbReducer.class);
FileOutputFormat.setOutputPath(job, newCrawlDb);
job.setOutputFormat(MapFileOutputFormat.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(CrawlDatum.class);
return job;
}
update函数会创建hadoop的Job,设置map函数为CrawlDbFilter中的map函数,reduce函数为CrawlDbReducer中的reduce函数,输入为crawl/crawldb下的current,crawl/segments/2*/下的crawl_fetch和crawl_parse,输出为crawl/crawldb下的由随机整数名构造的文件夹。最后调用CrawlDb的install更新crawl/crawldb下的文件,将由随机整数为名称的文件夹更名为current,将原来的current更名为old,并删除原来的old文件夹。
下面来看CrawlDbFilter的map函数。
CrawlDbFilter::map
public void map(Text key, CrawlDatum value,
OutputCollector<Text, CrawlDatum> output, Reporter reporter)
throws IOException {
String url = key.toString();
url = normalizers.normalize(url, scope);
url = filters.filter(url);
if (url != null) {
newKey.set(url);
output.collect(newKey, value);
}
}
map函数很简单,一是通过normalize函数对url的格式标准化,二是通过filter函数对url进行过滤。
下面来看reduce函数,由于reduce函数较长,下面分为两部分来看。
CrawlDbReducer::reduce
第一部分
public void reduce(Text key, Iterator values,
OutputCollector<Text, CrawlDatum> output, Reporter reporter)
throws IOException {
CrawlDatum fetch = new CrawlDatum();
CrawlDatum old = new CrawlDatum();
boolean fetchSet = false;
boolean oldSet = false;
byte[] signature = null;
boolean multiple = false;
linked.clear();
org.apache.hadoop.io.MapWritable metaFromParse = null;
while (values.hasNext()) {
CrawlDatum datum = values.next();
if (!multiple && values.hasNext())
multiple = true;
if (CrawlDatum.hasDbStatus(datum)) {
if (!oldSet) {
if (multiple) {
old.set(datum);
} else {
old = datum;
}
oldSet = true;
} else {
if (old.getFetchTime() < datum.getFetchTime())
old.set(datum);
}
continue;
}
if (CrawlDatum.hasFetchStatus(datum)) {
if (!fetchSet) {
if (multiple) {
fetch.set(datum);
} else {
fetch = datum;
}
fetchSet = true;
} else {
if (fetch.getFetchTime() < datum.getFetchTime())
fetch.set(datum);
}
continue;
}
switch (datum.getStatus()) {
case CrawlDatum.STATUS_LINKED:
CrawlDatum link;
if (multiple) {
link = new CrawlDatum();
link.set(datum);
} else {
link = datum;
}
linked.insert(link);
break;
case CrawlDatum.STATUS_SIGNATURE:
signature = datum.getSignature();
break;
case CrawlDatum.STATUS_PARSE_META:
metaFromParse = datum.getMetaData();
break;
default:
}
}
int numLinks = linked.size();
List<CrawlDatum> linkList = new ArrayList<CrawlDatum>(numLinks);
for (int i = numLinks - 1; i >= 0; i--) {
linkList.add(linked.pop());
}
if (!oldSet && !additionsAllowed)
return;
if (!fetchSet && linkList.size() > 0) {
fetch = linkList.get(0);
fetchSet = true;
}
if (!fetchSet) {
if (oldSet) {
output.collect(key, old);
} else {
}
return;
}
...
}
reduce函数的第一部分的主要功能就是收集某个url对应的所有CrawlDatum信息,从中找到处于和DB相关的状态里最新的CrawlDatum,保存在old中,以及处于和FETCH相关的状态里最新的CrawlDatum,保存在fetch中,将所有状态为STATUS_LINKED的CrawlDatum保存在linked,用来做后面的分数修改。DB状态和FETCH状态的大致理解是和存入的文件夹有关,例如在crawl/crawldb里存入的状态大多是DB状态,在crawl/segments里存入的状态大多是FETCH状态,如果只有DB状态没有FETCH状态,则说明是第一次存入crawl/crawldb中,或者从上次更新到crawl/crawldb中还未抓取过,此时直接返回old就行了。