nutch源码分析—generate
根据上一章的分析,“bin/nutch generate crawl/crawldb crawl/segments”这条命令最终会调用org.apache.nutch.crawl.Generator的main函数。
Generator::main
public static void main(String args[]) throws Exception {
int res = ToolRunner
.run(NutchConfiguration.create(), new Generator(), args);
System.exit(res);
}
ToolRunner是hadoop的一个工具类,最终会调用Generator类的run函数。
Generator::run
public int run(String[] args) throws Exception {
Path dbDir = new Path(args[0]);
Path segmentsDir = new Path(args[1]);
long curTime = System.currentTimeMillis();
long topN = Long.MAX_VALUE;
int numFetchers = -1;
boolean filter = true;
boolean norm = true;
boolean force = false;
String expr = null;
int maxNumSegments = 1;
for (int i = 2; i < args.length; i++) {
if ("-topN".equals(args[i])) {
topN = Long.parseLong(args[i + 1]);
i++;
} else if ("-numFetchers".equals(args[i])) {
numFetchers = Integer.parseInt(args[i + 1]);
i++;
} else if ("-adddays".equals(args[i])) {
long numDays = Integer.parseInt(args[i + 1]);
curTime += numDays * 1000L * 60 * 60 * 24;
} else if ("-noFilter".equals(args[i])) {
filter = false;
} else if ("-noNorm".equals(args[i])) {
norm = false;
} else if ("-force".equals(args[i])) {
force = true;
} else if ("-maxNumSegments".equals(args[i])) {
maxNumSegments = Integer.parseInt(args[i + 1]);
} else if ("-expr".equals(args[i])) {
expr = args[i + 1];
}
}
Path[] segs = generate(dbDir, segmentsDir, numFetchers, topN, curTime,
filter, norm, force, maxNumSegments, expr);
}
run函数根据调用main函数传入的参数设置各个变量,其中dbDir为crawl/crawldb,即inject后初始url所在的数据目录,segmentsDir为crawl/segments,代表一次抓取的url所在的数据目录。generate函数最终会在segmentsDir指定的目录下生成一次抓取的子目录。
Generator::run->generate
第一部分
public Path[] generate(Path dbDir, Path segments, int numLists, long topN,
long curTime, boolean filter, boolean norm, boolean force,
int maxNumSegments, String expr) throws IOException {
Path tempDir = new Path(getConf().get("mapred.temp.dir", ".")
+ "/generate-temp-" + java.util.UUID.randomUUID().toString());
Path lock = new Path(dbDir, CrawlDb.LOCK_NAME);
FileSystem fs = FileSystem.get(getConf());
LockUtil.createLockFile(fs, lock, force);
JobConf job = new NutchJob(getConf());
job.setInputFormat(SequenceFileInputFormat.class);
job.setMapperClass(Selector.class);
job.setPartitionerClass(Selector.class);
job.setReducerClass(Selector.class);
job.setOutputFormat(SequenceFileOutputFormat.class);
job.setOutputKeyClass(FloatWritable.class);
job.setOutputKeyComparatorClass(DecreasingFloatComparator.class);
job.setOutputValueClass(SelectorEntry.class);
job.setOutputFormat(GeneratorOutputFormat.class);
FileInputFormat.addInputPath(job, new Path(dbDir, CrawlDb.CURRENT_NAME));
FileOutputFormat.setOutputPath(job, tempDir);
...
JobClient.runJob(job);
...
}
tempDir指定的目录为“mapred临时文件存放地+/generate-temp+uuid”,接下来在crawl/crawldb下创建.locked锁文件,再往下创建了hadoop的Job,设置Mapper、Partitioner和Reducer为Selector,并设置Mapper的输入为crawl/crawldb/current目录下的文件,Reducer的输出保存到前面创建的临时目录tempDir中。
generate函数接下来调用JobClient的runJob函数执行该Job,注意这里是同步执行,如果是submitJob函数,则是异步执行。整个Job的执行结果是把crawl/crawldb/current目录下存储的关于起始url地址对应的CrawlDatum结构经过处理、过滤最后存入tempDir中。
下面先看Selector的map函数。
Selector::map
public void map(Text key, CrawlDatum value,
OutputCollector<FloatWritable, SelectorEntry> output, Reporter reporter)
throws IOException {
Text url = key;
if (filter) {
filters.filter(url.toString())
}
CrawlDatum crawlDatum = value;
if (!schedule.shouldFetch(url, crawlDatum, curTime)) {
return;
}
LongWritable oldGenTime = (LongWritable) crawlDatum.getMetaData().get(
Nutch.WRITABLE_GENERATE_TIME_KEY);
if (oldGenTime != null) {
if (oldGenTime.get() + genDelay > curTime)
return;
}
float sort = 1.0f;
sort = scfilters.generatorSortValue(key, crawlDatum, sort);
if (expr != null) {
if (!crawlDatum.evaluate(expr)) {
return;
}
}
if (restrictStatus != null
&& !restrictStatus.equalsIgnoreCase(CrawlDatum
.getStatusName(crawlDatum.getStatus())))
return;
if (scoreThreshold != Float.NaN && sort < scoreThreshold)
return;
if (intervalThreshold != -1
&& crawlDatum.getFetchInterval() > intervalThreshold)
return;
sortValue.set(sort);
crawlDatum.getMetaData().put(Nutch.WRITABLE_GENERATE_TIME_KEY, genTime);
entry.datum = crawlDatum;
entry.url = key;
output.collect(sortValue, entry);
}
map函数的核心功能是判断url地址是否符合抓取的条件。
map函数从crawl/crawldb/current目录下读出数据的key值为url,value值为封装了url信息的CrawlDatum结构。
filter默认为true,除非在执行generate命令时指定参数为“-noFilter”,因此接下来通过URLFilters的filter函数过滤url地址。
schedule默认为DefaultFetchSchedule,其shouldFetch检查是否可以对该url进行一次抓取,判断的准则是前后两次抓取的间隔时间要大于一定的阈值。
Selector::map->DefaultFetchSchedule::shouldFetch
public boolean shouldFetch(Text url, CrawlDatum datum, long curTime) {
if (datum.getFetchTime() - curTime > (long) maxInterval * 1000) {
if (datum.getFetchInterval() > maxInterval) {
datum.setFetchInterval(maxInterval * 0.9f);
}
datum.setFetchTime(curTime);
}
if (datum.getFetchTime() > curTime) {
return false;
}
return true;
}
再往下,如果url对应的信息中设置了WRITABLE_GENERATE_TIME_KEY对应的值,即不为null,且上一次抓取该url的时间没有过期,过期的默认时间genDelay为7天,则不进行抓取。
再往下的generatorSortValue会给url进行打分,CrawlDatum的evaluate函数会对CrawlDatum中的数据进行Jexl解析,这里就不往下看了。map函数再往下检查状态、打分、抓取间隔是否满足指定要求。
Selector的map函数的执行结果,会通过Selector的getPartition函数分发到Reducer上。
Selector::getPartition
public int getPartition(FloatWritable key, Writable value,
int numReduceTasks) {
return partitioner.getPartition(((SelectorEntry) value).url, key,
numReduceTasks);
}
partitioner默认为URLPartitioner,传入的参数numReduceTasks表示有多少任务需要执行。URLPartitioner的getPartition函数最终返回的整型指定哪一个Reducer处理Maper的输出结果。
Selector::getPartition->URLPartitioner::getPartition
public int getPartition(Text key, Writable value, int numReduceTasks) {
String urlString = key.toString();
URL url = null;
int hashCode = urlString.hashCode();
urlString = normalizers.normalize(urlString, URLNormalizers.SCOPE_PARTITION);
url = new URL(urlString);
hashCode = url.getHost().hashCode();
if (mode.equals(PARTITION_MODE_DOMAIN) && url != null)
hashCode = URLUtil.getDomainName(url).hashCode();
else if (mode.equals(PARTITION_MODE_IP)) {
InetAddress address = InetAddress.getByName(url.getHost());
hashCode = address.getHostAddress().hashCode();
}
hashCode ^= seed;
return (hashCode & Integer.MAX_VALUE) % numReduceTasks;
}
getPartition函数根据mode的不同,从url中提取出主机名、域名或主机IP地址,然后计算其hash值,最后模以numReduceTasks计算出需要哪个Reducer进行处理。
最后看一下Selector的reduce函数。
Selector::reduce
public void reduce(FloatWritable key, Iterator<SelectorEntry> values,
OutputCollector<FloatWritable, SelectorEntry> output, Reporter reporter)
throws IOException {
while (values.hasNext()) {
if (count == limit) {
if (currentsegmentnum < maxNumSegments) {
count = 0;
currentsegmentnum++;
} else
break;
}
SelectorEntry entry = values.next();
Text url = entry.url;
String urlString = url.toString();
URL u = null;
String hostordomain = null;
if (normalise && normalizers != null) {
urlString = normalizers.normalize(urlString, URLNormalizers.SCOPE_GENERATE_HOST_COUNT);
}
u = new URL(urlString);
if (byDomain) {
hostordomain = URLUtil.getDomainName(u);
} else {
hostordomain = new URL(urlString).getHost();
}
hostordomain = hostordomain.toLowerCase();
if (maxCount > 0) {
int[] hostCount = hostCounts.get(hostordomain);
if (hostCount == null) {
hostCount = new int[] { 1, 0 };
hostCounts.put(hostordomain, hostCount);
}
hostCount[1]++;
while (segCounts[hostCount[0] - 1] >= limit
&& hostCount[0] < maxNumSegments) {
hostCount[0]++;
hostCount[1] = 0;
}
if (hostCount[1] >= maxCount) {
if (hostCount[0] < maxNumSegments) {
hostCount[0]++;
hostCount[1] = 0;
} else {
continue;
}
}
entry.segnum = new IntWritable(hostCount[0]);
segCounts[hostCount[0] - 1]++;
} else {
entry.segnum = new IntWritable(currentsegmentnum);
segCounts[currentsegmentnum - 1]++;
}
output.collect(key, entry);
count++;
}
}
reduce函数的核心功能就是计算该url属于哪一个segment中,不同时间获取到的不同url地址会存入crawl/segments下不同的文件夹中。
Generator::run->generate
第二部分
public Path[] generate(Path dbDir, Path segments, int numLists, long topN,
long curTime, boolean filter, boolean norm, boolean force,
int maxNumSegments, String expr) throws IOException {
...
List<Path> generatedSegments = new ArrayList<Path>();
FileStatus[] status = fs.listStatus(tempDir);
for (FileStatus stat : status) {
Path subfetchlist = stat.getPath();
if (!subfetchlist.getName().startsWith("fetchlist-"))
continue;
Path newSeg = partitionSegment(fs, segments, subfetchlist, numLists);
generatedSegments.add(newSeg);
}
if (getConf().getBoolean(GENERATE_UPDATE_CRAWLDB, false)) {
Path tempDir2 = new Path(getConf().get("mapred.temp.dir", ".")
+ "/generate-temp-" + java.util.UUID.randomUUID().toString());
job = new NutchJob(getConf());
job.setJobName("generate: updatedb " + dbDir);
job.setLong(Nutch.GENERATE_TIME_KEY, generateTime);
for (Path segmpaths : generatedSegments) {
Path subGenDir = new Path(segmpaths, CrawlDatum.GENERATE_DIR_NAME);
FileInputFormat.addInputPath(job, subGenDir);
}
FileInputFormat.addInputPath(job, new Path(dbDir, CrawlDb.CURRENT_NAME));
job.setInputFormat(SequenceFileInputFormat.class);
job.setMapperClass(CrawlDbUpdater.class);
job.setReducerClass(CrawlDbUpdater.class);
job.setOutputFormat(MapFileOutputFormat.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(CrawlDatum.class);
FileOutputFormat.setOutputPath(job, tempDir2);
JobClient.runJob(job);
CrawlDb.install(job, dbDir);
fs.delete(tempDir2, true);
}
LockUtil.removeLockFile(fs, lock);
fs.delete(tempDir, true);
Path[] patharray = new Path[generatedSegments.size()];
return generatedSegments.toArray(patharray);
}
tempDir目录保存了前面Reducer的输出结果,依次遍历tempDir目录下文件名为“fetchlist-”开头的子目录。partitionSegment函数内部利用hadoop将tempDir文件夹下的输出结果写入crawl/segments中,并返回目录路径。
接下来的if语句是合并crawlDb中和crawl_generate文件夹中的数据,本章不仔细往下看了。
最后删除锁文件,删除临时文件夹tempDir,最后返回crawl/segments下的目录路径,例如“crawl/segments/2016*/crawl_generate”。
Generator::run->generate->partitionSegment
private Path partitionSegment(FileSystem fs, Path segmentsDir, Path inputDir,
int numLists) throws IOException {
Path segment = new Path(segmentsDir, generateSegmentName());
Path output = new Path(segment, CrawlDatum.GENERATE_DIR_NAME);
NutchJob job = new NutchJob(getConf());
job.setJobName("generate: partition " + segment);
job.setInt("partition.url.seed", new Random().nextInt());
FileInputFormat.addInputPath(job, inputDir);
job.setInputFormat(SequenceFileInputFormat.class);
job.setMapperClass(SelectorInverseMapper.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(SelectorEntry.class);
job.setPartitionerClass(URLPartitioner.class);
job.setReducerClass(PartitionReducer.class);
job.setNumReduceTasks(numLists);
FileOutputFormat.setOutputPath(job, output);
job.setOutputFormat(SequenceFileOutputFormat.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(CrawlDatum.class);
job.setOutputKeyComparatorClass(HashComparator.class);
JobClient.runJob(job);
return segment;
}
generateSegmentName函数根据当前时间生成字符串。
partitionSegment在segmentsDir指向的crawl/segments文件夹下创建generateSegmentName生成的时间为名称的文件夹,然后再在该文件夹下创建crawl_generate文件夹。
inputDir指向tempDir文件夹下的fetchlist-*文件夹。
接下来创建Job,SelectorInverseMapper的map函数什么也没做。URLPartitioner的getPartition函数前面分析过了。PartitionReducer的reduce函数从SelectorEntry中提取出CrawlDatum结构,最终写入到crawl_generate文件夹中。
总结一下,nutch的generate命令其实就是把crawl/crawldb中保存的url地址处理、过滤,最终存入crawl/segments/时间/crawl_generate文件夹中。