【无标题】

2021SC@SDUSC

Selector::map->DefaultFetchSchedule::shouldFetch

public boolean shouldFetch(Text url, CrawlDatum datum, long curTime) {
if (datum.getFetchTime() - curTime > (long) maxInterval * 1000) {
if (datum.getFetchInterval() > maxInterval) {
datum.setFetchInterval(maxInterval * 0.9f);
}
datum.setFetchTime(curTime);
}
if (datum.getFetchTime() > curTime) {
return false;
}
return true;
}

再往下,如果url对应的信息中设置了WRITABLE_GENERATE_TIME_KEY对应的值,即不为null,且上一次抓取该url的时间没有过期,过期的默认时间genDelay为7天,则不进行抓取。
再往下的generatorSortValue会给url进行打分,CrawlDatum的evaluate函数会对CrawlDatum中的数据进行Jexl解析,这里就不往下看了。map函数再往下检查状态、打分、抓取间隔是否满足指定要求。
Selector的map函数的执行结果,会通过Selector的getPartition函数分发到Reducer上。

Selector::getPartition

public int getPartition(FloatWritable key, Writable value,
    int numReduceTasks) {
  return partitioner.getPartition(((SelectorEntry) value).url, key,
      numReduceTasks);
}

partitioner默认为URLPartitioner,传入的参数numReduceTasks表示有多少任务需要执行。URLPartitioner的getPartition函数最终返回的整型指定哪一个Reducer处理Maper的输出结果。

Selector::getPartition->URLPartitioner::getPartition

public int getPartition(Text key, Writable value, int numReduceTasks) {
String urlString = key.toString();
URL url = null;
int hashCode = urlString.hashCode();

urlString = normalizers.normalize(urlString, URLNormalizers.SCOPE_PARTITION);
url = new URL(urlString);
hashCode = url.getHost().hashCode();

if (mode.equals(PARTITION_MODE_DOMAIN) && url != null)
  hashCode = URLUtil.getDomainName(url).hashCode();
else if (mode.equals(PARTITION_MODE_IP)) {
  InetAddress address = InetAddress.getByName(url.getHost());
  hashCode = address.getHostAddress().hashCode();
}

hashCode ^= seed;
return (hashCode & Integer.MAX_VALUE) % numReduceTasks;

}

getPartition函数根据mode的不同,从url中提取出主机名、域名或主机IP地址,然后计算其hash值,最后模以numReduceTasks计算出需要哪个Reducer进行处理。

最后看一下Selector的reduce函数。

Selector::reduce

public void reduce(FloatWritable key, Iterator<SelectorEntry> values,
    OutputCollector<FloatWritable, SelectorEntry> output, Reporter reporter)
    throws IOException {

  while (values.hasNext()) {

    if (count == limit) {
      if (currentsegmentnum < maxNumSegments) {
        count = 0;
        currentsegmentnum++;
      } else
        break;
    }

    SelectorEntry entry = values.next();
    Text url = entry.url;
    String urlString = url.toString();
    URL u = null;

    String hostordomain = null;
    if (normalise && normalizers != null) {
      urlString = normalizers.normalize(urlString, URLNormalizers.SCOPE_GENERATE_HOST_COUNT);
    }
    u = new URL(urlString);
    if (byDomain) {
      hostordomain = URLUtil.getDomainName(u);
    } else {
      hostordomain = new URL(urlString).getHost();
    }


    hostordomain = hostordomain.toLowerCase();

    if (maxCount > 0) {
      int[] hostCount = hostCounts.get(hostordomain);
      if (hostCount == null) {
        hostCount = new int[] { 1, 0 };
        hostCounts.put(hostordomain, hostCount);
      }

      hostCount[1]++;
      while (segCounts[hostCount[0] - 1] >= limit
          && hostCount[0] < maxNumSegments) {
        hostCount[0]++;
        hostCount[1] = 0;
      }

      if (hostCount[1] >= maxCount) {
        if (hostCount[0] < maxNumSegments) {
          hostCount[0]++;
          hostCount[1] = 0;
        } else {
          continue;
        }
      }
      entry.segnum = new IntWritable(hostCount[0]);
      segCounts[hostCount[0] - 1]++;
    } else {
      entry.segnum = new IntWritable(currentsegmentnum);
      segCounts[currentsegmentnum - 1]++;
    }

    output.collect(key, entry);
    count++;
  }
}

reduce函数的核心功能就是计算该url属于哪一个segment中,不同时间获取到的不同url地址会存入crawl/segments下不同的文件夹中。

Generator::run->generate
第二部分

public Path[] generate(Path dbDir, Path segments, int numLists, long topN,
long curTime, boolean filter, boolean norm, boolean force,
int maxNumSegments, String expr) throws IOException {

...

List<Path> generatedSegments = new ArrayList<Path>();

FileStatus[] status = fs.listStatus(tempDir);
for (FileStatus stat : status) {
  Path subfetchlist = stat.getPath();
  if (!subfetchlist.getName().startsWith("fetchlist-"))
    continue;
  Path newSeg = partitionSegment(fs, segments, subfetchlist, numLists);
  generatedSegments.add(newSeg);
}

if (getConf().getBoolean(GENERATE_UPDATE_CRAWLDB, false)) {
  Path tempDir2 = new Path(getConf().get("mapred.temp.dir", ".")
      + "/generate-temp-" + java.util.UUID.randomUUID().toString());

  job = new NutchJob(getConf());
  job.setJobName("generate: updatedb " + dbDir);
  job.setLong(Nutch.GENERATE_TIME_KEY, generateTime);
  for (Path segmpaths : generatedSegments) {
    Path subGenDir = new Path(segmpaths, CrawlDatum.GENERATE_DIR_NAME);
    FileInputFormat.addInputPath(job, subGenDir);
  }
  FileInputFormat.addInputPath(job, new Path(dbDir, CrawlDb.CURRENT_NAME));
  job.setInputFormat(SequenceFileInputFormat.class);
  job.setMapperClass(CrawlDbUpdater.class);
  job.setReducerClass(CrawlDbUpdater.class);
  job.setOutputFormat(MapFileOutputFormat.class);
  job.setOutputKeyClass(Text.class);
  job.setOutputValueClass(CrawlDatum.class);
  FileOutputFormat.setOutputPath(job, tempDir2);

  JobClient.runJob(job);
  CrawlDb.install(job, dbDir);

  fs.delete(tempDir2, true);
}

LockUtil.removeLockFile(fs, lock);
fs.delete(tempDir, true);

Path[] patharray = new Path[generatedSegments.size()];
return generatedSegments.toArray(patharray);

}

tempDir目录保存了前面Reducer的输出结果,依次遍历tempDir目录下文件名为“fetchlist-”开头的子目录。partitionSegment函数内部利用hadoop将tempDir文件夹下的输出结果写入crawl/segments中,并返回目录路径。
接下来的if语句是合并crawlDb中和crawl_generate文件夹中的数据,本章不仔细往下看了。
最后删除锁文件,删除临时文件夹tempDir,最后返回crawl/segments下的目录路径,例如“crawl/segments/2016*/crawl_generate”。

Generator::run->generate->partitionSegment

private Path partitionSegment(FileSystem fs, Path segmentsDir, Path inputDir,
int numLists) throws IOException {
Path segment = new Path(segmentsDir, generateSegmentName());
Path output = new Path(segment, CrawlDatum.GENERATE_DIR_NAME);

NutchJob job = new NutchJob(getConf());
job.setJobName("generate: partition " + segment);

job.setInt("partition.url.seed", new Random().nextInt());

FileInputFormat.addInputPath(job, inputDir);
job.setInputFormat(SequenceFileInputFormat.class);

job.setMapperClass(SelectorInverseMapper.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(SelectorEntry.class);
job.setPartitionerClass(URLPartitioner.class);
job.setReducerClass(PartitionReducer.class);
job.setNumReduceTasks(numLists);

FileOutputFormat.setOutputPath(job, output);
job.setOutputFormat(SequenceFileOutputFormat.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(CrawlDatum.class);
job.setOutputKeyComparatorClass(HashComparator.class);
JobClient.runJob(job);
return segment;

}

generateSegmentName函数根据当前时间生成字符串。
partitionSegment在segmentsDir指向的crawl/segments文件夹下创建generateSegmentName生成的时间为名称的文件夹,然后再在该文件夹下创建crawl_generate文件夹。
inputDir指向tempDir文件夹下的fetchlist-*文件夹。
接下来创建Job,SelectorInverseMapper的map函数什么也没做。URLPartitioner的getPartition函数前面分析过了。PartitionReducer的reduce函数从SelectorEntry中提取出CrawlDatum结构,最终写入到crawl_generate文件夹中。

总结一下,nutch的generate命令其实就是把crawl/crawldb中保存的url地址处理、过滤,最终存入crawl/segments/时间/crawl_generate文件夹中。

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值