nutch源代码分析之Generator

MapReduce1:选择要获取的urls
[list]
[*] 输入:爬虫数据库文件

public Path generate(...) {
...
job.setInputPath(new Path(dbDir, CrawlDb.CURRENT_NAME));
job.setInputFormat(SequenceFileInputFormat.class);
}


[*] Map() -> 如果date <= now, 反转成<CrawlDatum, url>


/** Selects entries due for fetch. */
public static class Selector implements Mapper ...{

private SelectorEntry entry = new SelectorEntry();

/** Select & invert subset due for fetch. */
public void map(WritableComparable key, Writable value,
OutputCollector output, Reporter reporter)
throws IOException {
Text url = (Text)key;
...
CrawlDatum crawlDatum = (CrawlDatum)value;

if (crawlDatum.getStatus() == CrawlDatum.STATUS_DB_GONE ||
crawlDatum.getStatus() == CrawlDatum.STATUS_DB_REDIR_PERM)
return; // don't retry

if (crawlDatum.getFetchTime() > curTime)
return; // not time yet

LongWritable oldGenTime = (LongWritable)crawlDatum.getMetaData().get(Nutch.WRITABLE_GENERATE_TIME_KEY);
if (oldGenTime != null) { // awaiting fetch & update
if (oldGenTime.get() + genDelay > curTime) // still wait for update
return;
}
...
// record generation time
crawlDatum.getMetaData().put(Nutch.WRITABLE_GENERATE_TIME_KEY, genTime);
entry.datum = crawlDatum;
entry.url = (Text)key;
output.collect(sortValue, entry); // invert for sort by score
}
}

[*] 以随机整数为种子, 用hash函数来划分数据块



/**
* Generate fetchlists in a segment.
* @return Path to generated segment or null if no entries were selected.
* */
public Path generate(...) {
...
job.setInt("partition.url.by.host.seed", new Random().nextInt());
}

public static class Selector implements Mapper, Partitioner, Reducer {

private Partitioner hostPartitioner = new PartitionUrlByHost();
...
/** Partition by host. */
public int getPartition(WritableComparable key, Writable value,
int numReduceTasks) {
return hostPartitioner.getPartition(((SelectorEntry)value).url, key,
numReduceTasks);
}
...
}


/** Partition urls by hostname. */
public class PartitionUrlByHost implements Partitioner {

private int seed;
...

public void configure(JobConf job) {
seed = job.getInt("partition.url.by.host.seed", 0);
...
}

/** Hash by hostname. */
public int getPartition(WritableComparable key, Writable value,
int numReduceTasks) {
...
int hashCode = (url==null ? urlString : url.getHost()).hashCode();

// make hosts wind up in different partitions on different runs
hashCode ^= seed;

return (hashCode & Integer.MAX_VALUE) % numReduceTasks;
}
}


[*] Reduce()是同一化
[*] 以CrawlDatum.linkCount降序排序
[*] 输出链接数最多的N个CrawlDatum实体
[/list]

MapReduce2:准备获取
[list]
[*] Map()是反向;Partition()根据主机划分;Reduce()是同一化
[*] Reduce: 合并CrawlDatum成单个入口
[*] 输出: <url,CrawlDatum>文件集,用来并行地获取
[/list]
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值