Fetch阶段重写Partition方法,是为了实现按照Host或者IP把url分到特定的Reducer.
具体代码如下:
public static class FetchEntryPartitioner
extends Partitioner<IntWritable, FetchEntry> implements Configurable {
private URLPartitioner partitioner = new URLPartitioner();
private Configuration conf;
@Override
public int getPartition(IntWritable intWritable, FetchEntry fetchEntry, int numReduces) {
String key = fetchEntry.getKey();
String url = TableUtil.unreverseUrl(key);
return partitioner.getPartition(url, numReduces);
}
@Override
public Configuration getConf() {
return conf;
}
@Override
public void setConf(Configuration conf) {
this.conf=conf;
partitioner.setConf(conf);
}
}
getPartition(Text key, Text value,
int
numPartitions) :输入是Map的结果对<key, value>和Reducer的数目,输出则是分配的Reducer(整数编号)。
nutch2.0中对getPartition的具体实现:
public int getPartition(String urlString, int numReduceTasks) {
if (numReduceTasks == 1) {
//this check can be removed when we use Hadoop with MAPREDUCE-1287
return 0;
}
int hashCode;
URL url = null;
try {
urlString = normalizers.normalize(urlString, URLNormalizers.SCOPE_PARTITION);
hashCode = urlString.hashCode();
url = new URL(urlString);
} catch (MalformedURLException e) {
LOG.warn("Malformed URL: '" + urlString + "'");
hashCode = urlString.hashCode();
}
if (url != null) {
if (mode.equals(PARTITION_MODE_HOST)) {
hashCode = url.getHost().hashCode();
} else if (mode.equals(PARTITION_MODE_DOMAIN)) {
hashCode = URLUtil.getDomainName(url).hashCode();
} else { // MODE IP
try {
InetAddress address = InetAddress.getByName(url.getHost());
hashCode = address.getHostAddress().hashCode();
} catch (UnknownHostException e) {
GeneratorJob.LOG.info("Couldn't find IP for host: " + url.getHost());
}
}
}
// make hosts wind up in different partitions on different runs
hashCode ^= seed;
return (hashCode & Integer.MAX_VALUE) % numReduceTasks;
}