2021SC@SDUSC
IndexerMapReduce::reduce
public void reduce(Text key, Iterator<NutchWritable> values,
OutputCollector<Text, NutchIndexAction> output, Reporter reporter)
throws IOException {
Inlinks inlinks = null;
CrawlDatum dbDatum = null;
CrawlDatum fetchDatum = null;
Content content = null;
ParseData parseData = null;
ParseText parseText = null;
while (values.hasNext()) {
final Writable value = values.next().get(); // unwrap
if (value instanceof Inlinks) {
inlinks = (Inlinks) value;
} else if (value instanceof CrawlDatum) {
final CrawlDatum datum = (CrawlDatum) value;
if (CrawlDatum.hasDbStatus(datum)) {
dbDatum = datum;
} else if (CrawlDatum.hasFetchStatus(datum)) {
if (datum.getStatus() != CrawlDatum.STATUS_FETCH_NOTMODIFIED) {
fetchDatum = datum;
}
} else if (CrawlDatum.STATUS_LINKED == datum.getStatus()
|| CrawlDatum.STATUS_SIGNATURE == datum.getStatus()
|| CrawlDatum.STATUS_PARSE_META == datum.getStatus()) {
continue;
}
} else if (value instanceof ParseData) {
parseData = (ParseData) value;
if (deleteRobotsNoIndex) {
String robotsMeta = parseData.getMeta("robots");
if (robotsMeta != null
&& robotsMeta.toLowerCase().indexOf("noindex") != -1) {
output.collect(key, DELETE_ACTION);
return;
}
}
} else if (value instanceof ParseText) {
parseText = (ParseText) value;
} else if (value instanceof Content) {
content = (Content)value;
}
}
...
NutchDocument doc = new NutchDocument();
doc.add("id", key.toString());
final Metadata metadata = parseData.getContentMeta();
doc.add("segment", metadata.get(Nutch.SEGMENT_NAME_KEY));
doc.add("digest", metadata.get(Nutch.SIGNATURE_KEY));
final Parse parse = new ParseImpl(parseText, parseData);
float boost = 1.0f;
boost = this.scfilters.indexerScore(key, doc, dbDatum, fetchDatum, parse,
inlinks, boost);
doc.setWeight(boost);
doc.add("boost", Float.toString(boost));
fetchDatum.setSignature(dbDatum.getSignature());
final Text url = (Text) dbDatum.getMetaData().get(
Nutch.WRITABLE_REPR_URL_KEY);
String urlString = filterUrl(normalizeUrl(url.toString()));
url.set(urlString);
fetchDatum.getMetaData().put(Nutch.WRITABLE_REPR_URL_KEY, url);
doc = this.filters.filter(doc, parse, key, fetchDatum, inlinks);
if (content != null) {
String binary;
if (base64) {
binary = Base64.encodeBase64String(content.getContent());
} else {
binary = new String(content.getContent());
}
doc.add("binaryContent", binary);
}
NutchIndexAction action = new NutchIndexAction(doc, NutchIndexAction.ADD);
output.collect(key, action);
}
首先获取各个文件夹下的输入,分别为crawl/segments/*/目录下crawl_fetch存入的CrawlDatum,crawl_parse目录下存入的CrawlDatum,parse_data目录下存入的ParseData,parse_text目录下存入的ParseText,content目录下存入的Content,crawl/crawldb/current目录下存入的CrawlDatum,crawl/linkdb目录下存入的Inlinks。
省略的部分检查是否要删除数据,或者跳过数据。
reduce函数接下来创建NutchDocument,创建lucene中的域,设置各个域名和域值,其中id为url地址,segment为段名,即crawl/segments下的目录名,digest为签名信息,boost为indexerScore函数计算的文档分数,binaryContent为文档未解析时的内容,即带标签的内容,最后将这些信息封装进NutchIndexAction中。
下面来看IndexerOutputFormat如何将NutchIndexAction写入临时文件中。IndexerOutputFormat::getRecordWriter
public RecordWriter<Text, NutchIndexAction> getRecordWriter(
FileSystem ignored, JobConf job, String name, Progressable progress)
throws IOException {
final IndexWriters writers = new IndexWriters(job);
writers.open(job, name);
return new RecordWriter<Text, NutchIndexAction>() {
public void close(Reporter reporter) throws IOException {
writers.close();
}
public void write(Text key, NutchIndexAction indexAction)
throws IOException {
if (indexAction.action == NutchIndexAction.ADD) {
writers.write(indexAction.doc);
} else if (indexAction.action == NutchIndexAction.DELETE) {
writers.delete(key.toString());
}
}
};
}
writers被创建为SolrIndexWriter,其open函数内部建立与solr服务器的连接,最后通过SolrIndexWriter的write函数将数据传给solr服务器用于建立索引,下面一一来看。
SolrIndexWriter::open
public void open(JobConf job, String name) throws IOException {
solrClients = SolrUtils.getSolrClients(job);
init(solrClients, job);
}
public static ArrayList<SolrClient> getSolrClients(JobConf job) throws MalformedURLException {
String[] urls = job.getStrings(SolrConstants.SERVER_URL);
ArrayList<SolrClient> solrClients = new ArrayList<SolrClient>();
for (int i = 0; i < urls.length; i++) {
SolrClient sc = new HttpSolrClient(urls[i]);
solrClients.add(sc);
}
return solrClients;
}
SolrIndexWriter的open函数的主要功能是根据solr服务器的地址,创建HttpSolrClient连接,然后调用init函数对其进行初始化。
SolrIndexWriter::write
public void write(NutchDocument doc) throws IOException {
final SolrInputDocument inputDoc = new SolrInputDocument();
for (final Entry<String, NutchField> e : doc) {
for (final Object val : e.getValue().getValues()) {
Object val2 = val;
if (val instanceof Date) {
val2 = DateUtil.getThreadLocalDateFormat().format(val);
}
if (e.getKey().equals("content") || e.getKey().equals("title")) {
val2 = SolrUtils.stripNonCharCodepoints((String) val);
}
inputDoc.addField(solrMapping.mapKey(e.getKey()), val2, e.getValue()
.getWeight());
String sCopy = solrMapping.mapCopyKey(e.getKey());
if (sCopy != e.getKey()) {
inputDoc.addField(sCopy, val);
}
}
}
inputDoc.setDocumentBoost(doc.getWeight());
inputDocs.add(inputDoc);
totalAdds++;
if (inputDocs.size() + numDeletes >= batchSize) {
push();
}4251
传入的processor默认为BinaryResponseParser,createMethod函数用于封装一个http请求,executeMethod执行该请求,获得返回结果并处理。