2021SC@SDUSC
本章开始分析nutch源码的最后一步,即通过“bin/nutch solrindex http://localhost:8983/solr crawl/crawldb/ -linkdb crawl/linkdb/ -dir crawl/segments/ -filter -normalize”命令在solr服务器上建立索引。
首先看nutch执行脚本的其中一段,
elif [ "$COMMAND" = "solrindex" ] ; then
CLASS="org.apache.nutch.indexer.IndexingJob -D solr.server.url=$1"
shift
solrindex最后执行IndexingJob的main函数,并将参数“http://localhost:8983/solr”存入名称为solr.server.url变量。
public static void main(String[] args) throws Exception {
final int res = ToolRunner.run(NutchConfiguration.create(),
new IndexingJob(), args);
System.exit(res);
}
public int run(String[] args) throws Exception {
index(crawlDb, linkDb, segments, noCommit, deleteGone, params, filter, normalize, addBinaryContent, base64);
return 0;
}
public void index(Path crawlDb, Path linkDb, List<Path> segments,
boolean noCommit, boolean deleteGone, String params,
boolean filter, boolean normalize, boolean addBinaryContent,
boolean base64) throws IOException {
final JobConf job = new NutchJob(getConf());
job.setJobName("Indexer");
IndexWriters writers = new IndexWriters(getConf());
IndexerMapReduce.initMRJob(crawlDb, linkDb, segments, job, addBinaryContent);
...
final Path tmp = new Path("tmp_" + System.currentTimeMillis() + "-"
+ new Random().nextInt());
FileOutputFormat.setOutputPath(job, tmp);
RunningJob indexJob = JobClient.runJob(job);
writers.open(job, "commit");
writers.commit();
IndexerMapReduce::initMRJob
public static void initMRJob(Path crawlDb, Path linkDb,
Collection<Path> segments, JobConf job, boolean addBinaryContent) {
for (final Path segment : segments) {
FileInputFormat.addInputPath(job, new Path(segment,
CrawlDatum.FETCH_DIR_NAME));
FileInputFormat.addInputPath(job, new Path(segment,
CrawlDatum.PARSE_DIR_NAME));
FileInputFormat.addInputPath(job, new Path(segment, ParseData.DIR_NAME));
FileInputFormat.addInputPath(job, new Path(segment, ParseText.DIR_NAME));
if (addBinaryContent) {
FileInputFormat.addInputPath(job, new Path(segment, Content.DIR_NAME));
}
}
FileInputFormat.addInputPath(job, new Path(crawlDb, CrawlDb.CURRENT_NAME));
if (linkDb != null) {
Path currentLinkDb = new Path(linkDb, LinkDb.CURRENT_NAME);
FileInputFormat.addInputPath(job, currentLinkDb);
}
job.setInputFormat(SequenceFileInputFormat.class);
job.setMapperClass(IndexerMapReduce.class);
job.setReducerClass(IndexerMapReduce.class);
job.setOutputFormat(IndexerOutputFormat.class);
job.setOutputKeyClass(Text.class);
job.setMapOutputValueClass(NutchWritable.class);
job.setOutputValueClass(NutchWritable.class);
}
设置Job的输入为crawl/segments/*/下的crawl_fetch、crawl_parse、parse_data、parse_text、content目录,crawl/crawldb下的current目录和crawl下的linkdb目录。设置Mapper和Reducer为IndexerMapReduce,写函数为IndexerOutputFormat,下面一一来看。
IndexerMapReduce::map、
public void map(Text key, Writable value,
OutputCollector<Text, NutchWritable> output, Reporter reporter)
throws IOException {
String urlString = filterUrl(normalizeUrl(key.toString()));
if (urlString == null) {
return;
} else {
key.set(urlString);
}
output.collect(key, new NutchWritable(value));
}
剩下的下一篇再来叙述