nutch1.2 index 详解

首先如果存在crawl/index ,crawl/indexes目录则删除

[img]http://dl.iteye.com/upload/attachment/0070/9519/a430b9dc-5f53-30cf-8a29-9fdcfd640db8.jpg[/img]
map:IndexerMapReduce
map输入目录为 所有的segment的crawl_fetch crawl_parse parse_data parse_text , crawl/crawldb/current, crawl/linkdb/current
1 map的任务就是为了合并目录代码如下
output.collect(key, new NutchWritable(value));
reduce: IndexerMapReduce
1 循环 解析出路 四个对象 就是抓取和解析成功
if (fetchDatum == null || dbDatum == null
|| parseText == null || parseData == null) {
return; // only have inlinks
}

2 如果抓取成功和解析成功 往下执行
if (!parseData.getStatus().isSuccess() ||
fetchDatum.getStatus() != CrawlDatum.STATUS_FETCH_SUCCESS) {
return;
}
3 创建NutchDocument 加入segment ,签名,field
4 通过IndexingFilters,这个filters,会调用配置的BasicIndexingFilter和AnchorIndexingFilter,filter方法,
5 BasicIndexingFilter设置host ,site ,url ,content,title 长度超过indexer.max.title.length会最title做截取,设置tstamp,
6 AnchorIndexingFilter设置anchor
7 如果doc不为空掉用ScoringFilters 设置boost,weight
8 写入,这里的 job.setOutputFormat(IndexerOutputFormat.class);
IndexerOutputFormat 的方法如下

@Override
public RecordWriter<Text, NutchDocument> getRecordWriter(FileSystem ignored,
JobConf job, String name, Progressable progress) throws IOException {

// populate JobConf with field indexing options
IndexingFilters filters = new IndexingFilters(job);

[b]final NutchIndexWriter[] writers =
NutchIndexWriterFactory.getNutchIndexWriters(job);[/b] for (final NutchIndexWriter writer : writers) {
writer.open(job, name);
}
return new RecordWriter<Text, NutchDocument>() {

public void close(Reporter reporter) throws IOException {
for (final NutchIndexWriter writer : writers) {
writer.close();
}
}

public void write(Text key, NutchDocument doc) throws IOException {
for (final NutchIndexWriter writer : writers) {
writer.write(doc);
}
}
};
}
如果粗体所示 他会使用 LuceneWriter 如下代码加入到
@SuppressWarnings("unchecked")
public static NutchIndexWriter[] getNutchIndexWriters(Configuration conf) {
final String[] classes = conf.getStrings("indexer.writer.classes");
final NutchIndexWriter[] writers = new NutchIndexWriter[classes.length];
for (int i = 0; i < classes.length; i++) {
final String clazz = classes[i];
try {
final Class<NutchIndexWriter> implClass =
(Class<NutchIndexWriter>) Class.forName(clazz);
writers[i] = implClass.newInstance();
} catch (final Exception e) {
throw new RuntimeException("Couldn't create " + clazz, e);
}
}
return writers;
}

public static void addClassToConf(Configuration conf,
Class<? extends NutchIndexWriter> clazz) {
final String classes = conf.get("indexer.writer.classes");
final String newClass = clazz.getName();

if (classes == null) {
conf.set("indexer.writer.classes", newClass);
} else {
conf.set("indexer.writer.classes", classes + "," + newClass);
}

}

NutchIndexWriterFactory.addClassToConf(job, LuceneWriter.class);
打开indexwriter的方法
[b] for (final NutchIndexWriter writer : writers) {
writer.open(job, name);
}[/b]

代码如下
public void open(JobConf job, String name)
throws IOException {
this.fs = FileSystem.get(job);
perm = new Path(FileOutputFormat.getOutputPath(job), name);
temp = job.getLocalPath("index/_" +
Integer.toString(new Random().nextInt()));

fs.delete(perm, true); // delete old, if any
analyzerFactory = new AnalyzerFactory(job);
writer = new IndexWriter(
FSDirectory.open(new File(fs.startLocalOutput(perm, temp).toString())),
new NutchDocumentAnalyzer(job), true, MaxFieldLength.UNLIMITED);

writer.setMergeFactor(job.getInt("indexer.mergeFactor", 10));
writer.setMaxBufferedDocs(job.getInt("indexer.minMergeDocs", 100));
writer.setMaxMergeDocs(job
.getInt("indexer.maxMergeDocs", Integer.MAX_VALUE));
writer.setTermIndexInterval(job.getInt("indexer.termIndexInterval", 128));
writer.setMaxFieldLength(job.getInt("indexer.max.tokens", 10000));
writer.setInfoStream(LogUtil.getDebugStream(Indexer.LOG));
writer.setUseCompoundFile(false);
writer.setSimilarity(new NutchSimilarity());

processOptions(job);
}
写入代码如下

public void write(NutchDocument doc) throws IOException {
final Document luceneDoc = createLuceneDoc(doc);
final NutchAnalyzer analyzer = analyzerFactory.get(luceneDoc.get("lang"));
if (Indexer.LOG.isDebugEnabled()) {
Indexer.LOG.debug("Indexing [" + luceneDoc.get("url")
+ "] with analyzer " + analyzer + " (" + luceneDoc.get("lang")
+ ")");
}
writer.addDocument(luceneDoc, analyzer);

}
通过上面的流程就把索引写好了
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值