- 环境搭建
Requirements(本人选择):
- Java 1.5
- Apache's Tomcat 5.x
- Win32 with Cygwin
- Nutch, 呵呵
Nutch 0.72版就我个人而言,感觉不太好用(有人说所谓感觉就是靠不住,管他呢),现在可以从svn仓库中下载源码,并用ant package编译,之后你就得到了一个0.8版本的Nutch。编译之后的目录结构:
nutch
bin -- 执行脚本
build -- 编译生成的类,插件目录,以及放入tomcat的war包。
conf -- 各种配置文件
...
这儿有一篇0.8的英文的tutorial http://lucene.apache.org/nutch/tutorial8.html, 文档写的比较详细,但是其中有几个可能会害死人的小Bug,一般而言,文档中的Bug比程序中的Bug更气人,我写的文档也是Bug不断,谁看谁生气。如果谁看到中文tutorial文档,请贴出连接。这儿还有一篇野网页(非官方)
http://wiki.media-style.com/display/nutchDocu/Home,里面有如何在Eclipse中调试Nutch的图解。
- Tips
- 执行bin/nutch,可以察看bin/nutch的所有命令,一个可能的输出是:
Usage: nutch COMMAND
where COMMAND is one of:
crawl one-step crawler for intranets
readdb read / dump crawl db
mergedb merge crawldb-s, with optional fi
readlinkdb read / dump link db
inject inject new urls into the database
generate generate new segments to fetch
fetch fetch a segment's pages
parse parse a segment's pages
segread read / dump segment data
mergesegs merge several segments, with opti
updatedb update crawl db from segments aft
invertlinks create a linkdb from parsed segme
mergelinkdb merge linkdb-s, with optional fil
index run the indexer on parsed segment
merge merge several segment indexes
dedup remove duplicates from a set of s
plugin load a plugin and run one of its
server run a search server
or
CLASSNAME run the class named CLASSNAME
Most commands print help when invoked w/o parameters.
2. 执行bin/nutch后面再加一个裸命令(无参数),可以察看这个命令的用法,比如
bin/nutch crawl回车,可能输出:
Usage: Crawl <urlDir> [-dir d] [-threads n] [-depth i] [-topN N]
3. 可以用readdb, readlinkdb,segread等命令帮你检查你的数据。
- 抓取模式
Intranet Crawling:
适合抓取网页预期总数在一百万,网站数量有限的情况,一步式命令bin/nutch crawl,比较舒服,对于很多垂直
搜索领域已经足够。
Whole-web Crawling:
抓取WWW海量数据,一般可分为
inject 注入url,
generate 生成抓取列表,
fetch 抓取网页,
updatedb 更新crawldb,
invertlinks 建立连接数据库,
index 建立索引
dedup 去重
merge 合并索引
其实这两种模式基本上一样的,可以互换,差别只是配置文件的不同(个人见解!)。如果是crawl命令,可能涉及的配置文件crawl-urlfilter.txt, regex-urlfilter.txt,prefix-urlfilter.txt,suffix-urlfilter.txt,automaton-urlfilter.txt, 注意配置文件要放在类搜索路径上,如果你用bin/nutch脚本来启动程序,则这些配置文件都应该在conf目录中找到。还有一点要注意的是,这些filter文件是否生效,要看你的插件配置情况,哎天哪,什么都需要配置!
- 更新
执行下面的循环:
generate
fetch
updatedb
invertlinks
index
dedup
merge
我自己简单修改了org.apache.nutch.crawl.Crawl, 生成了一个新类可以方便的一步式更新
package org.apache.nutch.crawl;
public class CrawlUpdate {
public static final Logger LOG = LogFormatter
.getLogger("org.apache.nutch.crawl.CrawlUpdate");
private static String getDate() {
return new SimpleDateFormat("yyyyMMddHHmmss").format(new Date(System
.currentTimeMillis()));
}
public static void main(String[] args) throws IOException {
if (args.length < 1) {
System.out
.println("Usage: CrawlUpdate [-dir d] [-threads n] [-topN N]");
return;
}
Configuration conf = NutchConfiguration.create();
conf.addDefaultResource("crawl-tool.xml");
JobConf job = new NutchJob(conf);
Path dir = new Path("crawl-" + getDate());
int threads = job.getInt("fetcher.threads.fetch", 10);
int topN = Integer.MAX_VALUE;
for (int i = 0; i < args.length; i++) {
if ("-dir".equals(args[i])) {
dir = new Path(args[i + 1]);
i++;
} else if ("-threads".equals(args[i])) {
threads = Integer.parseInt(args[i + 1]);
i++;
} else if ("-topN".equals(args[i])) {
topN = Integer.parseInt(args[i + 1]);
i++;
}
}
FileSystem fs = FileSystem.get(job);
if (!fs.exists(dir)) {
throw new RuntimeException(dir + " dosn't exist.");
}
LOG.info("crawl started in: " + dir);
LOG.info("threads = " + threads);
if (topN != Integer.MAX_VALUE)
LOG.info("topN = " + topN);
Path crawlDb = new Path(dir + "/crawldb");
Path linkDb = new Path(dir + "/linkdb");
Path segments = new Path(dir + "/segments");
Path indexes = new Path(dir + "/indexes" + getDate());
Path index = new Path(dir + "/index");
Path tmpDir = job.getLocalPath("crawl" + Path.SEPARATOR + getDate());
Path segment = new Generator(job).generate(crawlDb, segments, -1, topN,
System.currentTimeMillis());
new Fetcher(job).fetch(segment, threads, Fetcher.isParsing(job)); // fetch
if (!Fetcher.isParsing(job)) {
new ParseSegment(job).parse(segment); // parse it, if needed
}
new CrawlDb(job).update(crawlDb, segment); // update crawldb
new LinkDb(job).invert(linkDb, new Path[] { segment }); // invert links
// index, dedup & merge
new Indexer(job)
.index(indexes, crawlDb, linkDb, new Path[] { segment });
Path[] indexesDirs = fs.listPaths(dir, new PathFilter() {
public boolean accept(Path p) {
return p.getName().startsWith("indexes");
}
});
new DeleteDuplicates(job).dedup(indexesDirs);
List indexesParts = new ArrayList();
for (int i = 0; i < indexesDirs.length; i++) {
indexesParts.addAll(Arrays.asList((fs.listPaths(indexesDirs[i]))));
}
new IndexMerger(fs, (Path[]) (indexesParts
.toArray(new Path[indexesParts.size()])), index, tmpDir, job)
.merge();
LOG.info("crawl update finished: " + dir);
}
}
这样我可以用如下模式来周期更新我的搜索数据:
Crawl urlsdir -dir crawl -topN 1000 -- 第一次下载
CrawlUpdate -dir crawl -topN 1000 -- 更新
CrawlUpdate -dir crawl -topN 1000 -- 继续更新
...
还没搞明白,是lucene的限制还是基于什么考虑,在更新时(准确说是更新索引时)要先停止tomcat,感觉有那么一点不舒服。