crawl的每一步具体发生了什么。
==============准备工作======================
(Windows下需要cygwin)
从SVN check out代码;
cd到crawler目录;
==============inject==========================
$ bin/nutch inject crawl/crawldb urls
Injector: starting
Injector: crawlDb: crawl/crawldb
Injector: urlDir: urls
Injector: Converting injected urls to crawl db entries.
Injector: Merging injected urls into crawl db.
Injector: done
crawldb目录在这时生成。
查看里面的内容:
$ bin/nutch readdb crawl/crawldb -stats
CrawlDb statistics start: crawl/crawldb
Statistics for CrawlDb: crawl/crawldb
TOTAL urls: 1
retry 0: 1
min score: 1.0
avg score: 1.0
max score: 1.0
status 1 (db_unfetched): 1
CrawlDb statistics: done
===============generate=========================
$bin/nutch generate crawl/crawldb crawl/segments
$s1=`ls -d crawl/segments/2* | tail -1`
Generator: Selecting best-scoring urls due for fetch.
Generator: starting
Generator: segment: crawl/segments/20080112224520
Generator: filtering: true
Generator: jobtracker is 'local', generating exactly one partition.
Generator: Partitioning selected urls by host, for politeness.
Generator: done.
segments目录在这时生成。但里面只有一个crawl_generate目录:
$ bin/nutch readseg -list $1
NAME GENERATED FETCHER START FETCHER END
FETCHED PARSED
20080112224520 1 ? ? ? ?
crawldb的内容此时没变化,仍是1个unfetched url。
=================fetch==============================
$bin/nutch fetch $s1
Fetcher: starting
Fetcher: segment: crawl/segments/20080112224520
Fetcher: threads: 10
fetching http://www.complaints.com/directory/directory.htm
Fetcher: done
segments多了些其他子目录。
$ bin/nutch readseg -list $s1
NAME GENERATED FETCHER START FETCHER END
FETCHED PARSED
20080112224520 1 2008-01-12T22:52:00 2008-01-12T22:52:00
1 1
crawldb的内容此时没变化,仍是1个unfetched url。
================updatedb=============================
$ bin/nutch updatedb crawl/crawldb $s1
CrawlDb update: starting
CrawlDb update: db: crawl/crawldb
CrawlDb update: segments: [crawl/segments/20080112224520]
CrawlDb update: additions allowed: true
CrawlDb update: URL normalizing: false
CrawlDb update: URL filtering: false
CrawlDb update: Merging segment data into db.
CrawlDb update: done
这时候crawldb内容就变化了:
$ bin/nutch readdb crawl/crawldb -stats
CrawlDb statistics start: crawl/crawldb
Statistics for CrawlDb: crawl/crawldb
TOTAL urls: 97
retry 0: 97
min score: 0.01
avg score: 0.02
max score: 1.0
status 1 (db_unfetched): 96
status 2 (db_fetched): 1
CrawlDb statistics: done
==============invertlinks ==============================
$ bin/nutch invertlinks crawl/linkdb crawl/segments/*
LinkDb: starting
LinkDb: linkdb: crawl/linkdb
LinkDb: URL normalize: true
LinkDb: URL filter: true
LinkDb: adding segment: crawl/segments/20080112224520
LinkDb: done
linkdb目录在这时生成。
===============index====================================
$ bin/nutch index crawl/indexes crawl/crawldb crawl/linkdb crawl/segments/*
Indexer: starting
Indexer: linkdb: crawl/linkdb
Indexer: adding segment: crawl/segments/20080112224520
Indexing [http://www.complaints.com/directory/directory.htm] with analyzer
org
apache.nutch.analysis.NutchDocumentAnalyzer@ba4211 (null)
Optimizing index.
merging segments _ram_0 (1 docs) into _0 (1 docs)
Indexer: done
indexes目录在这时生成。
================测试crawl的结果==========================
$ bin/nutch org.apache.nutch.searcher.NutchBean complaints
Total hits: 1
0 20080112224520/http://www.complaints.com/directory/directory.htm
Complaints.com - Sitemap by date ?Complaints ...
参考资料:
【1】Nutch version 0.8.x tutorial
http://lucene.apache.org/nutch/tutorial8.html
【2】 Introduction to Nutch, Part 1: Crawling
http://today.java.net/lpt/a/255
[实际写于Jan 13, 12:10 am 2008]
==============准备工作======================
(Windows下需要cygwin)
从SVN check out代码;
cd到crawler目录;
==============inject==========================
$ bin/nutch inject crawl/crawldb urls
Injector: starting
Injector: crawlDb: crawl/crawldb
Injector: urlDir: urls
Injector: Converting injected urls to crawl db entries.
Injector: Merging injected urls into crawl db.
Injector: done
crawldb目录在这时生成。
查看里面的内容:
$ bin/nutch readdb crawl/crawldb -stats
CrawlDb statistics start: crawl/crawldb
Statistics for CrawlDb: crawl/crawldb
TOTAL urls: 1
retry 0: 1
min score: 1.0
avg score: 1.0
max score: 1.0
status 1 (db_unfetched): 1
CrawlDb statistics: done
===============generate=========================
$bin/nutch generate crawl/crawldb crawl/segments
$s1=`ls -d crawl/segments/2* | tail -1`
Generator: Selecting best-scoring urls due for fetch.
Generator: starting
Generator: segment: crawl/segments/20080112224520
Generator: filtering: true
Generator: jobtracker is 'local', generating exactly one partition.
Generator: Partitioning selected urls by host, for politeness.
Generator: done.
segments目录在这时生成。但里面只有一个crawl_generate目录:
$ bin/nutch readseg -list $1
NAME GENERATED FETCHER START FETCHER END
FETCHED PARSED
20080112224520 1 ? ? ? ?
crawldb的内容此时没变化,仍是1个unfetched url。
=================fetch==============================
$bin/nutch fetch $s1
Fetcher: starting
Fetcher: segment: crawl/segments/20080112224520
Fetcher: threads: 10
fetching http://www.complaints.com/directory/directory.htm
Fetcher: done
segments多了些其他子目录。
$ bin/nutch readseg -list $s1
NAME GENERATED FETCHER START FETCHER END
FETCHED PARSED
20080112224520 1 2008-01-12T22:52:00 2008-01-12T22:52:00
1 1
crawldb的内容此时没变化,仍是1个unfetched url。
================updatedb=============================
$ bin/nutch updatedb crawl/crawldb $s1
CrawlDb update: starting
CrawlDb update: db: crawl/crawldb
CrawlDb update: segments: [crawl/segments/20080112224520]
CrawlDb update: additions allowed: true
CrawlDb update: URL normalizing: false
CrawlDb update: URL filtering: false
CrawlDb update: Merging segment data into db.
CrawlDb update: done
这时候crawldb内容就变化了:
$ bin/nutch readdb crawl/crawldb -stats
CrawlDb statistics start: crawl/crawldb
Statistics for CrawlDb: crawl/crawldb
TOTAL urls: 97
retry 0: 97
min score: 0.01
avg score: 0.02
max score: 1.0
status 1 (db_unfetched): 96
status 2 (db_fetched): 1
CrawlDb statistics: done
==============invertlinks ==============================
$ bin/nutch invertlinks crawl/linkdb crawl/segments/*
LinkDb: starting
LinkDb: linkdb: crawl/linkdb
LinkDb: URL normalize: true
LinkDb: URL filter: true
LinkDb: adding segment: crawl/segments/20080112224520
LinkDb: done
linkdb目录在这时生成。
===============index====================================
$ bin/nutch index crawl/indexes crawl/crawldb crawl/linkdb crawl/segments/*
Indexer: starting
Indexer: linkdb: crawl/linkdb
Indexer: adding segment: crawl/segments/20080112224520
Indexing [http://www.complaints.com/directory/directory.htm] with analyzer
org
apache.nutch.analysis.NutchDocumentAnalyzer@ba4211 (null)
Optimizing index.
merging segments _ram_0 (1 docs) into _0 (1 docs)
Indexer: done
indexes目录在这时生成。
================测试crawl的结果==========================
$ bin/nutch org.apache.nutch.searcher.NutchBean complaints
Total hits: 1
0 20080112224520/http://www.complaints.com/directory/directory.htm
Complaints.com - Sitemap by date ?Complaints ...
参考资料:
【1】Nutch version 0.8.x tutorial
http://lucene.apache.org/nutch/tutorial8.html
【2】 Introduction to Nutch, Part 1: Crawling
http://today.java.net/lpt/a/255
[实际写于Jan 13, 12:10 am 2008]