转载本文请注明出处:http://blog.csdn.net/pwlazy
Command "generate": net.nutch.tools.FetchListTool
> "generate: generate new segments to fetch"
> Usage: FetchListTool <db_dir> <segment_dir> [-refetchonly] [-anchoroptimize linkdb] [-topN N] [-cutoff cutoffscore] [-numFetchers numFetchers] [-adddays numDays]
FetchListTool is used to create one or more "segments". From the tutorial:
<blockquote>
-
Each segment is a set of pages that are fetched and indexed as a unit. Segment data consists of the following types:
-
a "fetchlist": file that names the pages to be fetched
-
the "fetcher output": set of files containing the fetched pages
-
the "index" is a Lucene-format index of the fetcher output
</blockquote>
Within CrawlTool.main(), FetchListTool.main() is invoked once per "depth" value with two arguments: (dir + "/db", dir + "/segments"). After processing args, it creates an instance of itself, calls "flt.emitFetchList()", then returns.
Let's run FetchListTool to see what it changes on disk. Note that we have to specify the webdb directory, plus another directory where segments are written to.
$ bin/nutch generate spam spam_segments
$ find spam -type file | xargs ls -l
-rw-r--r-- 1 kangas users 0 Oct 25 20:18 spam/dbreadlock
-rw-r--r-- 1 kangas users 0 Oct 25 20:18 spam/dbwritelock
-rw-r--r-- 1 kangas users 16 Oct 25 20:18 spam/webdb/linksByMD5/data
-rw-r--r-- 1 kangas users 16 Oct 25 20:18 spam/webdb/linksByMD5/index
-rw-r--r-- 1 kangas users 16 Oct 25 20:18 spam/webdb/linksByURL/data
-rw-r--r-- 1 kangas users 16 Oct 25 20:18 spam/webdb/linksByURL/index
-rw-r--r-- 1 kangas users 89 Oct 25 20:18 spam/webdb/pagesByMD5/data
-rw-r--r-- 1 kangas users 97 Oct 25 20:18 spam/webdb/pagesByMD5/index
-rw-r--r-- 1 kangas users 115 Oct 25 20:18 spam/webdb/pagesByURL/data
-rw-r--r-- 1 kangas users 58 Oct 25 20:18 spam/webdb/pagesByURL/index
-rw-r--r-- 1 kangas users 17 Oct 25 20:18 spam/webdb/stats
$ find spam_segments/ -type file | xargs ls -l
-rw-r--r-- 1 kangas users 113 Oct 25 20:18 spam_segments/20041026001828/fetchlist/data
-rw-r--r-- 1 kangas users 40 Oct 25 20:18 spam_segments/20041026001828/fetchlist/index
Note that no changes occurred under the webdb dir ("spam"), but a new segments directory was created, and data+index files created therein.
命令generate对应net.nutch.tools.FetchListTool类
该命令产生待检索的segment
该类的调用方式如下:
FetchListTool <db_dir> <segment_dir> [-refetchonly] [-anchoroptimize linkdb] [-topN N] [-cutoff cutoffscore] [-numFetchers numFetchers] [-adddays numDays]
FetchListTool 产生一个或者多个segment,看看如下教程,
每个segment是一组页面,这些页面作为一个单元被检索和索引, segment数据包含以下几种类型
- "fetchlist":一个文件,该文件定义了被检索的页面
- "fetcher output":包含检索页面的文件组
- "index": 针对fetcher output的lucene格式的索引
我们来运行 FetchListTool,看看它对磁盘内容做了什么改动,请注意我们特定了webdb目录和segment目录
$ find spam - type file | xargs ls - l
- rw - r -- r -- 1 kangas users 0 Oct 25 20 : 18 spam / dbreadlock
- rw - r -- r -- 1 kangas users 0 Oct 25 20 : 18 spam / dbwritelock
- rw - r -- r -- 1 kangas users 16 Oct 25 20 : 18 spam / webdb / linksByMD5 / data
- rw - r -- r -- 1 kangas users 16 Oct 25 20 : 18 spam / webdb / linksByMD5 / index
- rw - r -- r -- 1 kangas users 16 Oct 25 20 : 18 spam / webdb / linksByURL / data
- rw - r -- r -- 1 kangas users 16 Oct 25 20 : 18 spam / webdb / linksByURL / index
- rw - r -- r -- 1 kangas users 89 Oct 25 20 : 18 spam / webdb / pagesByMD5 / data
- rw - r -- r -- 1 kangas users 97 Oct 25 20 : 18 spam / webdb / pagesByMD5 / index
- rw - r -- r -- 1 kangas users 115 Oct 25 20 : 18 spam / webdb / pagesByURL / data
- rw - r -- r -- 1 kangas users 58 Oct 25 20 : 18 spam / webdb / pagesByURL / index
- rw - r -- r -- 1 kangas users 17 Oct 25 20 : 18 spam / webdb / stats
$ find spam_segments / - type file | xargs ls - l
- rw - r -- r -- 1 kangas users 113 Oct 25 20 : 18 spam_segments / 20041026001828 / fetchlist / data
- rw - r -- r -- 1 kangas users 40 Oct 25 20 : 18 spam_segments / 20041026001828 / fetchlist / index
结果发现webdb目录下没有变化,但一个新的segments目录产生了,还且date和index也产生了