Dissecting The Nutch Crawler -Command "generate": net.nutch.tools.FetchListTool

最新推荐文章于 2024-08-12 17:42:17 发布

pwlazy

最新推荐文章于 2024-08-12 17:42:17 发布

阅读量1.5k

点赞数

分类专栏： search engine 文章标签： output lucene processing file command types

search engine 专栏收录该内容

29 篇文章 0 订阅

订阅专栏

英文原文出处： DissectingTheNutchCrawler
转载本文请注明出处：http://blog.csdn.net/pwlazy

Command "generate": net.nutch.tools.FetchListTool

> "generate: generate new segments to fetch"
> Usage: FetchListTool <db_dir> <segment_dir> [-refetchonly] [-anchoroptimize linkdb] [-topN N] [-cutoff cutoffscore] [-numFetchers numFetchers] [-adddays numDays]

FetchListTool is used to create one or more "segments". From the tutorial:

Each segment is a set of pages that are fetched and indexed as a unit. Segment data consists of the following types:
a "fetchlist": file that names the pages to be fetched
the "fetcher output": set of files containing the fetched pages
the "index" is a Lucene-format index of the fetcher output

</blockquote>

Within CrawlTool.main(), FetchListTool.main() is invoked once per "depth" value with two arguments: (dir + "/db", dir + "/segments"). After processing args, it creates an instance of itself, calls "flt.emitFetchList()", then returns.

Let's run FetchListTool to see what it changes on disk. Note that we have to specify the webdb directory, plus another directory where segments are written to.

$ bin/nutch generate spam spam_segments
$ find spam -type file | xargs ls -l
-rw-r--r--  1 kangas  users    0 Oct 25 20:18 spam/dbreadlock
-rw-r--r--  1 kangas  users    0 Oct 25 20:18 spam/dbwritelock
-rw-r--r--  1 kangas  users   16 Oct 25 20:18 spam/webdb/linksByMD5/data
-rw-r--r--  1 kangas  users   16 Oct 25 20:18 spam/webdb/linksByMD5/index
-rw-r--r--  1 kangas  users   16 Oct 25 20:18 spam/webdb/linksByURL/data
-rw-r--r--  1 kangas  users   16 Oct 25 20:18 spam/webdb/linksByURL/index
-rw-r--r--  1 kangas  users   89 Oct 25 20:18 spam/webdb/pagesByMD5/data
-rw-r--r--  1 kangas  users   97 Oct 25 20:18 spam/webdb/pagesByMD5/index
-rw-r--r--  1 kangas  users  115 Oct 25 20:18 spam/webdb/pagesByURL/data
-rw-r--r--  1 kangas  users   58 Oct 25 20:18 spam/webdb/pagesByURL/index
-rw-r--r--  1 kangas  users   17 Oct 25 20:18 spam/webdb/stats
$ find spam_segments/ -type file | xargs ls -l
-rw-r--r--  1 kangas  users  113 Oct 25 20:18 spam_segments/20041026001828/fetchlist/data
-rw-r--r--  1 kangas  users   40 Oct 25 20:18 spam_segments/20041026001828/fetchlist/index

Note that no changes occurred under the webdb dir ("spam"), but a new segments directory was created, and data+index files created therein.

命令generate对应net.nutch.tools.FetchListTool类
该命令产生待检索的segment

该类的调用方式如下：
FetchListTool <db_dir> <segment_dir> [-refetchonly] [-anchoroptimize linkdb] [-topN N] [-cutoff cutoffscore] [-numFetchers numFetchers] [-adddays numDays] FetchListTool 产生一个或者多个segment,看看如下教程，每个segment是一组页面，这些页面作为一个单元被检索和索引， segment数据包含以下几种类型

"fetchlist":一个文件，该文件定义了被检索的页面
"fetcher output":包含检索页面的文件组
"index": 针对fetcher output的lucene格式的索引

在CrwalTool类的main方法中，FetchToolList的main方法每个深度被调用一次，调用时传入两个参数dir+"db"和dir+"segment"(译注：db就是调用CrwalTool的方法时传入的-dir参数)，再处理参数后，该方法产生本类的一个实例，然后调用emitFetchList方法，然后返回。

我们来运行 FetchListTool，看看它对磁盘内容做了什么改动，请注意我们特定了webdb目录和segment目录 $ bin / nutch generate spam spam_segments $ find spam - type file | xargs ls - l - rw - r -- r -- 1 kangas users 0 Oct 25 20 : 18 spam / dbreadlock - rw - r -- r -- 1 kangas users 0 Oct 25 20 : 18 spam / dbwritelock - rw - r -- r -- 1 kangas users 16 Oct 25 20 : 18 spam / webdb / linksByMD5 / data - rw - r -- r -- 1 kangas users 16 Oct 25 20 : 18 spam / webdb / linksByMD5 / index - rw - r -- r -- 1 kangas users 16 Oct 25 20 : 18 spam / webdb / linksByURL / data - rw - r -- r -- 1 kangas users 16 Oct 25 20 : 18 spam / webdb / linksByURL / index - rw - r -- r -- 1 kangas users 89 Oct 25 20 : 18 spam / webdb / pagesByMD5 / data - rw - r -- r -- 1 kangas users 97 Oct 25 20 : 18 spam / webdb / pagesByMD5 / index - rw - r -- r -- 1 kangas users 115 Oct 25 20 : 18 spam / webdb / pagesByURL / data - rw - r -- r -- 1 kangas users 58 Oct 25 20 : 18 spam / webdb / pagesByURL / index - rw - r -- r -- 1 kangas users 17 Oct 25 20 : 18 spam / webdb / stats $ find spam_segments / - type file | xargs ls - l - rw - r -- r -- 1 kangas users 113 Oct 25 20 : 18 spam_segments / 20041026001828 / fetchlist / data - rw - r -- r -- 1 kangas users 40 Oct 25 20 : 18 spam_segments / 20041026001828 / fetchlist / index 结果发现webdb目录下没有变化，但一个新的segments目录产生了，还且date和index也产生了

pwlazy

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Dissecting The Nutch Crawler -Command "generate": net.nutch.tools.FetchListTool

英文原文出处：DissectingTheNutchCrawler 转载本文请注明出处：http://blog.csdn.net/pwlazyCommand "generate": net.nutch.tools.FetchListTool> "generate: generate new segments to fetch" > Usage: FetchList
复制链接

扫一扫

专栏目录