转载本文请注明出处:http://blog.csdn.net/pwlazy
Command "fetch": net.nutch.fetcher.Fetcher
> "fetch: fetch a segment's pages"
> Usage: Fetcher [-logLevel level] [-showThreadID] [-threads n] dir
So far we've created a webdb, primed it withURLs, and created a segment that a Fetcher can write to. Now let's look at the Fetcher itself, and try running it to see what comes out.
net.nutch.fetcher.Fetcher relies on several other classes:
-
FetcherThread, an inner class
-
net.nutch.parse.ParserFactory
-
net.nutch.plugin.PluginRepository
-
and, of course, any "plugin" classes loaded by the PluginRepository
Fetcher.main() reads arguments, instantiates a new Fetcher object, sets options, then calls run(). The Fetcher constructor is similarly simple; it just instantiates all of the input/output streams:
instance variable | class | arguments |
fetchList | ArrayFile.Reader | (dir, "fetchlist") |
fetchWriter | ArrayFile.Writer | (dir, "fetcher", FetcherOutput.class) |
contentWriter | ArrayFile.Writer | (dir, "content", Content.class) |
parseTextWriter | ArrayFile.Writer | (dir, "parse_text", ParseText.class) |
parseDataWriter | ArrayFile.Writer | (dir, "parse_data", ParseData.class) |
Fetcher.run() instantiates 1..threadCount FetcherThread objects, calls thread.start() on each, sleeps until all threads are gone or a fatal error is logged, then calls close() on the i/o streams.
FetcherThread is an inner class of net.nutch.fetcher.Fetcher that extends java.lang.Thread. It has one instance method, run(), and three static methods: handleFetch(), handleNoFetch(), and logError().
FetcherThread.run() instantiates a new FetchListEntry called "fle", then runs the following in an infinite loop:
-
If a fatal error was logged, break
-
Get the next entry in the FetchList, break if none remain
-
Extract url from FetchListEntry
-
If the FetchListEntry is not tagged "fetch", call this.handleNoFetch() with status=1. This in turn does:
-
Get MD5Hash.digest() of url
-
Build a FetcherOutput(fle, hash, status)
-
Call Fetcher.outputPage() with all of these objects
-
-
If is tagged "fetch", call ProtocolFactory and get Protocol and Content objects for this url
-
Call this.handleFetch(url, fle, content). This in turn does:
-
Call ParserFactory.getParser() for this content type
-
Call parser.getParse(content)
-
Call Fetcher.outputPage() with a new FetcherOutput, including url MD5, the populated Content object, and a new ParseText
-
-
On every 100th pass through loop, write a status message to the log
-
Catch any exceptions and log as necessary
As we can see here, the fetcher relies on Factory classes to choose the code it uses for different content types: ProtocolFactory() finds a Protocol instance for a given url, and ParserFactory finds a Parser for a given contentType.
It should now be apparent that implementing a custom crawler with Nutch will revolve around creating new Protocol/Parser classes, and updating ProtocolFactory/ParserFactory to load them as needed. Let's examine these classes now.
命令fetch对应net.nutch.fetcher.Fetcher类
该命令用于抓取一个segment的所有网页
该类的调用方式如下: Fetcher [-logLevel level] [-showThreadID] [-threads n] dir
到目前为止,我们产生了一个新的webdb,并注入url,也产生了一个segment待fetcher写入,现在我们看看fetcher本身,我们可以运行它看看到底发生了什么
net.nutch.fetcher.Fetcher类依赖一下个类
-
FetcherThread, 一个它的内部类
-
net.nutch.parse.ParserFactory
-
net.nutch.plugin.PluginRepository
-
当然还有一些由 PluginRepository加载的插件类
实例变量名 | 所属类 | 构造函数参数 |
fetchList | ArrayFile.Reader | (dir, "fetchlist") |
fetchWriter | ArrayFile.Writer | (dir, "fetcher", FetcherOutput.class) |
contentWriter | ArrayFile.Writer | (dir, "content", Content.class) |
parseTextWriter | ArrayFile.Writer | (dir, "parse_text", ParseText.class) |
parseDataWriter | ArrayFile.Writer | (dir, "parse_data", ParseData |
Fetch类的run方法实例化1到threadCount(译注:在调用CrawlTool的main是传入,默认为10)个 FetcherThread 对象,然后调用每个对象的start方法,接着主线程休眠直到所有子线程运行完毕或者发生了严重错误,最后调用个输入输出流的close方法。
FetcherThread 是一个net.nutch.fetcher.Fetcher的内部类,该类继承了java.lang.Thread,它有一个实例方法run()和3个静态方法:handleFetch(), handleNoFetch(), and logError().
FetcherThread 的run()方法首先实例化一个新的 FetchListEntry 对象叫"fle",接着以无限循环的方式运行一下步骤:
-
如果有严重错误发生,跳出循环
-
获取FetchList 的下一个条目,如果没有跳出循环
-
如果该fle并未标记"fetch", 那么调用 this.handleNoFetch() ,调用时传入值为1的status参数. 接着会发生如下步骤:
-
构造一个输出流: FetcherOutput(fle, hash, status)
-
调用Fetcher.outputPage() 并传入上面构造的所有对象
-
如果该fle标记"fetch", 调用 ProtocolFactory ,并从该url获取Protocol and Content 对象
-
调用 this.handleFetch(url, fle, content). 以下各步骤会发生
-
调用ParserFactory.getParser(contentType, url) (译注:contentType=content.getContentType();)
-
调用parser.getParse(content)(译注:parser=ParserFactory.getParser(contentType, url) )
- 接着调用outputPage(new FetcherOutput(fle, hash, protocolStatus),
content, new ParseText(parse.getText()), parse.getData());参数中包含一个新的 FetcherOutput 对象, including url MD5(译注:即参数中的hash),也包含一个被植入的Content对象和一个新的ParseText
-
-
每循环100次, 将状态信息写入日志
-
如果有必要捕捉任何意外并作记录
很明显,扩展nutch crawler可以通过产生新的Protocol/Parser 或者通过更新 ProtocolFactory /ParserFactory 按需加载他们来实现. 我们现在可以好好看看这些类。