Dissecting The Nutch Crawler -Factory classes: '''ParserFactory''', '''ProtocolFactory'''

最新推荐文章于 2016-05-05 19:13:26 发布

pwlazy

最新推荐文章于 2016-05-05 19:13:26 发布

阅读量2.2k

点赞数

分类专栏： search engine 文章标签： descriptor extension plugins library encoding file

search engine 专栏收录该内容

29 篇文章 0 订阅

订阅专栏

英文原文出处： DissectingTheNutchCrawler
转载本文请注明出处：http://blog.csdn.net/pwlazy

Factory classes: '''ParserFactory''', '''ProtocolFactory'''

> Class net.nutch.parser.ParserFactory
> used by:
> - net.nutch.db.WebDBInjector
> - net.nutch.fetcher.Fetcher
> - net.nutch.parser.ParserChecker
>
> Class net.nutch.protocol.ProtocolFactory
> used by:
> - net.nutch.fetcher.Fetcher
> - net.nutch.parser.ParserChecker
>
> Class net.nutch.plugin.PluginRepository: used by all of the above

ParserFactory and ProtocolFactory are called directly from net.nutch.fetcher.Fetcher, to get the appropriate Parser and Protocol objects for a given content_type and url. They both use an instance of net.nutch.plugin.PluginRepository to find and load Java classes.

By default, nutch-default.xml tells PluginRepository to look for classes in a directory called "plugins" somewhere on the Java classpath. Normally you'll just use the one in your Nutch install directory.

<!-- plugin properties -->

<property>
  <name>plugin.folders</name>
  <value>plugins</value>
  <description>Directories where nutch plugins are located.  Each
  element may be a relative or absolute path.  If absolute, it is used
  as is.  If relative, it is searched for on the classpath.</description>
</property>

Inside the plugin directory you will find a handful of sub-directories, each containing a file called "plugin.xml" and one or more Java archive (.jar) files. Directories include:

parse-html
parse-text
parse-msword
parse-pdf
protocol-file
protocol-ftp
protocol-http

One directory, plus the "plugin.xml" and .jar file contents, constitutes one "plugin".

TheXML file is a descriptor that is read by PluginRepository to determine two main things:

What "extension point" (Java interface) the plugin implements, and

b. how to load its contents.

Here is the plugin.xml file for "protocol-file":

<?xml version="1.0" encoding="UTF-8"?>
<plugin
   id="protocol-file"
   name="File Protocol Plug-in"
   version="1.0.0"
   provider-name="nutch.org">

   <extension-point
      id="net.nutch.protocol.Protocol"
      name="Nutch Protocol"/>

   <runtime>
      <library name="protocol-file.jar">
         <export name="*"/>
      </library>
   </runtime>

   <extension id="net.nutch.protocol.file"
              name="FileProtocol"
              point="net.nutch.protocol.Protocol">

      <implementation id="net.nutch.protocol.file.File"
                      class="net.nutch.protocol.file.File"
                      protocolName="file"/>
   </extension>
</plugin>

Since the plugin is named "protocol-file", you probably guessed already that this is a protocol handler for loading files on disk. But this descriptor tells us -- and PluginRepository -- precisely what it does:

the extension-point (Java interface) name is "net.nutch.protocol.Protocol"
the protocolName is "file"

Thus, when Nutch sees aURL that starts with " file://", it will know to call this plugin to fetch that page.

Look at the descriptors for "protocol-http" and "protocol-ftp". You should see that the extension-point is exactly the same as for protocol-file, but the protocolName is different: "http" and "ftp", respectively.

Now let's examine the descriptor for parse-text:

<?xml version="1.0" encoding="UTF-8"?>
<plugin
   id="parse-text"
   name="Text Parse Plug-in"
   version="1.0.0"
   provider-name="nutch.org">

   <extension-point
      id="net.nutch.parse.Parser"
      name="Nutch Content Parser"/>

   <runtime>
      <library name="parse-text.jar">
         <export name="*"/>
      </library>
   </runtime>

   <extension id="net.nutch.parse.text"
              name="TextParse"
              point="net.nutch.parse.Parser">

      <implementation id="net.nutch.parse.text.TextParser"
                      class="net.nutch.parse.text.TextParser"
                      contentType="text/plain"
                      pathSuffix="txt"/>
   </extension>
</plugin>

Note that the extension-point is now net.nutch.parse.Parser. And this time, <extension><implementation> doesn't specify a protocolName. Instead, we see "contentType" and "pathSuffix".

So now we see how PluginRepository chooses which plugin to use for a given task:

It finds the set of plugins that implement a certain extension-point
Then, from that set, it finds one that works for the content at hand (protocolName, contentType, or pathSuffix).

Look at the descriptor for parse-html. You'll see that it follows these rules. It implements the same extension-point as parse-text (net.nutch.parse.Parser), but it has different values for contentType and pathSuffix values:

    contentType="text/html"
    pathSuffix=""

This entry looks a bit strange with the empty pathSuffix value. But that just means that this plugin doesn't match any pathSuffix value. So, parse-html is only used when we fetch remoteURLs, not anything residing on the local filesystem.

Factory classes: '''ParserFactory''', '''ProtocolFactory'''

工厂类：''ParserFactory'' 和 ''ProtocolFactory''

类net.nutch.parser.ParserFactory 被如下类使用

net.nutch.db.WebDBInjector
net.nutch.fetcher.Fetcher
net.nutch.parser.ParserChecker

类Class net.nutch.protocol.ProtocolFactory 被如下类使用

net.nutch.fetcher.Fetcher
net.nutch.parser.ParserChecker

类net.nutch.plugin.PluginRepository: 被上面所有类使用

net.nutch.fetcher.Fetcher直接调用ParserFactory 和 ProtocolFactory 根据传入的内容类型和url获取合适的Parser和Protocol对象 , 两个工厂类都使用net.nutch.plugin.PluginRepository 的实例获取和加载相关java类

默认情况下，nutch-default.xml告诉了 PluginRepository 从位于类路径的plugins目录中获取类。通常情况下你应该使用你的Nutch安装目录中那个plugins目录

 
 
  
  
   
   
   
   <!--
   
    plugin properties 
   
   -->
   
   


   
   <
   
   property
   
   >
   
   
  
   
   <
   
   name
   
   >
   
   plugin.folders
   
   </
   
   name
   
   >
   
   
  
   
   <
   
   value
   
   >
   
   plugins
   
   </
   
   value
   
   >
   
   
  
   
   <
   
   description
   
   >
   
   Directories where nutch plugins are located.  Each
  element may be a relative or absolute path.  If absolute, it is used
  as is.  If relative, it is searched for on the classpath.
   
   </
   
   description
   
   >
   
   

   
   </
   
   property
   
   >
   
   


  
  
 
 

在plugin目录下，你会看到一些子目录。每个子目录包含一个名为plugin.xml的文件和一个或多个jar文件。目录包括

parse-html
parse-text
parse-msword
parse-pdf
protocol-file
protocol-ftp
protocol-http

一个目录加上目录里的plugin.xml及jar文件构成了一个插件

那个xml文件是个描述，由 PluginRepository 读取从而决定两个主要的事情：

该插件实现了什么扩展点（java接口）
如何加载其内容

以下是protocol-file目录下（译注：或者说protocol-file插件）的plugin.xml

 
 
  
  
   
   
   
   <?
   
   xml version="1.0" encoding="UTF-8"
   
   ?>
   
   

   
   <
   
   plugin
   
   
   id
   
   ="protocol-file"
   
   
   name
   
   ="File Protocol Plug-in"
   
   
   version
   
   ="1.0.0"
   
   
   provider-name
   
   ="nutch.org"
   
   >
   
   

   
   
   <
   
   extension-point
      
   
   id
   
   ="net.nutch.protocol.Protocol"
   
   
      name
   
   ="Nutch Protocol"
   
   />
   
   

   
   
   <
   
   runtime
   
   >
   
   
      
   
   <
   
   library 
   
   name
   
   ="protocol-file.jar"
   
   >
   
   
         
   
   <
   
   export 
   
   name
   
   ="*"
   
   />
   
   
      
   
   </
   
   library
   
   >
   
   
   
   
   </
   
   runtime
   
   >
   
   

   
   
   <
   
   extension 
   
   id
   
   ="net.nutch.protocol.file"
   
   
              name
   
   ="FileProtocol"
   
   
              point
   
   ="net.nutch.protocol.Protocol"
   
   >
   
   

      
   
   <
   
   implementation 
   
   id
   
   ="net.nutch.protocol.file.File"
   
   
                      class
   
   ="net.nutch.protocol.file.File"
   
   
                      protocolName
   
   ="file"
   
   />
   
   
   
   
   </
   
   extension
   
   >
   
   

   
   </
   
   plugin
   
   >

因为这个插件叫protocol-file,所以你很可能已经猜到这是一个加载磁盘文件的协议处理器。但这个xml描述能精确地告诉我们和PluginRepository 这个插件到底做什么用

the extension-point (Java interface) name is "net.nutch.protocol.Protocol"

这个扩展点（java接口）名是net.nutch.protocol.Protocol
协议名是 "file"

因此，当nutch看到一个url以file:// 开始，它就会用这个插件获取那个页面

看"protocol-http" 和 "protocol-ftp".的xml描述，你会看到它们的扩展点一样，但协议名不同一个是http,另一个是ftp

下面让我们看看parse-text的描述

<? xml version="1.0" encoding="UTF-8" ?>

< plugin

id ="parse-text"

name ="Text Parse Plug-in"

version ="1.0.0"

provider-name ="nutch.org" >

< extension-point

id ="net.nutch.parse.Parser"

name ="Nutch Content Parser" />

< runtime >

< library name ="parse-text.jar" >

< export name ="*" />

</ library >

</ runtime >

< extension id ="net.nutch.parse.text"

name ="TextParse"

point ="net.nutch.parse.Parser" >

< implementation id ="net.nutch.parse.text.TextParser"

class ="net.nutch.parse.text.TextParser"

contentType ="text/plain"

pathSuffix ="txt" />

</ extension >

</ plugin >

注意上面的扩展点是net.nutch.parse.Parser.这一次<extension><implementation>与协议无关了，我们看到的是contentType和pathSuffix

现在我们看看PluginRepository是如何根据给定任务选择插件的

找到实现某个扩展点的插件组
然后从插件组件中选择一个合适的用于目前的给定（比如协议名，内容类型或者路径后缀）

我们来看看parse-html的描述。你将会发现如下规则：它实现了和parse-text (net.nutch.parse.Parser)同样的扩展点，但它有不同的内容类型和路径后缀

    
 
 
  
  
   
   
   
   contentType="text/html"
    pathSuffix=""

上面最后一句中路径后缀为空，这看上去有些奇怪。但这也意味着这个插件不匹配任何后缀。所以parse-html插件只用于我们获取远程url而不是位于本地文件系统的任何冬冬

pwlazy

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

专栏目录