转载本文请注明出处:http://blog.csdn.net/pwlazy
Factory classes: '''URLFilterFactory'''
> Class net.nutch.net.URLFilterFactory
> used by:
> - net.nutch.db.WebDBInjector
> - net.nutch.tools.UpdateDatabaseTool
URLFilterFactory is not strictly part of the crawler, but it is a good extension point within Nutch. Here's how it works:
-
When the class is loaded, URLFILTER_CLASS is set to the value returned by NutchConf for the key "urlfilter.class"
-
When getFilter() is called, it checks to see if the filter class has already been loaded. If not, we load it using Class.forName(URLFILTER_CLASS), and the class is returned.
It loads one class, which is configurable via "urlfilter.class". By default, nutch-default.xml specifies this as follows:
<!-- urlfilter properties -->
<property>
<name>urlfilter.class</name>
<value>net.nutch.net.RegexURLFilter</value>
<description>Name of the class used to filterURLs.</description>
</property>
<property>
<name>urlfilter.regex.file</name>
<value>regex-urlfilter.txt</value>
<description>Name of file onCLASSPATH containing default regular
expressions used byRegexURLFilter.</description>
</property>
Now let's look at the crawler factories, which are a bit more complex.
工厂类:''URLFilterFactory'''
类 net.nutch.net.URLFilterFactory 被net.nutch.db.WebDBInjector 和net.nutch.tools.UpdateDatabaseTool 使用
URLFilterFactory is not strictly part of the crawler, but it is a good extension point within Nutch. Here's how it works:
URLFilterFactory 严格意义上并不属于crawler,但它是一个好的扩展点。让我们看看它的工作机制:
- 当该类被加载时,属性URLFILTER_CLASS被赋值为NutchConf.get().get("urlfilter.class")
- 当getFilter()方法被调用,它检查是否该类被加载,如果没有,通过Class.forName(URLFILTER_CLASS)来加载,否则直接返回该类
它通过可配置的urlfilter.class特性加载该类。默认情况下,nutch-default.xml定义如下
< property >
< name > urlfilter.class </ name >
< value > net.nutch.net.RegexURLFilter </ value >
< description > NameoftheclassusedtofilterURLs. </ description >
</ property >
< property >
< name > urlfilter.regex.file </ name >
< value > regex-urlfilter.txt </ value >
< description > NameoffileonCLASSPATHcontainingdefaultregular
expressionsusedbyRegexURLFilter. </ description >
</ property >
让我们再看看与crawler相关的工厂,那可是有点复杂。