1. 首先,在src/plugin文件夹里建一个目录,这个目录就等于一个插件,在这里我们命名为urlFilter。
2. 建立两个文件加一个目录。
Build.xml:ant编译的时候,要用的的文件。
Plugin.xml:这个是用来注册这个插件。Nutch源程序内部会调用这个文件。每个插件都有这个文件。
Src/java/*目录:*就是插件源程序的包目录。前面src/java一定要有。例如:你的源程序所在的包是com.plugin,那这个目录就是src/java/com/plugin。这个文件里面放插件源程序(方式跟eclipse里存放源文件方式是一样的,只是eclipse里的src变成src/java而已)。
3. 写一个实现URLFilter接口的插件(这个例子是从http://www.javaeye.com/topic/550277这个网页里看到的)。
源码是:
Package com.plugin
import org.apache.commons.logging.Log;
import org.apache.commons.logging.LogFactory;
import org.apache.hadoop.conf.Configuration;
import org.apache.nutch.net.URLFilter;
public class UrlLengthFilter implements URLFilter {
private static final Log LOG = LogFactory.getLog(UrlLengthFilter.class);
private Configuration conf;
@Override
public String filter(String inUrl) {
System.out.println("导入成功,导入成功!!!!!!!!!!!!!!!!!!!!!!!!");
LOG.info("begin UrlLengthFilter is .....");
String urlFilter = "";
if (inUrl == null || inUrl == "") return urlFilter;
String url = inUrl.toLowerCase();
// from first character after 'http://' or first character
int start = url.indexOf("http://");
start = start < 0 ? (url.indexOf("https://") == 0 ? 8 : 0) : 7;
url = url.substring(start);
int end = url.indexOf("/");
end = end < 0 ? url.length() : end;
// return the first character to the first or end
urlFilter = url.substring(0, end);
LOG.info("urlFilter is " + urlFilter);
return urlFilter;
}
@Override
public Configuration getConf() {
return conf;
}
@Override
public void setConf(Configuration conf) {
this.conf = conf;
}
}
然后,放到src/java/com/plugin文件夹里面。
Build.xml:
<?xml version="1.0"?>
<project name="urlFilter" default="jar-core">
<import file="../build-plugin.xml"/>
<!-- Build compilation dependencies -->
<target name="deps-jar">
<ant target="jar" inheritall="false" dir="../lib-xml"/>
</target>
<!-- Add compilation dependencies to classpath -->
<path id="plugin.deps">
<fileset dir="${nutch.root}/build">
<include name="**/lib-xml/*.jar" />
</fileset>
</path>
<!-- Deploy Unit test dependencies -->
<target name="deps-test">
<ant target="deploy" inheritall="false" dir="../lib-xml"/>
<ant target="deploy" inheritall="false" dir="../nutch-extensionpoints"/>
<ant target="deploy" inheritall="false" dir="../protocol-file"/>
</target>
</project>
Plugin.xml:
<?xml version="1.0" encoding="UTF-8"?>
<plugin
id=" urlFilter"
name="urlFilter"
version="0.0.1"
provider-name="nutch.org">
<runtime>
<!-- As defined in build.xml this plugin will end up bundled as recommended.jar -->
<library name="urlFilter.jar">
<export name="*"/>
</library>
</runtime>
<!-- The RecommendedParser extends the HtmlParseFilter to grab the contents of
any recommended meta tags -->
<extension id="com.plugin"
name=" urlFilter "
point="org.apache.nutch.net.URLFilter">
<implementation id="urlFilter"
class="com.plugin.UrlLengthFilter"/>
</extension>
</plugin>
4. 修改根目录/src/plugin里的build.xml。
添加<ant dir="urlFilter" target="deploy"/>
5. 使用ant编译整个工程。(我是先编译urlFilter这个插件文件以后,再编译整个文件的,但貌似直接编译是可以的)。
6. 然后,在根目录/plugins里创建一个文件夹,取名urlFilter(一定要一样)。然后,将编译生成的根目录/build/urlFilter里的jar文件拷贝到此目录。再将上面所建的plugin.xml拷贝到此目录下(这是我碰到的情况,貌似,有很多人使用ant编译后,自动在plugins目录里创建这些内容,我是不行,所以,手动移动的)。
7. 然后,在nutch-site.xml里加上
<property>
<name>plugin.includes</name>
<value>nutch-extensionpoints|protocol-http|urlfilter-regex|parse-(text|html)|index-basic|query-(basic|site|url)|urlFilter</value>
<description>Regular expression naming plugin directory names to
include. Any plugin not matching this expression is excluded.
In any case you need at least include the nutch-extensionpoints plugin. By
default Nutch includes crawling just HTML and plain text via HTTP,
and basic indexing and search plugins.
</description>
</property>
如果,你之前就写了这个property,那在value这里只要加上“|urlFilter”就可以。
然后,编译运行一下,你就会发现,插件生效了。
*调试的时候多去看根目录/logs目录下的hadoop.log日志文件,一些错误报告是在这里显示出来的,不会在控制台上显示出来。
8. 主要参考文献
http://www.javaeye.com/topic/550277
http://www.javaeye.com/topic/549962
http://www.javaeye.com/topic/549960
http://nhy520.javaeye.com/blog/394378
http://wiki.apache.org/nutch/WritingPluginExample-0.9?highlight=(plugin)|(nutch)
http://lucene.472066.n3.nabble.com/URLFilter-Plugin-ClassNotFoundException-td617694.html