Nutch 1.3 学习笔记 插件扩展 10-2
---------------------------------
如org.apache.nutch.urlfilter.my
在本机测试结果如下:
3. 总结
这里只是写了一个简单的插件,当然你可以根据你的需求写出更加复杂的插件.
---------------------------------
1. 自己扩展一个简单的插件
这里扩展一个Nutch的URLFilter插件,叫MyURLFilter1.1 生成一个Package
首先生成一个与urlfilter-regex类似的包结构如org.apache.nutch.urlfilter.my
1.2 在这个包中生成相应的扩展文件
再生成一个MyURLFilter.java文件,内容如下:- package org.apache.nutch.urlfilter.my;
- import java.io.BufferedReader;
- import java.io.IOException;
- import java.io.InputStreamReader;
- import org.apache.hadoop.conf.Configuration;
- import org.apache.nutch.net.URLFilter;
- import org.apache.nutch.urlfilter.prefix.PrefixURLFilter;
- public class MyURLFilter implements URLFilter{ // 这里的继承自Nutch的URLFilter扩展
- private Configuration conf;
- public MyURLFilter()
- {}
- @Override
- public String filter(String urlString) { // 对url字符串进行过滤
- // TODO Auto-generated method stub
- return "My Filter:"+ urlString;
- }
- @Override
- public Configuration getConf() {
- // TODO Auto-generated method stub
- return this.conf;
- }
- @Override
- public void setConf(Configuration conf) {
- // TODO Auto-generated method stub
- this.conf = conf;
- }
- public static void main(String[] args) throws IOException
- {
- MyURLFilter filter = new MyURLFilter();
- BufferedReader in=new BufferedReader(new InputStreamReader(System.in));
- String line;
- while((line=in.readLine())!=null) {
- String out=filter.filter(line);
- if(out!=null) {
- System.out.println(out);
- }
- }
- }
- }
1.3 打包成jar包并生成相应的plugin.xml文件
打包可以用ivy或者是eclipse来打,每一个plugin都有一个描述文件plugin.xml,内容如下:- <plugin
- id="urlfilter-my"
- name="My URL Filter"
- version="1.0.0"
- provider-name="nutch.org">
- <runtime>
- <library name="urlfilter-my.jar">
- <export name="*"/>
- </library>
- <!-- 如果这里你的插件有依赖第三方库的话,可以这样写
- <library name="fontbox-1.4.0.jar"/>
- <library name="geronimo-stax-api_1.0_spec-1.jar"/>
- -->
- </runtime>
- <requires>
- <import plugin="nutch-extensionpoints"/>
- </requires>
- <extension id="org.apache.nutch.net.urlfilter.my"
- name="Nutch My URL Filter"
- point="org.apache.nutch.net.URLFilter">
- <implementation id="MyURLFilter"
- class="org.apache.nutch.urlfilter.prefix.MyURLFilter"/>
- <!-- by default, attribute "file" is undefined, to keep classic behavior.
- <implementation id="PrefixURLFilter"
- class="org.apache.nutch.net.PrefixURLFilter">
- <parameter name="file" value="urlfilter-prefix.txt"/>
- </implementation>
- -->
- </extension>
- lt;/plugin>
1.4 把需要的包与配置文件放入plugins目录中
最后把打好的jar包与plugin.xml放到一个urlfilter-my文件夹中,再把这个文件夹到到nutch的plugins目录下
2. 使用bin/nutch plugin来进行测试
在运行bin/nutch plugin命令之前你要修改一下nutch-site.xml这个配置文件,在下面加入我们写的插件,如下- <property>
- <name>plugin.includes</name>
- <value>protocol-http|urlfilter-(regex|prefix|my)|parse-(html|tika)|index-(basic|anchor)|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
- <description>Regular expression naming plugin directory names to
- include. Any plugin not matching this expression is excluded.
- In any case you need at least include the nutch-extensionpoints plugin. By
- default Nutch includes crawling just HTML and plain text via HTTP,
- and basic indexing and search plugins. In order to use HTTPS please enable
- protocol-httpclient, but be aware of possible intermittent problems with the
- underlying commons-httpclient library.
- </description>
- </property>
在本机测试结果如下:
- lemo@debian:~/Workspace/java/Apache/Nutch/nutch-1.3$ bin/nutch plugin urlfilter-my org.apache.nutch.urlfilter.my.MyURLFilter
- urlString1
- My Filter:urlString1
- urlString2
- My Filter:urlString2
3. 总结
这里只是写了一个简单的插件,当然你可以根据你的需求写出更加复杂的插件.