噼里啪啦调了两周,网上的资源各种试,终于在今天下午调试成功。由于只需要nutch的索引功能,因此,nutch选择了1.0版本,在myeclipse中调试,具体过程如下:
1.配置java环境,网上资料巨多,不再赘述,这步不成功,就不用往下了;
2.下载Cygwin工具:http://www.cygwin.com并安装Cygwin.安装成功后,记得在我的电脑属性中配置环境变量:编辑path属性,加入d:cygwin/bin到path中.注:这步的环境变量不配置,后期运行会出错哦,找不到bash什么的(如果在Cygwin下运行,可以cd进入到bin下运行,但在myeclipse中运行,最好还是配置一下环境变量)
3.下载nutch软件包:http://labs.renren.com/apache-mirror//nutch/,这里下载的是nutch1.0版本
4.下载Javacc(用于在添加中文分词时编译NutchAnalysis.jj)
5.下载imdict-chinese-analyzer(中科院java版的分词工具)
6.将nutch导入myeclipse中,具体操作如下:
6.1File》New》Project...选择“Java Project from Existing Ant Buildfile”,点击“next”,在Ant buildfile的Browse...中选择nutch软件包中的build.xml,Project name 随意了,然后“finish”!
6.2.修改nutch工程conf文件夹下的nutch-site.xml,本人的文件如下:
<span style="font-size:18px;"><?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- Put site-specific property overrides in this file. -->
<configuration>
<property>
<name>http.agent.name</name>
<value>Nutch</value>
<description>HTTP'User-Agent'request header.MUST NOT be empty-please set this to a single word uniquely related to your organization.
NOTE:You should also check other related properties:
http.robots.agents
http.agent.description
http.agent.url
http.agent.email
http.agent.version
and set their values appropriately.
</description>
</property>
<!--file properties-->
<property>
<name>searcher.dir</name>
<value>D:\nutch-1.0\localweb</value>
<description></description>
</property>
<property>
<name>http.robots.agents</name>
<value>Nutch,*</value>
<description>The agent strings we'll look for in robots.txt files,comma-separated, in decreasing order of precedence.
You should put the value of http.agent.name as the first agent name,
and keep the default * at the end of the list.E.g:BlurflDev,Blurfl,*
</description>
</property>
<property>
<name>http.agent.description</name>
<value>Nutch Search Engineer</value>
<description>Further description of our bot-this text is used in the User-Agent header.
It appears in parenthesis after the agent name.
</description>
</property>
<property>
<name>http.agent.url</name>
<value>http://lucene.apache.org/nutch/bot.html</value>
<description>A URL to advertise in the User-Agent header. This will appear in parenthesis after the agent name.
Custom dictates that this should be a URL of a page explaining the purpose and behavior of this crawler.
</description>
</property>
<property>
<name>http.agent.email</name>
<value>nutch-agent@lucene.apache.org</value>
<description>An email address to advertise in the HTTP 'From' request header and User-Agent header.
A good practice is to mangle this address(e.g.'info at example dot com')to avoid spamming.
</description>
</property>
<property>
<name>http.agent.version</name>
<value>1.0</value>
</property>
</configuration>
</span>
6.3修改crawl-urlfilter.txt,本人文件如下:
<span style="font-size:18px;"># The url filter file used by the crawl command.
# Better for intranet crawling.
# Be sure to change MY.DOMAIN.NAME to your domain name.
# Each non-comment, non-blank line contains a regular expression
# prefixed by '+' or '-'. The first matching pattern in the file
# determines whether a URL is included or ignored. If no pattern
# matches, the URL is ignored.
# skip file:, ftp:, & mailto: urls
-^(file|ftp|mailto):
# skip image and other suffixes we can't yet parse
-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$
# skip URLs containing certain characters as probable queries, etc.
-[?*!@=]
# skip URLs with slash-delimited segment that repeats 3+ times, to break loops
-.*(/.+?)/.*?\1/.*?\1/
# accept hosts in MY.DOMAIN.NAME
+^</span>
6.4nutch工程根目录下建立weburls.txt,里面输入待检索的地址如:http://www.baidu.com,可以多个,一行写一个哦
备注:关于6.2,6.3,6.4处有问题的亲们,请自行百度,这部分资源很丰富
6.5nutch工程右键选择build path》Configure build path,Libraries中选择Add Class Folder...选择conf文件(这步很重要)
7.运行一下,在Run confgurations中Main class 处输入org.apache.nutch.crawl.Crawl,Arguments中的Program arguments中写入:weburls.txt -dir localweb -depth 3 -topN 100,VM arguments中写入:-Dhadoop.log.dir=logs -Dhadoop.log.file=hadoop.log -Xms256m -Xmx1024m
8.至此,完成了nutch在myeclipse中的部署运行,运行完成后,可以借助luke工具查看根目录下localweb文件夹中生成的索引。接下来开始增加中文分词功能:
首先,需要在myeclipse中把imdict-chinese-analyzer打包成jar包。
下载来的imdict-chinese-analyzer-java5.zip里面只是一个eclipse工程,我们要利用这个工程生成一个jar来用,或者也可以直接放到nutch的源代码项目中。这里注意的是如果要编译imdict-chinese-analyzer-java5.zip需要在项目中加入lucene-core-2.4.0.jar 和junit-4.1.jar。同时,需要在org.apache.lucene.analysis.cn包下建立stopwords.txt文件,在org.apache.lucene.analysis.cn.smart.hhmm下需要有bigramdict.mem和coredict.mem文件,这两个文件网上都可以下载到。全部添加完成之后再导出为chinese-analyzer.jar
其次,将jar放入nutch的lib文件夹下,并build path 配置路径
第三,添加中文分词
- 修改org.apache.nutch.analysis下NutchAnalysis.jj文件。
将
| <SIGRAM: <CJK> >
改成
| <SIGRAM: (<CJK>)+ >
然后用javacc编译NutchAnalysis.jj文件,编译出7个java源文件。(编译方法:CMD进入到javacc的bin文件夹下,javacc NutchAnalysis.jj)
- 将这7个文件放到nutch的源工程中,修改org.apache.nutch.analysis.ParseException.java。
原来是
- public class ParseException extends Exception
改成
- public class ParseException extends java.io.IOException
因为nutch原来的这个源文件就是这么写的,用jj文件编译出来的源文件就不对,不改的话会提示ParseException 未被捕获。 - 确保这7个文件放入工程后还能都编译过去。
- 修改org.apache.nutch.analysis.NutchDocumentAnalyzer.java
将源文件最后的那个函数修改为下面
就是使用了新的分词程序。 - Java代码
-
- /** Returns a new token stream for text from the named field. */
- public TokenStream tokenStream(String fieldName, Reader reader) {
- Analyzer analyzer;
- /*
- if ("anchor".equals(fieldName))
- analyzer = ANCHOR_ANALYZER;
- else
- analyzer = CONTENT_ANALYZER;
- */
- analyzer = new org.apache.lucene.analysis.cn.SmartChineseAnalyzer(true);
- //
- return analyzer.tokenStream(fieldName, reader);
- }
-
在build.xml中添加chinese-analyzer.jar,如下:
<span style="font-size:18px;"><lib dir="${lib.dir}">
<include name="lucene*.jar"/>
<include name="taglibs-*.jar"/>
<include name="hadoop-*.jar"/>
<include name="dom4j-*.jar"/>
<include name="xerces-*.jar"/>
<include name="tika-*.jar"/>
<include name="apache-solr-*.jar"/>
<include name="commons-httpclient-*.jar"/>
<include name="commons-codec-*.jar"/>
<include name="commons-collections-*.jar"/>
<include name="commons-beanutils-*.jar"/>
<include name="commons-cli-*.jar"/>
<include name="commons-lang-*.jar"/>
<include name="commons-logging-*.jar"/>
<include name="log4j-*.jar"/>
<include name="chinese-analyzer.jar"/>
</lib><lib dir="${lib.dir}">
<include name="lucene*.jar"/>
<include name="taglibs-*.jar"/>
<include name="hadoop-*.jar"/>
<include name="dom4j-*.jar"/>
<include name="xerces-*.jar"/>
<include name="tika-*.jar"/>
<include name="apache-solr-*.jar"/>
<include name="commons-httpclient-*.jar"/>
<include name="commons-codec-*.jar"/>
<include name="commons-collections-*.jar"/>
<include name="commons-beanutils-*.jar"/>
<include name="commons-cli-*.jar"/>
<include name="commons-lang-*.jar"/>
<include name="commons-logging-*.jar"/>
<include name="log4j-*.jar"/>
<include name="chinese-analyzer.jar"/>
</lib></span>