Nutch_<st1:chmetcnv hasspace="True" numbertype="1" negative="False" sourcevalue=".8" unitname="in" w:st="on" tcsc="0">0.8实践</st1:chmetcnv> (1)<o:p></o:p>
<st1:chsdate w:st="on" month="5" islunardate="False" day="21" year="2007" isrocdate="False">2007-5-21</st1:chsdate><o:p></o:p>
<o:p> </o:p>
Key Word: Nutch Lucene<o:p></o:p>
(一)前言<o:p></o:p>
1. 概述
为了完成本地的全文检索,并在以后中可以升级到对其他网站的全文检索的功能。采用了Lucene搜索引擎来尝试,如果效果好的话,可以普及到以后的产品或项目中。
整个过程中,发现网上的资源良莠不齐,才决定写该文档,供大家参考研究讨论。
<o:p></o:p>
2. 作业环境<o:p></o:p>
WIN SERVER 2003 Enterprise Editon + WAS6.0(自带JRE <st1:chsdate w:st="on" month="12" islunardate="False" day="30" year="1899" isrocdate="False">1.4.2</st1:chsdate>)
<o:p></o:p>
3. 测试开发环境<o:p></o:p>
WIN XP Pro + JRE <st1:chsdate w:st="on" month="12" islunardate="False" day="30" year="1899" isrocdate="False">1.4.2</st1:chsdate>_03 + Tomcat 5.0
1. 前提条件<o:p></o:p>
Cygwin下载:cygwin官方http://www.cygwin.com
Nutch 0.8下载:nutch官方http://lucene.apache.org/nutch/
Lukeall 0.6下载:http://www.getopt.org/luke/(查看Nutch Crawl工具,非必要)
建议:下载http://www.cygwin.com/setup.exe,点击setup.exe建议选择第二项(“Download Without Installing”),选择一个.tw的镜像下载;下载完毕后,在点击steup.exe,选择第三项(“Install from Local Direction”),完成Cygwin的安装。
插曲一:由于作业环境和测试开发环境是JRE1.4的,在安装nutch0.9之后即报错版本不兼容,打开源程序一看是JRE1.5的代码,只好忍痛舍弃最近更新的lucene<st1:chsdate w:st="on" month="12" islunardate="False" day="30" year="1899" isrocdate="False">2.1.0</st1:chsdate>,重新安装nutch0.8。lukeall-0.7.jar也是基于JRE1.5,使用lukeall-0.6.jar。
插曲二:默认cygwin是没有vi、more和crontab的功能(可惜...),建议都下载cron安装后,再继续配置nutch。
<o:p> </o:p>
2. 配置<o:p></o:p>
先将nutch下载后全部解压存放到$cygwin_home/home/$user/nutch下,设置环境变量NUTCH_JAVA_HOME=$JAVA_HOME
2.1. 修改$cygwin_home/home/$user/nutch/conf/crawl-urlfilter.txt<o:p></o:p>
<o:p></o:p>
<o:p> shell</o:p><o:p>代码</o:p>
- # accept hosts in MY.DOMAIN.NAME
- #+^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/
- +^http://192.168.0.92:8080/
(注:根据实际情况配置)
2.2. 添加$cygwin_home/home/$user/nutch/urls/url.txt文件
<o:p> 增加代码:</o:p>
<o:p>- http://192.168.0.92:8080/nbtravel/index.html
(注:根据实际情况配置)
2.3. 修改$cygwin_home/home/$user/nutch/conf/nutch-site.xml<o:p></o:p>
修改代码:
- <?xml version="1.0"?>
- <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
- <!-- Put site-specific property overrides in this file. -->
- <configuration>
- <property>
- <name>http.agent.name</name>
- <value>Nutch</value>
- <description>HTTP 'User-Agent' request header. MUST NOT be empty -
- please set this to a single word uniquely related to your organization.
- NOTE: You should also check other related properties:
- http.robots.agents
- http.agent.description
- http.agent.url
- http.agent.email
- http.agent.version
- and set their values appropriately.
- </description>
- </property>
- <property>
- <name>http.robots.agents</name>
- <value>Nutch,*</value>
- <description>The agent strings we'll look for in robots.txt files,
- comma-separated, in decreasing order of precedence. You should
- put the value of http.agent.name as the first agent name, and keep the
- default * at the end of the list. E.g.: BlurflDev,Blurfl,*
- </description>
- </property>
- <property>
- <name>http.agent.description</name>
- <value>Nutch Search Engineer</value>
- <description>Further description of our bot- this text is used in
- the User-Agent header. It appears in parenthesis after the agent name.
- </description>
- </property>
- <property>
- <name>http.agent.url</name>
- <value>http://lucene.apache.org/nutch/bot.html</value>
- <description>A URL to advertise in the User-Agent header. This will
- appear in parenthesis after the agent name. Custom dictates that this
- should be a URL of a page explaining the purpose and behavior of this
- crawler.
- </description>
- </property>
- <property>
- <name>http.agent.email</name>
- <value>nutch-agent@lucene.apache.org</value>
- <description>An email address to advertise in the HTTP 'From' request
- header and User-Agent header. A good practice is to mangle this
- address (e.g. 'info at example dot com') to avoid spamming.
- </description>
- </property>
- </configuration>