[原创]Nutch_0.8实践(1)

最新推荐文章于 2024-09-26 03:15:00 发布

iteye_11670

最新推荐文章于 2024-09-26 03:15:00 发布

阅读量125

点赞数

分类专栏： Lucene 文章标签： lucene Tomcat XSL XML 全文检索

Lucene 专栏收录该内容

1 篇文章 0 订阅

订阅专栏

Nutch_<st1:chmetcnv hasspace="True" numbertype="1" negative="False" sourcevalue=".8" unitname="in" w:st="on" tcsc="0">0.8实践</st1:chmetcnv> (1)<o:p></o:p>

<st1:chsdate w:st="on" month="5" islunardate="False" day="21" year="2007" isrocdate="False">2007-5-21</st1:chsdate><o:p></o:p>

<o:p> </o:p>

Key Word: Nutch Lucene<o:p></o:p>

(一)前言<o:p></o:p>

1. 概述

为了完成本地的全文检索，并在以后中可以升级到对其他网站的全文检索的功能。采用了Lucene搜索引擎来尝试，如果效果好的话，可以普及到以后的产品或项目中。

整个过程中，发现网上的资源良莠不齐，才决定写该文档，供大家参考研究讨论。

<o:p></o:p>

2. 作业环境<o:p></o:p>

WIN SERVER 2003 Enterprise Editon + WAS6.0(自带JRE <st1:chsdate w:st="on" month="12" islunardate="False" day="30" year="1899" isrocdate="False">1.4.2</st1:chsdate>)

<o:p></o:p>

3. 测试开发环境<o:p></o:p>

WIN XP Pro + JRE <st1:chsdate w:st="on" month="12" islunardate="False" day="30" year="1899" isrocdate="False">1.4.2</st1:chsdate>_03 + Tomcat 5.0

(二)开发

1. 前提条件<o:p></o:p>

Cygwin下载：cygwin官方http://www.cygwin.com

Nutch 0.8下载：nutch官方http://lucene.apache.org/nutch/

Lukeall 0.6下载：http://www.getopt.org/luke/（查看Nutch Crawl工具，非必要）

建议：下载http://www.cygwin.com/setup.exe，点击setup.exe建议选择第二项（“Download Without Installing”）,选择一个.tw的镜像下载；下载完毕后，在点击steup.exe，选择第三项（“Install from Local Direction”），完成Cygwin的安装。

插曲一：由于作业环境和测试开发环境是JRE1.4的，在安装nutch0.9之后即报错版本不兼容，打开源程序一看是JRE1.5的代码，只好忍痛舍弃最近更新的lucene<st1:chsdate w:st="on" month="12" islunardate="False" day="30" year="1899" isrocdate="False">2.1.0</st1:chsdate>，重新安装nutch0.8。lukeall-0.7.jar也是基于JRE1.5，使用lukeall-0.6.jar。

插曲二：默认cygwin是没有vi、more和crontab的功能（可惜...），建议都下载cron安装后，再继续配置nutch。

<o:p> </o:p>

2. 配置<o:p></o:p>

先将nutch下载后全部解压存放到$cygwin_home/home/$user/nutch下，设置环境变量NUTCH_JAVA_HOME=$JAVA_HOME

2.1. 修改$cygwin_home/home/$user/nutch/conf/crawl-urlfilter.txt<o:p></o:p>

<o:p></o:p>

<o:p> shell</o:p><o:p>代码</o:p>

# accept hosts in MY.DOMAIN.NAME
#+^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/
+^http://192.168.0.92:8080/

(注：根据实际情况配置)

2.2. 添加$cygwin_home/home/$user/nutch/urls/url.txt文件

<o:p> 增加代码：</o:p>

<o:p>

txt代码

</o:p> <o:p>

http://192.168.0.92:8080/nbtravel/index.html

(注：根据实际情况配置)

2.3. 修改$cygwin_home/home/$user/nutch/conf/nutch-site.xml<o:p></o:p>

修改代码:

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>http.agent.name</name>
<value>Nutch</value>
<description>HTTP 'User-Agent' request header. MUST NOT be empty -
please set this to a single word uniquely related to your organization.
NOTE: You should also check other related properties:
http.robots.agents
http.agent.description
http.agent.url
http.agent.email
http.agent.version
and set their values appropriately.
</description>
</property>
<property>
<name>http.robots.agents</name>
<value>Nutch,*</value>
<description>The agent strings we'll look for in robots.txt files,
comma-separated, in decreasing order of precedence. You should
put the value of http.agent.name as the first agent name, and keep the
default * at the end of the list. E.g.: BlurflDev,Blurfl,*
</description>
</property>
<property>
<name>http.agent.description</name>
<value>Nutch Search Engineer</value>
<description>Further description of our bot- this text is used in
the User-Agent header. It appears in parenthesis after the agent name.
</description>
</property>
<property>
<name>http.agent.url</name>
<value>http://lucene.apache.org/nutch/bot.html</value>
<description>A URL to advertise in the User-Agent header. This will
appear in parenthesis after the agent name. Custom dictates that this
should be a URL of a page explaining the purpose and behavior of this
crawler.
</description>
</property>
<property>
<name>http.agent.email</name>
<value>nutch-agent@lucene.apache.org</value>
<description>An email address to advertise in the HTTP 'From' request
header and User-Agent header. A good practice is to mangle this
address (e.g. 'info at example dot com') to avoid spamming.
</description>
</property>
</configuration>

2.4. 修改$tomcat_home/conf/server.xml

2.4.1. Connector

xml 代码

2.4.2. Context

xml 代码

(注：根据实际情况配置,pub和nbtravel为我的目标项目)

2.5. 启动TOMCAT

2.6. 修改$tomcat_home/webapps/nutch/WEB-INF/classes/nutch-site.xml

修改代码：

xml 代码

<?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>  <configuration> <property> <name>searcher.dir</name> <value>D:\cygwin\home\Howard\nutch\crawl</value> </property> </configuration>