windows上搭建自己的搜索引擎nutch

最新推荐文章于 2022-10-16 00:46:30 发布

IT局

最新推荐文章于 2022-10-16 00:46:30 发布

阅读量2.5k

点赞数

分类专栏：技术评论文章标签： windows 搜索引擎 download tomcat file search

本文链接：https://blog.csdn.net/liuliming3000/article/details/1804036

版权

技术评论专栏收录该内容

2 篇文章 0 订阅

订阅专栏

nutch windows install guider

--By Liming Liu

1 Install Cygwin. 1

2 Install JDK.. 4

3 Install Tomcat 5

4 Pre-Install nutch. 5

5 Configure and run nutch. 5

6 Begin search. 7

7 Referece. 7

1 Install Cygwin

Download and install the latest version, must select GCC while selecting packages.
　

2 Install JDK

Download jdk-1_5_0_06-windows-i586-p.exe and install(acquiescently, C:/Program Files/Java/jdk1.5.0_06 ).

Set environmental variable: NUTCH_JAVA_HOME: C:/Program Files/Java/jdk1.5.0_06

JAVA_HOME: C:/Program Files/Java/jdk1.5.0_06

3 Install Tomcat

Download apache-tomcat-6.0.13.exe and install(acquiescently, C:/Program Files/Apache Software Foundation/Tomcat 6.0).Remember the port, account and password.

4 Pre-Install nutch

Download nutch-0.9.tar.gz and unzip to nutch-0.9(such as C:/dev/search/netch/nutch-0.9).

Start Tomcat service, open http://localhost:8080/manager/html

Move to “WAR file to deploy”, upload file: C:/dev/search/netch/nutch-0.9/nutch-0.9.war.

Close Tomcat service, change directory name “ROOT” in “C:/Program Files/Apache Software Foundation/Tomcat 6.0/webapps” to “ ROOT-backup”, change directory name “nutch-0.9” in “C:/Program Files/Apache Software Foundation/Tomcat 6.0/webapps” to “ ROOT”.( OR do nothing)

5 Configure and run nutch

Create directory “urls” in “C:/dev/search/netch/nutch-0.9”.

Create a file “testurlfile” in directory “urls”.

Add line: “http://www.bokee.com “ to file “testurlfile”.

Find file “C:/dev/search/netch/nutch-0.9/conf/ crawl-urlfilter.txt”, replace “MY.DOMAIN.NAME” with “bokee.com”

Find file “C:/dev/search/netch/nutch-0.9/conf/ nutch-site.xml”, edit it to this:

<?xml version="1.0"?>

<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<name>http.agent.name</name>

<value>nutch</value>

<description>HTTP 'User-Agent' request header. MUST NOT be empty -

please set this to a single word uniquely related to your organization.

NOTE: You should also check other related properties:

http.robots.agents

http.agent.description

http.agent.url

http.agent.email

http.agent.version

and set their values appropriately.

</description>

</property>

<name>http.agent.description</name>

<value>liming agent.description</value>

<description>Further description of our bot- this text is used in

the User-Agent header. It appears in parenthesis after the agent name.

</description>

</property>

<name>http.agent.url</name>

<description>A URL to advertise in the User-Agent header. This will

appear in parenthesis after the agent name. Custom dictates that this

should be a URL of a page explaining the purpose and behavior of this

crawler.

</description>

</property>

<name>http.agent.email</name>

<value>agent.email</value>

<description>An email address to advertise in the HTTP 'From' request

header and User-Agent header. A good practice is to mangle this

address (e.g. 'info at example dot com') to avoid spamming.

</description>

</property>

</configuration>

Find file “C:/Program Files/Apache Software Foundation/Tomcat 6.0/webapps/ROOT/WEB-INF/classes/”, edit it to this:

<?xml version="1.0"?>

<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<name>searcher.dir</name>

<value>C:/dev/search/netch/nutch-0.9/crawl.demo</value>

</property>

</configuration>

Find file “C:/Program Files/Apache Software Foundation/Tomcat 6.0/conf/server.xml”.Edit the item “<Connector port="8080" …/>” to this:

Start tomcat service.

Start cygwin, cd to “C:/dev/search/netch/nutch-0.9”, run: bin/nutch crawl urls -dir crawl.demo -depth 2 -topN 50

6 Begin search

Open http://localhost:8080 with internet explorer, you will see a real search engine.

(Or http://localhost:8080/nutch)

7 Referece

http://www.javaeye.com/topic/81627 Nutch_0.8实践 (1) X.D.Hua

http://www.ideagrace.com/club/simple/index.php?t312.html Nutch 于 winxp Kevin

http://blog.csdn.net/pwlazy/archive/2006/08/23/1109868.aspx windows下nutch0.8初探 pwlazy

Liming Liu:

刘黎明 北京科技大学计算机硕士 liuliming2008@126.com

IT局

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

专栏目录