浅入浅出nutch 0.8使用指南4windows

最新推荐文章于 2024-08-02 20:40:26 发布

xiaodaoxiaodao

最新推荐文章于 2024-08-02 20:40:26 发布

阅读量3.6k

点赞数

文章标签： lucene exception apache properties windows each

本文链接：https://blog.csdn.net/xiaodaoxiaodao/article/details/1342433

版权

转载请注明来源/作者

浅入浅出nutch 0.8使用指南4windows

Nutch是一个搜索引擎，昨天刚从一个朋友那里知道，前一阵子接触了lucene，对搜索的东西跃跃欲试，趁着周末试用了一把，感觉蛮新鲜，网上的例子多是基于0.7版本的，找到了一些0.8的就是跑不起来，忽悠忽悠试了半天，写下一点感觉~~

系统环境：Tomcat 5.0.12 /JDK1.5/nutch0.8.1/cygwin-cd-release-20060906.iso

使用过程：

1．因为nutch的运行需要unix环境，所以对于windows用户，要先下载一个cygwin，它是一个自由软件，可在windows下模拟unix环境，你可以到http://www.cygwin.com下载在线安装程序，也可以到http://www-inst.eecs.berkeley.edu/~instcd/iso/下载完整安装程序（我下下来有 1.27G ，呵呵，要保证硬盘空间足够大~~），安装时一路next即可~~~

2．下载nutch 0.8.1 ，下载地址http://apache.justdn.org/lucene/nutch/，我下载后是解压到D:/ nutch-0.8.1

3．在nutch- 0.8.1 新建文件夹urls，在urls建一文本文件，文件名任意，添加一行内容：http://lucene.apache.org/nutch，这是要搜索的网址

4．打开nutch- 0.8.1 下的conf，找到crawl-urlfilter.txt，找到这两行

# accept hosts in MY.DOMAIN.NAME

+^http://([a-z0-9]*/.)*MY.DOMAIN.NAME/

红色部分是一个正则，你要搜索的网址要与其匹配，在这里我改为+^http://([a-z0-9]*/.)*apache.org/

5． OK，下面开始对搜索网址建立索引，运行cygwin，会打开一个命令窗口，输入”cd cygdrive/d/ nutch- 0.8.1 ”，转到nutch-0.8.1目录

6．执行”bin/nutch crawl urls -dir crawled -depth 2 -threads 5 >& crawl.log”

参数意义如下（来自apache网站http://lucene.apache.org/nutch/tutorial8.html ）：

-dir dir names the directory to put the crawl in.

-threads threads determines the number of threads that will fetch in parallel.

-depth depth indicates the link depth from the root page that should be crawled.

-topN N determines the maximum number of pages that will be retrieved at each level up to the depth.

crawl.log：日志文件

执行后可以看到nutch- 0.8.1 下新增一个crawled文件夹，它下面有5个文件夹：

①/② crawldb/ linkdb：web link目录，存放url 及url的互联关系，作为爬行与重新爬行的依据，页面默认30天过期（可以在nutch-site.xml中配置，后面会提到）

③ segments：一存放抓取的页面，与上面链接深度depth相关，depth设为2则在segments下生成两个以时间命名的子文件夹，比如” 20061014163012”，打开此文件夹可以看到，它下面还有6个子文件夹，分别是（来自apache网站http://lucene.apache.org/nutch/tutorial8.html）：

crawl_generate： names a set of urls to be fetched

crawl_fetch： contains the status of fetching each url

content： contains the content of each url

parse_text： contains the parsed text of each url

parse_data： contains outlinks and metadata parsed from each url

crawl_parse： contains the outlink urls, used to update the crawldb

④ indexes：索引目录，我运行时生成了一个” part -00000” 的文件夹，

⑤ index：lucene的索引目录（nutch是基于lucene的，在nutch- 0.8.1 /lib下可以看到lucene-core-1.9.1.jar，最后有luke工具的简单使用方法），是indexs里所有index合并后的完整索引，注意索引文件只对页面内容进行索引，没有进行存储，因此查询时要去访问segments目录才能获得页面内容

7．进行简单测试，在cygwin中输入”bin/nutch org.apache.nutch.searcher.NutchBean apache”，即调用NutchBean的main方法搜索关键字”apache”，在cygwin可以看到搜索出：Total hits: 29（hits相当于JDBC的results）

注意：如果发现搜索结果始终为0，则需要配置一下nutch- 0.8.1 /conf的nutch-site.xml，配置内容和下面过程9的配置相同(另外，过程6中depth如果设为1也可能造成搜索结果为0)，然后重新执行过程6

8．下面我们要在Tomcat下进行测试，nutch- 0.8.1 下面有nutch-0.8.1.war，拷贝到Tomcat/webapps下，可以直接用winrar解压到此目录下，我是用Tomcat启动后解压的，解压文件夹名为：nutch

9．打开nutch/WEB-INF/classes下nutch-site.xml文件，下面红色为需要新增的内容，其他为原nutch-site.xml内容

<?xml version="1.0"?>

<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<property>

<name>http.agent.name</name>

<value>*</value>

<description></description>

</property>

<property>

<name>searcher.dir</name>

<value>D:/nutch- 0.8.1 /crawled</value>

<description></description>

</property>

</configuration>

http.agent.name：必须，如果去掉这个property查询结果始终为0

searcher.dir：指定前面在cygwin中生成的crawled路径

其中我们还可以设置重新爬行时间（在过程6提到：页面默认30天过期）

<name>fetcher.max.crawl.delay</name>

</property>

另外还有很多参数可以在nutch- 0.8.1 /conf下的nutch-default.xml查询，nutch-default.xml中的property配置都带有注释，有兴趣的可以分别拷贝到Tomcat/webapps/nutch/WEB-INF/classes/nutch-site.xml中进行调试

10．打开http://localhost:8081/nutch ，输入”apache”，可以看到” 共有 29 项查询结果”，和上面在过程7进行简单测试的结果一致

Luke介绍：

Luke 是查询lucene索引文件的图形化工具，可以比较直观的看到索引创建情况，它需要结合lucene包一起用

使用过程：

1．下载地址http://www.getopt.org/luke 它提供3种下载：

standalone full JAR ：lukeall.jar

standalone minimal JAR：lukemin.jar

separate JARs：luke.jar (~113kB)

lucene-1.9-rc1-dev.jar (~380kB)

analyzers-dev.jar (~348kB)

snowball-1.1-dev.jar (~88kB)

js.jar (~492kB)

我们只需下载”separate JARs”的luke.jar即可

2．下载后新建一个文件夹，比如叫”luke_run”，把luke.jar放在文件夹下，同时从nutch- 0.8.1 /lib下拷贝lucene-core-1.9.1.jar到此文件夹下

3．在cmd命令行中转到”luke_run”目录，输入” java -classpath luke.jar;lucene-core- 1.9.1 .jar org.getopt.luke.Luke”，可以看到打开luke图形界面，从”File”==>”Open Lucene index”，打开”nutch-0.8.1/crawled/index”文件夹（在上面过程6已创建），然后可以在luke中看到索引创建的详细信息

4．附上一点闲言：）使用中发现一个问题（在lucene-core- 1.9.1 .jar中不存在，所以luke不会抛此Exception），就是”Documents”中”Reconstruct&Edit” 按钮只要一点，就会抛一个Exception：

Exception in thread "Thread-12" java.lang.NoSuchMethodError: org.apache.lucene.d

ocument.Field.<init>(Ljava/lang/String;Ljava/lang/String;ZZZZ)V

at org.getopt.luke.Luke$2.run(Unknown Source)

呵呵，我用的是lucene-core- 2.0.0 .jar，看起来应该是在这个版本中去掉了某个方法造成的，很多时候新版本的出现总是会带来一些细节问题 ~~~~~

xiaodaoxiaodao

关注

0
点赞
踩
2

收藏

觉得还不错? 一键收藏
0
评论
浅入浅出nutch 0.8使用指南4windows

版权所有：(xiaodaoxiaodao)蓝小刀 http://blog.csdn.net/xiaodaoxiaodao/archive/2006/10/20/1342433.aspx转载请注明来源/作者浅入浅出nutch 0.8使用指南4windows Nutch是一个搜索引擎，昨天刚从一个朋友那里知道，前一阵子接触了lucene，对搜索的东西跃跃欲试，趁着周末试用了一把，感觉蛮
复制链接

扫一扫