nutch-0.9

最新推荐文章于 2024-07-14 11:13:47 发布

xiajing12345

最新推荐文章于 2024-07-14 11:13:47 发布

阅读量2.1k

点赞数

文章标签： lucene properties exception apache url each

本文链接：https://blog.csdn.net/xiajing12345/article/details/1619311

版权

浅入浅出nutch 0.9使用指南 for windows

Nutch 是一个搜索引擎，昨天刚从一个朋友那里知道，前一阵子接触了 lucene ，对搜索的东西跃跃欲试，趁着周末试用了一把，感觉蛮新鲜，网上的例子多是基于 0.7 版本的，找到了一些 0.8 的就是跑不起来，忽悠忽悠试了半天，写下一点感觉 ~~

系统环境： Tomcat 6.0.13/JDK1.6/nutch0.9/cygwin-cd-release-20060906.iso

使用过程：

1．因为 nutch 的运行需要 unix 环境，所以对于 windows 用户，要先下载一个 cygwin ，它是一个自由软件，可在 windows 下模拟 unix 环境，你可以到 http://www.cygwin.com/ 下载在线安装程序，也可以到 http://www-inst.eecs.berkeley.edu/~instcd/iso/ 下载完整安装程序（我下下来有 1.27G ，呵呵，要保证硬盘空间足够大 ~~ ），安装时一路 next 即可 ~~~

2．下载 nutch0.9 ，下载地址 http://apache.justdn.org/lucene/nutch/ ，我下载后是解压到 D:/ nutch-0.9

3．在 nutch-0.9新建文件夹 urls ，在 urls 建一文本文件，文件名任意，添加一行内容： http://lucene.apache.org/nutch/ ，这是要搜索的网址 (urls/nutch里的路径一定要加入"/")

4．打开 nutch-0.9下的 conf ，找到 crawl-urlfilter.txt ，找到这两行

# accept hosts in MY.DOMAIN.NAME

+^http://([a-z0-9]*/.)*MY.DOMAIN.NAME/

红色部分是一个正则，你要搜索的网址要与其匹配，在这里我改为 +^http://([a-z0-9]*/.)*apache.org/

编辑conf目录下的nutch-site.xml文件,该文件用于将爬虫信息告诉被抓取的网站,如果不进行设置nutch不能运行.

该文件默认为这样:
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

    下面是我修改后的一个例子:
    <?xml version="1.0"?>
    <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

    <configuration>
        <property>
          <name>http.agent.name</name>
          <value>myfirsttest</value>
          <description>HTTP 'User-Agent' request header. MUST NOT be empty -
          please set this to a single word uniquely related to your organization.

          NOTE: You should also check other related properties:

          http.robots.agents
          http.agent.description
          http.agent.url
          http.agent.email
          http.agent.version

and set their values appropriately.

</description>
</property>

        <property>
          <name>http.agent.description</name>
          <value>myfirsttest</value>
          <description>Further description of our bot- this text is used in
          the User-Agent header. It appears in parenthesis after the agent name.
          </description>
        </property>

        <property>
          <name>http.agent.url</name>
          <value>myfirsttest.com</value>
          <description>A URL to advertise in the User-Agent header. This will
           appear in parenthesis after the agent name. Custom dictates that this
           should be a URL of a page explaining the purpose and behavior of this
           crawler.
          </description>
        </property>

        <property>
          <name>http.agent.email</name>
          <value>test@test.com</value>
          <description>An email address to advertise in the HTTP 'From' request
           header and User-Agent header. A good practice is to mangle this
           address (e.g. 'info at example dot com') to avoid spamming.
          </description>
        </property>

</configuration>
上述文件描述了爬虫的名称/描述/来自哪个网站/联系email等信息.

5．在cygwin中运行一定要在 /ect/profile 中加入一句指向 java的的安装目录中的JDK
JAVA_HOME=/usr/java/jdk1.6.0_01
export JAVA_HOME
OK ，下面开始对搜索网址建立索引，运行 cygwin ，会打开一个命令窗口，输入 ”cd cygdrive/d/ nutch-0.9，转到 nutch-0.9目录

6．执行 ”bin/nutch crawl urls -dir crawled -depth 3 -topN 50 -threads 10 >& crawl.log”

参数意义如下（来自 apache 网站 http://lucene.apache.org/nutch/tutorial8.html ）：

-dir dir names the directory to put the crawl in.

-threads threads determines the number of threads that will fetch in parallel.

-depth depth indicates the link depth from the root page that should be crawled.

-topN N determines the maximum number of pages that will be retrieved at each level up to the depth.

crawl.log ：日志文件

执行后可以看到 nutch-0.9下新增一个 crawled 文件夹，它下面有 5 个文件夹：

① / ② crawldb/ linkdb ： web link 目录，存放 url 及 url 的互联关系，作为爬行与重新爬行的依据，页面默认 30 天过期（可以在 nutch-site.xml 中配置，后面会提到）

③ segments ：一存放抓取的页面，与上面链接深度 depth 相关， depth 设为 2 则在 segments 下生成两个以时间命名的子文件夹，比如 ” 20061014163012” ，打开此文件夹可以看到，它下面还有 6 个子文件夹，分别是（来自 apache 网站 http://lucene.apache.org/nutch/tutorial8.html ）：

crawl_generate ： names a set of urls to be fetched

crawl_fetch ： contains the status of fetching each url

content ： contains the content of each url

parse_text ： contains the parsed text of each url

parse_data ： contains outlinks and metadata parsed from each url

crawl_parse ： contains the outlink urls, used to update the crawldb

④ indexes ：索引目录，我运行时生成了一个 ” part-00000” 的文件夹，

⑤ index ： lucene 的索引目录（ nutch 是基于 lucene 的，在 nutch-0.9/lib 下可以看到 lucene-core-1.9.1.jar ，最后有 luke 工具的简单使用方法），是 indexs 里所有 index 合并后的完整索引，注意索引文件只对页面内容进行索引，没有进行存储，因此查询时要去访问 segments 目录才能获得页面内容

7．进行简单测试，在 cygwin 中输入 ”bin/nutch org.apache.nutch.searcher.NutchBean apache” ，即调用 NutchBean 的 main 方法搜索关键字 ”apache” ，在 cygwin 可以看到搜索出： Total hits: 29 （ hits 相当于 JDBC 的 results ）

注意：如果发现搜索结果始终为 0 ，则需要配置一下 nutch-0.9 /conf 的 nutch-site.xml ，配置内容和下面过程 9 的配置相同 ( 另外，过程 6 中 depth 如果设为 1 也可能造成搜索结果为 0) ，然后重新执行过程 6

8．下面我们要在 Tomcat 下进行测试， nutch-0.8.1 下面有 nutch-0.9.war ，拷贝到 Tomcat/webapps 下，可以直接用 winrar 解压到此目录下，我是用 Tomcat 启动后解压的，解压文件夹名为： nutch

9．打开 nutch/WEB-INF/classes 下 nutch-site.xml 文件，下面红色为需要新增的内容，其他为原 nutch-site.xml 内容

<?xml version="1.0"?>

<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<name>http.agent.name</name>

</property>

<name>searcher.dir</name>

<value>D:/nutch-0.8.1/crawled</value>

</property>

</configuration>

http.agent.name ：必须，如果去掉这个 property 查询结果始终为 0

searcher.dir ：指定前面在 cygwin 中生成的 crawled 路径

其中我们还可以设置重新爬行时间（在过程 6 提到：页面默认 30 天过期）

<name>fetcher.max.crawl.delay</name>

</property>

另外还有很多参数可以在 nutch-0.8.1/conf 下的 nutch-default.xml 查询， nutch-default.xml 中的 property 配置都带有注释，有兴趣的可以分别拷贝到 Tomcat/webapps/nutch/WEB-INF/classes/nutch-site.xml 中进行调试

10．打开 http://localhost:8081/nutch ，输入 ”apache” ，可以看到 ” 共有 29 项查询结果 ” ，和上面在过程 7 进行简单测试的结果一致

Luke 介绍：

Luke 是查询 lucene 索引文件的图形化工具，可以比较直观的看到索引创建情况，它需要结合 lucene 包一起用

使用过程：

1．下载地址 http://www.getopt.org/luke 它提供 3 种下载：

standalone full JAR ： lukeall.jar

standalone minimal JAR ： lukemin.jar

separate JARs ： luke.jar (~113kB)

lucene-1.9-rc1-dev.jar (~380kB)

analyzers-dev.jar (~348kB)

snowball-1.1-dev.jar (~88kB)

js.jar (~492kB)

我们只需下载 ”separate JARs” 的 luke.jar 即可

2．下载后新建一个文件夹，比如叫 ”luke_run” ，把 luke.jar 放在文件夹下，同时从 nutch-0.9/lib 下拷贝 lucene-core-1.9.1.jar 到此文件夹下

3．在 cmd 命令行中转到 ”luke_run” 目录，输入 ” java -classpath luke.jar;lucene-core-1.9.1.jar org.getopt.luke.Luke ” ，可以看到打开 luke 图形界面，从 ”File”==>”Open Lucene index” ，打开 ”nutch-0.8.1/crawled/index” 文件夹（在上面过程 6 已创建），然后可以在 luke 中看到索引创建的详细信息

4．附上一点闲言：）使用中发现一个问题（在 lucene-core-1.9.1.jar 中不存在，所以 luke 不会抛此 Exception ），就是 ”Documents” 中 ”Reconstruct&Edit” 按钮只要一点，就会抛一个 Exception ：

Exception in thread "Thread-12" java.lang.NoSuchMethodError: org.apache.lucene.d

ocument.Field.<init>(Ljava/lang/String;Ljava/lang/String;ZZZZ)V

at org.getopt.luke.Luke$2.run(Unknown Source)

呵呵，我用的是 lucene-core-2.0.0.jar ，看起来应该是在这个版本中去掉了某个方法造成的，很多时候新版本的出现总是会带来

xiajing12345

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
nutch-0.9

浅入浅出nutch 0.9使用指南 for windows Nutch 是一个搜索引擎，昨天刚从一个朋友那里知道，前一阵子接触了 lucene ，对搜索的东西跃跃欲试，趁着周末试用了一把，感觉蛮新鲜，网上的例子多是基于 0.7 版本的，找到了一些 0.8 的就是跑不起来，忽悠忽悠试了半天，写下一点感觉 ~~ 系统环境： Tomcat 6.0.13/JDK1.6/nutch0
复制链接

扫一扫

nutch-0.9

“相关推荐”对你有帮助么？