nutch搜索引擎架设之抓取模块

最新推荐文章于 2024-08-07 07:30:00 发布

hackerwin7

最新推荐文章于 2024-08-07 07:30:00 发布

阅读量1.7k

点赞数

分类专栏：项目_校园搜索引擎文章标签：搜索引擎 generator eclipse thread character java

本文链接：https://blog.csdn.net/hackerwin7/article/details/8097125

版权

项目_校园搜索引擎专栏收录该内容

1 篇文章 0 订阅

订阅专栏

一、准备工作

jdk：http://www.oracle.com/technetwork/java/javase/downloads/index.html

tomcat：http://tomcat.apache.org/

eclipse：http://www.eclipse.org/downloads/

Cygwin：http://www.cygwin.com/ 下载里面的setup.exe （用于模拟linux环境，nutch 是适应于linux平台的,）

nutch0.9：http://www.kuaipan.cn/file/id_16410726440632359.htm（开源的基于java的搜索引擎）

缺省jar下载：rtf-parser.jar：http://www.kuaipan.cn/file/id_16410726440632361.htm

jid3lib-0.5.1.jar：http://www.kuaipan.cn/file/id_16410726440632360.htm

二、安装

将下面下载到的（除了缺省jar）进行安装

jdk 安装：点击下载下来的安装包进行安装，别忘了配环境变量，这里不再详述

tomcat 安装：安装好以后，启动startup.bat 在浏览器端测试 http://localhost:8080/ 如出现欢迎界面即安装成功，如果安装不成功在cmd 里运行startup.bat 查看错误信息，如果一闪而过，可手动修改startup.bat 将里面的这句 call "%EXECUTABLE%" start %CMD_LINE_ARGS%

修改为 call "%EXECUTABLE%" run %CMD_LINE_ARGS%

然后再在cmd里运行startup根据提供的具体错误信息来处理。

eclipse：这个没什么好说的。

cygwin：按提示一步一步install即可

nutch 0.9：解压到你的工程根目录里

缺省jar：将rtf-parser.jar 放到nutch目录 nutch-0.9\src\plugin\parse-rtf\lib 里面将 jid3lib-0.5.1.jar 放到 nutch-0.9\src\plugin\parse-mp3\lib 里面

三、配置与加载nutch

首先打开Cygwin 会出现一个类似cmd的命令行的窗口，其命令格式与dos几乎差不多，不懂得可以输入help看详细的命令说明

然后进入你的nutch-0.9 目录

然后运行nutch 命令：

bin/nutch

如图所示会出现关于nutch命令的说明即表示nutch运行环境搭建成功。

接着开始导入和配置nutch-0.9：

导入：

首先将nutch-0.9 作为工程项目倒入eclipse中。

打开eclipse，新建java项目，添加nutch-0.9 目录作为项目源

然后不要急着点完成

在Source选项卡配置默认输出文件路径my_build

在Libraries选项卡将nutch目录的conf文件添加进来

在Order and Export选项卡将conf文件置顶

刷新项目工程，这下基本将nutch完整地导入到eclipse中了

配置：

1.修改nutch-default.xml文件

在eclipse打开刚才倒入的项目 -〉 Referenced Libraries -〉conf 找到其中的nutch-default.xml文件

找到<name>plugin.folders</name>这个属性将下面以行的value值修改为<value>./src/plugin</value>

2.配置nutch抓取策略

在nutch-0.9 工程根目录建立weburl.txt 用以存放搜索引擎要抓取内容得网址，里面填入要抓取的网址（例如：http://www.buct.edu.cn/）

然后在conf文件中找到crawl_urlfilter.txt

找到：

# accept hosts in MY.DOMAIN.NAME

# +^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/

修改为：

# accept hosts in MY.DOMAIN.NAME

# +^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/

+^http://www.buct.edu.cn/

即配置仅抓取此网站的内容

最后配置conf里的nutch-site.xml文件

将下面一段内容完全覆盖nutch-site.xml文件中的内容：

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->




<configuration>
<property>
<name>http.agent.name</name>
<value>ahathinkingSpider</value>
<description>My Search Engine</description>
</property>

<property>
<name>http.agent.description</name>
<value>My web</value>
<description>Further description of our bot- this text is used in
User-Agent header. It appears in parenthesis after the agent name.</description>
</property>

<property>
<name>http.agent.url</name>
<value>ahathinking.com</value>
<description>A URL to advertise in the User-Agent header. This will
appear in parenthesis after the agent name. Custom dictates that this
should be a URL of a page explaining the purpose and behavior. of this
</description>
</property>

<property>
<name>http.agent.email</name>
<value>yx.coding@gmail.com</value>
<description>An email address to advertise in the HTTP 'From' request
header and User-Agent header. A good practice is to mangle this
address (e.g. 'info at example dot com') to avoid spamming.
</description>
</property>

</configuration>

3.配置运行参数：

选取 nutch 项目，在运行栏里选择 Run Configurations... 修改相关运行参数

修改如下：

注意修改完后点击Apply

最后点击Run运行

crawl started in: Local
rootUrlDir = weburl.txt
threads = 5
depth = 5
topN = 100
Injector: starting
Injector: crawlDb: Local/crawldb
Injector: urlDir: weburl.txt
Injector: Converting injected urls to crawl db entries.
Injector: Merging injected urls into crawl db.
Injector: done
Generator: Selecting best-scoring urls due for fetch.
Generator: starting
Generator: segment: Local/segments/20121022110809
Generator: filtering: false
Generator: topN: 100
Generator: jobtracker is 'local', generating exactly one partition.
Generator: Partitioning selected urls by host, for politeness.
Generator: done.
Fetcher: starting
Fetcher: segment: Local/segments/20121022110809
Fetcher: threads: 5
fetching http://www.buct.edu.cn/
Fetcher: done
CrawlDb update: starting
CrawlDb update: db: Local/crawldb
CrawlDb update: segments: [Local/segments/20121022110809]
CrawlDb update: additions allowed: true
CrawlDb update: URL normalizing: true
CrawlDb update: URL filtering: true
CrawlDb update: Merging segment data into db.
CrawlDb update: done
Generator: Selecting best-scoring urls due for fetch.
Generator: starting
Generator: segment: Local/segments/20121022110815
Generator: filtering: false
Generator: topN: 100
Generator: jobtracker is 'local', generating exactly one partition.
Generator: Partitioning selected urls by host, for politeness.
Generator: done.
Fetcher: starting
Fetcher: segment: Local/segments/20121022110815
Fetcher: threads: 5
fetching http://www.buct.edu.cn/kxyj/index.htm
fetching http://www.buct.edu.cn/zjbh/ldjt/index.htm
fetching http://www.buct.edu.cn/bhxy/index.htm
fetching http://www.buct.edu.cn/zjbh/xycf/index.htm
fetching http://www.buct.edu.cn/zjbh/lrld/index.htm
fetching http://www.buct.edu.cn/xywh/index.htm
fetching http://www.buct.edu.cn/tjtp/25314.htm
fetching http://www.buct.edu.cn/xysz/index.htm
fetching http://www.buct.edu.cn/zsjy/index.htm
fetching http://www.buct.edu.cn/kslj/23583.htm
fetching http://www.buct.edu.cn/js/jquery.flexslider-min.js
fetching http://www.buct.edu.cn/rcpy/index.htm
............

出现上述信息即表示nutch的抓取模块已架设成功。

备注：

nutch架设常见错误：

1.1.1 Crawl抓取出现hadoop出错提示

配置完成nutch在cygwin中运行nutch的crawl命令时：

[Fatal Error] hadoop-site.xml:15:7: The content of elements must consist of well

-formed character data or markup.

Exception in thread "main" java.lang.RuntimeException: org.xml.sax.SAXParseExcep

tion: The content of elements must consist of well-formed character data or mark

up.

问题解决：

hadoop-site.xml、hadoop-site.xml：其中一个标签</property>前面多了一个尖括号

1.1.2 运行crawl报错Job failed

Exception in thread "main" java.io.IOException: Job failed!

at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:604)

at org.apache.nutch.indexer.DeleteDuplicates.dedup(DeleteDuplicates.java

:439)

at org.apache.nutch.crawl.Crawl.main(Crawl.java:135)

问题解决：

此多为crawl-urlfilter.txt：MY.DOMAIN.NAME的修改不正确

1.1.3 又一个Job failed

Exception in thread "main" java.io.IOException: Job failed!

at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:604)

at org.apache.nutch.indexer.DeleteDuplicates.dedup(DeleteDuplicates.java

:439)

at org.apache.nutch.crawl.Crawl.main(Crawl.java:135)

问题解决：

多为crawl-urlfilter.txt的MY.DOMAIN.NAME修改不正确

1.1.4 Eclipse中运行nutch：Job failed

Exception in thread "main" java.io.IOException: Job failed!

at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:604)

at org.apache.nutch.crawl.Injector.inject(Injector.java:162)

at org.apache.nutch.crawl.Crawl.main(Crawl.java:115)

问题解决：

此问题是eclipse的java版本设置问题，解决方法：

如原来使用java1.4，需要改为1.6

project-》properties-》java compiler

右 jdk compliance

compiler compliance level：改为6.0

1.1.5当修改配置文件指定索引库.( nutch-site.xml）

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<configuration>
<property>
         <name>http.agent.name</name>
         <value>HD nutch agent</value>
     </property>
     <property>
       <name>http.agent.version</name>
       <value>1.0</value>
      </property>
   <property>
<name>urlfilter.order</name>
<value>org.apache.nutch.urlfilter.regex.RegexURLFilter</value>
<description>The order by which url filters are applied.
If empty, all available url filters (as dictated by properties
plugin-includes and plugin-excludes above) are loaded and applied in system
defined order. If not empty, only named filters are loaded and applied
in given order. For example, if this property has value:
org.apache.nutch.urlfilter.regex.RegexURLFilter org.apache.nutch.urlfilter.prefix.PrefixURLFilter
then RegexURLFilter is applied first, and PrefixURLFilter second.
Since all filters are AND'ed, filter ordering does not have impact
on end result, but it may have performance implication, depending
on relative expensiveness of filters.
</description>
</property>
</configuration>

复制上述配置文件时，如果出现下列错误，是因为复制文件时带有空格或编码格式，重敲一遍即可: java.io.UTFDataFormatException: Invalid byte 1 of 1-byte UTF-8 sequence

总的来说常见错误多见于配置文件配置不当……

参考文章：

http://www.ahathinking.com/archives/140.html

http://hi.baidu.com/orminknckdbkntq/item/020a8bc6e4d275000bd93a2d

hackerwin7

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
nutch搜索引擎架设之抓取模块

一、准备工作jdk：http://www.oracle.com/technetwork/java/javase/downloads/index.htmltomcat：http://tomcat.apache.org/eclipse：http://www.eclipse.org/downloads/Cygwin：http://www.cygwin.com/ 下载里面的setup.e
复制链接

扫一扫