nutch搜索引擎架设 之 抓取模块

一、准备工作

jdk:http://www.oracle.com/technetwork/java/javase/downloads/index.html

tomcat:http://tomcat.apache.org/

eclipse:http://www.eclipse.org/downloads/

Cygwin:http://www.cygwin.com/  下载里面的setup.exe (用于模拟linux环境,nutch 是适应于linux平台的,)

nutch0.9:http://www.kuaipan.cn/file/id_16410726440632359.htm(开源的基于java的搜索引擎)

缺省jar下载:rtf-parser.jar:http://www.kuaipan.cn/file/id_16410726440632361.htm

jid3lib-0.5.1.jar:http://www.kuaipan.cn/file/id_16410726440632360.htm

二、安装

将下面下载到的(除了缺省jar)进行安装

jdk 安装:点击下载下来的安装包进行安装,别忘了配环境变量,这里不再详述

tomcat 安装:安装好以后,启动startup.bat  在浏览器端测试  http://localhost:8080/  如出现欢迎界面即安装成功,如果安装不成功  在cmd 里运行startup.bat 查看错误信息,如果一闪而过,可手动修改startup.bat     将里面的这句  call "%EXECUTABLE%" start %CMD_LINE_ARGS%  

 修改为  call "%EXECUTABLE%" run %CMD_LINE_ARGS%

然后再在cmd里运行startup根据提供的具体错误信息来处理。

eclipse:这个没什么好说的。

cygwin:按提示一步一步install即可

nutch 0.9:解压到你的工程根目录里

缺省jar:将rtf-parser.jar 放到nutch目录  nutch-0.9\src\plugin\parse-rtf\lib 里面  将 jid3lib-0.5.1.jar 放到  nutch-0.9\src\plugin\parse-mp3\lib 里面

三、配置与加载nutch

首先打开Cygwin 会出现一个类似cmd的命令行的窗口,其命令格式与dos几乎差不多,不懂得可以输入help看详细的命令说明

然后进入你的nutch-0.9 目录

然后运行nutch 命令:

bin/nutch

如图所示会出现关于nutch命令的说明 即表示nutch运行环境搭建成功。


接着开始导入和配置nutch-0.9:


导入:

首先将nutch-0.9 作为工程项目倒入eclipse中。

打开eclipse,新建java项目,添加nutch-0.9 目录作为项目源

然后不要急着点完成

在Source选项卡 配置默认输出文件路径my_build

在Libraries选项卡将nutch目录的conf文件添加进来

在Order and Export选项卡 将conf文件置顶


刷新项目工程,这下基本将nutch完整地导入到eclipse中了


配置:


1.修改nutch-default.xml文件

在eclipse打开刚才倒入的项目 -〉 Referenced Libraries -〉conf 找到其中的nutch-default.xml文件




找到<name>plugin.folders</name>这个属性  将下面以行的value值修改为<value>./src/plugin</value>


2.配置nutch抓取策略

在nutch-0.9 工程根目录建立weburl.txt 用以存放搜索引擎要抓取内容得网址,里面填入要抓取的网址(例如:http://www.buct.edu.cn/)

然后在conf文件中找到crawl_urlfilter.txt   

找到  :

# accept hosts in MY.DOMAIN.NAME

# +^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/

修改为:


# accept hosts in MY.DOMAIN.NAME

# +^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/

+^http://www.buct.edu.cn/


即配置仅抓取此网站的内容


最后配置conf里的nutch-site.xml文件


将下面一段内容完全覆盖nutch-site.xml文件中的内容:

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->




<configuration>
<property>
<name>http.agent.name</name>
<value>ahathinkingSpider</value>
<description>My Search Engine</description>
</property>

<property>
<name>http.agent.description</name>
<value>My web</value>
<description>Further description of our bot- this text is used in
User-Agent header. It appears in parenthesis after the agent name.</description>
</property>

<property>
<name>http.agent.url</name>
<value>ahathinking.com</value>
<description>A URL to advertise in the User-Agent header. This will
appear in parenthesis after the agent name. Custom dictates that this
should be a URL of a page explaining the purpose and behavior. of this
</description>
</property>

<property>
<name>http.agent.email</name>
<value>yx.coding@gmail.com</value>
<description>An email address to advertise in the HTTP 'From' request
header and User-Agent header. A good practice is to mangle this
address (e.g. 'info at example dot com') to avoid spamming.
</description>
</property>

</configuration>

3.配置运行参数:


选取 nutch 项目,在运行栏里选择 Run Configurations... 修改相关运行参数


修改如下:


注意修改完后点击Apply




注意修改完后点击Apply




最后点击Run运行


crawl started in: Local
rootUrlDir = weburl.txt
threads = 5
depth = 5
topN = 100
Injector: starting
Injector: crawlDb: Local/crawldb
Injector: urlDir: weburl.txt
Injector: Converting injected urls to crawl db entries.
Injector: Merging injected urls into crawl db.
Injector: done
Generator: Selecting best-scoring urls due for fetch.
Generator: starting
Generator: segment: Local/segments/20121022110809
Generator: filtering: false
Generator: topN: 100
Generator: jobtracker is 'local', generating exactly one partition.
Generator: Partitioning selected urls by host, for politeness.
Generator: done.
Fetcher: starting
Fetcher: segment: Local/segments/20121022110809
Fetcher: threads: 5
fetching http://www.buct.edu.cn/
Fetcher: done
CrawlDb update: starting
CrawlDb update: db: Local/crawldb
CrawlDb update: segments: [Local/segments/20121022110809]
CrawlDb update: additions allowed: true
CrawlDb update: URL normalizing: true
CrawlDb update: URL filtering: true
CrawlDb update: Merging segment data into db.
CrawlDb update: done
Generator: Selecting best-scoring urls due for fetch.
Generator: starting
Generator: segment: Local/segments/20121022110815
Generator: filtering: false
Generator: topN: 100
Generator: jobtracker is 'local', generating exactly one partition.
Generator: Partitioning selected urls by host, for politeness.
Generator: done.
Fetcher: starting
Fetcher: segment: Local/segments/20121022110815
Fetcher: threads: 5
fetching http://www.buct.edu.cn/kxyj/index.htm
fetching http://www.buct.edu.cn/zjbh/ldjt/index.htm
fetching http://www.buct.edu.cn/bhxy/index.htm
fetching http://www.buct.edu.cn/zjbh/xycf/index.htm
fetching http://www.buct.edu.cn/zjbh/lrld/index.htm
fetching http://www.buct.edu.cn/xywh/index.htm
fetching http://www.buct.edu.cn/tjtp/25314.htm
fetching http://www.buct.edu.cn/xysz/index.htm
fetching http://www.buct.edu.cn/zsjy/index.htm
fetching http://www.buct.edu.cn/kslj/23583.htm
fetching http://www.buct.edu.cn/js/jquery.flexslider-min.js
fetching http://www.buct.edu.cn/rcpy/index.htm
............


出现上述信息即表示nutch的抓取模块已架设成功。




备注:

nutch架设 常见错误:

1.1.1   Crawl抓取出现hadoop出错提示

配置完成nutch在cygwin中运行nutch的crawl命令时:

[Fatal Error] hadoop-site.xml:15:7: The content of elements must consist of well

-formed character data or markup.

Exception in thread "main" java.lang.RuntimeException: org.xml.sax.SAXParseExcep

tion: The content of elements must consist of well-formed character data or mark

up.

问题解决:

hadoop-site.xml、hadoop-site.xml:其中一个标签</property>前面多了一个尖括号

1.1.2   运行crawl报错Job failed

Exception in thread "main" java.io.IOException: Job failed!

        at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:604)

        at org.apache.nutch.indexer.DeleteDuplicates.dedup(DeleteDuplicates.java

:439)

        at org.apache.nutch.crawl.Crawl.main(Crawl.java:135)

问题解决:

此多为crawl-urlfilter.txt:MY.DOMAIN.NAME的修改不正确

1.1.3   又一个Job failed

Exception in thread "main" java.io.IOException: Job failed!

        at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:604)

        at org.apache.nutch.indexer.DeleteDuplicates.dedup(DeleteDuplicates.java

:439)

        at org.apache.nutch.crawl.Crawl.main(Crawl.java:135)

问题解决:

多为crawl-urlfilter.txt的MY.DOMAIN.NAME修改不正确

1.1.4   Eclipse中运行nutch:Job failed

Exception in thread "main" java.io.IOException: Job failed!

       at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:604)

       at org.apache.nutch.crawl.Injector.inject(Injector.java:162)

       at org.apache.nutch.crawl.Crawl.main(Crawl.java:115)

问题解决:

此问题是eclipse的java版本设置问题,解决方法:

如原来使用java1.4,需要改为1.6

project-》properties-》java compiler

右 jdk compliance

compiler compliance level:改为6.0

1.1.5当修改配置文件指定索引库.( nutch-site.xml)

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->

<configuration>
<property>
         <name>http.agent.name</name>
         <value>HD nutch agent</value>
     </property>
     <property>
       <name>http.agent.version</name>
       <value>1.0</value>
      </property>
   <property> 
<name>urlfilter.order</name> 
<value>org.apache.nutch.urlfilter.regex.RegexURLFilter</value> 
<description>The order by which url filters are applied. 
If empty, all available url filters (as dictated by properties 
plugin-includes and plugin-excludes above) are loaded and applied in system 
defined order. If not empty, only named filters are loaded and applied 
in given order. For example, if this property has value: 
org.apache.nutch.urlfilter.regex.RegexURLFilter org.apache.nutch.urlfilter.prefix.PrefixURLFilter 
then RegexURLFilter is applied first, and PrefixURLFilter second. 
Since all filters are AND'ed, filter ordering does not have impact 
on end result, but it may have performance implication, depending 
on relative expensiveness of filters. 
</description> 
</property>
</configuration>

复制上述配置文件时,如果出现下列错误,是因为复制文件时带有空格或编码格式,重敲一遍即可: java.io.UTFDataFormatException: Invalid byte 1 of 1-byte UTF-8 sequence



总的来说常见错误多见于配置文件配置不当……




参考文章:

http://www.ahathinking.com/archives/140.html


http://hi.baidu.com/orminknckdbkntq/item/020a8bc6e4d275000bd93a2d













  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值