配置nutch软件

1.1 下载安装Java jdk-1.7.0
from: http://www.oracle.com/
安装目录:C:\Program Files (x86)\Java\jdk1.7.0
1.2 修改环境变量
JAVA_HOME= C:\Program Files (x86)\Java\jdk1.7.0
classpath=.;%JAVA_HOME%\lib\dt.jar;%JAVA_HOME%\lib\tools.jar
1.3 测试
java -version

2、tomcat-7.0
2.1 下载tomcat-7.0
from: http://tomcat.apache.org/

 

2.2 解压到C盘目录并改名
安装目录:C:\tomcat7

 

 

3、安装Cygwin
from:http://www.cygwin.cn/
由于运行Nutch自带的脚本命令需要Linux的环境,所以必须首先安装Cygwin来模拟这种环境

将cygwin文件夹拷贝到适当的位置,如C盘根目录下,打开cygwin文件夹,如下图:



双击图标,进行安装


点击“下一步“




“C:\cygwin“表示安装好的文件位置


 









点击“完成”,安装完成。

4、

4.1建立一个地址目录,mkdir -p urls

   在这个目录中建立一个url文件,写上一些url,如

view plain

http://nutch.apache.org/ 

 

4.2 然后运行如下命令

view plain

bin/nutch crawl urls -dir crawl -depth 3-topN 5 



 

注意,这里是不带索引的,如果要对抓取的数据建立索引,运行如下命令

view plain

bin/nutch crawl urls -solr http://localhost:8983/solr/-depth 3 -topN 5 

截图就略了吐舌头

4.3Nutch的抓取流程

4.3.1初始化crawlDb,注入初始url

view plain

<pre name="code"class="html">bin/nutch inject  

Usage: Injector <crawldb><url_dir> 

 

 

 

在我本地运行这个命令后的输出结果如下:

view plain

lemo@debian:~/Workspace/java/Apache/Nutch/nutch-1.3$bin/nutch inject db/crawldb urls/ 

       Injector: starting at 2011-08-22 10:50:01 

       Injector: crawlDb: db/crawldb 

       Injector: urlDir: urls 

       Injector: Converting injected urls to crawl db entries. 

       Injector: Merging injected urls into crawl db. 

       Injector: finished at 2011-08-22 10:50:05, elapsed: 00:00:03 

 

4.3.2产生新的抓取urls

view plain

bin/nutch generate 

Usage: Generator <crawldb><segments_dir> [-force] [-topN N] [-numFetchers numFetchers] [-adddaysnumDays] [-noFilter] [-noNorm][-maxNumSegments num] 

 

 

本机输出结果如下:

view plain

lemo@debian:~/Workspace/java/Apache/Nutch/nutch-1.3$bin/nutch generate db/crawldb/ db/segments 

       Generator: starting at 2011-08-22 10:52:41 

       Generator: Selecting best-scoring urls due for fetch. 

       Generator: filtering: true 

       Generator: normalizing: true 

        Generator: jobtracker is 'local',generating exactly one partition. 

       Generator: Partitioning selected urls for politeness. 

       Generator: segment: db/segments/20110822105243   // 这里会产生一个新的segment 

       Generator: finished at 2011-08-22 10:52:44, elapsed: 00:00:03 

4.3.3对上面产生的url进行抓取

view plain

bin/nutch fetch 

Usage: Fetcher <segment> [-threads n][-noParsing] 

 

这里是本地的输出结果:

view plain

lemo@debian:~/Workspace/java/Apache/Nutch/nutch-1.3$bin/nutch fetch db/segments/20110822105243/ 

       Fetcher: Your 'http.agent.name' value should be listed first in'http.robots.agents' property. 

       Fetcher: starting at 2011-08-22 10:56:07 

       Fetcher: segment: db/segments/20110822105243 

       Fetcher: threads: 10 

       QueueFeeder finished: total 1 records + hit by time limit :0 

       fetching http://www.baidu.com/ 

       -finishing thread FetcherThread, activeThreads=1 

       -finishing thread FetcherThread, activeThreads= 

       -finishing thread FetcherThread, activeThreads=1 

       -finishing thread FetcherThread, activeThreads=1 

       -finishing thread FetcherThread, activeThreads=0 

       -activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0 

       -activeThreads=0 

       Fetcher: finished at 2011-08-22 10:56:09, elapsed: 00:00:02 

 

 

我们来看一下这里的segment目录结构

view plain

lemo@debian:~/Workspace/java/Apache/Nutch/nutch-1.3$ls db/segments/20110822105243/ 

content crawl_fetch  crawl_generate 

4.3.4 对上面的结果进行解析

view plain

<pre name="code"class="html">bin/nutch parse 

Usage: ParseSegment segment 

 

 

本机输出结果:

view plain

<pre name="code"class="html">lemo@debian:~/Workspace/java/Apache/Nutch/nutch-1.3$ bin/nutchparse db/segments/20110822105243/ 

ParseSegment: starting at 2011-08-2210:58:19 

ParseSegment: segment:db/segments/20110822105243 

ParseSegment: finished at 2011-08-2210:58:22, elapsed: 00:00:02 

 

 

我们再来看一下解析后的目录结构

view plain

<pre name="code"class="html">lemo@debian:~/Workspace/java/Apache/Nutch/nutch-1.3$ls db/segments/20110822105243/ 

content crawl_fetch  crawl_generate  crawl_parse parse_data  parse_text 

 

 

这里多了三个解析后的目录。

 

 

4.3.5 更新外链接数据库

 

view plain

bin/nutch updatedb 

Usage: CrawlDb <crawldb> (-dir<segments> | <seg1> <seg2> ...) [-force] [-normalize][-filter] [-noAdditions] 

 

本机输出结果:

view plain

<pre name="code"class="html">lemo@debian:~/Workspace/java/Apache/Nutch/nutch-1.3$bin/nutch updatedb db/crawldb/ -dir db/segments/ 

CrawlDb update: starting at 2011-08-2211:00:09 

CrawlDb update: db: db/crawldb 

CrawlDb update: segments:[file:/home/lemo/Workspace/java/Apache/Nutch/nutch-1.3/db/segments/20110822105243] 

CrawlDb update: additions allowed:true 

CrawlDb update: URL normalizing: false 

CrawlDb update: URL filtering: false 

CrawlDb update: Merging segment data intodb. 

CrawlDb update: finished at 2011-08-2211:00:10, elapsed: 00:00:01 

 

 

这时它会更新crawldb链接库,这里是放在文件系统中的,像taobao抓取程序的链接库是用redis来做的,一种key-value形式的NoSql数据库。

 

5.6 计算反向链接

view plain

<pre name="code"class="html">bin/nutch invertlinks 

Usage: LinkDb <linkdb> (-dir<segmentsDir> | <seg1> <seg2> ...) [-force] [-noNormalize][-noFilter] 

 

 

本地输出结果:

view plain

<pre name="code"class="html">lemo@debian:~/Workspace/java/Apache/Nutch/nutch-1.3$bin/nutch invertlinks db/linkdb -dir db/segments/ 

LinkDb: starting at 2011-08-2211:02:49 

LinkDb: linkdb: db/linkdb 

LinkDb: URL normalize: true 

LinkDb: URL filter: true 

LinkDb: adding segment:file:/home/lemo/Workspace/java/Apache/Nutch/nutch-1.3/db/segments/20110822105243 

LinkDb: finished at 2011-08-22 11:02:50,elapsed: 00:00:01 


ID: ug_ia_on

评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值