1.1 下载安装Java jdk-1.7.0
from: http://www.oracle.com/
安装目录:C:\Program Files (x86)\Java\jdk1.7.0
1.2 修改环境变量
JAVA_HOME= C:\Program Files (x86)\Java\jdk1.7.0
classpath=.;%JAVA_HOME%\lib\dt.jar;%JAVA_HOME%\lib\tools.jar
1.3 测试
java -version
2、tomcat-7.0
2.1 下载tomcat-7.0
from: http://tomcat.apache.org/
2.2 解压到C盘目录并改名
安装目录:C:\tomcat7
3、安装Cygwin
from:http://www.cygwin.cn/
由于运行Nutch自带的脚本命令需要Linux的环境,所以必须首先安装Cygwin来模拟这种环境
将cygwin文件夹拷贝到适当的位置,如C盘根目录下,打开cygwin文件夹,如下图:
双击图标,进行安装
点击“下一步“
“C:\cygwin“表示安装好的文件位置
点击“完成”,安装完成。
4、
4.1建立一个地址目录,mkdir -p urls
在这个目录中建立一个url文件,写上一些url,如
view plain
http://nutch.apache.org/
4.2 然后运行如下命令
view plain
bin/nutch crawl urls -dir crawl -depth 3-topN 5
注意,这里是不带索引的,如果要对抓取的数据建立索引,运行如下命令
view plain
bin/nutch crawl urls -solr http://localhost:8983/solr/-depth 3 -topN 5
截图就略了
4.3Nutch的抓取流程
4.3.1初始化crawlDb,注入初始url
view plain
<pre name="code"class="html">bin/nutch inject
Usage: Injector <crawldb><url_dir>
在我本地运行这个命令后的输出结果如下:
view plain
lemo@debian:~/Workspace/java/Apache/Nutch/nutch-1.3$bin/nutch inject db/crawldb urls/
Injector: starting at 2011-08-22 10:50:01
Injector: crawlDb: db/crawldb
Injector: urlDir: urls
Injector: Converting injected urls to crawl db entries.
Injector: Merging injected urls into crawl db.
Injector: finished at 2011-08-22 10:50:05, elapsed: 00:00:03
4.3.2产生新的抓取urls
view plain
bin/nutch generate
Usage: Generator <crawldb><segments_dir> [-force] [-topN N] [-numFetchers numFetchers] [-adddaysnumDays] [-noFilter] [-noNorm][-maxNumSegments num]
本机输出结果如下:
view plain
lemo@debian:~/Workspace/java/Apache/Nutch/nutch-1.3$bin/nutch generate db/crawldb/ db/segments
Generator: starting at 2011-08-22 10:52:41
Generator: Selecting best-scoring urls due for fetch.
Generator: filtering: true
Generator: normalizing: true
Generator: jobtracker is 'local',generating exactly one partition.
Generator: Partitioning selected urls for politeness.
Generator: segment: db/segments/20110822105243 // 这里会产生一个新的segment
Generator: finished at 2011-08-22 10:52:44, elapsed: 00:00:03
4.3.3对上面产生的url进行抓取
view plain
bin/nutch fetch
Usage: Fetcher <segment> [-threads n][-noParsing]
这里是本地的输出结果:
view plain
lemo@debian:~/Workspace/java/Apache/Nutch/nutch-1.3$bin/nutch fetch db/segments/20110822105243/
Fetcher: Your 'http.agent.name' value should be listed first in'http.robots.agents' property.
Fetcher: starting at 2011-08-22 10:56:07
Fetcher: segment: db/segments/20110822105243
Fetcher: threads: 10
QueueFeeder finished: total 1 records + hit by time limit :0
fetching http://www.baidu.com/
-finishing thread FetcherThread, activeThreads=1
-finishing thread FetcherThread, activeThreads=
-finishing thread FetcherThread, activeThreads=1
-finishing thread FetcherThread, activeThreads=1
-finishing thread FetcherThread, activeThreads=0
-activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0
-activeThreads=0
Fetcher: finished at 2011-08-22 10:56:09, elapsed: 00:00:02
我们来看一下这里的segment目录结构
view plain
lemo@debian:~/Workspace/java/Apache/Nutch/nutch-1.3$ls db/segments/20110822105243/
content crawl_fetch crawl_generate
4.3.4 对上面的结果进行解析
view plain
<pre name="code"class="html">bin/nutch parse
Usage: ParseSegment segment
本机输出结果:
view plain
<pre name="code"class="html">lemo@debian:~/Workspace/java/Apache/Nutch/nutch-1.3$ bin/nutchparse db/segments/20110822105243/
ParseSegment: starting at 2011-08-2210:58:19
ParseSegment: segment:db/segments/20110822105243
ParseSegment: finished at 2011-08-2210:58:22, elapsed: 00:00:02
我们再来看一下解析后的目录结构
view plain
<pre name="code"class="html">lemo@debian:~/Workspace/java/Apache/Nutch/nutch-1.3$ls db/segments/20110822105243/
content crawl_fetch crawl_generate crawl_parse parse_data parse_text
这里多了三个解析后的目录。
4.3.5 更新外链接数据库
view plain
bin/nutch updatedb
Usage: CrawlDb <crawldb> (-dir<segments> | <seg1> <seg2> ...) [-force] [-normalize][-filter] [-noAdditions]
本机输出结果:
view plain
<pre name="code"class="html">lemo@debian:~/Workspace/java/Apache/Nutch/nutch-1.3$bin/nutch updatedb db/crawldb/ -dir db/segments/
CrawlDb update: starting at 2011-08-2211:00:09
CrawlDb update: db: db/crawldb
CrawlDb update: segments:[file:/home/lemo/Workspace/java/Apache/Nutch/nutch-1.3/db/segments/20110822105243]
CrawlDb update: additions allowed:true
CrawlDb update: URL normalizing: false
CrawlDb update: URL filtering: false
CrawlDb update: Merging segment data intodb.
CrawlDb update: finished at 2011-08-2211:00:10, elapsed: 00:00:01
这时它会更新crawldb链接库,这里是放在文件系统中的,像taobao抓取程序的链接库是用redis来做的,一种key-value形式的NoSql数据库。
5.6 计算反向链接
view plain
<pre name="code"class="html">bin/nutch invertlinks
Usage: LinkDb <linkdb> (-dir<segmentsDir> | <seg1> <seg2> ...) [-force] [-noNormalize][-noFilter]
本地输出结果:
view plain
<pre name="code"class="html">lemo@debian:~/Workspace/java/Apache/Nutch/nutch-1.3$bin/nutch invertlinks db/linkdb -dir db/segments/
LinkDb: starting at 2011-08-2211:02:49
LinkDb: linkdb: db/linkdb
LinkDb: URL normalize: true
LinkDb: URL filter: true
LinkDb: adding segment:file:/home/lemo/Workspace/java/Apache/Nutch/nutch-1.3/db/segments/20110822105243
LinkDb: finished at 2011-08-22 11:02:50,elapsed: 00:00:01
ID: ug_ia_on