Nutch学习笔记三

Nutch抓取网页步骤

1,新建url列表

http://www.qq.com/
http://www.sina.com.cn/

2,将种子列表URL导入Nutch的crawldb

hadoop@slave5:~/nutch$ nutch inject crawl/crawldb urls/
Injector: starting at 2013-07-14 17:19:07
Injector: crawlDb: crawl/crawldb
Injector: urlDir: urls
Injector: Converting injected urls to crawl db entries.
Injector: total number of urls rejected by filters: 1
Injector: total number of urls injected after normalization and filtering: 2
Injector: Merging injected urls into crawl db.
Injector: finished at 2013-07-14 17:19:10, elapsed: 00:00:02

3,生成获取(fetch)列表,以便获取和分析内容

hadoop@slave5:~/nutch$ nutch generate crawl/crawldb crawl/segments
Generator: starting at 2013-07-14 17:19:14
Generator: Selecting best-scoring urls due for fetch.
Generator: filtering: true
Generator: normalizing: true
Generator: jobtracker is 'local', generating exactly one partition.
Generator: Partitioning selected urls for politeness.
Generator: segment: crawl/segments/20130714171917
Generator: finished at 2013-07-14 17:19:18, elapsed: 00:00:03

以上命令在crawl/segments目录下生成了一个新的segment目录,里边存储了抓到的URLs

4,我们需要最新的segment目录作为参数,存储到环境变量SEGMENT里

export SEGMENT=crawl/segments/`ls -tr crawl/segments|tail -1`

hadoop@slave5:~/nutch$ export SEGMENT=crawl/segments/`ls -tr crawl/segments|tail -1`

5,启动抓取程序真正开始抓取内容

nutch fetch $SEGMENT -noParsing

6,

解析抓取下来的内容

bin/nutch parse $SEGMENT

hadoop@slave5:~/nutch$ bin/nutch fetch $SEGMENT -noParsing
Fetcher: Your 'http.agent.name' value should be listed first in 'http.robots.agents' property.
Fetcher: starting at 2013-07-14 17:20:18
Fetcher: segment: crawl/segments/20130714171917
Using queue mode : byHost
Fetcher: threads: 10
Fetcher: time-out divisor: 2
QueueFeeder finished: total 2 records + hit by time limit :0
Using queue mode : byHost
Using queue mode : byHost
fetching http://www.qq.com/ (queue crawl delay=5000ms)
Using queue mode : byHost
Using queue mode : byHost
fetching http://www.sina.com.cn/ (queue crawl delay=5000ms)
-finishing thread FetcherThread, activeThreads=2
Using queue mode : byHost
-finishing thread FetcherThread, activeThreads=3
-finishing thread FetcherThread, activeThreads=2
Using queue mode : byHost
-finishing thread FetcherThread, activeThreads=2
Using queue mode : byHost
-finishing thread FetcherThread, activeThreads=2
Using queue mode : byHost
-finishing thread FetcherThread, activeThreads=2
Using queue mode : byHost
-finishing thread FetcherThread, activeThreads=2
Using queue mode : byHost
-finishing thread FetcherThread, activeThreads=2
Fetcher: throughput threshold: -1
Fetcher: throughput threshold retries: 5
-finishing thread FetcherThread, activeThreads=1
-finishing thread FetcherThread, activeThreads=0
-activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0
-activeThreads=0
Fetcher: finished at 2013-07-14 17:20:21, elapsed: 00:00:02
hadoop@slave5:~/nutch$ nutch parse $SEGMENT
ParseSegment: starting at 2013-07-14 17:21:35
ParseSegment: segment: crawl/segments/20130714171917
Parsed (10ms):http://www.qq.com/
http://www.sina.com.cn/ skipped. Content of size 135311 was truncated to 65536
ParseSegment: finished at 2013-07-14 17:21:37, elapsed: 00:00:01

7,

更 新Nutch crawldb,updatedb命令会存储以上两步抓取(fetch)和分析(parse)最新的segment而得到的新的URLs到Nutch crawldb,以便后续的继续抓取,除了URLs之外,Nutch也存储了相应的页面内容,防止相同的URLs被反反复复的抓取。

bin/nutch updatedb crawl/crawldb $SEGMENT -filter -normalize

到此,一个完整的抓取周期结束了,你可以重复步骤10多次以便可以抓取更多的内容。

hadoop@slave5:~/nutch$ bin/nutch updatedb crawl/crawldb $SEGMENT -filter -normalize
CrawlDb update: starting at 2013-07-14 17:33:55
CrawlDb update: db: crawl/crawldb
CrawlDb update: segments: [crawl/segments/20130714171917]
CrawlDb update: additions allowed: true
CrawlDb update: URL normalizing: true
CrawlDb update: URL filtering: true
CrawlDb update: 404 purging: false
CrawlDb update: Merging segment data into db.
CrawlDb update: finished at 2013-07-14 17:33:56, elapsed: 00:00:01

8,创建超链接库

bin/nutch invertlinks crawl/linkdb -dir crawl/segments

hadoop@slave5:~/nutch$ bin/nutch invertlinks crawl/linkdb -dir crawl/segments
LinkDb: starting at 2013-07-14 17:37:43
LinkDb: linkdb: crawl/linkdb
LinkDb: URL normalize: true
LinkDb: URL filter: true
LinkDb: internal links will be ignored.
LinkDb: adding segment: file:/home/hadoop/apache-nutch-1.7/crawl/segments/20130714171917
LinkDb: finished at 2013-07-14 17:37:44, elapsed: 00:00:01

出现错误步骤

9,

索引所有segments中的内容到Solr中

bin/nutch solrindex http://127.0.0.1:8983/solr/ crawl/crawldb crawl/linkdb crawl/segments/*

现在,所有Nutch抓取的内容已经被Solr索引了,你可以通过Solr Admin执行查询操作了

http://127.0.0.1:8983/solr/admin







。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值