N.B. 本文最重要的部分在第三节对比部分!
1. Nutch抓取流程概述
1.1 抓取流程图示
1.2 抓取流程步骤
2. 使用命令逐步抓取
Round 1
2.0 Seed Url
<!--
Location:
~/Nutch/data/urls/seed.txt
-->
http://sannahkvist.se/
种子url是一个瑞典的小清新摄影网站,选这个还有一个原因就是这个网站很简约简洁,没有大量垃圾外链。
2.1 Inject
此步骤将seed urls注入到名为 crawlId + 'webpage' 的 数据库中。数据库中将产生裸Url加各种默认信息。2.2 Generate
marker _gnmrk_ : 1440144780-494854742
batchId: 1440144780-494854742
较上步消失了
2.3 Fetch
较上步变化了
status: 2 (status_fetched) [0(null)]
fetchTime: 1440144966010 [1440144429269]
prevFetchTime: 1440144429269 [0]
protocolStatus: SUCCESS, args=[] [(null)]
较上步产生了
marker _ftcmrk_ : 1440144780-494854742
metadata _rs_ : ####
header: [xxx]
contentType: text/html
content:start:
[xxx]
content:end:
较上步消失了
metadata _csh_ : ?�##
2.4 Parse
较上步变化了
parseStatus: success/ok (1/0), args=[] [(null)]
title: Sannah Kvist [null]
较上步产生了
signature: 261c653e067e097acc6dd5dc68072e91
marker __prsmrk__ : 1440144780-494854742
metadata [xxx]
outlink: http://sannahkvist.se/
text:start:
[xxx]
text:end:
较上步消失了
2.5 Updatedb
较上步变化了
marker _updmrk_ : 1440144780-494854742
metadata _csh_ : ####
inlink: http://sannahkvist.se/
http://sannahkvist.se/commissioned/ key: se.sannahkvist:http/commissioned/
baseUrl: null
status: 1 (status_unfetched) [compare to inject: 0 (null)]
fetchTime: 1440146083913 [compareto inject: 1440144429269]
prevFetchTime: 0
fetchInterval: 2592000
retriesSinceFetch: 0
modifiedTime: 0
prevModifiedTime: 0
protocolStatus: (null)
parseStatus: (null)
title: null
score: 0.0 [compareto inject: 0]
[compare toinject: marker_injmrk_:y]
markerdist : 1 [compare to inject: 0]
reprUrl: null
metadata_csh_ : ####
inlink: http://sannahkvist.se/ commissioned[compare inject: new]
较上步消失了
marker __prsmrk__ : 1440144780-494854742
marker_gnmrk_ : 1440144780-494854742
marker_ftcmrk_ : 1440144780-494854742
2.6 Solrindex
2.6.1 solrindex命令
较上步变化了
较上步产生了
marker _idxmrk_ : 1440144780-494854742
较上步消失了
Round 2
2.2 generate
2.3 fetch
lhd@master:~/Nutch/apache-nutch-2.3/runtime/local/bin$ ./nutch fetch-all -crawlId photo
FetcherJob: starting at 2015-08-21 21:52:26
FetcherJob: fetching all
FetcherJob: threads: 10
FetcherJob: parsing: false
FetcherJob: resuming: false
FetcherJob : timelimit set for : -1
Using queue mode : byHost
Fetcher: threads: 10
QueueFeeder finished: total 3 records. Hit by time limit :0
fetchinghttp://sannahkvist.se/commissioned/i-rymden-finns-inga-kanslor/ (queue crawldelay=5000ms)
Fetcher: throughput threshold: -1
Fetcher: throughput threshold sequence: 5
10/10 spinwaiting/active, 1 pages, 0 errors, 0.2 0 pages/s, 10 10kb/s, 2 URLs in 1 queues
* queue: http://sannahkvist.se
maxThreads = 1
inProgress = 0
crawlDelay = 5000
minCrawlDelay = 0
nextFetchTime = 1440165160919
now = 1440165157995
0.http://sannahkvist.se/commissioned/
1.http://sannahkvist.se/commissioned/flickan/
fetching http://sannahkvist.se/commissioned/ (queue crawldelay=5000ms)
10/10 spinwaiting/active, 2 pages, 0 errors, 0.2 0 pages/s, 9 7kb/s, 1 URLs in 1 queues
* queue: http://sannahkvist.se
maxThreads = 1
inProgress = 0
crawlDelay = 5000
minCrawlDelay = 0
nextFetchTime =1440165167259
now = 1440165162997
0. http://sannahkvist.se/commissioned/flickan/
fetching http://sannahkvist.se/commissioned/flickan/ (queue crawldelay=5000ms)
-finishing thread FetcherThread5, activeThreads=9
-finishing thread FetcherThread8, activeThreads=8
-finishing thread FetcherThread6, activeThreads=7
-finishing thread FetcherThread2, activeThreads=6
-finishing thread FetcherThread1, activeThreads=5
-finishing thread FetcherThread3, activeThreads=4
-finishing thread FetcherThread4, activeThreads=3
-finishing thread FetcherThread7, activeThreads=2
-finishing thread FetcherThread9, activeThreads=1
0/1 spinwaiting/active, 2 pages, 0 errors, 0.1 0 pages/s, 6 0 kb/s,0 URLs in 1 queues
-finishing thread FetcherThread0, activeThreads=0
0/0 spinwaiting/active, 3 pages, 0 errors, 0.2 0 pages/s, 8 13 kb/s,0 URLs in 0 queues
-activeThreads=0
FetcherJob: finished at 2015-08-21 21:52:53, time elapsed: 00:00:27
2.4 parse
2.5 updatedb
2.6 solrindex
Round 3
……
!!!3. 对比
Round 1
表1 | Inject | Generate | Fetch | Parse | Updatedb | Solrindex |
key: | se.sannahkvist:http/ | se.sannahkvist:http/ | se.sannahkvist:http/ | se.sannahkvist:http/ | se.sannahkvist:http/ | se.sannahkvist:http/ |
baseUrl: | null | null | null | null | null | null |
status: | 0 (null) | 0 (null) | 2 (status_fetched) | 2 (status_fetched) | 2 (status_fetched) | 2 (status_fetched) |
fetchTime: | 1440144429269 | 1440144429269 | 1440144966010 | 1440144966010 | 1440144966010 | 1440144966010 |
prevFetchTime: | 0 | 0 | 1440144429269 | 1440144429269 | 1440144429269 | 1440144429269 |
fetchInterval: | 2592000 | 2592000 | 2592000 | 2592000 | 2592000 | 2592000 |
retriesSinceFetch: | 0 | 0 | 0 | 0 | 0 | 0 |
modifiedTime: | 0 | 0 | 0 | 0 | 0 | 0 |
prevModifiedTime | 0 | 0 | 0 | 0 | 0 | 0 |
protocolStatus: | (null) | (null) | SUCCESS, args=[] | SUCCESS, args=[] | SUCCESS, args=[] | SUCCESS, args=[] |
signature: |
|
|
| 261c653e067e097acc6dd5dc68072e91 | 261c653e067e097acc6dd5dc68072e91 | 261c653e067e097acc6dd5dc68072e91 |
parseStatus: | (null) | (null) | (null) | success/ok (1/0), args=[] | success/ok (1/0), args=[] | success/ok (1/0), args=[] |
title: | null | null | null | Sannah Kvist | Sannah Kvist | Sannah Kvist |
score: | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 |
marker _injmrk_ | y | y | y | y | y | y |
marker _updmrk_ |
|
|
|
| 1440144780-494854742 | 1440144780-494854742 |
marker__prsmrk__ |
|
|
| 1440144780-494854742 |
|
|
marker _gnmrk_ |
| 1440144780-494854742 | 1440144780-494854742 | 1440144780-494854742 |
|
|
marker _ftcmrk_ |
|
| 1440144780-494854742 | 1440144780-494854742 |
|
|
marker _idxmrk_ |
|
|
|
|
| 1440144780-494854742 |
marker dist : | 0 | 0 | 0 | 0 | 0 | 0 |
reprUrl: | null | null | null | null | null | null |
batchId: |
| 1440144780-494854742 | 1440144780-494854742 | 1440144780-494854742 | 1440144780-494854742 | 1440144780-494854742 |
metadata _csh_ : | ?�## | ?�## |
|
| #### | #### |
metadata _rs_ : |
|
| #### | #### | #### | #### |
metadata: |
|
|
| xxx | xxx | xxx |
outlink: |
|
|
| http://sannahkvist.se/ | http://sannahkvist.se/ | http://sannahkvist.se/ |
inlink: |
|
|
|
| http://sannahkvist.se/ | http://sannahkvist.se/ |
header: |
|
| xxx | xxx | xxx | xxx |
contentType: |
|
| text/html | text/html | text/html | text/html |
content:start: content:end: |
|
| xxx | xxx | xxx | xxx |
text:start: text:end: |
|
|
| xxx | xxx | xxx |
Change after updatedb round 1 & 2
有新的种子产生。
本表(表2)就是将第一轮的inject步骤过后种子url的状态与第一轮第二轮updatedb步骤之后新产生的url的状态做了个对比。
表2 | Inject | NEW ADDED URL AFTER updatedb round 1 | NEW ADDED URL AFTER updatedb round 2 |
key: | se.sannahkvist:http/ | se.sannahkvist:http/commissioned/ | com.imdb.www:http/title/tt1342378/ |
baseUrl: | null | null | null |
status: | 0 (null) | 1 (status_unfetched) | 1 (status_unfetched) |
fetchTime: | 1440144429269 | 1440146083913 | 1440166605303 |
prevFetchTime: | 0 | 0 | 0 |
fetchInterval: | 2592000 | 2592000 | 2592000 |
retriesSinceFetch: | 0 | 0 | 0 |
modifiedTime: | 0 | 0 | 0 |
prevModifiedTime | 0 | 0 | 0 |
protocolStatus: | (null) | (null) | (null) |
signature: |
|
|
|
parseStatus: | (null) | (null) | (null) |
title: | null | null | null |
score: | 1.0 | 0.0 | 0.0 |
marker _injmrk_ | y |
|
|
marker _updmrk_ |
|
|
|
marker__prsmrk__ |
|
|
|
marker _gnmrk_ |
|
|
|
marker _ftcmrk_ |
|
|
|
marker _idxmrk_ |
|
|
|
marker dist : | 0 | 1 | 2 |
reprUrl: | null | null | null |
batchId: |
|
|
|
metadata _csh_ : | ?�## | #### | #### |
metadata _rs_ : |
|
|
|
metadata: |
|
|
|
outlink: |
|
|
|
inlink: |
| http://sannahkvist.se/ | http://sannahkvist.se/commissioned/flickan/ |
header: |
|
|
|
contentType: |
|
|
|
content:start: content:end: |
|
|
|
text:start: text:end: |
|
|
|
Round 2
表3 | NEW ADDED URL AFTER updatedb 1 | generate | fetch | parse | updatedb | solrindex |
key: | se.sannahkvist:http/commissioned/ | se.sannahkvist:http/commissioned/ | se.sannahkvist:http/commissioned/ | se.sannahkvist:http/commissioned/ | se.sannahkvist:http/commissioned/ | se.sannahkvist:http/commissioned/ |
baseUrl: | null | null | null | null | null | null |
status: | 1 (status_unfetched) | 1 (status_unfetched) | 2 (status_fetched) | 2 (status_fetched) | 2 (status_fetched) | 2 (status_fetched) |
fetchTime: | 1440146083913 | 1440146083913 | 1440165162261 | 1440165162261 | 1440165162261 | 1440165162261 |
prevFetchTime: | 0 | 0 | 1440146083913 | 1440146083913 | 1440146083913 | 1440146083913 |
fetchInterval: | 2592000 | 2592000 | 2592000 | 2592000 | 2592000 | 2592000 |
retriesSinceFetch: | 0 | 0 | 0 | 0 | 0 | 0 |
modifiedTime: | 0 | 0 | 0 | 0 | 0 | 0 |
prevModifiedTime | 0 | 0 | 0 | 0 | 0 | 0 |
protocolStatus: | (null) | (null) | SUCCESS, args=[] | SUCCESS, args=[] | SUCCESS, args=[] | SUCCESS, args=[] |
signature: |
|
|
| c02daceb65a33aaba8fc075b4e1afe37 | c02daceb65a33aaba8fc075b4e1afe37 | c02daceb65a33aaba8fc075b4e1afe37 |
parseStatus: | (null) | (null) | (null) | success/ok (1/0), args=[] | success/ok (1/0), args=[] | success/ok (1/0), args=[] |
title: | null | null | null | commissioned × Sannah Kvist | commissioned × Sannah Kvist | commissioned × Sannah Kvist |
score: | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
marker _injmrk_ |
|
|
|
|
|
|
marker _updmrk_ |
|
|
|
| 1440164632-449570503 | 1440164632-449570503 |
marker__prsmrk__ |
|
|
| 1440164632-449570503 |
|
|
marker _gnmrk_ |
| 1440164632-449570503 | 1440164632-449570503 | 1440164632-449570503 |
|
|
marker _ftcmrk_ |
|
| 1440164632-449570503 | 1440164632-449570503 |
|
|
marker _idxmrk_ |
|
|
|
|
| 1440164632-449570503 |
marker dist : | 1 | 1 | 1 | 1 | 1 | 1 |
reprUrl: | null | null | null | null | null | null |
batchId: |
| 1440164632-449570503 | 1440164632-449570503 | 1440164632-449570503 | 1440164632-449570503 | 1440164632-449570503 |
metadata _csh_ : | #### | #### |
|
| #### | #### |
metadata _rs_ : |
|
| #### | #### | #### | #### |
metadata: |
|
|
| xxx | xxx | xxx |
outlink: |
|
|
| http://sannahkvist.se/ | http://sannahkvist.se/ | http://sannahkvist.se/ |
inlink: | http://sannahkvist.se/ | http://sannahkvist.se/ | http://sannahkvist.se/ | http://sannahkvist.se/ | http://sannahkvist.se/commissioned/flickan/ http://sannahkvist.se/commissioned/i-rymden-finns-inga-kanslor/ http://sannahkvist.se/commissioned/ | http://sannahkvist.se/commissioned/flickan/ http://sannahkvist.se/commissioned/i-rymden-finns-inga-kanslor/ http://sannahkvist.se/commissioned/ |
header: |
|
| xxx | xxx | xxx | xxx |
contentType: |
|
| text/html | text/html | text/html | text/html |
content:start: content:end: |
|
| xxx | xxx | xxx | xxx |
text:start: text:end: |
|
|
| xxx | xxx | xxx |
Change after updatedb round 2
在第二轮进行updatedb 操作之后,种子url也发生了变化,原因是种子url也出现在了这次爬取的urls中的outlinks。就跟产生了一个和种子url一样的新url。
本表(表4)是第二轮进行updatedb操作之后种子url产生的变化与第一轮solrindex步骤之后的种子url状态的一个对比。
表4 | Solrindex 1 | updatedb 2 |
key: | se.sannahkvist:http/ | se.sannahkvist:http/ |
baseUrl: | null | null |
status: | 2 (status_fetched) | 1 (status_unfetched) |
fetchTime: | 1440144966010 | 1440166605317 |
prevFetchTime: | 1440144429269 | 1440144429269 |
fetchInterval: | 2592000 | 2592000 |
retriesSinceFetch: | 0 | 0 |
modifiedTime: | 0 | 0 |
prevModifiedTime | 0 | 0 |
protocolStatus: | SUCCESS, args=[] | SUCCESS, args=[] |
signature: | 261c653e067e097acc6dd5dc68072e91 | 261c653e067e097acc6dd5dc68072e91 |
parseStatus: | success/ok (1/0), args=[] | success/ok (1/0), args=[] |
title: | Sannah Kvist | Sannah Kvist |
score: | 1.0 | 0.0 |
marker _injmrk_ | y | y |
marker _updmrk_ | 1440144780-494854742 |
|
marker__prsmrk__ |
|
|
marker _gnmrk_ |
|
|
marker _ftcmrk_ |
|
|
marker _idxmrk_ | 1440144780-494854742 |
|
marker dist : | 0 | 2 |
reprUrl: | null | null |
batchId: | 1440144780-494854742 | 1440144780-494854742 |
metadata _csh_ : | #### | #### |
metadata _rs_ : | #### |
|
metadata: | xxx | xxx |
outlink: | http://sannahkvist.se/ | http://sannahkvist.se/ |
inlink: | http://sannahkvist.se/ | http://sannahkvist.se/commissioned/flickan/ http://sannahkvist.se/commissioned/i-rymden-finns-inga-kanslor/ http://sannahkvist.se/commissioned/ |
header: | xxx | xxx |
contentType: | text/html | text/html |
content:start: content:end: | xxx | xxx |
text:start: text:end: | xxx | xxx |