【Nutch】Nutch的抓取流程

N.B. 本文最重要的部分在第三节对比部分!


1. Nutch抓取流程概述


1.1 抓取流程图示

Inject => Generate => Fetch => Parse => Updatedb => Solrindex

1.2 抓取流程步骤

(1) Inject
Round 1...
(2) Generate
(3) Fetch
(4) Parse
(5) Updatedb
(6) Solrindex
Round 2...
(2) Generate
(3) Fetch
(4) Parse
(5) Updatedb
(6) Solrindex
Round 3...
……

2. 使用命令逐步抓取


Round 1


2.0 Seed Url

<!--
Location:
~/Nutch/data/urls/seed.txt
-->
seed.txt
http://sannahkvist.se/

种子url是一个瑞典的小清新摄影网站,选这个还有一个原因就是这个网站很简约简洁,没有大量垃圾外链。


2.1 Inject

此步骤将seed urls注入到名为 crawlId + 'webpage' 的 数据库中。数据库中将产生裸Url加各种默认信息。

2.1.1 inject命令

2.1.2 查看数据库

这下知道crawlId是用来干什么的了吧。


数据库中产生了裸Url加各种默认信息。

2.2 Generate

2.2.1 generate命令

2.2.2 查看数据库

较上步变化了

较上步产生了

marker _gnmrk_ : 1440144780-494854742

batchId:   1440144780-494854742

较上步消失了


2.3 Fetch

2.3.1 fetch命令

2.3.2 查看数据库

较上步变化了

status:    2 (status_fetched)      [0(null)]

fetchTime: 1440144966010            [1440144429269]

prevFetchTime:    1440144429269   [0]

protocolStatus:   SUCCESS, args=[] [(null)]

较上步产生了

marker _ftcmrk_ :    1440144780-494854742

metadata _rs_ :   ####

header:    [xxx]

contentType:  text/html

content:start:

[xxx]

content:end:

较上步消失了

metadata _csh_ : ?�##


2.4 Parse

2.4.1 parse命令

2.4.2 查看数据库

较上步变化了

parseStatus:  success/ok (1/0), args=[]   [(null)]

title: Sannah Kvist                      [null]

 较上步产生了

signature: 261c653e067e097acc6dd5dc68072e91

marker __prsmrk__ : 1440144780-494854742

metadata [xxx]

outlink:   http://sannahkvist.se/  

text:start:

[xxx]

text:end:

较上步消失了


2.5 Updatedb

2.5.1 updatedb命令

2.5.2 查看数据库命令

较上步变化了


较上步产生了

marker _updmrk_ :    1440144780-494854742

metadata _csh_ : ####

inlink:    http://sannahkvist.se/  


http://sannahkvist.se/commissioned/    key:   se.sannahkvist:http/commissioned/

baseUrl:   null

status:    1 (status_unfetched) [compare to inject: 0 (null)]

fetchTime: 1440146083913       [compareto inject: 1440144429269]

prevFetchTime:    0

fetchInterval:    2592000

retriesSinceFetch:   0

modifiedTime: 0

prevModifiedTime: 0

protocolStatus:   (null)

parseStatus:  (null)

title: null

score: 0.0                      [compareto inject: 0]

                               [compare toinject: marker_injmrk_:y]

markerdist :     1            [compare to inject: 0]

reprUrl:   null

metadata_csh_ : ####

inlink:    http://sannahkvist.se/   commissioned[compare inject: new]

较上步消失了

marker __prsmrk__ : 1440144780-494854742

marker_gnmrk_ : 1440144780-494854742

marker_ftcmrk_ :    1440144780-494854742


2.6 Solrindex

2.6.1 solrindex命令


较上步变化了

较上步产生了
marker _idxmrk_ : 1440144780-494854742
较上步消失了


Round 2


2.2 generate

2.3 fetch

lhd@master:~/Nutch/apache-nutch-2.3/runtime/local/bin$ ./nutch fetch-all -crawlId photo
FetcherJob: starting at 2015-08-21 21:52:26
FetcherJob: fetching all
FetcherJob: threads: 10
FetcherJob: parsing: false
FetcherJob: resuming: false
FetcherJob : timelimit set for : -1
Using queue mode : byHost
Fetcher: threads: 10
QueueFeeder finished: total 3 records. Hit by time limit :0
fetchinghttp://sannahkvist.se/commissioned/i-rymden-finns-inga-kanslor/ (queue crawldelay=5000ms)
Fetcher: throughput threshold: -1
Fetcher: throughput threshold sequence: 5
10/10 spinwaiting/active, 1 pages, 0 errors, 0.2 0 pages/s, 10 10kb/s, 2 URLs in 1 queues
* queue: http://sannahkvist.se
  maxThreads    = 1
  inProgress    = 0
  crawlDelay    = 5000
  minCrawlDelay = 0
  nextFetchTime = 1440165160919
  now           = 1440165157995
  0.http://sannahkvist.se/commissioned/
  1.http://sannahkvist.se/commissioned/flickan/
fetching http://sannahkvist.se/commissioned/ (queue crawldelay=5000ms)
10/10 spinwaiting/active, 2 pages, 0 errors, 0.2 0 pages/s, 9 7kb/s, 1 URLs in 1 queues
* queue: http://sannahkvist.se
  maxThreads    = 1
  inProgress    = 0
  crawlDelay    = 5000
  minCrawlDelay = 0
  nextFetchTime =1440165167259
  now           = 1440165162997
  0. http://sannahkvist.se/commissioned/flickan/
fetching http://sannahkvist.se/commissioned/flickan/ (queue crawldelay=5000ms)
-finishing thread FetcherThread5, activeThreads=9
-finishing thread FetcherThread8, activeThreads=8
-finishing thread FetcherThread6, activeThreads=7
-finishing thread FetcherThread2, activeThreads=6
-finishing thread FetcherThread1, activeThreads=5
-finishing thread FetcherThread3, activeThreads=4
-finishing thread FetcherThread4, activeThreads=3
-finishing thread FetcherThread7, activeThreads=2
-finishing thread FetcherThread9, activeThreads=1
0/1 spinwaiting/active, 2 pages, 0 errors, 0.1 0 pages/s, 6 0 kb/s,0 URLs in 1 queues
-finishing thread FetcherThread0, activeThreads=0
0/0 spinwaiting/active, 3 pages, 0 errors, 0.2 0 pages/s, 8 13 kb/s,0 URLs in 0 queues
-activeThreads=0
FetcherJob: finished at 2015-08-21 21:52:53, time elapsed: 00:00:27


2.4 parse

2.5 updatedb

2.6 solrindex


Round 3

……


!!!3. 对比

Round 1

本表(表1)给出了第一轮爬取流程中每一步产生,消失,变化的一些数据。其中 蓝色是较上步发生变化的一些数据红色是较上步新产生的一些数据,而较上步消失的数据就在本步中再未出现。需要注意的是,在updatedb步之后,新解析出来的一些urls也加入到了数据库中。表2的第三列[NEW ADDED URL AFTER updatedb round 1]就是从这些新产生的Urls中选了一条url来与Round 1的Inject步骤做了对比。

 表1

Inject

Generate

Fetch

Parse

Updatedb

Solrindex

key:

se.sannahkvist:http/

se.sannahkvist:http/

se.sannahkvist:http/

se.sannahkvist:http/

se.sannahkvist:http/

se.sannahkvist:http/

baseUrl:

null

null

null

null

null

null

status:

0 (null)

0 (null)

2 (status_fetched)

2 (status_fetched)

2 (status_fetched)

2 (status_fetched)

fetchTime:

1440144429269

1440144429269

1440144966010

1440144966010

1440144966010

1440144966010

prevFetchTime:

0

0

1440144429269

1440144429269

1440144429269

1440144429269

fetchInterval:

2592000

2592000

2592000

2592000

2592000

2592000

retriesSinceFetch:

0

0

0

0

0

0

modifiedTime:

0

0

0

0

0

0

prevModifiedTime

0

0

0

0

0

0

protocolStatus:

(null)

(null)

SUCCESS, args=[]

SUCCESS, args=[]

SUCCESS, args=[]

SUCCESS, args=[]

signature:

 

 

 

261c653e067e097acc6dd5dc68072e91

261c653e067e097acc6dd5dc68072e91

261c653e067e097acc6dd5dc68072e91

parseStatus:

(null)

(null)

(null)

success/ok (1/0), args=[]

success/ok (1/0), args=[]

success/ok (1/0), args=[]

title:

null

null

null

Sannah Kvist

Sannah Kvist

Sannah Kvist

score:

1.0

1.0

1.0

1.0

1.0

1.0

marker _injmrk_

y

y

y

y

y

y

marker _updmrk_

 

 

 

 

1440144780-494854742

1440144780-494854742

marker__prsmrk__

 

 

 

1440144780-494854742

 

 

marker _gnmrk_

 

1440144780-494854742

1440144780-494854742

1440144780-494854742

 

 

marker _ftcmrk_

 

 

1440144780-494854742

1440144780-494854742

 

 

marker _idxmrk_

 

 

 

 

 

1440144780-494854742

marker dist :

0

0

0

0

0

0

reprUrl:

null

null

null

null

null

null

batchId:

 

1440144780-494854742

1440144780-494854742

1440144780-494854742

1440144780-494854742

1440144780-494854742

metadata _csh_ :

?�##

?�##

 

 

####

####

metadata _rs_ :

 

 

####

####

####

####

metadata:

 

 

 

xxx

xxx

xxx

outlink:

 

 

 

http://sannahkvist.se/

http://sannahkvist.se/

http://sannahkvist.se/

inlink:

 

 

 

 

http://sannahkvist.se/

http://sannahkvist.se/

header:

 

 

xxx

xxx

xxx

xxx

contentType:

 

 

text/html

text/html

text/html

text/html

content:start:

content:end:

 

 

xxx

xxx

xxx

xxx

text:start:

text:end:

 

 

 

xxx

xxx

xxx


Change after updatedb round 1 & 2

有新的种子产生。

本表(表2)就是将第一轮的inject步骤过后种子url的状态与第一轮第二轮updatedb步骤之后新产生的url的状态做了个对比。


 表2

Inject

NEW ADDED URL AFTER updatedb round 1

NEW ADDED URL AFTER updatedb round 2

key:

se.sannahkvist:http/

se.sannahkvist:http/commissioned/

com.imdb.www:http/title/tt1342378/

baseUrl:

null

null

null

status:

0 (null)

1 (status_unfetched)

1 (status_unfetched)

fetchTime:

1440144429269

1440146083913

1440166605303

prevFetchTime:

0

0

0

fetchInterval:

2592000

2592000

2592000

retriesSinceFetch:

0

0

0

modifiedTime:

0

0

0

prevModifiedTime

0

0

0

protocolStatus:

(null)

(null)

(null)

signature:

 

 

 

parseStatus:

(null)

(null)

(null)

title:

null

null

null

score:

1.0

0.0

0.0

marker _injmrk_

y

 

 

marker _updmrk_

 

 

 

marker__prsmrk__

 

 

 

marker _gnmrk_

 

 

 

marker _ftcmrk_

 

 

 

marker _idxmrk_

 

 

 

marker dist :

0

1

2

reprUrl:

null

null

null

batchId:

 

 

 

metadata _csh_ :

?�##

####

####

metadata _rs_ :

 

 

 

metadata:

 

 

 

outlink:

 

 

 

inlink:

 

http://sannahkvist.se/

http://sannahkvist.se/commissioned/flickan/

header:

 

 

 

contentType:

 

 

 

content:start:

content:end:

 

 

 

text:start:

text:end:

 

 

 


Round 2

本表(表3)是将上表第三列用来做对比的url拿出来,再与第二轮其他步骤过后种子的状态变化做了一个对比。
需要注意的是,同样在updatedb步之后,新解析出来的一些urls也加入到了数据库中。表2的第四列[NEW ADDED URL AFTER updatedb round 2]就是从这些新产生的Urls中选了一条url来与Round 1的Inject步骤做了对比。

 表3

NEW ADDED URL AFTER updatedb 1

generate

fetch

parse

updatedb

solrindex

key:

se.sannahkvist:http/commissioned/

se.sannahkvist:http/commissioned/

se.sannahkvist:http/commissioned/

se.sannahkvist:http/commissioned/

se.sannahkvist:http/commissioned/

se.sannahkvist:http/commissioned/

baseUrl:

null

null

null

null

null

null

status:

1 (status_unfetched)

1 (status_unfetched)

2 (status_fetched)

2 (status_fetched)

2 (status_fetched)

2 (status_fetched)

fetchTime:

1440146083913

1440146083913

1440165162261

1440165162261

1440165162261

1440165162261

prevFetchTime:

0

0

1440146083913

1440146083913

1440146083913

1440146083913

fetchInterval:

2592000

2592000

2592000

2592000

2592000

2592000

retriesSinceFetch:

0

0

0

0

0

0

modifiedTime:

0

0

0

0

0

0

prevModifiedTime

0

0

0

0

0

0

protocolStatus:

(null)

(null)

SUCCESS, args=[]

SUCCESS, args=[]

SUCCESS, args=[]

SUCCESS, args=[]

signature:

 

 

 

c02daceb65a33aaba8fc075b4e1afe37

c02daceb65a33aaba8fc075b4e1afe37

c02daceb65a33aaba8fc075b4e1afe37

parseStatus:

(null)

(null)

(null)

success/ok (1/0), args=[]

success/ok (1/0), args=[]

success/ok (1/0), args=[]

title:

null

null

null

commissioned × Sannah Kvist

commissioned × Sannah Kvist

commissioned × Sannah Kvist

score:

0.0

0.0

0.0

0.0

0.0

0.0

marker _injmrk_

 

 

 

 

 

 

marker _updmrk_

 

 

 

 

1440164632-449570503

1440164632-449570503

marker__prsmrk__

 

 

 

1440164632-449570503

 

 

marker _gnmrk_

 

1440164632-449570503

1440164632-449570503

1440164632-449570503

 

 

marker _ftcmrk_

 

 

1440164632-449570503

1440164632-449570503

 

 

marker _idxmrk_

 

 

 

 

 

1440164632-449570503

marker dist :

1

1

1

1

1

1

reprUrl:

null

null

null

null

null

null

batchId:

 

1440164632-449570503

1440164632-449570503

1440164632-449570503

1440164632-449570503

1440164632-449570503

metadata _csh_ :

####

####

 

 

####

####

metadata _rs_ :

 

 

####

####

####

####

metadata:

 

 

 

xxx

xxx

xxx

outlink:

 

 

 

http://sannahkvist.se/

http://sannahkvist.se/

http://sannahkvist.se/

inlink:

http://sannahkvist.se/

http://sannahkvist.se/

http://sannahkvist.se/

http://sannahkvist.se/

http://sannahkvist.se/commissioned/flickan/

http://sannahkvist.se/commissioned/i-rymden-finns-inga-kanslor/

http://sannahkvist.se/commissioned/

http://sannahkvist.se/commissioned/flickan/

http://sannahkvist.se/commissioned/i-rymden-finns-inga-kanslor/

http://sannahkvist.se/commissioned/

header:

 

 

xxx

xxx

xxx

xxx

contentType:

 

 

text/html

text/html

text/html

text/html

content:start:

content:end:

 

 

xxx

xxx

xxx

xxx

text:start:

text:end:

 

 

 

xxx

xxx

xxx


Change after updatedb round 2

在第二轮进行updatedb 操作之后,种子url也发生了变化,原因是种子url也出现在了这次爬取的urls中的outlinks。就跟产生了一个和种子url一样的新url。

本表(表4)是第二轮进行updatedb操作之后种子url产生的变化与第一轮solrindex步骤之后的种子url状态的一个对比。


 表4

Solrindex 1

updatedb 2

key:

se.sannahkvist:http/

se.sannahkvist:http/

baseUrl:

null

null

status:

2 (status_fetched)

1 (status_unfetched)

fetchTime:

1440144966010

1440166605317

prevFetchTime:

1440144429269

1440144429269

fetchInterval:

2592000

2592000

retriesSinceFetch:

0

0

modifiedTime:

0

0

prevModifiedTime

0

0

protocolStatus:

SUCCESS, args=[]

SUCCESS, args=[]

signature:

261c653e067e097acc6dd5dc68072e91

261c653e067e097acc6dd5dc68072e91

parseStatus:

success/ok (1/0), args=[]

success/ok (1/0), args=[]

title:

Sannah Kvist

Sannah Kvist

score:

1.0

0.0

marker _injmrk_

y

y

marker _updmrk_

1440144780-494854742

 

marker__prsmrk__

 

 

marker _gnmrk_

 

 

marker _ftcmrk_

 

 

marker _idxmrk_

1440144780-494854742

 

marker dist :

0

2

reprUrl:

null

null

batchId:

1440144780-494854742

1440144780-494854742

metadata _csh_ :

####

####

metadata _rs_ :

####

 

metadata:

xxx

xxx

outlink:

http://sannahkvist.se/

http://sannahkvist.se/

inlink:

http://sannahkvist.se/

http://sannahkvist.se/commissioned/flickan/ 

http://sannahkvist.se/commissioned/i-rymden-finns-inga-kanslor/   

http://sannahkvist.se/commissioned/   

header:

xxx

xxx

contentType:

text/html

text/html

content:start:

content:end:

xxx

xxx

text:start:

text:end:

xxx

xxx


Solr服务器的变化


1. 在未开始任何nutch任务之前:




在完成第一轮抓取任务之后:




在完成第二轮抓取任务之后:




END!

  • 0
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值