成功抓取完成之后,运行bin/nutch readdb data/crawldb -stats,可以查看抓取的统计信息:
TOTAL urls: 1843
retry 0: 1838
retry 1: 5
min score: 0.0
avg score: 5.425936E-4
max score: 1.0
status 1 (db_unfetched): 1748
status 2 (db_fetched): 63
status 3 (db_gone): 1
status 4 (db_redir_temp): 29
status 5 (db_redir_perm): 2
CrawlDb statistics: done
5个状态的含义,1是还未请求的URL,2是成功抓取的URL(对应状态200),3对应404 ,4对应302 ,5对应301.