首先,何以见得crawl是inject,generate,fetch,parse,update的集成呢(命令的具体含义及功能会在后续文章中说明),我们打开NUTCH_HOME/runtime/local/bin/crawl
我将主要代码黏贴下来
# initial injection
echo "Injecting seed URLs"
__bin_nutch inject "$SEEDDIR" -crawlId "$CRAWL_ID"
# main loop : rounds of generate - fetch - parse - update
for ((a=1; a <= LIMIT ; a++))
do
...
echo "Generating a new fetchlist"
generate_args=($commonOptions -topN $sizeFetchlist -noNorm -noFilter -adddays $addDays -crawlId "$CRAWL_ID" -batchId $batchId)
$bin/nutch generate "${generate_args[@]}"
...
echo "Fetching : "
__bin_nutch fetch $commonOptions -D fetcher.timelimit.mins=$timeLimitFetch $batchId -crawlId "$CRAWL_ID" -threads 50
...
__bin_nutch parse $commonOptions $skipRecordsOptions $batchId -crawlId "$CRAWL_ID"
...
__bin_nutch updatedb $commonOptions $batchId -crawlId "$CRAWL_ID"
...
echo "Indexing $CRAWL_ID on SOLR index -> $SOLRURL"
__bin_nutch index $commonOptions -D solr.server.url=$SOLRURL -all -crawlId "$CRAWL_ID"
...
echo "SOLR dedup -> $SOLRURL"
__bin_nutch solrdedup $commonOptions $SOLRURL
接下来手动执行上述步骤
我们一直会处在runtime/local/ 目录下
1,inject:
当然,种子文件先要写好,urls/url 文件中写入想要抓取的网站,我以http://www.6vhao.com为例
在抓取期间,我不想让他抓取除了6vhao.com以外的其他网站,这个可以在conf/regex-urlfilter.txt文件中加入
# accept anything else
使用下面的命令开始抓取:
./bin/nutch inject urls/url -crawlId 6vhao
在hbase shell中使用list命令中查看,生成了一张新表6vhao_webpage
scan '6vhao_webpage'查看其内容
ROW COLUMN+CELL
com.6vhao.www:http/ column=f:fi, timestamp=1446135434505, value=\x00'\x8D\x00
com.6vhao.www:http/ column=f:ts, timestamp=1446135434505, value=\x00\x00\x01P\xB4c\x86\xAA
com.6vhao.www:http/ column=mk:_injmrk_, timestamp=1446135434505, value=y
com.6vhao.www:http/ column=mk:dist, timestamp=1446135434505, value=0
com.6vhao.www:http/ column=mtdt:_csh_, timestamp=1446135434505, value=?\x80\x00\x00
com.6vhao.www:http/ column=s:s, timestamp=1446135434505, value=?\x80\x00\x00
可以看出生成了一行hbase 数据,4列族数据,具体含义以后再说
2,generator
使用命令./bin/nutch generate
-topN <N> - number of top URLs to be selected, default is Long.MAX_VALUE
-crawlId <id> - the id to prefix the schemas to operate on,
(default: storage.crawl.id)");
-noFilter - do not activate the filter plugin to filter the url, default is true
-noNorm - do not activate the normalizer plugin to normalize the url, default is true
-adddays - Adds numDays to the current time to facilitate crawling urls already
fetched sooner then db.fetch.interval.default. Default value is 0.
-batchId - the batch id
我们指定-crawlId 为 6vhao
./bin/nutch generate -crawlId 6vhao
com.6vhao.www:http/ column=f:bid, timestamp=1446135900858, value=1446135898-215760616
com.6vhao.www:http/ column=f:fi, timestamp=1446135434505, value=\x00'\x8D\x00
com.6vhao.www:http/ column=f:ts, timestamp=1446135434505, value=\x00\x00\x01P\xB4c\x86\xAA
com.6vhao.www:http/ column=mk:_gnmrk_, timestamp=1446135900858, value=1446135898-215760616
com.6vhao.www:http/ column=mk:_injmrk_, timestamp=1446135900858, value=y
com.6vhao.www:http/ column=mk:dist, timestamp=1446135900858, value=0
com.6vhao.www:http/ column=mtdt:_csh_, timestamp=1446135434505, value=?\x80\x00\x00
com.6vhao.www:http/ column=s:s, timestamp=1446135434505, value=?\x80\x00\x00
对比发现多了2列数据
3,开始抓取 fetch
Usage: FetcherJob (<batchId> | -all) [-crawlId <id>] [-threads N]
[-resume] [-numTasks N]
<batchId> - crawl identifier returned by Generator, or -all for all
generated batchId-s
-crawlId <id> - the id to prefix the schemas to operate on,
(default: storage.crawl.id)
-threads N - number of fetching threads per task
-resume - resume interrupted job
-numTasks N - if N > 0 then use this many reduce tasks for fetching
(default: mapred.map.tasks)
./bin/nutch fetch -all -crawlId 6vhao -threads 8
数据较多,基本网页的内容全都在,自行到hbase中查看
4,parse
Usage: ParserJob (<batchId> | -all) [-crawlId <id>] [-resume] [-force]
<batchId> - symbolic batch ID created by Generator
-crawlId <id> - the id to prefix the schemas to operate on,
(default: storage.crawl.id)
-all - consider pages from all crawl jobs
-resume - resume a previous incomplete job
-force - force re-parsing even if a page is already parsed
root@tong:/opt/nutch/nutch-2.3/runtime/local# ./bin/nutch parse -crawlId 6vhao -all
./bin/nutch parse -crawlId 6vhao -all
pase结果可在hbase中查看
5,update
Usage: DbUpdaterJob (<batchId> | -all) [-crawlId <id>] <batchId> - crawl identifier returned by Generator, or -all for all
generated batchId-s
-crawlId <id> - the id to prefix the schemas to operate on,
(default: storage.crawl.id)
./bin/nutch updatedb -all -crawlId 6vhao
结果可在hbase中查看
6,重复2-5步骤,即抓该网站2层的深度
solrindex下节再讲....