nutch2 crawl 命令分解，抓取网页的详细过程

最新推荐文章于 2024-10-16 10:56:54 发布

weixin_33866037

最新推荐文章于 2024-10-16 10:56:54 发布

阅读量157

点赞数

文章标签：大数据 runtime python

原文链接：https://my.oschina.net/u/2494265/blog/523828

版权

2019独角兽企业重金招聘Python工程师标准>>>

首先，何以见得crawl是inject,generate,fetch,parse,update的集成呢(命令的具体含义及功能会在后续文章中说明)，我们打开NUTCH_HOME/runtime/local/bin/crawl

我将主要代码黏贴下来

# initial injection
echo "Injecting seed URLs"
__bin_nutch inject "$SEEDDIR" -crawlId "$CRAWL_ID"

# main loop : rounds of generate - fetch - parse - update
for ((a=1; a <= LIMIT ; a++))
do
...
echo "Generating a new fetchlist"
  generate_args=($commonOptions -topN $sizeFetchlist -noNorm -noFilter -adddays $addDays -crawlId "$CRAWL_ID" -batchId $batchId)
$bin/nutch generate "${generate_args[@]}"
...
echo "Fetching : "
  __bin_nutch fetch $commonOptions -D fetcher.timelimit.mins=$timeLimitFetch $batchId -crawlId "$CRAWL_ID" -threads 50
...
__bin_nutch parse $commonOptions $skipRecordsOptions $batchId -crawlId "$CRAWL_ID"
...
 __bin_nutch updatedb $commonOptions $batchId -crawlId "$CRAWL_ID"
...
echo "Indexing $CRAWL_ID on SOLR index -> $SOLRURL"
__bin_nutch index $commonOptions -D solr.server.url=$SOLRURL -all -crawlId "$CRAWL_ID"
...
echo "SOLR dedup -> $SOLRURL"
__bin_nutch solrdedup $commonOptions $SOLRURL

接下来手动执行上述步骤

我们一直会处在runtime/local/ 目录下

1，inject:

当然，种子文件先要写好，urls/url 文件中写入想要抓取的网站，我以http://www.6vhao.com为例

在抓取期间，我不想让他抓取除了6vhao.com以外的其他网站，这个可以在conf/regex-urlfilter.txt文件中加入

# accept anything else

+^http://www.6vhao.com/

使用下面的命令开始抓取：

./bin/nutch inject urls/url -crawlId 6vhao

在hbase shell中使用list命令中查看，生成了一张新表6vhao_webpage

scan '6vhao_webpage'查看其内容

ROW                                 COLUMN+CELL                                                                                            
 com.6vhao.www:http/                column=f:fi, timestamp=1446135434505, value=\x00'\x8D\x00                                              
 com.6vhao.www:http/                column=f:ts, timestamp=1446135434505, value=\x00\x00\x01P\xB4c\x86\xAA                                 
 com.6vhao.www:http/                column=mk:_injmrk_, timestamp=1446135434505, value=y                                                   
 com.6vhao.www:http/                column=mk:dist, timestamp=1446135434505, value=0                                                       
 com.6vhao.www:http/                column=mtdt:_csh_, timestamp=1446135434505, value=?\x80\x00\x00                                        
 com.6vhao.www:http/                column=s:s, timestamp=1446135434505, value=?\x80\x00\x00

可以看出生成了一行hbase 数据，4列族数据，具体含义以后再说

2，generator

使用命令./bin/nutch generate

    -topN <N>      - number of top URLs to be selected, default is Long.MAX_VALUE 
    -crawlId <id>  - the id to prefix the schemas to operate on, 
                    (default: storage.crawl.id)");
    -noFilter      - do not activate the filter plugin to filter the url, default is true 
    -noNorm        - do not activate the normalizer plugin to normalize the url, default is true 
    -adddays       - Adds numDays to the current time to facilitate crawling urls already
                     fetched sooner then db.fetch.interval.default. Default value is 0.
    -batchId       - the batch id

我们指定-crawlId 为 6vhao

./bin/nutch generate -crawlId 6vhao

 com.6vhao.www:http/                column=f:bid, timestamp=1446135900858, value=1446135898-215760616                                      
 com.6vhao.www:http/                column=f:fi, timestamp=1446135434505, value=\x00'\x8D\x00                                              
 com.6vhao.www:http/                column=f:ts, timestamp=1446135434505, value=\x00\x00\x01P\xB4c\x86\xAA                                 
 com.6vhao.www:http/                column=mk:_gnmrk_, timestamp=1446135900858, value=1446135898-215760616                                 
 com.6vhao.www:http/                column=mk:_injmrk_, timestamp=1446135900858, value=y                                                   
 com.6vhao.www:http/                column=mk:dist, timestamp=1446135900858, value=0                                                       
 com.6vhao.www:http/                column=mtdt:_csh_, timestamp=1446135434505, value=?\x80\x00\x00                                        
 com.6vhao.www:http/                column=s:s, timestamp=1446135434505, value=?\x80\x00\x00

对比发现多了2列数据

3，开始抓取 fetch

Usage: FetcherJob (<batchId> | -all) [-crawlId <id>] [-threads N] 
                  [-resume] [-numTasks N]
    <batchId>     - crawl identifier returned by Generator, or -all for all 
                    generated batchId-s
    -crawlId <id> - the id to prefix the schemas to operate on, 
                    (default: storage.crawl.id)
    -threads N    - number of fetching threads per task
    -resume       - resume interrupted job
    -numTasks N   - if N > 0 then use this many reduce tasks for fetching 
                    (default: mapred.map.tasks)

./bin/nutch fetch -all -crawlId 6vhao -threads 8

数据较多，基本网页的内容全都在，自行到hbase中查看

4，parse

Usage: ParserJob (<batchId> | -all) [-crawlId <id>] [-resume] [-force]
    <batchId>     - symbolic batch ID created by Generator
    -crawlId <id> - the id to prefix the schemas to operate on, 
                    (default: storage.crawl.id)
    -all          - consider pages from all crawl jobs
    -resume       - resume a previous incomplete job
    -force        - force re-parsing even if a page is already parsed
root@tong:/opt/nutch/nutch-2.3/runtime/local# ./bin/nutch parse -crawlId 6vhao -all

./bin/nutch parse -crawlId 6vhao -all

pase结果可在hbase中查看

5，update

Usage: DbUpdaterJob (<batchId> | -all) [-crawlId <id>]     <batchId>     - crawl identifier returned by Generator, or -all for all 
                    generated batchId-s
    -crawlId <id> - the id to prefix the schemas to operate on, 
                    (default: storage.crawl.id)

./bin/nutch updatedb -all -crawlId 6vhao

结果可在hbase中查看

6,重复2-5步骤，即抓该网站2层的深度

solrindex下节再讲....

转载于:https://my.oschina.net/u/2494265/blog/523828