下载、编译、配置看官方手册。使用时注意的是nutch和crawl脚本都的区别。nutch只单步执行相关命令,要按流程爬取的话用crawl脚本(种子注入到生成索引按流程处理),实质上crawl也是调用nutch脚本进行批处理的。另外,local模式时HBase要起来。
注入种子
$ bin/nutch inject
Usage: InjectorJob <url_dir> [-crawlId <id>]
$bin/nutch inject ./urls -crawlId CRAWL_1
或
$bin/nutch inject ./urls
爬
$ bin/crawl
Missing seedDir : crawl <seedDir> <crawlID> <solrURL> <numberOfRounds>
$ bin/crawl ./urls CRAWL_1_webpage 1
看爬结果
$hbase shell
$scan 'CRAWL_1_webpage'
crawl脚本的过程:
1、检测参数及配置
2、initial injection ==>> CLASS=org.apache.nutch.crawl.InjectorJob
3、主循环(1~numberOfRounds)
3.1、Generating a new fetchlist ==>> CLASS=org.apache.nutch.crawl.GeneratorJob
3.2、Fetching ==>> CLASS=org.apache.nutch.fetcher.FetcherJob
3.3、Parsing ==>> CLASS=org.apache.nutch.parse.ParserJob
3.4、updatedb ==> CLASS=org.apache.nutch.crawl.DbUpdaterJob
3.5、Indexing(如果有的话,默认solr) ==>> CLASS=org.apache.nutch.indexer.IndexingJob
3.6、solrdedup(如果有的话,去重) ==>> CLASS=org.apache.nutch.indexer.solr.SolrDeleteDuplicates
4、打完收工