上篇博客介绍了一下nutch的下载和构建,这篇主要分享一下nutch的简单爬取,和爬取流程
在主目录下运行bin/nutch 会看到
crawl one-step crawler for intranets (DEPRECATED - USE CRAWL SCRIPT INSTEAD)
readdb read / dump crawl db
mergedb merge crawldb-s, with optional filtering
readlinkdb read / dump link db
inject inject new urls into the database
generate generate new segments to fetch from crawl db
freegen generate new segments to fetch from text files
fetch fetch a segment's pages
parse parse a segment's pages
readseg read / dump segment data
mergesegs merge several segments, with optional filtering and slicing
updatedb update crawl db from segments after fetching
invertlinks create a linkdb from parsed segments
mergelinkdb merge linkdb-s, with optional filtering
solrindex run the solr indexer on parsed segments and linkdb
solrdedup remove duplicates from solr
solrclean remove HTTP 301 and 404 documents from solr
parsechecker check the parser for a given url
indexchecker check the indexing filters for a given url
domainstats calculate domain statistics from crawldb
webgraph generate a web graph from existing segments
linkrank run a link analysis program on the generated web graph
scoreupdater updates the crawldb with linkrank scores
nodedumper dumps the web graph's node scores
plugin load a plugin and run one of its classes main()
junit runs the given JUnit test
or
CLASSNAME run the class named CLASSNAME
相关的提示信息,
简单 爬取 bin/nutch crawl 查看提示信息
Usage: Crawl <urlDir> -solr <solrURL> [-dir d] [-threads n] [-depth i] [-topN N]
这里面我们需要给出的参数为 urldir (抓取url的目录) -dir输出目录 -threads 使用的线程数目 -depth 抓取的深度及要进行几轮抓取 -topN 显示前几个
在当前目录下 mkdir一个目录 url 中写入一个文本 urls写入一个链接
bin/nutch crawl urls -dir data -depth 3 -threads 100
抓取结束后 会在data目录下生成三个文件夹
crawldb linkdb segments
这三个文件夹都各自有不同的作用,下面这张图是nutch的工作流程 包含了 爬取和生成索引两部分
一个简单的爬取的流程
第一次 inject
generate -> fetch -> parse ->updatedb 循环抓取
最后抓取结束后 执行 invertlinks 获取linkdb