Nutch开源搜索引擎的crawl日志分析及工作目录说明

看了nutch关于crawl的源码后,我将crawl的日志分析了一下,主要是熟悉一下整个下载、分析、索引的过程。nutch在整个过程中都是通过Hadoop的MapReduce来实现的。 
可以通过nutch来深入学习Hadoop编程,都是比较横的代码。这一块待以后研究完毕后,blog出来。 

crawl通过nutch-default.xml参数来控制运行过程,另外需要修改crawl-urlfilter.txt和nutch-site.xml来配合运行,这一点在之前的文章中都提到过,就不再赘述了。 

下面是我用来crawl的命令 
> bin/nutch crawl urls -dir crawl10 -depth 10 -threads 10 >& nohup.out 

crawl started in: crawl10                               //表明网目录络蜘蛛的名称 
rootUrlDir = urls //待下载数据的列表文件或列表
threads = 10 //下载线程为10个 
depth = 10 //深度是10层 
Injector: starting //注入下载列表
Injector: crawlDb: crawl10/crawldb 
Injector: urlDir: urls 
Injector: Converting injected urls to crawl db entries. //根据注入的列表生成待下载的地址库 
Injector: Merging injected urls into crawl db. //执行merge
Injector: done 
Generator: Selecting best-scoring urls due for fetch. //判断网页重要性,决定下载顺序 
Generator: starting 
Generator: segment: crawl10/segments/20080904102201 //生成下载结果存储的数据段 
Generator: filtering: false 
Generator: topN: 2147483647 //没有指定topN大小,nutch会取默认值
Generator: Partitioning selected urls by host, for politeness.  //将url下载列表按hadoop的中配置文件slaves中定义的datanode来分配。 
Generator: done. 
Fetcher: starting 
Fetcher: segment: crawl10/segments/20080904102201 //下载指定网页内容到segment中去 
Fetcher: done 
CrawlDb update: starting //下载完毕后,更新下载数据库,增加新的下载 
CrawlDb update: db: crawl10/crawldb 
CrawlDb update: segments: [crawl10/segments/20080904102201] 
CrawlDb update: additions allowed: true 
CrawlDb update: URL normalizing: true 
CrawlDb update: URL filtering: true 
CrawlDb update: Merging segment data into db. 
CrawlDb update: done 

//循环执行下载 
Generator: Selecting best-scoring urls due for fetch. 
Generator: starting 
Generator: segment: crawl10/segments/20080904102453 
Generator: filtering: false 
Generator: topN: 2147483647 
Generator: Partitioning selected urls by host, for politeness. 
Generator: done. 
Fetcher: starting 
Fetcher: segment: crawl10/segments/20080904102453 
Fetcher: done 
CrawlDb update: starting 
CrawlDb update: db: crawl10/crawldb 
CrawlDb update: segments: [crawl10/segments/20080904102453] 
CrawlDb update: additions allowed: true 
CrawlDb update: URL normalizing: true 
CrawlDb update: URL filtering: true 
CrawlDb update: Merging segment data into db. 
CrawlDb update: done 

...... //一共循环10次,Nutch的局域网模式采用了广度优先策略,把二级页面抓取完成以后,进行三级页面抓取。 

LinkDb: starting //进行网页链接关系分析 
LinkDb: linkdb: crawl10/linkdb 
LinkDb: URL normalize: true //规范化 
LinkDb: URL filter: true //根据crawl-urlfilter.txt来过滤 
LinkDb: adding segment: /user/nutch/crawl10/segments/20080904102201
LinkDb: adding segment: /user/nutch/crawl10/segments/20080904102453 
LinkDb: adding segment: /user/nutch/crawl10/segments/20080904102841 
LinkDb: adding segment: /user/nutch/crawl10/segments/20080904104322 
LinkDb: adding segment: /user/nutch/crawl10/segments/20080904113511 
LinkDb: adding segment: /user/nutch/crawl10/segments/20080904132510 
LinkDb: adding segment: /user/nutch/crawl10/segments/20080904153615 
LinkDb: adding segment: /user/nutch/crawl10/segments/20080904175052 
LinkDb: adding segment: /user/nutch/crawl10/segments/20080904194724 
LinkDb: adding segment: /user/nutch/crawl10/segments/20080904211956 
LinkDb: done //链接分析完毕 
Indexer: starting //开始创建索引 
Indexer: linkdb: crawl10/linkdb 
Indexer: adding segment: /user/nutch/crawl10/segments/20080904102201 
Indexer: adding segment: /user/nutch/crawl10/segments/20080904102453 
Indexer: adding segment: /user/nutch/crawl10/segments/20080904102841 
Indexer: adding segment: /user/nutch/crawl10/segments/20080904104322 
Indexer: adding segment: /user/nutch/crawl10/segments/20080904113511 
Indexer: adding segment: /user/nutch/crawl10/segments/20080904132510 
Indexer: adding segment: /user/nutch/crawl10/segments/20080904153615 
Indexer: adding segment: /user/nutch/crawl10/segments/20080904175052 
Indexer: adding segment: /user/nutch/crawl10/segments/20080904194724 
Indexer: adding segment: /user/nutch/crawl10/segments/20080904211956 
Indexer: done //索引创建完毕 
Dedup: starting //风页去重 
Dedup: adding indexes in: crawl10/indexes
Dedup: done 
merging indexes to: crawl10/index //索引合并 
Adding /user/nutch/crawl10/indexes/part-00000 
Adding /user/nutch/crawl10/indexes/part-00001 
Adding /user/nutch/crawl10/indexes/part-00002 
Adding /user/nutch/crawl10/indexes/part-00003 
done merging //合并完毕 
crawl finished: crawl10 //入口注入、循环下载、链接分析、建立索引、去重、合并 


抓取程序自动在用户根目录(/user/nutch)下面建立了crawl10目录,可以看到crawldb,segments,index,indexs,linkdb目录, 
1)crawldb目录下面存放下载的URL,以及下载的日期,用来页面更新检查时间。 
2)linkdb目录存放URL的关联关系,是下载完成后分析时创建的,通过这个关联关系可以实现类似google的pagerank功能。 
3)segments目录存储抓取的页面,下面子目录的个数与获取页面的层数有关系,我指定-depth是10层,这个目录下就有10层。 
  里面有6个子目录 
  content,下载页面的内容 
  crawl_fetch,下载URL的状态内容 
  crawl_generate,待下载的URL的集合,在generate任务生成时和下载过程中持续分析出来 
  crawl_parse,存放用来更新crawldb的外部链接库 
  parse_data,存放每个URL解析出来的外部链接和元数据 
  parse_text,存放每个解析过的URL的文本内容 
4)index目录存放符合lucene格式的索引目录,是indexs里所有的索引内容合并后的完整内容,看了一下这里的索引文件和用lucenedemo做出来的文件名称都不一样,待进一步研究 
5)indexs目录存放每次下载的索引目录,存放part-0000到part-0003 

crawl流程和工作目录搞清楚了,大家可以进一步拜读一下nutch源码。进一步搞清楚各个目录下面的文件作用,以及文件格式。待续!

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值