英文原文出处:
DissectingTheNutchCrawler
转载本文请注明出处:http://blog.csdn.net/pwlazy
Nutch使用java实现的,所以我们假定你有基本的相关知识。
转载本文请注明出处:http://blog.csdn.net/pwlazy
Introduction
The open-source Nutch search engine consists, very roughly, of three components:
-
the crawler, which discovers and retrieves web pages
-
theWebDB, a custom database that stores knownURLs and fetched page contents
-
the indexer, which dissects pages and builds keyword-based indexes from them
This document attempts to describe the operation of the crawler. We begin with theory and drill down to into the details needed to create a customized crawler.
Nutch is implemented in Java, so basic knowledge of the language is assumed.
介绍
开源Nutch搜索引擎大致包含3部分
- crawler,发觉和检索网页
- theWebDB,一个定制的数据库用于存储已知的url和检索的网页内容
- indexer,剖析页面以及从中构建基于关键词的索引
Nutch使用java实现的,所以我们假定你有基本的相关知识。
注:本人英文水平有限,翻译不当之处请批评指正,谢谢