昨天研究过larbin之后,晚上回去想了想,larbin是单机的爬取,速度有限,像baidu,google这样的公司肯定不是用一个爬虫去爬,不然 那么大的数据量,根本就处理不了.如果能将其改成一个分布式网络爬虫,同时用不同的机器去爬不同的站点,处理不同的信息,这样就快多了,处理量也能大大的 提高.
网上关于分布式爬虫的资料比较少,就我自己的思考,设计一个分布式网络爬虫,应该有这样一些问题:
1.不同机器之间数据传输.
为了不重复的爬取文件,分布式系统中的各个机器应该互相之间有数据交换,或是使用一个集中的中央主机对各个机器进行调度.
同时,对于爬取回来的结果,如果是使用一个索引文件的话,那么,过一段时间,就要将爬取的结果传输到索引文件所在的机器上去.这将是一个很大的传输量,并且管理索引的机器负担将会很重.有没有分布式索引技术,还需要进一步的查找文献.
2.不同机器之间的任务分配问题.
任务分配的首要前提是不重复分配任务.而且,应该根据情况采取不同的分配方式,以达到最优效果.
如不同城市的机器爬取不同城市的网页,在北京的主机就主要爬取北京的站点,这样可以提高传输速率.
上网找了一下,发现了google创始人在stanford读书时发表的paper<The Anatomy of a Large-Scale Hypertextual Web Search Engine>.里面有一段:
4.1 Google Architecture Overview
In this section, we will give a high level overview of how the whole system works as pictured in Figure 1. Further sections will discuss the applications and data structures not mentioned in this section. Most of Google is implemented in C or C++ for efficiency and can run in either Solaris or Linux.
In Google, the web crawling (downloading of web pages) is done by several distributed crawlers. There is a URLserver that sends lists of URLs to be fetched to the crawlers. The web pages that are fetched are then sent to the storeserver. The storeserver then compresses and stores the web pages into a repository. Every web page has an associated ID number called a docID which is assigned whenever a new URL is parsed out of a web page. The indexing function is performed by the indexer and the sorter. The indexer performs a number of functions. It reads the repository, uncompresses the documents, and parses them. Each document is converted into a set of word occurrences called hits. The hits record the word, position in document, an approximation of font size, and capitalization. The indexer distributes these hits into a set of "barrels", creating a partially sorted forward index. The indexer performs another important function. It parses out all the links in every web page and stores important information about them in an anchors file. This file contains enough information to determine where each link points from and to, and the text of the link.
The URLresolver reads the anchors file and converts relative URLs into absolute URLs and in turn into docIDs. It puts the anchor text into the forward index, associated with the docID that the anchor points to. It also generates a database of links which are pairs of docIDs. The links database is used to compute PageRanks for all the documents.
The sorter takes the barrels, which are sorted by docID (this is a simplification, see Section 4.2.5), and resorts them by wordID to generate the inverted index. This is done in place so that little temporary space is needed for this operation. The sorter also produces a list of wordIDs and offsets into the inverted index. A program called DumpLexicon takes this list together with the lexicon produced by the indexer and generates a new lexicon to be used by the searcher. The searcher is run by a web server and uses the lexicon built by DumpLexicon together with the inverted index and the PageRanks to answer queries.
我只说其中提到的关于分布式爬虫的内容吧,在这段中对分布式网络一代而过,说明了google的网页爬取是使用若干个分布式网络爬虫,它使用一个URLServer分配URL给网络爬虫让他们去爬取.网络爬虫爬取的网页再传送到一个storesever上去,由storesever将网页进行压缩并将其存储起来,
看来,google的分布式网络爬虫是集中式控制的,都是由指定的服务器来分配.但google是很牛很牛的大公司,他们有顶尖的硬件,甚至他们是从硬件层开始优化的,有很多的硬件都是为了优化系统而特地改造或设计过的.对于一般硬件难以胜任的功能都可以做到,而我们没有这个条件.
我在想,能不能不使用集中式控制,而使每个分布式个体manage themself呢?如果能实现自治的话,应该会增强系统的稳定性,即时服务器崩溃了或是其中一个个体崩溃了,也不会影响整个系统的工作,但这里面就涉及到很多很复杂的问题了.包括各子节点之间的协议的设计,互斥,数据一致性等.
网上关于分布式爬虫的资料比较少,就我自己的思考,设计一个分布式网络爬虫,应该有这样一些问题:
1.不同机器之间数据传输.
为了不重复的爬取文件,分布式系统中的各个机器应该互相之间有数据交换,或是使用一个集中的中央主机对各个机器进行调度.
同时,对于爬取回来的结果,如果是使用一个索引文件的话,那么,过一段时间,就要将爬取的结果传输到索引文件所在的机器上去.这将是一个很大的传输量,并且管理索引的机器负担将会很重.有没有分布式索引技术,还需要进一步的查找文献.
2.不同机器之间的任务分配问题.
任务分配的首要前提是不重复分配任务.而且,应该根据情况采取不同的分配方式,以达到最优效果.
如不同城市的机器爬取不同城市的网页,在北京的主机就主要爬取北京的站点,这样可以提高传输速率.
上网找了一下,发现了google创始人在stanford读书时发表的paper<The Anatomy of a Large-Scale Hypertextual Web Search Engine>.里面有一段:
4.1 Google Architecture Overview
In this section, we will give a high level overview of how the whole system works as pictured in Figure 1. Further sections will discuss the applications and data structures not mentioned in this section. Most of Google is implemented in C or C++ for efficiency and can run in either Solaris or Linux.
In Google, the web crawling (downloading of web pages) is done by several distributed crawlers. There is a URLserver that sends lists of URLs to be fetched to the crawlers. The web pages that are fetched are then sent to the storeserver. The storeserver then compresses and stores the web pages into a repository. Every web page has an associated ID number called a docID which is assigned whenever a new URL is parsed out of a web page. The indexing function is performed by the indexer and the sorter. The indexer performs a number of functions. It reads the repository, uncompresses the documents, and parses them. Each document is converted into a set of word occurrences called hits. The hits record the word, position in document, an approximation of font size, and capitalization. The indexer distributes these hits into a set of "barrels", creating a partially sorted forward index. The indexer performs another important function. It parses out all the links in every web page and stores important information about them in an anchors file. This file contains enough information to determine where each link points from and to, and the text of the link.
The URLresolver reads the anchors file and converts relative URLs into absolute URLs and in turn into docIDs. It puts the anchor text into the forward index, associated with the docID that the anchor points to. It also generates a database of links which are pairs of docIDs. The links database is used to compute PageRanks for all the documents.
The sorter takes the barrels, which are sorted by docID (this is a simplification, see Section 4.2.5), and resorts them by wordID to generate the inverted index. This is done in place so that little temporary space is needed for this operation. The sorter also produces a list of wordIDs and offsets into the inverted index. A program called DumpLexicon takes this list together with the lexicon produced by the indexer and generates a new lexicon to be used by the searcher. The searcher is run by a web server and uses the lexicon built by DumpLexicon together with the inverted index and the PageRanks to answer queries.
我只说其中提到的关于分布式爬虫的内容吧,在这段中对分布式网络一代而过,说明了google的网页爬取是使用若干个分布式网络爬虫,它使用一个URLServer分配URL给网络爬虫让他们去爬取.网络爬虫爬取的网页再传送到一个storesever上去,由storesever将网页进行压缩并将其存储起来,
看来,google的分布式网络爬虫是集中式控制的,都是由指定的服务器来分配.但google是很牛很牛的大公司,他们有顶尖的硬件,甚至他们是从硬件层开始优化的,有很多的硬件都是为了优化系统而特地改造或设计过的.对于一般硬件难以胜任的功能都可以做到,而我们没有这个条件.
我在想,能不能不使用集中式控制,而使每个分布式个体manage themself呢?如果能实现自治的话,应该会增强系统的稳定性,即时服务器崩溃了或是其中一个个体崩溃了,也不会影响整个系统的工作,但这里面就涉及到很多很复杂的问题了.包括各子节点之间的协议的设计,互斥,数据一致性等.