爬虫的作用以及在搜索引擎里面的位置:
http://www.googleguide.com/google_works.html
Google的爬虫Googlebot介绍:
http://support.google.com/webmasters/bin/answer.py?hl=en&answer=1061943
http://support.google.com/webmasters/bin/answer.py?hl=en&answer=182072
李彦宏早期做的搜索引擎RankDex介绍。
http://www.rankdex.com/about.html
开源爬虫介绍:
Apache Nutch:http://nutch.apache.org/
Heritrix:
http://crawler.archive.org/
http://www.ibm.com/developerworks/cn/opensource/os-cn-heritrix/
WebSPHINX: http://www.cs.cmu.edu/~rcm/websphinx/
dyse:http://www.ibm.com/developerworks/cn/java/j-lo-dyse1/index.html?ca=drs-
crawler4j: http://code.google.com/p/crawler4j/
自己实现一个爬虫:
http://www.developer.com/java/other/article.php/1573761/Programming-a-Spider-in-Java.htm
http://writingbots.javaprogramming4u.info/how-to-write-your-own-little-googlebot/