开源网络蜘蛛(Spider)一览

spider是搜索引擎的必须模块.spider数据的结果直接影响到搜索引擎的评价指标.

第一个spider程序由MIT的Matthew K Gray操刀该程序的目的是为了统计互联网中主机的数目

Spier定义(关于Spider的定义,有广义和狭义两种).

  • 狭义:利用标准的http协议根据超链和web文档检索的方法遍历万维网信息空间的软件程序.
  • 广义:所有能利用http协议检索web文档的软件都称之为spider.

其中Protocol Gives Sites Way To Keep Out The 'Bots Jeremy Carl, Web Week, Volume 1, Issue 7, November 1995 是和spider息息相关的协议,可以参考robotstxt.org.

Heritrix

Heritrix is the Internet Archive's open-source, extensible, web-scale, archival-quality web crawler project.

Heritrix (sometimes spelled heretrix, or misspelled or missaid as heratrix/heritix/heretix/heratix) is an archaic word for heiress (woman who inherits). Since our crawler seeks to collect and preserve the digital artifacts of our culture for the benefit of future researchers and generations, this name seemed apt.

语言:JAVA, (下载地址)

WebLech URL Spider 

WebLech is a fully featured web site download/mirror tool in Java, which supports many features required to download websites and emulate standard web-browser behaviour as much as possible. WebLech is multithreaded and comes with a GUI console.

语言:JAVA, (下载地址)

JSpider

A Java implementation of a flexible and extensible web spider engine. Optional modules allow functionality to be added (searching dead links, testing the performance and scalability of a site, creating a sitemap, etc ..

语言:JAVA, (下载地址)

WebSPHINX

WebSPHINX is a web crawler (robot, spider) Java class library, originally developed by Robert Miller of Carnegie Mellon University. Multithreaded, tollerant HTML parsing, URL filtering and page classification, pattern matching, mirroring, and more.

语言:JAVA, (下载地址)

PySolitaire

PySolitaire is a fork of PySol Solitaire that runs correctly on Windows and has a nice clean installer. PySolitaire (Python Solitaire) is a collection of more than 300 solitaire and Mahjongg games like Klondike and Spider.

语言:Python , (下载地址)

The Spider Web Network Xoops Mod Team     

The Spider Web Network Xoops Module Team provides modules for the Xoops community written in the PHP coding language. We develop mods and or take existing php script and port it into the Xoops format. High quality mods is our goal.

语言:php , (下载地址)

Fetchgals

A multi-threaded web spider that finds free porn thumbnail galleries by visiting a list of known TGPs (Thumbnail Gallery Posts). It optionally downloads the located pictures and movies. TGP list is included. Public domain perl script running on Linux.

语言:perl , (下载地址)

Where Spider


 

The purpose of the Where Spider software is to provide a database system for storing URL addresses. The software is used for both ripping links and browsing them offline. The software uses a pure XML database which is easy to export and import.

语言:XML , (下载地址)

Sperowider

Sperowider Website Archiving Suite is a set of Java applications, the primary purpose of which is to spider dynamic websites, and to create static distributable archives with a full text search index usable by an associated Java applet.

语言:Java , (下载地址)

SpiderPy

SpiderPy is a web crawling spider program written in Python that allows users to collect files and search web sites through a configurable interface.

语言:Python , (下载地址)

Spidered Data Retrieval

Spider is a complete standalone Java application designed to easily integrate varied datasources. * XML driven framework * Scheduled pulling * Highly extensible * Provides hooks for custom post-processing and configuration

语言:Java , (下载地址)

webloupe

WebLoupe is a java-based tool for analysis, interactive visualization (sitemap), and exploration of the information architecture and specific properties of local or publicly accessible websites. Based on web spider (or web crawler) technology.

语言:java , (下载地址)

ASpider

Robust featureful multi-threaded CLI web spider using apache commons httpclient v3.0 written in java. ASpider downloads any files matching your given mime-types from a website. Tries to reg.exp. match emails by default, logging all results using log4j.

语言:java , (下载地址)

larbin

Larbin is an HTTP Web crawler with an easy interface that runs under Linux. It can fetch more than 5 million pages a day on a standard PC (with a good network).

语言:C++, (下载地址)

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
早先年,驰骋网络,积累了很多源代码…… 互联网的基因、骨头里就是自由、分享,非常感谢没有墙的时代,无限怀念,想想当时的BBS,俱往矣~ 如今的互联网却在疯狂的过滤、筛选、删除,有了N多的墙…… 不知道说什么好,很多的资源因为商业利益的原因从如今臭名昭著的搜索引擎中被删除,恐怕是很担忧后起之秀…… 遗憾的是互联网不但是必须自由的,而且是创新的天地,前辈们结实的臂膀就是无私奉献给后来者攀登的,而决不是扼杀…… 最近看到网络上的商业争吵,很是气愤~!!! 于是就开始重新注意网络蜘蛛…… 很难过,网络上已经很难找到免费的、有价值的蜘蛛代码~ 每一个程序人(包括国外的)如果没有demo,又从何入手??? 可笑的人,可笑的世界,一个自由的网络被勒上了无形的缰绳,网络上哪些免费的、有价值的资源正被搜索引擎“淡”化,包括谷沟(早先一睹古够地图后就下了结论:互联网摧毁一切!),不是吗?看看全世界的墙,从太极看:物极必反,自由的极端就是地狱、牢笼……任何东西都有互反的作用,美味的食物都带“毒”,但人们选择容忍、忽略,存在有不用付出代价的东西吗?! 我翻出我的布袋,把它们依然放回它们应该呆的地方,让更多的人们得到启发,开始创新之旅,期待您的精彩,感谢曾经自由的(不被看重)网络~~~ ------------------------------- 这个是完整的项目源代码,原汁原味,无需多舌~ 搞程序,必须e文过关,自己琢磨去吧~ 我们记住一点: 互联网上流转的都是数字,那些非数字的东西只是方便更多人使用,网络上“散布”的什么反向搜索等等只是一种愚笨的方法而已,实际上蜘蛛是根本不需要DNS、注册机构什么劳什子的,它只需要不知疲倦地啃噬不同国家地区的IP,并不是所有的IP都注册、登记哦~ 把不“规则”的物料抓回来,接着才是反向等方式的数据整理,蜘蛛织网、缝补都是需要时间积累的,这些原始的东西才是人们真正感兴趣的东西,“归置、加工”后只是一种规则下的苟且或商业的需要罢了…… 所以这个蜘蛛只需要你给(小小更动)它喂食(IP库)~ 它就会不知疲劳地四处爬,抓回的东西…… 怎么组织数据(库)才是搜索引擎的关键~!!! 抓回来就看你怎么折腾、运作了,可以肯定的是: 互联网需要的就是千奇百怪的、五花八门的搜索引擎~!!! 目前的数量远远不够,远远不够,这将是它的魅力,需要大家的智慧和义务劳动,在摧毁一切之前我们尽心营造和谐,呵呵~ ===================================== 忘记了…… 再来说下用途: 比如你在某网站,看到想看的图片,但要注册什么的看不到,还有其它资源~ 怎么办呢? 就用蜘蛛吧,把地址准确输入,它就会自动给你爬出来…… 可以设定要抓取的文件类型,…… 比如图片什么的,征婚网站有很多mm图片,就是看不到,怎么办? 用爬虫吧,当然爬虫只能爬所有http资源,数据库中的就得用另法了,呵呵~
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值