开源爬虫项目

最新推荐文章于 2024-08-07 16:17:44 发布

福海鑫森

最新推荐文章于 2024-08-07 16:17:44 发布

阅读量1.6k

点赞数 3

分类专栏： docker 文章标签：安全运维爬虫

原文链接：https://zhuanlan.zhihu.com/p/24844250

版权

docker 专栏收录该内容

18 篇文章 0 订阅

订阅专栏

1、WebMagic
码云地址：webmagic: webmagic 是一个无须配置、便于二次开发的爬虫框架，它提供简单灵活的API，只需少量代码即可实现一个爬虫。

官方网站WebMagic

webmagic是一个开源的Java垂直爬虫框架，目标是简化爬虫的开发流程，让开发者专注于逻辑功能的开发。webmagic的核心非常简单，但是覆盖爬虫的整个流程，也是很好的学习爬虫开发的材料。

webmagic的主要特色：

完全模块化的设计，强大的可扩展性。
核心简单但是涵盖爬虫的全部流程，灵活而强大，也是学习爬虫入门的好材料。
提供丰富的抽取页面API。
无配置，但是可通过POJO+注解形式实现一个爬虫。
支持多线程。
支持分布式。
支持爬取js动态渲染的页面。
无框架依赖，可以灵活的嵌入到项目中去。

webmagic的架构和设计参考了以下两个项目，感谢以下两个项目的作者：

python爬虫 scrapy https://github.com/scrapy/scrapy

Java爬虫 Spiderman Spiderman: 强力 Java 爬虫，列表分页、详细页分页、ajax、微内核高扩展、配置灵活

webmagic的github地址：https://github.com/code4craft/webmagic。

2、Gecco

项目地址：Gecco: Gecco 是一款用java语言开发的轻量化的易用的网络爬虫，整合了jsoup、httpclient、fastjson、spring、htmlunit、redission等优秀框架。

Gecco 整合了 jsoup、httpclient、fastjson、spring、htmlunit、redission 等优秀框架，让您只需要配置一些 jquery 风格的选择器就能很快的写出一个爬虫。

3、WebCollector

项目地址：https://github.com/CrawlScript/WebCollector

WebCollector is an open source web crawler framework based on Java.It provides some simple interfaces for crawling the Web,you can setup a multi-threaded web crawler in less than 5 minutes.

In addition to a general crawler framework, WebCollector also integrates CEPF, a well-designed state-of-the-art web content extraction algorithm proposed by Wu, et al.:

Wu GQ, Hu J, Li L, Xu ZH, Liu PC, Hu XG, Wu XD. Online Web news extraction via tag path feature fusion. Ruan Jian Xue Bao/Journal of Software, 2016,27(3):714-735 (in Chinese). 基于标签路径特征融合的在线Web新闻内容抽取

WebCollector是一个无须配置、便于二次开发的JAVA爬虫框架（内核），它提供精简的的API，只需少量代码即可实现一个功能强大的爬虫。

4、Spiderman
码云地址：Spiderman2: 二代蜘蛛侠，此版本完全重新开发，比上一代更加强大（性能，易用，架构，分布式，简洁，成熟）
使用案例：【最新更新支持频道分页、文章分页】【抛砖引玉】抓取OSC的问答数据展现垂直爬虫的能力 - 像风一样自由 - OSCHINA - 中文开源技术交流社区

Spiderman 是一个基于微内核+插件式架构的网络蜘蛛，它的目标是通过简单的方法就能将复杂的目标网页信息抓取并解析为自己所需要的业务数据。

5、Heritrix
github地址：https://github.com/internetarchive/heritrix3

Heritrix is the Internet Archive's open-source, extensible, web-scale, archival-quality web crawler project. Heritrix (sometimes spelled heretrix, or misspelled or missaid as heratrix/heritix/heretix/heratix) is an archaic word for heiress (woman who inherits). Since our crawler seeks to collect and preserve the digital artifacts of our culture for the benefit of future researchers and generations, this name seemed apt.

Heritrix是一个开源，可扩展的web爬虫项目。用户可以使用它来从网上抓取想要的资源。Heritrix设计成严格按照robots.txt文件的排除指示和META robots标签。其最出色之处在于它良好的可扩展性,方便用户实现自己的抓取逻辑。

6、crawler4j
github地址：https://github.com/yasserg/crawler4j

Crawler4j is an open source web crawler for Java which provides a simple interface for crawling the Web. Using it, you can setup a multi-threaded web crawler in few minutes.

7、Nutch

github地址：https://github.com/apache/nutch

Nutch 是一个开源Java 实现的搜索引擎。它提供了我们运行自己的搜索引擎所需的全部工具。包括全文搜索和Web爬虫。

在Nutch的进化过程中，产生了Hadoop、Tika、Gora和Crawler Commons四个Java开源项目。如今这四个项目都发展迅速，极其火爆，尤其是Hadoop，其已成为大规模数据处理的事实上的标准。Tika使用多种现有的开源内容解析项目来实现从多种格式的文件中提取元数据和结构化文本，Gora支持把大数据持久化到多种存储实现，Crawler Commons是一个通用的网络爬虫组件。

8、SeimiCrawler

github地址：https://github.com/zhegexiaohuozi/SeimiCrawler

SeimiCrawler是一个敏捷的，独立部署的，支持分布式的Java爬虫框架，希望能在最大程度上降低新手开发一个可用性高且性能不差的爬虫系统的门槛，以及提升开发爬虫系统的开发效率。在SeimiCrawler的世界里，绝大多数人只需关心去写抓取的业务逻辑就够了，其余的Seimi帮你搞定。设计思想上SeimiCrawler受Python的爬虫框架Scrapy启发，同时融合了Java语言本身特点与Spring的特性，并希望在国内更方便且普遍的使用更有效率的XPath解析HTML，所以SeimiCrawler默认的HTML解析器是JsoupXpath(独立扩展项目，非jsoup自带),默认解析提取HTML数据工作均使用XPath来完成（当然，数据处理亦可以自行选择其他解析器）。并结合SeimiAgent彻底完美解决复杂动态页面渲染抓取问题。

9、Jsoup

github地址：https://github.com/jhy/jsoup/

jsoup is a Java library for working with real-world HTML. It provides a very convenient API for fetching URLs and extracting and manipulating data, using the best of HTML5 DOM methods and CSS selectors.

jsoup implements the WHATWG HTML5 specification, and parses HTML to the same DOM as modern browsers do.

scrape and parse HTML from a URL, file, or string
find and extract data, using DOM traversal or CSS selectors
manipulate the HTML elements, attributes, and text
clean user-submitted content against a safe-list, to prevent XSS attacks
output tidy HTML

jsoup is designed to deal with all varieties of HTML found in the wild; from pristine and validating, to invalid tag-soup; jsoup will create a sensible parse tree.

See jsoup.org for downloads and the full API documentation.

中文指南：jsoup开发指南,jsoup中文文档

jsoup 是一款Java 的HTML解析器，可直接解析URL地址、HTML文本内容。它提供了一套非常省力的API，可通过DOM，CSS以及类似于jQuery的操作方法来取出和操作数据

10、雅虎开源的web爬虫工具anthelion

项目地址：https://github.com/YahooArchive/anthelion

Anthelion is a Nutch plugin for focused crawling of semantic data. The project is an open-source project released under the Apache License 2.0.

Note: This project contains the complete Nutch 1.6 distribution. The plugin itself can be found in /src/plugin/parse-anth

11、spider-flow

项目地址：spider-flow: 新一代爬虫平台，以图形化方式定义爬虫流程，不写代码即可完成爬虫。