爬网工具_Diffbot：通过视觉机器学习进行爬网

最新推荐文章于 2024-02-27 16:57:02 发布

culi3118

最新推荐文章于 2024-02-27 16:57:02 发布

阅读量570

点赞数

文章标签： python java 大数据编程语言 javascript ViewUI

原文链接：https://www.sitepoint.com/diffbot-crawling-visual-machine-learning/

版权

Diffbot是一款使用视觉机器学习技术的爬网工具，它可以完整渲染网页并提取数据，甚至能处理JavaScript内容。不同于依赖复杂的正则表达式，Diffbot通过视觉分析提供更准确的URL预览，例如提取图片、作者和标签等信息。文章介绍了如何通过Diffbot API进行操作，包括创建项目、添加规则和获取数据，并展示了如何在PHP和JavaScript环境中使用Diffbot。尽管PHP库可能过时，但通过原始API调用，开发者可以构建自己的库。Diffbot提供了免费试用和商业计划，适用于新闻聚合、URL预览等多种场景。

摘要由CSDN通过智能技术生成

爬网工具

Have you ever wondered how social networks do URL previews so well when you share links? How do they know which images to grab, whom to cite as an author, or which tags to attach to the preview? Is it all crawling with complex regexes over source code? Actually, more often than not, it isn’t. Meta information defined in the source can be unreliable, and sites with less than stellar reputation often use them as keyword carriers, attempting to get search engines to rank them higher. Isn’t what we, the humans, see in front of us what matters anyway?

您是否曾经想过，当您共享链接时，社交网络的URL预览效果如何？他们如何知道要抓取哪些图像，作为作者引用哪些图像或附加到预览的标签？是否都是通过源代码使用复杂的正则表达式进行爬网？实际上，并非总是如此。源中定义的元信息可能不可靠，信誉不佳的网站经常将其用作关键字载体，试图让搜索引擎将其排名更高。我们人类不是在我们面前看到什么重要吗？

If you want to build a URL preview snippet or a news aggregator, there are many automatic crawlers available online, both proprietary and open source, but you seldom find something as niche as visual machine learning. This is exactly what Diffbot is – a “visual learning robot” which renders a URL you request in full and then visually extracts data, helping itself with some metadata from the page source as needed.

如果您想构建一个URL预览代码段或新闻聚合器，可以在线上找到许多自动爬网程序，包括专有和开放源代码，但是您很少会发现像可视化机器学习这样的利基产品。这正是Diffbot的本质 –一个“视觉学习机器人”，它完整呈现您请求的URL，然后以视觉方式提取数据，并根据需要使用页面源中的一些元数据来帮助自己。

After covering some theory, in this post we’ll do a demo API call at one of SitePoint’s posts.

在介绍了一些理论之后，在这篇文章中，我们将在SitePoint的其中一篇文章中进行演示API调用。

PHP库 (PHP Library)

The PHP library for Diffbot is somewhat out of date, and as such we won’t be using it in this demo. We’ll be performing raw API calls, and in some future posts we’ll build our own library for API interaction.

用于DiffbotPHP库有些过时了，因此在本演示中我们将不再使用它。我们将执行原始API调用，在以后的文章中，我们将建立自己的API交互库。

If you’d like to take a look at the PHP library nonetheless, see here, and if you’re interested in libraries for other languages, Diffbot has a directory.

不过，如果您想看一下 PHP库，请参见此处，如果您对其他语言的库感兴趣，则Diffbot拥有一个目录。

Update, July 2015: A PHP library has been developed since this article was published. See its entire development process here, or the source code here.

2015年7月更新 ：自本文发布以来，已经开发了一个PHP库。 在此处查看其整个开发过程，或在此处查看源代码。

JavaScript内容 (JavaScript Content)

We said in the introductory section that Diffbot renders the request in full and then analyzes it. But, what about JavaScript content? Nowadays, websites often render some HTML above the fold, and then finish the CSS, JS, and dynamic content loading afterwards. Can the Diffbot API see that?

我们在介绍性部分中说过，Diffbot完全渲染了请求，然后对其进行了分析。但是，JavaScript内容呢？如今，网站经常在折叠时渲染一些HTML，然后再完成CSS，JS和动态内容的加载。 Diffbot API可以看到吗？

As a matter of fact, yes. Diffbot literally renders the page in full, and then inspects it visually, as explained in my StackOverflow Q&A here. There are some caveats, though, so make sure you read the answer carefully.

事实上，是的。 Diffbot会按字面上的方式完全渲染页面，然后进行可视化检查，如我在此处的