爬网工具_Diffbot:通过视觉机器学习进行爬网

爬网工具

Have you ever wondered how social networks do URL previews so well when you share links? How do they know which images to grab, whom to cite as an author, or which tags to attach to the preview? Is it all crawling with complex regexes over source code? Actually, more often than not, it isn’t. Meta information defined in the source can be unreliable, and sites with less than stellar reputation often use them as keyword carriers, attempting to get search engines to rank them higher. Isn’t what we, the humans, see in front of us what matters anyway?

您是否曾经想过,当您共享链接时,社交网络的URL预览效果如何? 他们如何知道要抓取哪些图像,作为作者引用哪些图像或附加到预览的标签? 是否都是通过源代码使用复杂的正则表达式进行爬网? 实际上,并非总是如此。 源中定义的元信息可能不可靠,信誉不佳的网站经常将其用作关键字载体,试图让搜索引擎将其排名更高。 我们人类不是在我们面前看到什么重要吗?

If you want to build a URL preview snippet or a news aggregator, there are many automatic crawlers available online, both proprietary and open source, but you seldom find something as niche as visual machine learning. This is exactly what Diffbot is – a “visual learning robot” which renders a URL you request in full and then visually extracts data, helping itself with some metadata from the page source as needed.

如果您想构建一个URL预览代码段或新闻聚合器,可以在线上找到许多自动爬网程序,包括专有和开放源代码,但是您很少会发现像可视化机器学习这样的利基产品。 这正是Diffbot的本质 –一个“视觉学习机器人”,它完整呈现您请求的URL,然后以视觉方式提取数据,并根据需要使用页面源中的一些元数据来帮助自己。

After covering some theory, in this post we’ll do a demo API call at one of SitePoint’s posts.

在介绍了一些理论之后,在这篇文章中,我们将在SitePoint的其中一篇文章中进行演示API调用。

PHP库 (PHP Library)

The PHP library for Diffbot is somewhat out of date, and as such we won’t be using it in this demo. We’ll be performing raw API calls, and in some future posts we’ll build our own library for API interaction.

用于DiffbotPHP库有些过时了,因此在本演示中我们将不再使用它。 我们将执行原始API调用,在以后的文章中,我们将建立自己的API交互库。

If you’d like to take a look at the PHP library nonetheless, see here, and if you’re interested in libraries for other languages, Diffbot has a directory.

不过,如果您想看一下 PHP库,请参见此处 ,如果您对其他语言的库感兴趣,则Diffbot拥有一个目录

Update, July 2015: A PHP library has been developed since this article was published. See its entire development process here, or the source code here.

2015年7月更新 :自本文发布以来,已经开发了一个PHP库。 在此处查看其整个开发过程,或在此处查看源代码。

JavaScript内容 (JavaScript Content)

We said in the introductory section that Diffbot renders the request in full and then analyzes it. But, what about JavaScript content? Nowadays, websites often render some HTML above the fold, and then finish the CSS, JS, and dynamic content loading afterwards. Can the Diffbot API see that?

我们在介绍性部分中说过,Diffbot完全渲染了请求,然后对其进行了分析。 但是,JavaScript内容呢? 如今,网站经常在折叠时渲染一些HTML,然后再完成CSS,JS和动态内容的加载。 Diffbot API可以看到吗?

As a matter of fact, yes. Diffbot literally renders the page in full, and then inspects it visually, as explained in my StackOverflow Q&A here. There are some caveats, though, so make sure you read the answer carefully.

事实上,是的。 Diffbot会按字面上的方式完全渲染页面,然后进行可视化检查,如我

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值