如何使用Puppeteer从任何网站创建自定义API

It often happens that you come across a website and are forced to perform a set of actions to finally get some data. You are then faced with a dilemma: how do you make this data available in a form which can easily be consumed by your application?

通常,您访问一个网站并被迫执行一系列操作以最终获取一些数据。 然后,您将面临一个难题:如何以易于应用程序使用的形式提供这些数据?

Scraping comes to the rescue in such a case. And selecting the right tool for the job is quite important.

在这种情况下,便可以进行报废。 选择正确的工作工具非常重要。

木偶:不仅是另一个剪贴库 (Puppeteer: Not Just Another Scraping Library)

Puppeteer is a Node.js library maintained by the Chrome Devtools Team at Google. It basically runs a Chromium or Chrome (perhaps the more recognizable name) instance in a headless (or configurable) manner and exposes a set of high-level APIs.

Puppeteer是Google的Chrome Devtools小组维护的Node.js库。 它基本上以无头(或可配置)的方式运行Chromium或Chrome(也许更易于识别的名称)实例,并公开了一组高级API。

From its official documentation, puppeteer is normally leveraged for multiple processes which are not limited to the following:

从其官方文档中 ,puppeteer通常用于多个过程,而不仅限于以下过程:

  • Generating screenshots and PDFs

    生成屏幕截图和PDF
  • Crawling an SPA and generating pre-rendered content (i.e. Server Side Rendering)

    搜寻SPA并生成预渲染的内容(即服务器端渲染)
  • Testing Chrome extensions

    测试Chrome扩展程序
  • Automation testing of Web Interfaces

    Web界面的自动化测试
  • Diagnosis of performance issues through techniques like capturing the timeline trace of a website

    通过捕获网站时间线跟踪之类的技术诊断性能问题

For our case, we need to be able to access a website and map the data in a form which can be easily consumed by our application.

对于我们来说,我们需要能够访问网站并以易于应用程序使用的形式映射数据。

Sounds simple? The implementation is not that complex, either. Let's start.

听起来很简单? 实现也不是那么复杂。 开始吧。

将代码串起来 (Stringing the Code Along)

My fondness for Amazon products prompts me to use one of their product listing page as a sample here. We will implement our use case in two steps:

我对Amazon产品的爱好促使我在此处使用其产品列表页面之一作为示例。 我们将分两步实施用例:

  • Extract data from the page and map it in an easily consumable JSON form

    从页面中提取数据并以易于使用的JSON形式映射它
  • Add a little sprinkle of automation to make our lives a little bit easier

    增加一点点自动化,使我们的生活更轻松

You can find the complete code in this repository.

您可以在此存储库中找到完整的代码。

We will be extracting the data from this link: https://www.amazon.in/s?k=Shirts&ref=nb_sb_noss_2 ( a listing of the top searched shirts as shown in the image) in an API servable form.

我们将从此链接中提取数据:以API形式提供的https://www.amazon.in/s?k=Shirts&ref=nb_sb_noss_2 (如图中所示,是搜索最多的衬衫的清单)。

Before we get started using puppeteer extensively in this section, we need to understand the two primary classes provided by it.

在本节中开始

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值