It often happens that you come across a website and are forced to perform a set of actions to finally get some data. You are then faced with a dilemma: how do you make this data available in a form which can easily be consumed by your application?
通常,您访问一个网站并被迫执行一系列操作以最终获取一些数据。 然后,您将面临一个难题:如何以易于应用程序使用的形式提供这些数据?
Scraping comes to the rescue in such a case. And selecting the right tool for the job is quite important.
在这种情况下,便可以进行报废。 选择正确的工作工具非常重要。
木偶:不仅是另一个剪贴库 (Puppeteer: Not Just Another Scraping Library)
Puppeteer is a Node.js library maintained by the Chrome Devtools Team at Google. It basically runs a Chromium or Chrome (perhaps the more recognizable name) instance in a headless (or configurable) manner and exposes a set of high-level APIs.
Puppeteer是Google的Chrome Devtools小组维护的Node.js库。 它基本上以无头(或可配置)的方式运行Chromium或Chrome(也许更易于识别的名称)实例,并公开了一组高级API。
From its official documentation, puppeteer is normally leveraged for multiple processes which are not limited to the following:
从其官方文档中 ,puppeteer通常用于多个过程,而不仅限于以下过程:
- Generating screenshots and PDFs 生成屏幕截图和PDF
- Crawling an SPA and generating pre-rendered content (i.e. Server Side Rendering) 搜寻SPA并生成预渲染的内容(即服务器端渲染)
- Testing Chrome extensions 测试Chrome扩展程序
- Automation testing of Web Interfaces Web界面的自动化测试
- Diagnosis of performance issues through techniques like capturing the timeline trace of a website 通过捕获网站时间线跟踪之类的技术诊断性能问题
For our case, we need to be able to access a website and map the data in a form which can be easily consumed by our application.
对于我们来说,我们需要能够访问网站并以易于应用程序使用的形式映射数据。
Sounds simple? The implementation is not that complex, either. Let's start.
听起来很简单? 实现也不是那么复杂。 开始吧。
将代码串起来 (Stringing the Code Along)
My fondness for Amazon products prompts me to use one of their product listing page as a sample here. We will implement our use case in two steps:
我对Amazon产品的爱好促使我在此处使用其产品列表页面之一作为示例。 我们将分两步实施用例:
- Extract data from the page and map it in an easily consumable JSON form 从页面中提取数据并以易于使用的JSON形式映射它
- Add a little sprinkle of automation to make our lives a little bit easier 增加一点点自动化,使我们的生活更轻松
You can find the complete code in this repository.
您可以在此存储库中找到完整的代码。
We will be extracting the data from this link: https://www.amazon.in/s?k=Shirts&ref=nb_sb_noss_2 ( a listing of the top searched shirts as shown in the image) in an API servable form.
我们将从此链接中提取数据:以API形式提供的https://www.amazon.in/s?k=Shirts&ref=nb_sb_noss_2 (如图中所示,是搜索最多的衬衫的清单)。
Before we get started using puppeteer extensively in this section, we need to understand the two primary classes provided by it.
在本节中开始