puppeteer执行js_使用Node.js和Puppeteer进行Web爬网

最新推荐文章于 2024-08-09 16:21:20 发布

cuk0051

最新推荐文章于 2024-08-09 16:21:20 发布

阅读量1.7k

点赞数 1

文章标签： javascript js java vue web ViewUI

原文链接：https://flaviocopes.com/web-scraping/

版权

puppeteer执行js

Web Scraping is the task of downloading a web page and extracting some kind of information from it.

Web Scraping是下载网页并从中提取某种信息的任务。

I recently made a little project with an Arduino board with a LCD display attached. Using Johnny-Five, which lets us program the Arduino using Node.js, I wanted to fetch the temperature measured at the top of a mountain, and show it on the Arduino board.

我最近用一个带有LCD显示屏的Arduino板做了一个小项目。使用Johnny-Five，我们可以使用Node.js对Arduino进行编程，我想获取在山顶测得的温度，并将其显示在Arduino板上。

I used Puppeteer to do the task of scraping. Puppeteer is a great tool built by Google. It’s a Node library we can use to control a headless Chrome instance.

我用Puppeteer来完成抓取任务。 Puppeteer是Google打造的出色工具。这是一个Node库，我们可以用来控制无头Chrome实例。

This means we are basically use Chrome, but programmatically.

这意味着我们基本上是使用Chrome，但是是以编程方式使用的。

There are many practical uses for Puppeteer, including automating testing, make screenshots, create server-side rendered versions of single page apps, and more.

Puppeteer有许多实际用途，包括自动化测试，制作屏幕截图，创建单页应用程序的服务器端渲染版本等。

Start by installing it using

首先使用安装

npm install puppeteer

In a Node.js file, require it:

在Node.js文件中，要求它：

const puppeteer = require('puppeteer');

then we can use the launch() method to create a browser instance:

然后我们可以使用launch()方法创建一个浏览器实例：

(async () => {
  const browser = await puppeteer.launch()
})()

We use await, and so we must wrap this method call in an async function, which we immediately invoke.

我们使用await ，因此必须将此方法调用包装在异步函数中，然后立即调用该函数。

Next we can use the newPage() method on the browser object to get the page object:

接下来，我们可以在browser对象上使用newPage()方法获取page对象：

(async () => {
  const browser = await puppeteer.launch()
  const page = await browser.newPage()
})()

Next up we call the goto() method on the page object to load that page:

接下来，我们在page对象上调用goto()方法以加载该页面：

(async () => {
  const browser = await puppeteer.launch()
  const page = await browser.newPage()
  await page.goto('https://website.com')
})()

Finally, we can get the page content calling the evaluate() method of page. This method takes a callback function where we can add the code needed to retrieve the elements of the page we need. The function is executed in the context of a page, so we have access to document and all the browser APIs. We return a new object, and this will be the result of our evaluate() method call.

最后，我们可以得到网页内容调用evaluate()的方法page 。该方法具有一个回调函数，我们可以在其中添加检索所需页面元素所需的代码。该函数在页面的上下文中执行，因此我们可以访问document和所有浏览器API。我们返回一个新对象，这将是我们evaluate()方法调用的结果。

We can use the Selectors API and retrieve data from the page.

我们可以使用Selectors API并从页面中检索数据。

(async () => {
  const browser = await puppeteer.launch()
  const page = await browser.newPage()
  await page.goto('https://website.com')

	const result = await page.evaluate(() => {
  	//...
	})
})()

Let’s get to the particular problem I have. This is the page which hosts the meteo station, located on the top of a mountain at 3315m: http://www.meteocentrale.ch/it/europa/svizzera/meteo-corvatsch/details/S067910/

让我们解决我遇到的特定问题。这是主持气象站的页面，位于3315m的山顶上： http : //www.meteocentrale.ch/it/europa/svizzera/meteo-corvatsch/details/S067910/

I want to get that -9°C text. Using the browser inspector I can see it has a column-4 class attached. It’s not an ideal class name, as it’s not meaningful, and might change if they decide to add a new column, but this is what we got:

我想得到-9°C文字。使用浏览器检查器，我可以看到它附加了column-4类。它不是理想的类名，因为它没有意义，并且如果他们决定添加新列，则可能会更改，但这是我们得到的：

This is the complete code up to now:

这是到目前为止的完整代码：

const puppeteer = require('puppeteer');

(async () => {
  const browser = await puppeteer.launch()
  const page = await browser.newPage()
  await page.goto('http://www.meteocentrale.ch/it/europa/svizzera/meteo-corvatsch/details/S067910/')

	const result = await page.evaluate(() => {
 	  let temperature = document.querySelector('.column-4').innerText
    return {
	    temperature
	  }
  })

  console.log(result)

  browser.close()
})()

If we run this code, result will have this value:

如果运行此代码， result将具有以下值：