puppeteer执行js_使用Node.js和Puppeteer进行Web爬网

puppeteer执行js

Web Scraping is the task of downloading a web page and extracting some kind of information from it.

Web Scraping是下载网页并从中提取某种信息的任务。

I recently made a little project with an Arduino board with a LCD display attached. Using Johnny-Five, which lets us program the Arduino using Node.js, I wanted to fetch the temperature measured at the top of a mountain, and show it on the Arduino board.

我最近用一个带有LCD显示屏的Arduino板做了一个小项目。 使用Johnny-Five,我们可以使用Node.js对Arduino进行编程,我想获取在山顶测得的温度,并将其显示在Arduino板上。

I used Puppeteer to do the task of scraping. Puppeteer is a great tool built by Google. It’s a Node library we can use to control a headless Chrome instance.

我用Puppeteer来完成抓取任务。 Puppeteer是Google打造的出色工具。 这是一个Node库,我们可以用来控制无头Chrome实例。

This means we are basically use Chrome, but programmatically.

这意味着我们基本上是使用Chrome,但是是以编程方式使用的。

There are many practical uses for Puppeteer, including automating testing, make screenshots, create server-side rendered versions of single page apps, and more.

Puppeteer有许多实际用途,包括自动化测试,制作屏幕截图,创建单页应用程序的服务器端渲染版本等。

Start by installing it using

首先使用安装

npm install puppeteer

In a Node.js file, require it:

在Node.js文件中,要求它:

const puppeteer = require('puppeteer');

then we can use the launch() method to create a browser instance:

然后我们可以使用launch()方法创建一个浏览器实例:

(async () => {
  const browser = await puppeteer.launch()
})()

We use await, and so we must wrap this method call in an async function, which we immediately invoke.

我们使用await ,因此必须将此方法调用包装在异步函数中 ,然后立即调用函数

Next we can use the newPage() method on the browser object to get the page object:

接下来,我们可以在browser对象上使用newPage()方法获取page对象:

(async () => {
  const browser = await puppeteer.launch()
  const page = await browser.newPage()
})()

Next up we call the goto() method on the page object to load that page:

接下来,我们在page对象上调用goto()方法以加载该页面:

(async () => {
  const browser = await puppeteer.launch()
  const page = await browser.newPage()
  await page.goto('https://website.com')
})()

Finally, we can get the page content calling the evaluate() method of page. This method takes a callback function where we can add the code needed to retrieve the elements of the page we need. The function is executed in the context of a page, so we have access to document and all the browser APIs. We return a new object, and this will be the result of our evaluate() method call.

最后,我们可以得到网页内容调用evaluate()的方法page 。 该方法具有一个回调函数,我们可以在其中添加检索所需页面元素所需的代码。 该函数在页面的上下文中执行,因此我们可以访问document和所有浏览器API。 我们返回一个新对象,这将是我们evaluate()方法调用的结果。

We can use the Selectors API and retrieve data from the page.

我们可以使用Selectors API并从页面中检索数据。

(async () => {
  const browser = await puppeteer.launch()
  const page = await browser.newPage()
  await page.goto('https://website.com')

	const result = await page.evaluate(() => {
  	//...
	})
})()

Let’s get to the particular problem I have. This is the page which hosts the meteo station, located on the top of a mountain at 3315m: http://www.meteocentrale.ch/it/europa/svizzera/meteo-corvatsch/details/S067910/

让我们解决我遇到的特定问题。 这是主持气象站的页面,位于3315m的山顶上: http : //www.meteocentrale.ch/it/europa/svizzera/meteo-corvatsch/details/S067910/

I want to get that -9°C text. Using the browser inspector I can see it has a column-4 class attached. It’s not an ideal class name, as it’s not meaningful, and might change if they decide to add a new column, but this is what we got:

我想得到-9°C文字。 使用浏览器检查器,我可以看到它附加了column-4类。 它不是理想的类名,因为它没有意义,并且如果他们决定添加新列,则可能会更改,但这是我们得到的:

This is the complete code up to now:

这是到目前为止的完整代码:

const puppeteer = require('puppeteer');

(async () => {
  const browser = await puppeteer.launch()
  const page = await browser.newPage()
  await page.goto('http://www.meteocentrale.ch/it/europa/svizzera/meteo-corvatsch/details/S067910/')

	const result = await page.evaluate(() => {
 	  let temperature = document.querySelector('.column-4').innerText
    return {
	    temperature
	  }
  })

  console.log(result)

  browser.close()
})()

If we run this code, result will have this value:

如果运行此代码, result将具有以下值:

{
  temperature: '-9°C'
}

or whatever the temperature is right now.

或当前温度如何。

翻译自: https://flaviocopes.com/web-scraping/

puppeteer执行js

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值