puppeteer执行js
Web Scraping is the task of downloading a web page and extracting some kind of information from it.
Web Scraping是下载网页并从中提取某种信息的任务。
I recently made a little project with an Arduino board with a LCD display attached. Using Johnny-Five, which lets us program the Arduino using Node.js, I wanted to fetch the temperature measured at the top of a mountain, and show it on the Arduino board.
我最近用一个带有LCD显示屏的Arduino板做了一个小项目。 使用Johnny-Five,我们可以使用Node.js对Arduino进行编程,我想获取在山顶测得的温度,并将其显示在Arduino板上。
I used Puppeteer to do the task of scraping. Puppeteer is a great tool built by Google. It’s a Node library we can use to control a headless Chrome instance.
我用Puppeteer来完成抓取任务。 Puppeteer是Google打造的出色工具。 这是一个Node库,我们可以用来控制无头Chrome实例。
This means we are basically use Chrome, but programmatically.
这意味着我们基本上是使用Chrome,但是是以编程方式使用的。
There are many practical uses for Puppeteer, including automating testing, make screenshots, create server-side rendered versions of single page apps, and more.
Puppeteer有许多实际用途,包括自动化测试,制作屏幕截图,创建单页应用程序的服务器端渲染版本等。
Start by installing it using
首先使用安装
npm install puppeteer
In a Node.js file, require it:
在Node.js文件中,要求它:
const puppeteer = require('puppeteer');
then we can use the launch()
method to create a browser instance:
然后我们可以使用launch()
方法创建一个浏览器实例:
(async () => {
const browser = await puppeteer.launch()
})()
We use await
, and so we must wrap this method call in an async function, which we immediately invoke.
我们使用await
,因此必须将此方法调用包装在异步函数中 ,然后立即调用该函数 。
Next we can use the newPage()
method on the browser
object to get the page
object:
接下来,我们可以在browser
对象上使用newPage()
方法获取page
对象:
(async () => {
const browser = await puppeteer.launch()
const page = await browser.newPage()
})()
Next up we call the goto()
method on the page
object to load that page:
接下来,我们在page
对象上调用goto()
方法以加载该页面:
(async () => {
const browser = await puppeteer.launch()
const page = await browser.newPage()
await page.goto('https://website.com')
})()
Finally, we can get the page content calling the evaluate()
method of page
. This method takes a callback function where we can add the code needed to retrieve the elements of the page we need. The function is executed in the context of a page, so we have access to document
and all the browser APIs. We return a new object, and this will be the result of our evaluate()
method call.
最后,我们可以得到网页内容调用evaluate()
的方法page
。 该方法具有一个回调函数,我们可以在其中添加检索所需页面元素所需的代码。 该函数在页面的上下文中执行,因此我们可以访问document
和所有浏览器API。 我们返回一个新对象,这将是我们evaluate()
方法调用的结果。
We can use the Selectors API and retrieve data from the page.
我们可以使用Selectors API并从页面中检索数据。
(async () => {
const browser = await puppeteer.launch()
const page = await browser.newPage()
await page.goto('https://website.com')
const result = await page.evaluate(() => {
//...
})
})()
Let’s get to the particular problem I have. This is the page which hosts the meteo station, located on the top of a mountain at 3315m: http://www.meteocentrale.ch/it/europa/svizzera/meteo-corvatsch/details/S067910/
让我们解决我遇到的特定问题。 这是主持气象站的页面,位于3315m的山顶上: http : //www.meteocentrale.ch/it/europa/svizzera/meteo-corvatsch/details/S067910/
I want to get that -9°C
text. Using the browser inspector I can see it has a column-4
class attached. It’s not an ideal class name, as it’s not meaningful, and might change if they decide to add a new column, but this is what we got:
我想得到-9°C
文字。 使用浏览器检查器,我可以看到它附加了column-4
类。 它不是理想的类名,因为它没有意义,并且如果他们决定添加新列,则可能会更改,但这是我们得到的:
This is the complete code up to now:
这是到目前为止的完整代码:
const puppeteer = require('puppeteer');
(async () => {
const browser = await puppeteer.launch()
const page = await browser.newPage()
await page.goto('http://www.meteocentrale.ch/it/europa/svizzera/meteo-corvatsch/details/S067910/')
const result = await page.evaluate(() => {
let temperature = document.querySelector('.column-4').innerText
return {
temperature
}
})
console.log(result)
browser.close()
})()
If we run this code, result
will have this value:
如果运行此代码, result
将具有以下值:
{
temperature: '-9°C'
}
or whatever the temperature is right now.
或当前温度如何。
puppeteer执行js