Puppeteer使用总结

最新推荐文章于 2024-08-09 16:21:20 发布

Pixiao1997

最新推荐文章于 2024-08-09 16:21:20 发布

阅读量1.5w

点赞数 5

分类专栏： Puppeteer 文章标签： js

本文链接：https://blog.csdn.net/m0_54531505/article/details/114641645

版权

Puppeteer 专栏收录该内容

1 篇文章 2 订阅

订阅专栏

文章目录

NodeJS环境配置

Nodejs下载地址：http://nodejs.cn/download/

MAC配置

从官网下载并直接安装就可以了

安装方式二：

# 查看node版本
brew search node

# 安装node
brew install node

# 检查是否安装成功
node -v
npm -v

node版本管理:

npm install -g n

# 查看所有node版本
n ls

# 从安装的node中切换不同版本
n

# 安装node 12版本
n 12

Linux配置

sudo apt-get install nodejs

同样可以采用node版本管理工具n进行版本切换

Windows配置

下载好安装包
全部默认-下一步-Finish完成安装（可以改个安装路径，我一般装到D盘）
CMD打开命令行，输入node -v、npm -v查看node、npm的版本号
配置全局模块安装路径和缓存路径，在nodejs的安装目录下创建两个文件夹node_global和node_cache

CMD打开命令行，执行下面内容（具体路径根据你自己的填写）

npm config set perfix "D:\Program Files\nodejs\node_global"
npm config set cache  "D:\Program Files\nodejs\node_cache"

系统环境变量新建NODE_PATH，路径是上面添加的node_global文件夹下新建的node_modules文件夹
```
D:\Program Files\nodejs\node_global\node_modules
```
编辑用户变量path，添加一个node_global路径
```
D:\Program Files\nodejs\node_global
```

Windows配置环境参考地址：https://www.cnblogs.com/hshdexy/p/13605176.html

开发环境初始化

基础环境

找个位置新建一个文件夹，例如：test
创建一个js文件，例如test.js

进入test文件夹，在该目录下运行命令行，并执行下面命令

npm init
# 之后就一路回车就行了，就是填写一个项目描述之类的，之后会生成一个package.json的配置文件

因为要在工程中使用puppeteer，运行如下命令进行安装（参考GitHub）
```
npm i puppeteer
# or "yarn add puppeteer"
```
提示：安装Puppeteer时，它会下载最新版本并且可以与Puppeteer一起使用的Chromium（Mac〜170MB ，Linux〜282MB ，Windows〜280MB ）（翻译自Github）

这里我在Mac上进行开发的，就直接使用这个了，因为我需要在浏览器上显示程序的运行过程，后面部署到Linux服务器上的时候就不再使用浏览器了（可以自定义浏览器类型），就可以安装puppeteer-core，安装步骤如下
```
npm i puppeteer-core
# or "yarn add puppeteer-core"
```
提示：从1.7.0版开始，官方发布了puppeteer-core软件包，这是一个Puppeteer版本，默认情况下不会下载任何浏览器。puppeteer-core旨在作为Puppeteer的轻量级版本，用于启动现有的浏览器安装或用于连接到远程浏览器。确保您安装的puppeteer-core版本与您打算连接的浏览器兼容。（翻译自Github）

官方截图例子

const puppeteer = require('puppeteer');

(async () => {
  const browser = await puppeteer.launch();
  const page = await browser.newPage();
  await page.goto('https://example.com');
  await page.screenshot({ path: 'example.png' });

  await browser.close();
})();

编辑完test.js保存，然后在命令行中运行如下命令执行该脚本
```
node test.js
```

GitHub地址： https://github.com/puppeteer/puppeteer

百度OCR对接

说明：因为我自己的程序中需要用验证码识别，所以使用了百度的OCR

下载文字识别Node.js SDK： https://ai.baidu.com/sdk#ocr

将SDK单独解压到一个文件夹内，例如aip-node-sdk-version，并复制到test文件夹下
进入到aip-node-sdk-version文件夹内运行如下命令，安装sdk依赖库
```
npm install
```
把目录当做模块依赖
进入到test文件夹下安装依赖
```
npm install baidu-aip-sdk
```

可以使用百度普通文字识别的示例接口

var fs = require('fs');

var image = fs.readFileSync("assets/example.jpg").toString("base64");

// 调用通用文字识别, 图片参数为本地图片
client.generalBasic(image).then(function(result) {
    console.log(JSON.stringify(result));
}).catch(function(err) {
    // 如果发生网络错误
    console.log(err);
});

// 如果有可选参数
var options = {};
options["language_type"] = "CHN_ENG";
options["detect_direction"] = "true";
options["detect_language"] = "true";
options["probability"] = "true";

// 带参数调用通用文字识别, 图片参数为本地图片
client.generalBasic(image, options).then(function(result) {
    console.log(JSON.stringify(result));
}).catch(function(err) {
    // 如果发生网络错误
    console.log(err);
});;

var url = "https//www.x.com/sample.jpg";

// 调用通用文字识别, 图片参数为远程url图片
client.generalBasicUrl(url).then(function(result) {
    console.log(JSON.stringify(result));
}).catch(function(err) {
    // 如果发生网络错误
    console.log(err);
});

// 如果有可选参数
var options = {};
options["language_type"] = "CHN_ENG";
options["detect_direction"] = "true";
options["detect_language"] = "true";
options["probability"] = "true";

// 带参数调用通用文字识别, 图片参数为远程url图片
client.generalBasicUrl(url, options).then(function(result) {
    console.log(JSON.stringify(result));
}).catch(function(err) {
    // 如果发生网络错误
    console.log(err);
});;

我的实例代码

const puppeteer = require('puppeteer');

// 网站用户名和登陆密码
const userName = "xxxxxxxxxx";
const passWord = "xxxxxxxxxx";
// 网站首页和职位列表地址
const gotoUrl = "xxxxxxxxxxxx";
const listUrl = "xxxxxxxxxxxx";
// 验证码图片路径
const verCodeImgPath = "verCodeImg.png";
// 验证码
let code;
// 引用百度OCR
let AipOcrClient = require('baidu-aip-sdk').ocr;

// 设置百度OCR APPID/AK/SK
let APP_ID = "xxxxx";
let API_KEY = "xxxxxxxxx";
let SECRET_KEY = "xxxxxxxxx";

// 新建一个对象，建议只保存一个对象调用服务接口
let client = new AipOcrClient(APP_ID, API_KEY, SECRET_KEY);
// 本地图片上传
let fs = require('fs');

// 刷新函数
async function refresh() {
  const browser = await puppeteer.launch({
    // 无头模式，不打开浏览器显示脚本运行过程，可以在调试过程中打开
    headless: true,
    // 设置浏览器窗口大小
    defaultViewport: {
      width: 1000,
      height: 2000,
    }
  });
  const page = await browser.newPage();
  try {
    // 进入登陆页面，并等待直到没有网络连接的时候向下进行
    await page.goto(gotoUrl, {
      waitUntil: "networkidle2",
    });
  } catch(e) {
    console.log("登陆页面无法访问！");
    // 关闭浏览器并返回不再向下运行，本次刷新失败
    await browser.close();
    return;
  }
  
  // 填写用户名
  try {
    // 找到用户名的标签元素
    let accountElements = await page.$x('//input[@id="UserName"]', {
      waitForTimeout: 3000
    });
    // 填写用户名
    await accountElements[0].type(userName)
  } catch (e) {
    console.log("用户名输入失败！");
    await browser.close();
    return;
  }
  await page.waitForTimeout(2000);
  // 填写密码
  try {
    // 找到密码的标签元素
    let pwdElements = await page.$x('//input[@id="UserPass"]', {
      waitForTimeout: 3000
    });
    // 填写密码
    await pwdElements[0].type(passWord)
  } catch (e) {
    console.log("密码输入失败！");
    await browser.close();
    return;
  }

  // 选择用户类型
  await page.click('#RadioC');
  // 提交表单
  await page.click('#Denglu');
  // 等待5秒加载页面
  await page.waitForTimeout(5000);

  // 跳转到职位列表页面
  try {
    // 等待直到没有网络连接的时候向下进行
    await page.goto(listUrl, {
      waitUntil: "networkidle2"
    });
  } catch (e) {
    console.log("职位列表页面无法访问！");
    await browser.close();
    return;
  }

  //点击全选
  try {
    await page.click("#CheckAll");
  } catch (e) {
    console.log("全选失败！");
    await browser.close();
    return;
  }

  // 找到验证码标签元素
  const verCodeImg = await page.$('body > div:nth-child(5) > table > tbody > tr > td:nth-child(3) > form > table:nth-child(5) > tbody > tr > td:nth-child(2) > img');
  // 判断验证码标签是否存在
  if (verCodeImg) {
    // 获取到验证码并存储到本地
    try {
      await verCodeImg.screenshot({
        path: verCodeImgPath
      });
      var image = fs.readFileSync(verCodeImgPath).toString("base64");
    } catch (e) {
      console.log("验证码截取错误！");
      await browser.close();
      return;
    }
    // 百度OCR
    // 调用通用文字识别, 图片参数为本地图片
    client.generalBasic(image).then(async function(result) {
      code = result.words_result[0].words;
      // 等待识别结果
      await page.waitForTimeout(3000);
    }).catch(async function(err) {
      // 如果发生网络错误
      console.log(err);
      console.log("百度OCR接口发生网络错误！");
      await browser.close();
      return;
    });
    // 填写验证码
    try {
      // 直到验证码输入框标签元素
      let codeInput = await page.$("#Tel");
      // 等待2秒
      await page.waitForTimeout(2000);
      // 输入验证码
      await codeInput.type(code);
      // 等待2秒
      await page.waitForTimeout(2000);
    } catch (e) {
      console.log("验证码输入错误！");
      await browser.close();
      return;
    }
  }

  //点击刷新职位
  try {
    // 点击刷新职位按钮
    await page.click("#btn_tigao");
    // 等待3秒，实际测试的时候这里有个响应时间，所以必须有等待时间
    await page.waitForTimeout(3000);
    console.log("刷新成功！");
  } catch (e) {
    console.log("刷新失败！");
    await browser.close();
    return;
  }
  await browser.close();
}
// 运行一次刷新函数
refresh();
// 设置一个2分钟刷新一次
setInterval(() => {
  refresh();
}, 120 * 1000);

Docker部署

待补充

小知识点

不会用xpath选择器选择标签怎么办？

直接在定位到标签处-右击-复制-复制选择器，粘贴到page.$x('')单引号中即可，很简单！
对某个标签进行截图，先获取到标签元素，然后使用element.screenshot即可

Puppeteer 语法

待补充

函数	说明
waitForTimeout	等待n毫秒后下下执行，类似于以前的waitFor
page.screenshot	页面截图
element.screenshot	获取到了标签元素也可以截图

错误总结

树莓派（Ubuntu）运行错误：

Error: Failed to launch the browser process puppeteer
解决方法：

sudo apt-get install chromium-browser

树莓派（Ubuntu）运行使用puppeteer-core：

需要安装puppeteer-core

js文件中引用puppeteer-core

const puppeteer = require('puppeteer-core');

在async修饰的函数中，每个操作都要添加上await

我的博客

博客会更新的比较及时，有问题请留言！

博客地址：https://pixiao.gitee.io/blog

Pixiao1997

关注

5
点赞
踩
43

收藏

觉得还不错? 一键收藏
打赏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

专栏目录