【Puppeteer】解决在headless为true时的网站反爬虫机制的限制

 

目录

描述:

问题原因

解决方法:


描述:

headless为true模式下,发现无法获取对应的元素,一开始以为是自己写的组件名称不对,无法识别,但是将元素的名称去控制台中去搜索是可以锁定的。而当我将headless设置为false,一切就都很正常,代码如下

const puppeteer = require('puppeteer');

(async () => {
    const browser = await puppeteer.launch({
        headless:true
    });

    const page = await browser.newPage();
    await page.goto('https://www.staples.com/paint/cat_CL140420');
    const bodyHandle = await page.waitForSelector("#searchTerm");
    const html = await page.evaluate(body => {
        return body.innerHTML;
    }, bodyHandle);

    console.log(html)

    await browser.close();
})();

运行的结果:

(node:21380) UnhandledPromiseRejectionWarning: TimeoutError: waiting for selector `#searchTerm` failed: timeout 30000ms exceeded
    at new WaitTask (I:\puppeteer\node_modules\puppeteer\lib\cjs\puppeteer\common\DOMWorld.js:609:34)
    at DOMWorld._waitForSelectorInPage (I:\puppeteer\node_modules\puppeteer\lib\cjs\puppeteer\common\DOMWorld.js:520:26)
    at Object.internalHandler.waitFor (I:\puppeteer\node_modules\puppeteer\lib\cjs\puppeteer\common\QueryHandler.js:34:29)
    at DOMWorld.waitForSelector (I:\puppeteer\node_modules\puppeteer\lib\cjs\puppeteer\common\DOMWorld.js:455:36)
    at Frame.waitForSelector (I:\puppeteer\node_modules\puppeteer\lib\cjs\puppeteer\common\FrameManager.js:1007:51)
    at Page.waitForSelector (I:\puppeteer\node_modules\puppeteer\lib\cjs\puppeteer\common\Page.js:2224:39)
    at I:\puppeteer\index.js:14:35
(Use `node --trace-warnings ...` to show where the warning was created)
(node:21380) UnhandledPromiseRejectionWarning: Unhandled promise rejection. This error originated either by throwing inside of an async function without a catch block, or by rejecting a promise which was not handled with .catch(). To terminate the node process on unhandled promise rejection, use the CLI flag `-
-unhandled-rejections=strict` (see https://nodejs.org/api/cli.html#cli_unhandled_rejections_mode). (rejection id: 1)
(node:21380) [DEP0018] DeprecationWarning: Unhandled promise rejections are deprecated. In the future, promise rejections that are not handled will terminate the Node.js process with a non-zero exit code.

问题原因

困惑许久,后面添加了一个截图方法在页面启动后,发现如下:

也就是说在headless为true的模式下,这个网站的反爬虫机制会禁止访问。

解决方法:

源于github讨论区Different behavior between { headless: false } and { headless: true } · Issue #665 · puppeteer/puppeteer · GitHub

在打开页面以后,添加一个服务代理就可以解决了

await page.setUserAgent("Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.181 Safari/537.36");

完整代码如下:

const puppeteer = require('puppeteer');

(async () => {
    const browser = await puppeteer.launch({
        headless:true
    });

    const page = await browser.newPage();
    await page.setUserAgent("Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.181 Safari/537.36");
    await page.goto('https://www.staples.com/paint/cat_CL140420');

    await page.screenshot({path: 'exampl1.png'});

    const bodyHandle = await page.waitForSelector("#searchTerm");
    const html = await page.evaluate(body => {
        return body.innerHTML;
    }, bodyHandle);

    console.log(html)

    await browser.close();
})();

这个时候就可以获取运行结果:

Painting Supplies

如果发现上述的方法还是不能够的话,那么可以再添加以下参数

await page.setExtraHTTPHeaders({
    'Accept-Language': 'en-GB,en-US;q=0.9,en;q=0.8'
});

还有另外一种是使用

puppeteer-extra

GitHub - berstend/puppeteer-extra: 💯 Teach puppeteer new tricks through plugins. 

根据网上大神的原话

  1. puppeteer-extra-plugin-anonymize-ua -- anonymizes your User Agent. Note that this might help with getting past headless mode detection, but as you'll see if you visit AmIUnique it is unlikely to be enough to keep you from being identified as a repeat visitor.
  2. puppeteer-extra-plugin-stealth -- this might help win the cat-and-mouse game of not being detected as headless. There are many tricks that are employed to detect headless mode, and as many tricks to evade them.

这个方法放在这里,因为需要添加插件,觉得麻烦,后面上面使用代理的方法无法解决时再考虑使用这个方法

  • 0
    点赞
  • 5
    收藏
    觉得还不错? 一键收藏
  • 1
    评论
如果在 Puppeteer 中将 `headless` 设置为 `true`,可能会导致无法下载文件。这是因为 Chrome 在无头模式下默认禁用了文件下载。 为了解决这个问题,你可以尝试以下两种方法: 1. 修改 Chrome 启动参数:在 Puppeteer 的 `launch` 方法中,你可以传递一个 `args` 数组来设置 Chrome 的启动参数。尝试添加 `--disable-gpu`、`--disable-software-rasterizer` 和 `--disable-dev-shm-usage` 参数,这些参数可以帮助解决文件下载问题。示例代码如下: ```javascript const puppeteer = require('puppeteer'); async function downloadFile() { const browser = await puppeteer.launch({ headless: true, args: [ '--disable-gpu', '--disable-software-rasterizer', '--disable-dev-shm-usage' ] }); const page = await browser.newPage(); // 其他操作... await browser.close(); } downloadFile(); ``` 2. 使用无头 Chrome 的可执行文件:Puppeteer 提供了一个 `executablePath` 选项,可以指定使用自定义的 Chrome 可执行文件路径。你可以下载一个适用于你系统的 Chrome 可执行文件,并将其路径指定给 `executablePath`。这样,你就可以使用具有默认文件下载功能的完整版 Chrome 运行 Puppeteer。 ```javascript const puppeteer = require('puppeteer'); async function downloadFile() { const browser = await puppeteer.launch({ headless: true, executablePath: '/path/to/chrome/executable' }); const page = await browser.newPage(); // 其他操作... await browser.close(); } downloadFile(); ``` 请根据你的需求选择其中一种方法尝试解决文件下载问题。希望能帮到你!如有其他问题,请随提问。

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值