Puppeteer 爬虫框架使用

飞锡2024

已于 2024-03-11 15:32:51 修改

阅读量6.9k

点赞数 1

分类专栏：爬虫文章标签： javascript 前端 python

于 2022-02-19 14:36:17 首次发布

本文链接：https://blog.csdn.net/weixin_38235865/article/details/123017602

版权

爬虫专栏收录该内容

14 篇文章 0 订阅

订阅专栏

Puppeteer

pyppeteer官方文档
 Puppeteer教程

Puppeteer 是 Google 基于 Node.js 开发的工具，调用 Chrome 的 API，通过 JavaScript 代码来操纵 Chrome 完成一些操作，用于网络爬虫、Web 程序自动测试等，其 API 极其完善，功能非常强大。

pyppeteer 介绍

Pyppeteer 是一款非常高效的 web 自动化测试工具，是 Puppeteer 的 Python 版本。

pyppeteer 使用了 Python 异步协程库 asyncio，可整合 Scrapy 进行分布式爬虫。

优点

安装配置的便利性和运行效率方面都要远胜 selenium
支持 asyncio 异步协程，对于并发比较友好

缺点

编程语言只能用javascript
支持的浏览器比较单一，只能用chromium
是第三方的，好久没有更新了，bug 也不少，Pyppeteer 所依赖的 Puppeteer 的原生 JS 版本，本身就很不稳定

pyppeteer使用

下载安装

python3 -m pip install pyppeteer

launch常用配置

在这里插入图片描述

执行脚本

dimensions = await page.evaluate('''() => {
        return {
            width: document.documentElement.clientWidth,
            height: document.documentElement.clientHeight,
            deviceScaleFactor: window.devicePixelRatio,
        }
    }''')

使用自定义路径来存储cookie和缓存之类的数据

const browser = await puppeteer.launch({
    userDataDir: './data', })

防止webdriver检测到

await page.evaluateOnNewDocument('() =>{ Object.defineProperties(navigator,'
                                     '{ webdriver:{ get: () => false } }) }')

await page.evaluate('''() =>{ Object.defineProperties(navigator,{ webdriver:{ get: () => false } }) }''')

实践


import pyppeteer

# 在导入 launch 之前 把 --enable-automation 禁用 防止监测webdriver
pyppeteer.launcher.DEFAULT_ARGS.remove("--enable-automation")

async def main(key_word, start_page, is_last_key_word):
    # launch 方法会新建一个 Browser 对象，其执行后最终会得到一个 Browser 对象，然后赋值给 browser。这一步就相当于启动了浏览器。
    browser = await pyppeteer.launch(headless=False,  # 网站可能设置了无头/自动化测试工具嗅探
                                     # devtools = True,
                                     # slowMo=100,
                                     # userDataDir='./pyppeteer_data',
                                     defaultViewport={"width": 1280,
                                                      "height": 720},
                                     fullPage=True,
                                     dumpio=True,  # chromium浏览器多开页面卡死问题
                                     args=['--disable-infobars',
                                           '--window-size=1920,1080',
                                           '--disable-features=TranslateUI',
                                           # '--proxy-server="socks5://127.0.0.1:1080"',
                                           # '--proxy-bypass-list=*',
                                           "--disable-infobars",
                                           "--user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36",
                                           ])
    # 1.新建页面跳转到主页
    #  context = await browser.createIncognitoBrowserContext()无痕模式
    index_page = await browser.newPage()
    await index_page.goto(index_url)
    await index_page.screenshot(
        {'path': './PYPPETEER_crawl_screenshot.png', 'type': 'png',
         'fullPage': True})
    print(index_page.target.url)
    page_text = await index_page.content()
    # print(page_text)
    #  2.点击关闭主页提示
    index_close_button_selector = ''
    await index_page.waitForSelector(index_close_button_selector)
    await index_page.click(index_close_button_selector,
                           options={'delay': delay_time})
    
    input_selector = ''
    # 3.输入关键字 并搜索
    await index_page.type(input_selector, key_word)
 
    search_bnt_selector = ''
    await index_page.click(search_bnt_selector, options={'delay': delay_time})
    page_text = await index_page.content()
    await index_page.close()#关闭当前tab
    await browser.close()# 关闭浏览器
     
asyncio.get_event_loop().run_until_complete(
                main(key_word, start_page,is_last_key_word))

部署

Linux安装pyppeteer

安装 pyppeteer

pip3 install pyppeteer

安装 Chromium

在线

pyppeteer-install

离线
https://download-chromium.appspot.com/?platform=Linux_x64&type=snapshots

常见错误

https://www.jianshu.com/p/ef86d9963009 https://www.jianshu.com/p/f1a8fb7037d7

Execution context was destroyed, most likely because of a navigation.

// 在登录页跳转之后添加
await page.waitForNavigation(); // 等待页面跳转

pyppeteer.errors.TimeoutError: Navigation Timeout Exceeded: 30000 ms exceeded

由于点击事件执行很快已跳转到新的页面，导致程序运行到导航等待的时候，一直处于新的页面等待触发，直到30秒超时报错，所以，正确的做法应该是把点击和导航等待视为一个整体进行操作
参考：https://blog.csdn.net/qq_29570381/article/details/89735639

##写法一：
await asyncio.gather(
page.waitForNavigation()，
page.click(’…’),
)
## 写法二：
await asyncio.wait([
page.waitForNavigation(),
page.click(’…’),
])

飞锡2024

关注

1
点赞
踩
26

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

专栏目录