pyppeteer进阶技巧

柳柳的博客

已于 2024-08-13 16:46:15 修改

阅读量105

点赞数

分类专栏：反扒文章标签： python

于 2021-03-02 11:58:22 首次发布

本文链接：https://blog.csdn.net/weixin_40303822/article/details/114282033

版权

反扒专栏收录该内容

7 篇文章 2 订阅

订阅专栏

pyppeteer进阶技巧
记录一下在使用pyppeteer过程中慢慢发现的一些稍微高级一点的用法。

一、拦截器简单用法

拦截器作用于单个Page，即浏览器中的一个标签页。每初始化一个Page都要添加一下拦截器。拦截器实际上是

通过给各种事件添加回调函数来实现的。

事件列表可参见：pyppeteer.page.Page.Events

常用拦截器：

request：发出网络请求时触发
response：收到网络响应时触发
dialog：页面有弹窗时触发
使用request拦截器修改请求：

复制代码

coding:utf8

import asyncio
from pyppeteer import launch

from pyppeteer.network_manager import Request

launch_args = {
“headless”: False,
“args”: [
“–start-maximized”,
“–no-sandbox”,
“–disable-infobars”,
“–ignore-certificate-errors”,
“–log-level=3”,
“–enable-extensions”,
“–window-size=1920,1080”,
“–user-agent=Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.97 Safari/537.36”,
],
}

async def modify_url(request: Request):
if request.url == “https://www.baidu.com/”:
await request.continue_({“url”: “https://www.baidu.com/s?wd=ip&ie=utf-8”})
else:
await request.continue_()

async def interception_test():
# 启动浏览器
browser = await launch(**launch_args)
# 新建标签页
page = await browser.newPage()
# 设置页面打开超时时间
page.setDefaultNavigationTimeout(10 * 1000)
# 设置窗口大小
await page.setViewport({“width”: 1920, “height”: 1040})

# 启用拦截器
await page.setRequestInterception(True)

# 设置拦截器
# 1. 修改请求的url
if 1:
    page.on("request", modify_url)
    await page.goto("https://www.baidu.com")

await asyncio.sleep(10)

# 关闭浏览器
await page.close()
await browser.close()
return

if name == “main”:
loop = asyncio.get_event_loop()
loop.run_until_complete(interception_test())
复制代码
使用response拦截器获取某个请求的响应：

复制代码
async def get_content(response: Response):
“”"
# 注意这里不需要设置 page.setRequestInterception(True)
page.on(“response”, get_content)
:param response:
:return:
“”"
if response.url == “https://www.baidu.com/”:
content = await response.text()
title = re.search(b"(.*?)", content)
print(title.group(1))
复制代码
干掉页面所有弹窗：

复制代码
async def handle_dialog(dialog: Dialog):
“”"
page.on(“dialog”, get_content)
:param dialog:
:return:
“”"
await dialog.dismiss()
复制代码

二、拦截器实现切换代理

一般情况下浏览器添加代理的方法为设置启动参数：

–proxy-server=http://user:password@ip:port

例如：

复制代码
launch_args = {
“headless”: False,
“args”: [
“–proxy-server=http://localhost:1080”,
“–user-agent=Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.97 Safari/537.36”,
],
}
复制代码
但此种方式的缺点很明显，只能在浏览器启动时设置。当需要切换代理时，只能重启浏览器，这个代价

就太高了，所以我们可以想想其他办法。

思路很简单：

request拦截器可以修改请求属性并且返回自定义响应内容
使用第三方库来发送网络请求，并设置代理。然后封装响应内容返回给浏览器
上代码：

复制代码
import aiohttp

aiohttp_session = aiohttp.ClientSession(loop=asyncio.get_event_loop())

proxy = “http://127.0.0.1:1080”
async def use_proxy_base(request: Request):
“”"
# 启用拦截器
await page.setRequestInterception(True)
page.on(“request”, use_proxy_base)
:param request:
:return:
“”"
# 构造请求并添加代理
req = {
“headers”: request.headers,
“data”: request.postData,
“proxy”: proxy, # 使用全局变量则可随意切换
“timeout”: 5,
“ssl”: False,
}
try:
# 使用第三方库获取响应
async with aiohttp_session.request(
method=request.method, url=request.url, **req
) as response:
body = await response.read()
except Exception as e:
await request.abort()
return

# 数据返回给浏览器
resp = {"body": body, "headers": response.headers, "status": response.status}
await request.respond(resp)
return

复制代码
或者再增加一些缓存来节约一下带宽：

复制代码

静态资源缓存

static_cache = {}

async def use_proxy_and_cache(request: Request):
“”"
# 启用拦截器
await page.setRequestInterception(True)
page.on(“request”, use_proxy_base)
:param request:
:return:
“”"
global static_cache
if request.url not in static_cache:
# 构造请求并添加代理
req = {
“headers”: request.headers,
“data”: request.postData,
“proxy”: proxy, # 使用全局变量则可随意切换
“timeout”: 5,
“ssl”: False,
}
try:
# 使用第三方库获取响应
async with aiohttp_session.request(
method=request.method, url=request.url, **req
) as response:
body = await response.read()
except Exception as e:
await request.abort()
return

    # 数据返回给浏览器
    resp = {"body": body, "headers": response.headers, "status": response.status}
    # 判断数据类型 如果是静态文件则缓存起来
    content_type = response.headers.get("Content-Type")
    if content_type and ("javascript" in content_type or "/css" in content_type):
        static_cache[request.url] = resp
else:
    resp = static_cache[request.url]

await request.respond(resp)
return

复制代码

三、反反爬虫

使用pyppeteer来模拟浏览器进行爬虫行动，我们的本意是伪装自己，让目标网站认为我是一个真实的人，然而

总有一些很蛋疼的东西会暴露自己。比如当你使用我上面的配置去模拟淘宝登录的时候，会发现怎么都登录不上。因

为浏览器的navigator.webdriver属性暴露了你的身份。在正常浏览器中，这个属性是没有的。但是当你使用pyppeteer

或者selenium时，默认情况下这个参数就会设置为true。

去除这个属性有两种方式。

先说简单的，pyppeteer的启动参数中，默认会增加一个：–enable-automation

去掉方式如下：在导入launch之前先把默认参数改了

from pyppeteer import launcher

hook 禁用防止监测webdriver

launcher.AUTOMATION_ARGS.remove(“–enable-automation”)
from pyppeteer import launch
还有个稍微复杂点的方式，就是利用拦截器来实现注入JS代码。

JS代码参见:

https://github.com/dytttf/little_spider/blob/master/pyppeteer/pass_webdriver.js

拦截器代码：

复制代码
async def pass_webdriver(request: Request):
“”"
# 启用拦截器
await page.setRequestInterception(True)
page.on(“request”, use_proxy_base)
:param request:
:return:
“”"
# 构造请求并添加代理
req = {
“headers”: request.headers,
“data”: request.postData,
“proxy”: proxy, # 使用全局变量则可随意切换
“timeout”: 5,
“ssl”: False,
}
try:
# 使用第三方库获取响应
async with aiohttp_session.request(
method=request.method, url=request.url, **req
) as response:
body = await response.read()
except Exception as e:
await request.abort()
return

if request.url == "https://www.baidu.com/":
    with open("pass_webdriver.js") as f:
        js = f.read()
    # 在html源码头部添加js代码 修改navigator属性
    body = body.replace(b"<title>", b"<script>%s</script><title>" % js.encode())

# 数据返回给浏览器
resp = {"body": body, "headers": response.headers, "status": response.status}
await request.respond(resp)
return

复制代码
这个功能pyppeteer是有专门的函数来做这件事情的：

pyppeteer.page.Page.evaluateOnNewDocument

BUT，这个函数实现的有问题，总是不起作用。而与之对比，如果你用的是nodejs的puppeteer的话，这个函数

是生效的。

四、使用Xvfb配合实现headless效果

之所以用pyppeteer，很大程度上是为了使用chromium的无头headless模式。无头更省资源，限制也少。然而现

实很残酷，特别是对爬虫。

类似于navigator.webdriver这样的东西可以用来检测是否是机器人。还有更多的手段可以来检测是否是headless。

比如：headless模式下没有window.chrome属性。具体我就不列了，反正好多。可以参见文后链接。关于如何伪装

headless模式，使其不被探测到，网上资料也有很多，也很有用。但是，这个东西细节太多了。。。。。。还得看目

标网站工程师的心情和实力。如果对方有大把时间去检测各种边边角角的东西，不断提升代码的混淆程度，死磕到底

的话，就有点得不偿失了。