Playwright自动化框架系列(七)

📖 前言

👋 简介

本章节主要介绍一些爬虫领域相关的设置,比如设置请求头中的一些参数,代理等

💡 正文

1 设置UA

设置user-agent参数

from playwright.sync_api import sync_playwright

with sync_playwright() as playwright:
    browser = playwright.chromium.launch(headless=False)
    # 添加 UserAgent
    page = browser.new_page(
        user_agent='Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/112.0.0.0 Safari/537.36'
    )
    page.goto("https://www.baidu.com/")
    page.wait_for_timeout(10000)
    browser.close()

2 设置Headers

在设置headers时,也可以通过这种方式设置ua

from playwright.sync_api import sync_playwright

with sync_playwright() as playwright:
    browser = playwright.chromium.launch(headless=False)
    context = browser.new_context()
    page = context.new_page()
    # 设置请求头部
    page.set_extra_http_headers(
        headers={
            "Authorization": "Bearer ****************************",
            "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/119.0.0.0 Safari/537.36"
        }
    )
    page.goto("https://www.baidu.com/")
    page.wait_for_timeout(2000)
    context.close()
    browser.close()

3 获取接口请求数据

from playwright.sync_api import sync_playwright, Response

def on_response(response: Response) -> None:
    print(f'Statu {response.status}: {response.url}')

def get_data(response: Response) -> None:
    if '/api/movie/' in response.url and response.status == 200:
        print(response.json())

with sync_playwright() as playwright:
    browser = playwright.chromium.launch(headless=False)
    context = browser.new_context()
    page = context.new_page()
    page.on('response', on_response)
    page.on('response', get_data)
    page.goto('http://spa6.scrape.center')
    page.wait_for_load_state('networkidle')
    browser.close()

我们通过page.on()方法来执行想要的操作,通过page.on()方法用于注册一个事件监听器,当事件发生时调用回调函数,其中函数可以使用lambda匿名函数,也可以自定义函数来调用。
page.on()的使用场景,比如获取百度网站console控制台的内容

from playwright.sync_api import sync_playwright

def console_msg(msg):
    values = []
    for value in msg.args:
        print(value)
        values.append(f'{value.json_value()}')
    # print(values)

with sync_playwright() as playwright:
    browser = playwright.chromium.launch(headless=False)
    context = browser.new_context()
    page = context.new_page()
    page.on('console', console_msg)
    page.goto('https://www.baidu.com')
    page.wait_for_load_state('networkidle')
    browser.close()

# 这是一个最好的时代,
# 科技的发展给予了每个人创造价值的可能性;
# 这也是一个最充满想象的时代,
# 每一位心怀梦想的人,终会奔向星辰大海。
# 百度与你们一起仰望星辰大海,携手共筑未来!

# %c百度2023校园招聘简历投递:https://talent.baidu.com/jobs/list
# color:red

4 设置代理

在创建浏览器的时候,给参数proxy传入代理的信息即可

from playwright.sync_api import sync_playwright

with sync_playwright() as playwright:
    browser = playwright.chromium.launch(proxy={
        'server': 'https://xxx.xxx.xxx.xxx:8080',
        # 'username': 'xxx',
        # 'password': 'xxx'
    })
    context = browser.new_context()
    page = context.new_page()
    page.goto('https://httpbin.org/get')
    print(page.content())
    context.close()
    browser.close()

5 执行Javascript代码

首先加载本地js代码到内存中,然后通过page.add_init_script()添加执行

from playwright.sync_api import sync_playwright

with open('./source.js', 'r') as f:
    js = f.read()

with sync_playwright() as playwright:
    browser = playwright.chromium.launch(headless=False)
    context = browser.new_context()
    page = context.new_page()
    # 执行 JS 代码
    page.add_init_script(js)
    page.goto("https://www.baidu.com/")
    page.wait_for_timeout(10000)
    browser.close()

🎉 欢迎我的关注公众号

在这里插入图片描述

  • 10
    点赞
  • 10
    收藏
    觉得还不错? 一键收藏
  • 1
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值