📖 前言
👋 简介
本章节主要介绍一些爬虫领域相关的设置,比如设置请求头中的一些参数,代理等
💡 正文
1 设置UA
设置user-agent参数
from playwright.sync_api import sync_playwright
with sync_playwright() as playwright:
browser = playwright.chromium.launch(headless=False)
# 添加 UserAgent
page = browser.new_page(
user_agent='Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/112.0.0.0 Safari/537.36'
)
page.goto("https://www.baidu.com/")
page.wait_for_timeout(10000)
browser.close()
2 设置Headers
在设置headers时,也可以通过这种方式设置ua
from playwright.sync_api import sync_playwright
with sync_playwright() as playwright:
browser = playwright.chromium.launch(headless=False)
context = browser.new_context()
page = context.new_page()
# 设置请求头部
page.set_extra_http_headers(
headers={
"Authorization": "Bearer ****************************",
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/119.0.0.0 Safari/537.36"
}
)
page.goto("https://www.baidu.com/")
page.wait_for_timeout(2000)
context.close()
browser.close()
3 获取接口请求数据
from playwright.sync_api import sync_playwright, Response
def on_response(response: Response) -> None:
print(f'Statu {response.status}: {response.url}')
def get_data(response: Response) -> None:
if '/api/movie/' in response.url and response.status == 200:
print(response.json())
with sync_playwright() as playwright:
browser = playwright.chromium.launch(headless=False)
context = browser.new_context()
page = context.new_page()
page.on('response', on_response)
page.on('response', get_data)
page.goto('http://spa6.scrape.center')
page.wait_for_load_state('networkidle')
browser.close()
我们通过page.on()
方法来执行想要的操作,通过page.on()
方法用于注册一个事件监听器,当事件发生时调用回调函数,其中函数可以使用lambda匿名函数,也可以自定义函数来调用。
page.on()
的使用场景,比如获取百度网站console控制台的内容
from playwright.sync_api import sync_playwright
def console_msg(msg):
values = []
for value in msg.args:
print(value)
values.append(f'{value.json_value()}')
# print(values)
with sync_playwright() as playwright:
browser = playwright.chromium.launch(headless=False)
context = browser.new_context()
page = context.new_page()
page.on('console', console_msg)
page.goto('https://www.baidu.com')
page.wait_for_load_state('networkidle')
browser.close()
# 这是一个最好的时代,
# 科技的发展给予了每个人创造价值的可能性;
# 这也是一个最充满想象的时代,
# 每一位心怀梦想的人,终会奔向星辰大海。
# 百度与你们一起仰望星辰大海,携手共筑未来!
# %c百度2023校园招聘简历投递:https://talent.baidu.com/jobs/list
# color:red
4 设置代理
在创建浏览器的时候,给参数proxy
传入代理的信息即可
from playwright.sync_api import sync_playwright
with sync_playwright() as playwright:
browser = playwright.chromium.launch(proxy={
'server': 'https://xxx.xxx.xxx.xxx:8080',
# 'username': 'xxx',
# 'password': 'xxx'
})
context = browser.new_context()
page = context.new_page()
page.goto('https://httpbin.org/get')
print(page.content())
context.close()
browser.close()
5 执行Javascript代码
首先加载本地js代码到内存中,然后通过page.add_init_script()
添加执行
from playwright.sync_api import sync_playwright
with open('./source.js', 'r') as f:
js = f.read()
with sync_playwright() as playwright:
browser = playwright.chromium.launch(headless=False)
context = browser.new_context()
page = context.new_page()
# 执行 JS 代码
page.add_init_script(js)
page.goto("https://www.baidu.com/")
page.wait_for_timeout(10000)
browser.close()