playwright介绍
- playwright是由微软开发的新一代web自动测试工具,相比selenium它的特点:
- 不需要安装webdriver
- 不需要手动设置等待
- playwright支持异步
- selenium底层是http(单向通信),而playwright则基于websocket(双向通信)
- 重点:自带录制功能,根据录制过程中的操作,自带生成代码
playwright codegen www.xxx.com
playwright codegen -o script.py
- playwright环境搭建:
- 版本需求:
Python >= 3.7
- 安装所需的模块:
pip install playwright
- 安装自带浏览器和ffmpeg:
playwright install
- 官方文档:https://playwright.bootcss.com/docs/why-playwright
playwright基本使用
- 导入模块:
from playwright.sync_api import sync_playwright
- 显示浏览器:
browser = 浏览器.launch(headless=False)
- 启用不同的有头浏览器
浏览器
:chromium、firefox、webKit
- 浏览器页面:
page = browser.new_page()
context = browser.new_context()
new_context()
:设置可启用多个page页面page_num = context.new_page()
- 设置加载超时延迟:
page.wait_for_timeout(5000)
- 返回渲染后的源码:
- 入门案例:
from playwright.sync_api import sync_playwright
with sync_playwright() as pw:
browser = pw.chromium.launch(headless=False)
page = browser.new_page()
page.goto('https://www.baidu.com')
print(page.title())
page.wait_for_timeout(3000)
browser.close()
playwright选择元素
- 常用的元素选择器:
- 节点选择器:
page.query_selector_all('xxx')
page.query_selector('xxx')
- 文本选择器:
page.locator("text=文本内容")
- css选择器:
page.locator("标签名称")
- 存在多个时默认选择第一个
- 可直接使用标签的名称:
button
- 可通过id、class选择器:
#x .y
- 还有特定节点属性:
"[xxx=yyy]"
- xpath选择器:
page.locator("xpath=xxx")
- 下标选择器:
page.locator("button >> nth=x")
- 案例:
from playwright.sync_api import sync_playwright
with sync_playwright() as pw:
browser = pw.chromium.launch(headless=False)
page = browser.new_page()
page.goto('https://www.lagou.com/jobs/list_爬虫')
jobs_data_list = page.query_selector_all('//*[@id="s_position_list"]/ul/li')
for jobs_data in jobs_data_list:
job_title = jobs_data.query_selector('xpath=./div[1]/div[1]/div[1]/a/h3').text_content()
print(job_title)
browser.close()
- 选择元素后常用的操作:
.text_content()
:
.fill('内容')
:
.type('内容')
:
.get_attribute('属性名')
:
.press('Shift+A')
:
.wait_for()
:
.鼠标单次事件()
:
- 案例:
from playwright.sync_api import sync_playwright
with sync_playwright() as pw:
browser = pw.chromium.launch(headless=False)
page = browser.new_page()
page.goto('https://www.baidu.com')
"""
像这类的方法都有两种使用方法:
1. page.locator('xpath=xxx').fill('test')
2. page.fill('xpath=xxx', 'test')
两种方法作用相同,选择适合自己的就好
"""
browser.close()
playwright鼠标操作
- 鼠标单次事件:
- 单击鼠标
左键
:
- 双击鼠标
左键
:
- 鼠标悬停:
- 单击鼠标
右键
:
page.click('元素位置', button='right')
- 按
shift
+ 单击
鼠标:
page.click('元素位置', modifiers=['Shift'])
- 鼠标点击元素的指定位置:
page.click('元素位置', position={'x': 0, 'y': 0})
- 鼠标保持事件:
- 按下鼠标
不放
:
移动
鼠标到指定位置:
page.mouse.move(x轴, y轴, steps=10)
松开
鼠标:
- 案例:
from playwright.sync_api import sync_playwright
with sync_playwright() as pw:
browser = pw.chromium.launch(headless=False)
page = browser.new_page()
page.goto('https://www.baidu.com')
"""
这类也是是有两种使用方法:
1. page.locator('xpath=xxx').click()
2. page.click('xpath=xxx')
两种方法作用相同,选择适合自己的就好
"""
browser.close()
playwright异步并发
import asyncio
from playwright.async_api import async_playwright
async def main():
async with async_playwright() as pw:
browser = await pw.chromium.launch()
page = await browser.new_page()
await page.goto('https://www.baidu.com')
print(await page.title())
await browser.close()
asyncio.run(main())
playwright其他操作
同时启用多个页面:
from playwright.sync_api import sync_playwright
with sync_playwright() as p:
browser_type = p.chromium
browser = browser_type.launch(headless=False)
context = browser.new_context()
page1 = context.new_page()
page1.goto('https://mail.163.com/')
page2 = context.new_page()
page2.goto("https://www.baidu.com/")
context.close()
browser.close()
截取浏览器页面:
from playwright.sync_api import sync_playwright
with sync_playwright() as pw:
browser = pw.webkit.launch()
page = browser.new_page()
page.goto('https://www.baidu.com')
page.screenshot(path="baidu.png")
page.screenshot(path="screenshot.png", full_page=True)
page.locator('元素位置').screenshot(path="test.png")
browser.close()
进入生成的frame标签:
from playwright.sync_api import sync_playwright
with sync_playwright() as pw:
browser = pw.chromium.launch(headless=False)
page = browser.new_page()
page.goto('https://www.baidu.com')
"""
进入frame标签,有四种方式:
1. 通过url定位frame:page.frame(url='www.title.com')
2. 通过name定位frame:page.frame('title')
3. 通过特定元素定位frame:page.query_selector('.title').content_frame()
4. 通过page.frames查看全部的frame标签,然后使用:page.frames[元素下标]
"""
frame = page.query_selector('.title').content_frame()
browser.close()
打开页面时不加载图片(网络劫持):
from playwright.sync_api import sync_playwright
import re
with sync_playwright() as pw:
browser = pw.chromium.launch(headless=False)
page = browser.new_page()
def cancel_request(route, request):
route.abort()
page.route(re.compile(r"(\.png)|(\.jpg)"), cancel_request)
page.goto("https://movie.douban.com/")
page.wait_for_load_state('networkidle')
page.screenshot(path='move_douban.png')
browser.close()
事件监听,可以拦截获取Ajax加载的数据:
from playwright.sync_api import sync_playwright
def on_response(response):
if '/api/movie/' in response.url and response.status == 200:
print(response.json())
with sync_playwright() as pw:
browser = pw.chromium.launch(headless=False)
page = browser.new_page()
page.on('response', on_response)
page.goto('https://spa6.scrape.center/')
page.wait_for_load_state('networkidle')
browser.close()
防止playwright被检测为webdriver:
from playwright.sync_api import sync_playwright
with sync_playwright() as pw:
browser = pw.webkit.launch(headless=False)
page = browser.new_page()
page.add_init_script(
"""
Object.defineProperties(navigator, {
webdriver:{
get:()=>undefined
}
});
"""
)
page.goto('https://www.baidu.com')
page.wait_for_timeout(100000)
browser.close()
模拟移动设备打开浏览器:
with sync_playwright() as pw:
mobile_type = pw.devices['iPhone 12']
browser = pw.webkit.launch(headless=False)
context = browser.new_context(
**mobile_type,
locale='zh-CN',
geolocation={'longitude': 115.725177, 'latitude': 34.404329},
permissions=['geolocation']
)
page = context.new_page()
page.goto('https://amap.com')
page.wait_for_load_state(state='networkidle')
page.screenshot(path='mobile_web.png')
browser.close()
获取元素相对于浏览器的坐标:
from playwright.sync_api import sync_playwright
with sync_playwright() as pw:
browser = pw.chromium.launch(headless=False)
page = browser.new_page()
page.goto('https://www.baidu.com')
s = save_img_frame.locator('xpath=xxx')
"""
xxx.bounding_box()
获取元素相对于浏览器的坐标和元素自身的大小,返回一个字典:
{
'x': 837.5375366210938,
'y': 190.31250762939453,
'width': 56,
'height': 56
}
"""
box = s.bounding_box()
x = int(box["x"] + box["width"] / 2)
y = int(box["y"] + box["height"] / 2)
browser.close()