Playwright-新一代自动化工具 > 酱紫写爬虫?

playwright介绍

  • playwright是由微软开发的新一代web自动测试工具,相比selenium它的特点:
    • 不需要安装webdriver
    • 不需要手动设置等待
    • playwright支持异步
    • selenium底层是http(单向通信),而playwright则基于websocket(双向通信)
    • 重点:自带录制功能,根据录制过程中的操作,自带生成代码
      • playwright codegen www.xxx.com
      • playwright codegen -o script.py
  • playwright环境搭建:
    • 版本需求:Python >= 3.7
    • 安装所需的模块:pip install playwright
    • 安装自带浏览器和ffmpeg:playwright install
      • 自带浏览器包括:
        • chromium
        • firefox
        • webKit
  • 官方文档:https://playwright.bootcss.com/docs/why-playwright

playwright基本使用

  • 导入模块:
    • from playwright.sync_api import sync_playwright
  • 显示浏览器:
    • browser = 浏览器.launch(headless=False)
      • 启用不同的有头浏览器
        • 浏览器:chromium、firefox、webKit
  • 浏览器页面:
    • page = browser.new_page()
      • new_page():创建一个page页面
    • context = browser.new_context()
      • new_context():设置可启用多个page页面
      • page_num = context.new_page()
        • new_page():创建一个新的page页面
  • 设置加载超时延迟:
    • page.wait_for_timeout(5000)
  • 返回渲染后的源码:
    • page.content()
  • 入门案例:
from playwright.sync_api import sync_playwright

with sync_playwright() as pw:
    # 设为有头浏览器
    browser = pw.chromium.launch(headless=False)
    # 创建一个page页面
    page = browser.new_page()
    # 访问百度
    page.goto('https://www.baidu.com')
    # 输出百度的标题
    print(page.title())
    page.wait_for_timeout(3000)
    browser.close()

playwright选择元素

  • 常用的元素选择器:
    • 节点选择器:
      • page.query_selector_all('xxx')
        • 获取页面所有xxx节点
      • page.query_selector('xxx')
        • 获取页面xxx节点,多个时返回第一个
    • 文本选择器:
      • page.locator("text=文本内容")
        • 文本内容支持正则
    • css选择器:
      • page.locator("标签名称")
        • 存在多个时默认选择第一个
          • 可直接使用标签的名称:button
          • 可通过id、class选择器:#x .y
          • 还有特定节点属性:"[xxx=yyy]"
    • xpath选择器:
      • page.locator("xpath=xxx")
    • 下标选择器:
      • page.locator("button >> nth=x")
        • 选取第x个button标签
  • 案例:
from playwright.sync_api import sync_playwright

with sync_playwright() as pw:
    browser = pw.chromium.launch(headless=False)
    page = browser.new_page()
    page.goto('https://www.lagou.com/jobs/list_爬虫')
    # 获取存放所有li
    jobs_data_list = page.query_selector_all('//*[@id="s_position_list"]/ul/li')
    # 遍历得到每个li中的内容
    for jobs_data in jobs_data_list:
        job_title = jobs_data.query_selector('xpath=./div[1]/div[1]/div[1]/a/h3').text_content()
        print(job_title)
    browser.close()
  • 选择元素后常用的操作:
    • .text_content()
      • 返回元素的文本内容
    • .fill('内容')
      • 输入文本,一次性输完
    • .type('内容')
      • 输入文本,单个字符输入
    • .get_attribute('属性名')
      • 返回标签的属性值
    • .press('Shift+A')
      • 对元素按下快捷键
    • .wait_for()
      • 等待元素加载完毕
    • .鼠标单次事件()
      • 直接对选择的元素进行鼠标操作
  • 案例:
from playwright.sync_api import sync_playwright

with sync_playwright() as pw:
    browser = pw.chromium.launch(headless=False)
    page = browser.new_page()
    page.goto('https://www.baidu.com')
    """
    像这类的方法都有两种使用方法:
        1. page.locator('xpath=xxx').fill('test')
        2. page.fill('xpath=xxx', 'test')
        两种方法作用相同,选择适合自己的就好
    """
    browser.close()

playwright鼠标操作

  • 鼠标单次事件:
    • 单击鼠标左键
      • page.click('元素位置')
    • 双击鼠标左键
      • page.dblclick('元素位置')
    • 鼠标悬停:
      • page.hover('元素位置')
    • 单击鼠标右键
      • page.click('元素位置', button='right')
    • shift + 单击鼠标:
      • page.click('元素位置', modifiers=['Shift'])
    • 鼠标点击元素的指定位置:
      • page.click('元素位置', position={'x': 0, 'y': 0})
  • 鼠标保持事件:
    • 按下鼠标不放
      • page.mouse.down()
    • 移动鼠标到指定位置:
      • page.mouse.move(x轴, y轴, steps=10)
        • steps:可选,设置单次移动的比例,值越大越慢
    • 松开鼠标:
      • page.mouse.up()
  • 案例:
from playwright.sync_api import sync_playwright

with sync_playwright() as pw:
    browser = pw.chromium.launch(headless=False)
    page = browser.new_page()
    page.goto('https://www.baidu.com')
    """
    这类也是是有两种使用方法:
        1. page.locator('xpath=xxx').click()
        2. page.click('xpath=xxx')
        两种方法作用相同,选择适合自己的就好
    """
    browser.close()

playwright异步并发

import asyncio
from playwright.async_api import async_playwright

async def main():
    async with async_playwright() as pw:
        browser = await pw.chromium.launch()
        page = await browser.new_page()
        await page.goto('https://www.baidu.com')
        print(await page.title())
        await browser.close()

asyncio.run(main())

playwright其他操作

同时启用多个页面:

from playwright.sync_api import sync_playwright

with sync_playwright() as p:
    browser_type = p.chromium
    browser = browser_type.launch(headless=False)
    context = browser.new_context()
    page1 = context.new_page()
    page1.goto('https://mail.163.com/')
    page2 = context.new_page()
    page2.goto("https://www.baidu.com/")
    context.close()
    browser.close()

截取浏览器页面:

from playwright.sync_api import sync_playwright

with sync_playwright() as pw:
    browser = pw.webkit.launch()
    page = browser.new_page()
    page.goto('https://www.baidu.com')
    # 截取当前页面的截图
    page.screenshot(path="baidu.png")
    # 整页截图,从上滚到最低
    page.screenshot(path="screenshot.png", full_page=True)
    # 截取某个元素的截屏
    page.locator('元素位置').screenshot(path="test.png")
    browser.close()

进入生成的frame标签:

from playwright.sync_api import sync_playwright

with sync_playwright() as pw:
    browser = pw.chromium.launch(headless=False)
    page = browser.new_page()
    page.goto('https://www.baidu.com')
    """
    进入frame标签,有四种方式:
        1. 通过url定位frame:page.frame(url='www.title.com')
        2. 通过name定位frame:page.frame('title')
        3. 通过特定元素定位frame:page.query_selector('.title').content_frame()
        4. 通过page.frames查看全部的frame标签,然后使用:page.frames[元素下标]
    """
    frame = page.query_selector('.title').content_frame()
    browser.close()

打开页面时不加载图片(网络劫持):

from playwright.sync_api import sync_playwright
import re
 
with sync_playwright() as pw:
    browser = pw.chromium.launch(headless=False)
    page = browser.new_page()
    
    # 执行拦截操作
    def cancel_request(route, request):
        route.abort()

    # 拦截以'.png'、'.jpg'结尾的请求
    page.route(re.compile(r"(\.png)|(\.jpg)"), cancel_request)
 
    page.goto("https://movie.douban.com/")
    page.wait_for_load_state('networkidle')
    # 保存截图查看效果
    page.screenshot(path='move_douban.png')
    browser.close()

事件监听,可以拦截获取Ajax加载的数据:

from playwright.sync_api import sync_playwright

def on_response(response):
    if '/api/movie/' in response.url and response.status == 200:
        print(response.json())

with sync_playwright() as pw:
    browser = pw.chromium.launch(headless=False)
    page = browser.new_page()
    page.on('response', on_response)
    page.goto('https://spa6.scrape.center/')
    page.wait_for_load_state('networkidle')
    browser.close()

防止playwright被检测为webdriver:

from playwright.sync_api import sync_playwright

with sync_playwright() as pw:
    browser = pw.webkit.launch(headless=False)
    page = browser.new_page()
    page.add_init_script(
        """
            Object.defineProperties(navigator, {
                webdriver:{
                    get:()=>undefined
                }
            });
        """
    )
    page.goto('https://www.baidu.com')
    page.wait_for_timeout(100000)
    browser.close()

模拟移动设备打开浏览器:

with sync_playwright() as pw:
    # 设置手机型号
    mobile_type = pw.devices['iPhone 12']
    browser = pw.webkit.launch(headless=False)
    context = browser.new_context(
        **mobile_type,
        # 设置地区语言
        locale='zh-CN',
        # 设置位置经纬度
        geolocation={'longitude': 115.725177, 'latitude': 34.404329},
        permissions=['geolocation']
    )
    page = context.new_page()
    page.goto('https://amap.com')
    page.wait_for_load_state(state='networkidle')
    page.screenshot(path='mobile_web.png')
    browser.close()

获取元素相对于浏览器的坐标:

from playwright.sync_api import sync_playwright

with sync_playwright() as pw:
    browser = pw.chromium.launch(headless=False)
    page = browser.new_page()
    page.goto('https://www.baidu.com')
    # 获取滑块位置
    s = save_img_frame.locator('xpath=xxx')
    """
    xxx.bounding_box()
    获取元素相对于浏览器的坐标和元素自身的大小,返回一个字典:
    {
        'x': 837.5375366210938, 
        'y': 190.31250762939453, 
        'width': 56, 
        'height': 56
    }
    """
    box = s.bounding_box()
    # 获取元素的中心点:
    x = int(box["x"] + box["width"] / 2)
    y = int(box["y"] + box["height"] / 2)
    browser.close()
  • 2
    点赞
  • 9
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值