爬虫：playwright+BeautifulSoup太好用了

最新推荐文章于 2025-03-31 08:30:00 发布

哟哟-

最新推荐文章于 2025-03-31 08:30:00 发布

阅读量992

点赞数 8

分类专栏：爬虫文章标签：爬虫 beautifulsoup 前端 python 开发语言

本文链接：https://blog.csdn.net/qq_29517595/article/details/141187119

版权

简直太好用了！！

playwright：动态页面爬取，执行如页面导航、获取元素HTML这些功能，可用模仿操作行为
BeautifulSoup：解析静态页面，特点是可用按顺序解析HTML元素，如文章页面，一般很多都是一个段落一个<p></p>标签包裹的，中间还包括图片、表格等等。把通过playwright获取到的页面内容塞给BeautifulSoup，按顺序解析

来一个简单的示例：

playwright获取动态页面

async def page_crawler(url: str, logger) -> (int, dict):

    async with async_playwright() as p:
        browser = await p.chromium.launch(headless=True)
        page = await browser.new_page()
        await page.goto(url)

        # 获取要解析的HTML内容，这里也可用是一个page的全部元素
        info_html = await page.locator('div.con-bd ').inner_html()
        
        # 执行页面解析
        result = await process_element(info_html)
        
        return result