scrapy-playwright 入门（爬虫教程）

YYSonic407

已于 2024-06-04 10:06:18 修改

阅读量3.2k

点赞数 50

分类专栏： scrapy-playwright 文章标签： scrapy 爬虫 python

于 2024-06-01 21:03:11 首次发布

本文链接：https://blog.csdn.net/YYSonic407/article/details/139378603

版权

scrapy-playwright 专栏收录该内容

3 篇文章

订阅专栏

文章目录

一、为什么写这篇文章
二、前置步骤
三、环境配置
四、步骤详解
总结

一、为什么写这篇文章

该插件只能在macOS和linux系统中使用。
在使用scrapy框架的过程中，需要抓取网页的响应数据，之后了解到playwright有这样的功能，并有scrapy-playwright插件，但在使用过程中，出现许多意料之外的bug，即使是复现github项目中已有的example也无法正常运行，后来发现缺少关键要点。（windows系统用不了，本地虚拟机也出错，后来在远程linux服务器上才运行成功）

二、前置步骤

scrapy入门：https://blog.csdn.net/xw1680/article/details/135089951
playwright简介：https://blog.csdn.net/gitblog_00100/article/details/137667697
playwright入门：https://blog.csdn.net/weixin_43845191/article/details/108271962
scrapy-playwright项目地址：https://gitcode.com/scrapy-plugins/scrapy-playwright

三、环境配置

建立远程服务器
本地虚拟机（推荐 Ubuntu 20 及以上，仅供参考）https://blog.csdn.net/YYSonic407/article/details/139422632
安装并配置python：https://blog.csdn.net/weixin_64079883/article/details/129352508
安装pycharm（因为我只会pycharm）https://blog.csdn.net/weixin_47556601/article/details/121159698
配置pycharm，方便直接使用：https://blog.csdn.net/a_cherry_blossoms/article/details/123421990

四、步骤详解

1.创建scrapy项目

在项目路径文件夹下，命令行逐条执行

# 安装scrapy，或者 pip3 install scrapy
pip install scrapy
# 创建scrapy项目
scrapy startproject myscrapy
# 打开项目目录
cd myscrapy
# 新建scrapy程序
scrapy genspider csdn blog.csdn.net

2.配置项目环境

用pycharm打开scrapy项目

配置虚拟环境
文件（File）——> 设置（Settings）——> 项目（Project: xxx）——> Python解释器（Python Interpreter）——> 设置按钮——> add local interpreter（默认venv或.venv）
安装项目库
打开pycharm自带terminal终端，逐条执行

pip install scrapy
pip install playwright
pip install scrapy-playwright
playwright install

查看已安装的库

pip list

3.使用scrapy-playwright插件

本程序的目的是监听目标网址的响应数据，在playwright中有给定的on方法，例如page.on(“response”, handler)，其中handler是事件监听方法。
对事件监听的深入了解：
https://blog.csdn.net/weixin_44104090/article/details/128991945
https://playwright.dev/python/docs/api/class-page
https://github.com/scrapy-plugins/scrapy-playwright?tab=readme-ov-file#handling-page-events

通过使用插件的playwright_page_event_handlers属性，完成事件监听。
将下方代码复制粘贴到scrapy项目中csdn.py最后，并在terminal终端运行scrapy crawl events命令即可。
以下是部分代码示例：

from playwright.async_api import Dialog, PlaywrightResponse

class EventsSpider(scrapy.Spider):
    """Handle page events."""

    name = "events"
    custom_settings = {
        "TWISTED_REACTOR": "twisted.internet.asyncioreactor.AsyncioSelectorReactor",
        "DOWNLOAD_HANDLERS": {
            "https": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
            "http": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
        },
    }

    def start_requests(self):
        yield scrapy.Request(
            url="https://blog.csdn.net/csdngeeknews/article/details/139284722",
            meta={
                "playwright": True,
                "playwright_page_event_handlers": {
                    "dialog": self.handle_dialog,
                    "response": "handle_response",
                },
            },
        )

    async def handle_dialog(self, dialog: Dialog) -> None:
        print(f"Handled dialog with message: {dialog.message}")
        await dialog.dismiss()

    async def handle_response(self, response: PlaywrightResponse) -> None:
        if ".json" in response.url:
            with open("json.txt", "w", encoding="utf-8") as f:
                f.write(f"Received response with URL {response.url}")

    def parse(self, response, **kwargs):
        pass