scrapy-playwright 入门(爬虫教程)


一、为什么写这篇文章

该插件只能在macOS和linux系统中使用。
在使用scrapy框架的过程中,需要抓取网页的响应数据,之后了解到playwright有这样的功能,并有scrapy-playwright插件,但在使用过程中,出现许多意料之外的bug,即使是复现github项目中已有的example也无法正常运行,后来发现缺少关键要点。(windows系统用不了,本地虚拟机也出错,后来在远程linux服务器上才运行成功)


二、前置步骤

scrapy入门:https://blog.csdn.net/xw1680/article/details/135089951
playwright简介:https://blog.csdn.net/gitblog_00100/article/details/137667697
playwright入门:https://blog.csdn.net/weixin_43845191/article/details/108271962
scrapy-playwright项目地址:https://gitcode.com/scrapy-plugins/scrapy-playwright


三、环境配置

建立远程服务器
本地虚拟机(推荐 Ubuntu 20 及以上,仅供参考)https://blog.csdn.net/YYSonic407/article/details/139422632
安装并配置python:https://blog.csdn.net/weixin_64079883/article/details/129352508
安装pycharm(因为我只会pycharm)https://blog.csdn.net/weixin_47556601/article/details/121159698
配置pycharm,方便直接使用:https://blog.csdn.net/a_cherry_blossoms/article/details/123421990

四、步骤详解

1.创建scrapy项目

在项目路径文件夹下,命令行逐条执行

# 安装scrapy,或者 pip3 install scrapy
pip install scrapy
# 创建scrapy项目
scrapy startproject myscrapy
# 打开项目目录
cd myscrapy
# 新建scrapy程序
scrapy genspider csdn blog.csdn.net

2.配置项目环境

用pycharm打开scrapy项目

  • 配置虚拟环境
    文件(File)——> 设置(Settings)——> 项目(Project: xxx)——> Python解释器(Python Interpreter)——> 设置按钮——> add local interpreter(默认venv或.venv)
  • 安装项目库
    打开pycharm自带terminal终端,逐条执行
pip install scrapy
pip install playwright
pip install scrapy-playwright
playwright install
  • 查看已安装的库
pip list

3.使用scrapy-playwright插件

本程序的目的是监听目标网址的响应数据,在playwright中有给定的on方法,例如page.on(“response”, handler),其中handler是事件监听方法。
对事件监听的深入了解:
https://blog.csdn.net/weixin_44104090/article/details/128991945
https://playwright.dev/python/docs/api/class-page
https://github.com/scrapy-plugins/scrapy-playwright?tab=readme-ov-file#handling-page-events

通过使用插件的playwright_page_event_handlers属性,完成事件监听。
将下方代码复制粘贴到scrapy项目中csdn.py最后,并在terminal终端运行scrapy crawl events命令即可。
以下是部分代码示例:

from playwright.async_api import Dialog, PlaywrightResponse

class EventsSpider(scrapy.Spider):
    """Handle page events."""

    name = "events"
    custom_settings = {
        "TWISTED_REACTOR": "twisted.internet.asyncioreactor.AsyncioSelectorReactor",
        "DOWNLOAD_HANDLERS": {
            "https": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
            "http": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
        },
    }

    def start_requests(self):
        yield scrapy.Request(
            url="https://blog.csdn.net/csdngeeknews/article/details/139284722",
            meta={
                "playwright": True,
                "playwright_page_event_handlers": {
                    "dialog": self.handle_dialog,
                    "response": "handle_response",
                },
            },
        )

    async def handle_dialog(self, dialog: Dialog) -> None:
        print(f"Handled dialog with message: {dialog.message}")
        await dialog.dismiss()

    async def handle_response(self, response: PlaywrightResponse) -> None:
        if ".json" in response.url:
            with open("json.txt", "w", encoding="utf-8") as f:
                f.write(f"Received response with URL {response.url}")

    def parse(self, response, **kwargs):
        pass

阅读源码的话,也会发现,本质上就是调用了page的on方法,只是做了许多封装,简化了代码的实现,并且scrapy-playwright插件还有许多巧妙使用,此不一一列举。
以下是playwright_page_event_handlers属性使用源码:
在这里插入图片描述


总结

以上就是本文要讲的内容,该插件只能在macOS和linux系统中使用。
多使用搜索引擎,常看文档和源码,会对做项目有不错的帮助。

Scrapy-Playwright是一个用于Scrapy框架的插件,它允许您使用Playwright库来爬取JavaScript动态渲染的网站。下面是使用Scrapy-Playwright进行Web Scraping的简单教程: 1. 安装Scrapy-Playwright 您可以使用pip命令来安装Scrapy-Playwright。在命令提示符或终端中运行以下命令: ``` pip install scrapy-playwright ``` 2. 配置Scrapy-Playwright 要使用Scrapy-Playwright,您需要在Scrapy项目的settings.py文件中进行配置。添加以下行: ``` DOWNLOADER_MIDDLEWARES = { 'scrapy_playwright.PlaywrightMiddleware': 543, } PLAYWRIGHT_LAUNCH_OPTIONS = { 'headless': True, } ``` 这将启用Playwright中间件,并将Playwright设置为在无头模式下运行。 3. 创建Spider 创建一个新的Spider并导入PlaywrightRequest和PlaywrightResponse类。这些类类似于Scrapy的Request和Response类,但它们使用Playwright库来处理JavaScript渲染。 ``` from scrapy_playwright import PlaywrightRequest, PlaywrightResponse from scrapy.spiders import Spider class MySpider(Spider): name = 'myspider' start_urls = ['https://www.example.com'] def start_requests(self): for url in self.start_urls: yield PlaywrightRequest(url) def parse(self, response: PlaywrightResponse): # 处理响应 ``` 4. 处理响应 在parse方法中,您可以像处理Scrapy Response对象一样处理PlaywrightResponse对象。但是,PlaywrightResponse对象包含了一个page属性,它是由Playwright库返回的Page对象,您可以使用它来处理JavaScript渲染的内容。 ``` def parse(self, response: PlaywrightResponse): # 获取HTML和JavaScript渲染的内容 html = response.text js_rendered_html = response.page.content() ``` 5. 运行Spider 最后,您可以像运行任何其他Scrapy Spider一样运行您的Spider。 ``` scrapy crawl myspider ``` 希望这个简单的教程能够帮助您开始使用Scrapy-Playwright进行Web Scraping。
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值