一、为什么写这篇文章
该插件只能在macOS和linux系统中使用。
在使用scrapy框架的过程中,需要抓取网页的响应数据,之后了解到playwright有这样的功能,并有scrapy-playwright插件,但在使用过程中,出现许多意料之外的bug,即使是复现github项目中已有的example也无法正常运行,后来发现缺少关键要点。(windows系统用不了,本地虚拟机也出错,后来在远程linux服务器上才运行成功)
二、前置步骤
scrapy入门:https://blog.csdn.net/xw1680/article/details/135089951
playwright简介:https://blog.csdn.net/gitblog_00100/article/details/137667697
playwright入门:https://blog.csdn.net/weixin_43845191/article/details/108271962
scrapy-playwright项目地址:https://gitcode.com/scrapy-plugins/scrapy-playwright
三、环境配置
建立远程服务器
本地虚拟机(推荐 Ubuntu 20 及以上,仅供参考)https://blog.csdn.net/YYSonic407/article/details/139422632
安装并配置python:https://blog.csdn.net/weixin_64079883/article/details/129352508
安装pycharm(因为我只会pycharm)https://blog.csdn.net/weixin_47556601/article/details/121159698
配置pycharm,方便直接使用:https://blog.csdn.net/a_cherry_blossoms/article/details/123421990
四、步骤详解
1.创建scrapy项目
在项目路径文件夹下,命令行逐条执行
# 安装scrapy,或者 pip3 install scrapy
pip install scrapy
# 创建scrapy项目
scrapy startproject myscrapy
# 打开项目目录
cd myscrapy
# 新建scrapy程序
scrapy genspider csdn blog.csdn.net
2.配置项目环境
用pycharm打开scrapy项目
- 配置虚拟环境
文件(File)——> 设置(Settings)——> 项目(Project: xxx)——> Python解释器(Python Interpreter)——> 设置按钮——> add local interpreter(默认venv或.venv) - 安装项目库
打开pycharm自带terminal终端,逐条执行
pip install scrapy
pip install playwright
pip install scrapy-playwright
playwright install
- 查看已安装的库
pip list
3.使用scrapy-playwright插件
本程序的目的是监听目标网址的响应数据,在playwright中有给定的on方法,例如page.on(“response”, handler),其中handler是事件监听方法。
对事件监听的深入了解:
https://blog.csdn.net/weixin_44104090/article/details/128991945
https://playwright.dev/python/docs/api/class-page
https://github.com/scrapy-plugins/scrapy-playwright?tab=readme-ov-file#handling-page-events
通过使用插件的playwright_page_event_handlers属性,完成事件监听。
将下方代码复制粘贴到scrapy项目中csdn.py最后,并在terminal终端运行scrapy crawl events
命令即可。
以下是部分代码示例:
from playwright.async_api import Dialog, PlaywrightResponse
class EventsSpider(scrapy.Spider):
"""Handle page events."""
name = "events"
custom_settings = {
"TWISTED_REACTOR": "twisted.internet.asyncioreactor.AsyncioSelectorReactor",
"DOWNLOAD_HANDLERS": {
"https": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
"http": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
},
}
def start_requests(self):
yield scrapy.Request(
url="https://blog.csdn.net/csdngeeknews/article/details/139284722",
meta={
"playwright": True,
"playwright_page_event_handlers": {
"dialog": self.handle_dialog,
"response": "handle_response",
},
},
)
async def handle_dialog(self, dialog: Dialog) -> None:
print(f"Handled dialog with message: {dialog.message}")
await dialog.dismiss()
async def handle_response(self, response: PlaywrightResponse) -> None:
if ".json" in response.url:
with open("json.txt", "w", encoding="utf-8") as f:
f.write(f"Received response with URL {response.url}")
def parse(self, response, **kwargs):
pass
阅读源码的话,也会发现,本质上就是调用了page的on方法,只是做了许多封装,简化了代码的实现,并且scrapy-playwright插件还有许多巧妙使用,此不一一列举。
以下是playwright_page_event_handlers属性使用源码:
总结
以上就是本文要讲的内容,该插件只能在macOS和linux系统中使用。
多使用搜索引擎,常看文档和源码,会对做项目有不错的帮助。