scrapy命令和项目调试-scrapy框架4-python

最新推荐文章于 2024-07-03 16:18:05 发布

gaog2zh

最新推荐文章于 2024-07-03 16:18:05 发布

阅读量1.4k

点赞数

分类专栏： Python 文章标签： scrapy

本文链接：https://blog.csdn.net/gaogzhen/article/details/123156174

版权

Python 专栏收录该内容

33 篇文章 0 订阅

订阅专栏

文章目录

在编写项目的时候，需要不断的调试代码。同时频繁大量的请求目标网站，可能触发一些安全策略，比如屏蔽IP等等。这时，需要掌握一些调试技巧。在这之前，先了解一些scrapy命令。

1、scrapy命令

scrapy命令分2类：全局命令和项目命令。

# 查看用法帮助和可用的命令
scrapy 
scrapy -h
# 查看某个命令的详细信息
scrapy <command> -h

全局命令：

命令	格式	描述
startproject	scrapy startproject <project_name> [project_dir]	创建scrapy项目
genspider	scrapy genspider [-t template]	生成Spider
settings	scrapy settings [options]	获取settings.py中的配置内容
runspider	scrapy runspider [options] <spider_file>	运行spider(单独的)
shell	scrapy shell [url\|file]	进入scrapy shell 控制台
fetch	scrapy fetch [options]	获取一个URL并把结果打印到输出端
view	scrapy view [options]	获取一个URL并把结果展示在浏览器中
version	scrapy version [-v]	获取scrapy版本

项目命令：

命令	格式	描述
crawl	scrapy crawl	运行项目
check	scrapy check [-l]	进行contract检查
list	scrapy list	列举项目中spiders
edit	scrapy edit	使用预定义的EDITOR编辑spider
parse	scrapy parse [options]	获取给定的URL且使用spider（parse方法）处理响应

关于用于自定义命令部分，自行查阅相关文档。

2、项目调试

2.1、shell控制台调试

为了调试需要，我们把http缓存开启，在settings.py中去掉注释：

HTTPCACHE_ENABLED = True

调试的时候，一般我们在scrapy shell 控制台进行，进入控制台如下所示：

2022-02-26 20:14:28 [asyncio] DEBUG: Using selector: SelectSelector
[s] Available Scrapy objects:
[s]   scrapy     scrapy module (contains scrapy.Request, scrapy.Selector, etc)
[s]   crawler    <scrapy.crawler.Crawler object at 0x000002392F0D2278>
[s]   item       {}
[s]   settings   <scrapy.settings.Settings object at 0x0000023932DE42E8>
[s] Useful shortcuts:
[s]   fetch(url[, redirect=True]) Fetch URL and update local objects (by default, redirects are followed)
[s]   fetch(req)                  Fetch a scrapy.Request and update local objects
[s]   shelp()           Shell help (print this help)
[s]   view(response)    View response in a browser
2022-02-26 20:14:29 [asyncio] DEBUG: Using selector: SelectSelector

默认刚进入没有Request和Response对象，只有运行fetch之后，才会生成Request和Response对象，并且Request和Response只会存在一个最新的。

fetch(‘http://www.qianmu.org/ranking/1528.htm’) , 请求成功，运行shelp()查看：

2022-02-26 20:36:36 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.qianmu.org/ranking/1528.htm> (referer: None)

[s]   request    <GET http://www.qianmu.org/ranking/1528.htm>
[s]   response   <200 http://www.qianmu.org/ranking/1528.htm>

对response对象操作，比如查看类型,type(response)，提取所有的大学链接

In [4]: type(response)
Out[4]: scrapy.http.response.html.HtmlResponse
links = response.xpath('//div[@class="rankItem"]/table//tr[position()>1]/td[2]/a/@href').getall()
In [6]: links
Out[6]: 
['http://www.qianmu.org/%E9%BA%BB%E7%9C%81%E7%90%86%E5%B7%A5%E5%AD%A6%E9%99%A2',
 'http://www.qianmu.org/%E7%89%9B%E6%B4%A5%E5%A4%A7%E5%AD%A6',
...

下面我们通过scrapy shell url 来模拟scrapy 框架执行流程。

scrapy shell http://www.qianmu.org/ranking/1528.htm

[s]   request    <GET http://www.qianmu.org/ranking/1528.htm>
[s]   response   <200 http://www.qianmu.org/ranking/1528.htm>
[s]   settings   <scrapy.settings.Settings object at 0x000002485CF24048>
[s]   spider     <UsinfoSpider 'usinfo' at 0x2485d21ff28>

# 使用的都是我们项目的配置

下一步执行parse方法，属性spider类

In [1]: result = spider.parse(response)

In [2]: type(result)
Out[2]: generator

In [3]: result
Out[3]: <generator object UsinfoSpider.parse at 0x000002485F1F19E8>

获取结果的第一个元素，其实是Requst对象
```
req = list(result)[0]
fetch(req)
```

获取第一个大学链接对应的数据

In [8]: data = next(response)

In [9]: data
Out[9]: {'name': '麻省理工学院 '}

上面只是爬取一个，想要全部爬取

reqs = list(result)
for req in reqs
	fetch(req)

到这里我们已经手动执行完了。

shell 控制台使用的都是项目的配置

2.2、parse命令

parse命令可以查看spider的输出，优点灵活和简单，但是不能用来调试方法内的代码。

scrapy parse --spider=myspider -c parse_item -d 2 -v <item_url>

2.3、日志

查看运行日志。

参考视频：https://www.bilibili.com/video/BV1R7411F7JV p559~p560

源代码仓库地址：https://gitee.com/gaogzhen/python-study

QQ群：433529853

gaog2zh

关注

0
点赞
踩
3

收藏

觉得还不错? 一键收藏
打赏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

专栏目录