php spider shell,scrapy shell的使用

DGGs

于 2021-03-10 07:36:39 发布

阅读量128

点赞数

文章标签： php spider shell

scrapy是一个交互终端，我们可以在未启动spider的情况下尝试及调试代码也可以用来测试xpath表达式

使用方法：

scrapy shell http://www.baidu.com

scrapy shell 常用命令

response.url: 当前响应的url地址

response.request.url：当前响应对应的请求的url地址

response.headers：响应头

response.body：响应体，html代码，bype类型

response.request.headers：scrapy返回的当前响应的请求头

配置 shell 使用的终端

若系统安装 IPython，则使用它替换默认的 Python 终端。可在工程内的 scrapy.cfg 文件内指定终端，如：

[settings]

shell = bpython

登录 shell

使用命令：

scrapy shell

进入 shell，其中 url 支持本地文件：

# UNIX-style

scrapy shell ./path/to/file.html

scrapy shell ../other/path/to/file.html

scrapy shell /absolute/path/to/file.html

# File URI

scrapy shell file:///absolute/path/to/file.html

使用 shell

可用方法

shelp(): 打印可用的对象和方法

fetch(url[, redirect=True]): 爬取新的 URL 并更新相关对象

fetch(request): 通过 request 爬取，并更新相关对象

view(response): 使用本地浏览器打开爬取的页面

可用对象

crawler: Crawler 对象

spider: 爬取使用的 spider

request: 请求

response: 响应

settings: 设置

示例：

$ scrapy shell 'http://scrapy.org' --nolog

[s] Available Scrapy objects:

[s] scrapy scrapy module (contains scrapy.Request, scrapy.Selector, etc)

[s] crawler

[s] item {}

[s] request

[s] response <200 https://scrapy.org/>

[s] settings

[s] spider

[s] Useful shortcuts:

[s] fetch(url[, redirect=True]) Fetch URL and update local objects (by default, redirects are followed)

[s] fetch(req) Fetch a scrapy.Request and update local objects

[s] shelp() Shell help (print this help)

[s] view(response) View response in a browser

In [1]: response.xpath('//title/text()').extract_first()

Out[1]: 'Scrapy | A Fast and Powerful Scraping and Web Crawling Framework'

In [2]: fetch("http://reddit.com")

In [3]: response.xpath('//title/text()').extract()

Out[3]: ['reddit: the front page of the internet']

In [4]: request = request.replace(method="POST")

In [5]: fetch(request)

In [6]: response.status

Out[6]: 404

在 spider 内调用 shell

使用 scrapy.shell.inspect_response 函数：

import scrapy

class MySpider(scrapy.Spider):

name = "myspider"

start_urls = [

"http://example.com",

"http://example.org",

"http://example.net",

]

def parse(self, response):

# We want to inspect one specific response.

if ".org" in response.url:

from scrapy.shell import inspect_response

inspect_response(response, self)

# Rest of parsing code.

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
php spider shell,scrapy shell的使用

scrapy是一个交互终端，我们可以在未启动spider的情况下尝试及调试代码也可以用来测试xpath表达式使用方法：scrapy shell http://www.baidu.comscrapy shell 常用命令response.url: 当前响应的url地址response.request.url：当前响应对应的请求的url地址response.headers：响应头response.bo...
复制链接

扫一扫

评论

被折叠的条评论为什么被折叠?

到【灌水乐园】发言

查看更多评论

添加红包

成就一亿技术人!

hope_wisdom

发出的红包

实付元

使用余额支付

点击重新获取

扫码支付

钱包余额 0

抵扣说明：

1.余额是钱包充值的虚拟货币，按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载，可以购买VIP、付费专栏及课程。