python爬虫 -06- scrapy的安装和简单实用

最新推荐文章于 2024-03-26 13:44:58 发布

迷雾总会解

最新推荐文章于 2024-03-26 13:44:58 发布

阅读量519

点赞数

分类专栏：爬虫 python 文章标签： python

本文链接：https://blog.csdn.net/qq_44766883/article/details/108090112

版权

python 同时被 2 个专栏收录

67 篇文章 6 订阅

订阅专栏

爬虫

15 篇文章 1 订阅

订阅专栏

scrapy的安装和简单实用

安装

👉 https://blog.csdn.net/qq_44766883/article/details/107790504

基本使用

import scrapy


class QuoteSpider(scrapy.Spider):
    name = "quote"
    start_urls = ["http://quotes.toscrape.com/"]

    def parse(self, response):
        # quotes = response.css("div.quote")
        quotes = response.xpath("//div[@class='quote']")
        for quote in quotes:
            yield {
                "text": quote.css("span.text::text").extract_first(),
                "author": quote.xpath("./span/small/text()").extract_first(),
            }
        next_page = response.xpath("//li[@class='next']/a/@href").extract_first()
        if next_page:
            yield response.follow(next_page, self.parse)

运行命令

控制台输出
```
scrapy runspider quotes_spider.py
```

保存到指定文件

scrapy runspider scrapy_learn/quotes_spider.py -o ./scrapy_learn/quotes.json

指定文件类型

scrapy runspider scrapy_learn/quotes_spider.py -o ./scrapy_learn/quotes.csv -t csv

常用命令

创建一个项目
```
scrapy startproject qianmu
```

初始化一个爬虫文件

# scrapy genspider [爬虫名字] [目标网站域名]
scrapy genspider qianmu_new qianmu.iguye.com

运行爬虫

# 运行名为qianmu_new的爬虫
scrapy crawl qianmu_new
scrapy crawl qianmu_new -o qianmu_new.json
scrapy crawl qianmu_new -o qianmu_new.csv -t csv

# 单独运行爬虫文件
scrapy runspider quotes_spider.py
scrapy runspider scrapy_learn/quotes_spider.py -o ./scrapy_learn/quotes.json
scrapy runspider scrapy_learn/quotes_spider.py -o ./scrapy_learn/quotes.csv -t csv

创建以下文件，便于直接运行

在这里插入图片描述

调试爬虫

# 进入到scrapy控制台，使用的是项目的环境
scrapy shell
# 带一个URL参数，将会自动请求这个url，并在请求成功后进入控制台
scrapy shell http://www.qianmu.org/ranking/1528.html

# 调用parse方法
result = spider.parse(response)
# result是一个生成器，没什么疑惑好吧
type(result)：<generator object QianmuNewSpider.parse at 0x0000025096AEF200>

# one其实就是一个Request对象
one = next(result)
one：<GET http://www.qianmu.org/%E9%BA%BB%E7%9C%81%E7%90%86%E5%B7%A5%E5%AD%A6%E9%99%A2>
type(one)：<class 'scrapy.http.request.Request'>

# callback其实就是yield response.follow(link, self.parse_university)中的 parse_university
one.callback：<bound method QianmuNewSpider.parse_university of <QianmuNewSpider 'qianmu_new' at 0x25096aa3640>>

# 继续请求
fetch(one)  # 输出：2020-08-04 20:54:46 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.qianmu.org/%E9%BA%BB%E7%9C%81%E7%90%86%E5%B7%A5%E5%AD%A6%E9%99%A2> (referer: None) ['cached']
data = next(response)  # 输出了:18 26
data  # 输出一个请求抓取的数据

# 可以进行循环爬取
for req in result:
...   fetch(req)

进入到控制台以后，可以使用以下函数和对象

A	B
fetch	请求url或者Requesrt对象，注意：请求成功以后会自动将当前作用域内的request和responsne对象重新赋值
view	用浏览器打开response对象内的网页
shelp	打印帮助信息
spider	相应的Spider类的实例
settings	保存所有配置信息的Settings对象
crawler	当前Crawler对象
scrapy	scrapy模块

# 用项目配置下载网页，然后用浏览器打开网页
scrapy view url
# 用项目配置下载网页，然后输出至控制台
scrapy fetch url

一张图

这张图一定要熟记呀！！！

The data flow in Scrapy is controlled by the execution engine, and goes like this:

The Engine gets the initial Requests to crawl from the Spider.
The Engine schedules the Requests in the Scheduler and asks for the next Requests to crawl.
The Scheduler returns the next Requests to the Engine.
The Engine sends the Requests to the Downloader, passing through the Downloader Middlewares (see process_request()).
Once the page finishes downloading the Downloader generates a Response (with that page) and sends it to the Engine, passing through the Downloader Middlewares (see process_response()).
The Engine receives the Response from the Downloader and sends it to the Spider for processing, passing through the Spider Middleware (see process_spider_input()).
The Spider processes the Response and returns scraped items and new Requests (to follow) to the Engine, passing through the Spider Middleware (see process_spider_output()).
The Engine sends processed items to Item Pipelines, then send processed Requests to the Scheduler and asks for possible next Requests to crawl.
The process repeats (from step 1) until there are no more requests from the Scheduler.

迷雾总会解

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
python爬虫 -06- scrapy的安装和简单实用

scrapy的安装和简单实用安装  ???? https://blog.csdn.net/qq_44766883/article/details/107790504基本使用import scrapyclass QuoteSpider(scrapy.Spider): name = "quote" start_urls = ["http://quotes.toscrape.com/"] def parse(self, response): # q
复制链接

扫一扫