python爬虫 -06- scrapy的安装和简单实用

scrapy的安装和简单实用

安装

  👉 https://blog.csdn.net/qq_44766883/article/details/107790504

基本使用

import scrapy


class QuoteSpider(scrapy.Spider):
    name = "quote"
    start_urls = ["http://quotes.toscrape.com/"]

    def parse(self, response):
        # quotes = response.css("div.quote")
        quotes = response.xpath("//div[@class='quote']")
        for quote in quotes:
            yield {
                "text": quote.css("span.text::text").extract_first(),
                "author": quote.xpath("./span/small/text()").extract_first(),
            }
        next_page = response.xpath("//li[@class='next']/a/@href").extract_first()
        if next_page:
            yield response.follow(next_page, self.parse)

 运行命令

  • 控制台输出

    scrapy runspider quotes_spider.py
    
  • 保存到指定文件

    scrapy runspider scrapy_learn/quotes_spider.py -o ./scrapy_learn/quotes.json
    
  • 指定文件类型

    scrapy runspider scrapy_learn/quotes_spider.py -o ./scrapy_learn/quotes.csv -t csv
    

常用命令

  • 创建一个项目

    scrapy startproject qianmu
    
  • 初始化一个爬虫文件

    # scrapy genspider [爬虫名字] [目标网站域名]
    scrapy genspider qianmu_new qianmu.iguye.com
    
  • 运行爬虫

    # 运行名为qianmu_new的爬虫
    scrapy crawl qianmu_new
    scrapy crawl qianmu_new -o qianmu_new.json
    scrapy crawl qianmu_new -o qianmu_new.csv -t csv
    
    # 单独运行爬虫文件
    scrapy runspider quotes_spider.py
    scrapy runspider scrapy_learn/quotes_spider.py -o ./scrapy_learn/quotes.json
    scrapy runspider scrapy_learn/quotes_spider.py -o ./scrapy_learn/quotes.csv -t csv
    
  • 创建以下文件,便于直接运行

在这里插入图片描述


调试爬虫

# 进入到scrapy控制台,使用的是项目的环境
scrapy shell
# 带一个URL参数,将会自动请求这个url,并在请求成功后进入控制台
scrapy shell http://www.qianmu.org/ranking/1528.html

# 调用parse方法
result = spider.parse(response)
# result是一个生成器,没什么疑惑好吧
type(result)<generator object QianmuNewSpider.parse at 0x0000025096AEF200>

# one其实就是一个Request对象
one = next(result)
one:<GET http://www.qianmu.org/%E9%BA%BB%E7%9C%81%E7%90%86%E5%B7%A5%E5%AD%A6%E9%99%A2>
type(one)<class 'scrapy.http.request.Request'>

# callback其实就是yield response.follow(link, self.parse_university)中的 parse_university
one.callback:<bound method QianmuNewSpider.parse_university of <QianmuNewSpider 'qianmu_new' at 0x25096aa3640>>

# 继续请求
fetch(one)  # 输出:2020-08-04 20:54:46 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.qianmu.org/%E9%BA%BB%E7%9C%81%E7%90%86%E5%B7%A5%E5%AD%A6%E9%99%A2> (referer: None) ['cached']
data = next(response)  # 输出了:18 26
data  # 输出一个请求抓取的数据

# 可以进行循环爬取
for req in result:
...   fetch(req)

进入到控制台以后,可以使用以下函数和对象

AB
fetch请求url或者Requesrt对象,注意:请求成功以后会自动将当前作用域内的request和responsne对象重新赋值
view用浏览器打开response对象内的网页
shelp打印帮助信息
spider相应的Spider类的实例
settings保存所有配置信息的Settings对象
crawler当前Crawler对象
scrapyscrapy模块
# 用项目配置下载网页,然后用浏览器打开网页
scrapy view url
# 用项目配置下载网页,然后输出至控制台
scrapy fetch url



一张图

这张图一定要熟记呀!!!

The data flow in Scrapy is controlled by the execution engine, and goes like this:

  1. The Engine gets the initial Requests to crawl from the Spider.
  2. The Engine schedules the Requests in the Scheduler and asks for the next Requests to crawl.
  3. The Scheduler returns the next Requests to the Engine.
  4. The Engine sends the Requests to the Downloader, passing through the Downloader Middlewares (see process_request()).
  5. Once the page finishes downloading the Downloader generates a Response (with that page) and sends it to the Engine, passing through the Downloader Middlewares (see process_response()).
  6. The Engine receives the Response from the Downloader and sends it to the Spider for processing, passing through the Spider Middleware (see process_spider_input()).
  7. The Spider processes the Response and returns scraped items and new Requests (to follow) to the Engine, passing through the Spider Middleware (see process_spider_output()).
  8. The Engine sends processed items to Item Pipelines, then send processed Requests to the Scheduler and asks for possible next Requests to crawl.
  9. The process repeats (from step 1) until there are no more requests from the Scheduler.
  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值