Scrapy框架的学习(一)

1. Scrapy概述

1. 为什么要学习scrapy框架

  • 爬虫必备的技术,面试会问相关的知识。
  • 让我们的爬虫更快更强大。(支持异步爬虫)

2. 什么是Scrapy?

在这里插入图片描述

  • 异步爬虫框架:Scrapy是一个基于Python开发的爬虫框架,用于抓取网站并从其页面中提取结构化数据,也是当前Python爬虫生态中最流行的爬虫框架,Scrapy框架架构清晰,可扩展性强,可以灵活高效的完成各种爬虫需求。
    程序状态转换图:
    在这里插入图片描述

3. 如何学习Scrapy?

4. Scrapy工作流程

在这里插入图片描述
在这里插入图片描述
在这里插入图片描述

分工介绍表
版块介绍要求
Scrapy engine(引擎)总指挥:负责数据和信号在不同模块之间的传递Scrapy已经实现
Scheduler(调度器)一个队列,存放引擎发过来的request请求Scrapy已经实现
Downoader(下载器)下载引擎发过来的requests请求的源码(即response),将源码返回给引擎scrapy已经实现
Spider(爬虫)处理引擎发来的response,提取数据,提取url,并交给引擎需要手写
Item Pipline(管道)处理引擎传过来的数据,比如存储数据需要手写
Downloader Middlewares(下载中间件)可以自定义的下载扩展,比如设置代理一般不用手写
Spider Middlewares(中间件)可以自定义requests请求和进行response过滤一般不用手写

2. Scrapy快速入门(小案例)

1. 安装

pip install scrapy
pip install scrapy==2.5.1   # 指定安装2.5.1版本的scrapy

在这里插入图片描述
在终端内输入“scrapy”命令验证是否安装好了:
在这里插入图片描述
以上显示就说明已经安装好了。

2. 创建项目

  • 需要进入到项目保存位置的cmd中。
    请添加图片描述
# scrapy startproject 项目名称
scrapy startproject my_Scrapy

在这里插入图片描述

3. 项目结构分析

在这里插入图片描述

  • my_Scrapy
    • my_Scrapy
      • spiders
        • __init__.py
      • __init__.py
      • items.py
      • middlewares.py
      • piplelines.py
      • settings.py
    • scrapy.cfg
      在这里插入图片描述

功能介绍:

  • scrapy.cfg:Scrapy项目配置文件,定义项目的配置文件的路径,部署信息。(一般不需要改)
  • items.py:定义了item的数据结构,所有item的定义都可以放在这里。(定义爬取的数据内容有哪些)
  • piplelines.py:定义item Pipeline的实现。
  • settings.py:定义项目的全局配置。
  • middlewares.py:中间件文件,定义了Spider Middlewares和Downloader Middlewares的实现。
  • spiders:里面包含一个个spider(爬虫)的实现。每一个spider对应一个.py文件。

4. 创建Spider

# 先进入项目目录:
cd my_Scrapy
# scrapy genspider 爬虫文件名 爬取网站的域名
scrapy genspider spider1 www.baidu.com

在这里插入图片描述
在这里插入图片描述

  • 修改spider1.py文件:
import scrapy


class Spider1Spider(scrapy.Spider):
    # spider(爬虫)名称,需要记住,通过名字来启动爬虫:
    name = 'spider1'
    # 允许爬取的域名:可更改(限制爬虫,不要爬到其他网站去了)
    allowed_domains = ['http://quotes.toscrape.com/']
    # 初始request请求:
    start_urls = ['http://quotes.toscrape.com/']

    # 解析方法:
    def parse(self, response):
        print(response.text)

官方参考案例网站:http://quotes.toscrape.com/

5. 创建item

  • item是保存爬取数据的容器,定义爬取的数据结构。
    修改项目中的items.py文件如下:
# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html

import scrapy

class MyScrapyItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()

    # 采集的目标内容:名言、名人、分类标签
    # 名言:
    text = scrapy.Field()
    # 名人:
    author = scrapy.Field()
    # 标签:
    tags = scrapy.Field()

6. 解析Response

1. 仅仅爬取第一页的数据
  • 修改spider1.py文件中的parse()方法,该方法用于解析源码中的目标内容。
import scrapy
from lxml import etree


class Spider1Spider(scrapy.Spider):
    # spider(爬虫)名称,需要记住,通过名字来启动爬虫:
    name = 'spider1'
    # 允许爬取的域名:可更改(限制爬虫,不要爬到其他网站去了)
    allowed_domains = ['http://quotes.toscrape.com/']
    # 初始request请求:
    start_urls = ['http://quotes.toscrape.com/']

    # 解析方法:
    def parse(self, response):
        # print(response.text)
        """
        解析返回响应,提取数据或进一步生成要处理的请求。
        :param response:
        :return:
        """
        # 方法一:通过css选择器来进行解析
        """ 
        quotes = response.css('.quote')   # 列表
        for quote in quotes:
            # text = quote.css('span.text::text')   # ::text表示获取里面的文本(注意它是一个对象)
            # 旧方法:
            # extract_first()  返回第一个数据。(字符串)
            # extract()  返回全部的数据。(字符串列表)
            text = quote.css('span.text::text').extract_first()   # ::text表示获取里面的文本  (extract_first()方法可以获取到对象里面的文本内容)
            # author = quote.css('small.author::text')  # 作者  (得到一个对象)
            author = quote.css('small.author::text').extract_first()  # 作者  (得到文本内容)
            # tags = quote.css('div.tags a.tag::text')   # 标签(CSS对象)
            tags = quote.css('div.tags a.tag::text').extract()   # 返回全部的数据
            # print(tags)
            # print(text, '      ——————', author, tags)

            # 新方法:
            # get()   返回一条数据
            # getall()  返回全部的数据
            text = quote.css('span.text::text').get()
            author = quote.css('small.author::text').get()
            tags = quote.css('div.tags a.tag::text').getall()
            print(text, tags, '      ——————', author)
        """

        # 方法二:通过xpath进行解析
        html = etree.HTML(response.text)
        quotes_divs = html.xpath('//div[@class="quote"]')
        for quote_div in quotes_divs:
            text = quote_div.xpath('./span[1]/text()')[0]
            author = quote_div.xpath('./span[2]/small/text()')[0]
            tags = quote_div.xpath('./div[@class="tags"]/a/text()')
            print(text, tags, '    ------', author)

运行start.py文件的结果:

D:\Anaconda\python.exe C:/Users/lv/Desktop/scrapy框架的学习/my_Scrapy/start.py
2022-04-03 19:24:47 [scrapy.utils.log] INFO: Scrapy 2.5.1 started (bot: my_Scrapy)
2022-04-03 19:24:47 [scrapy.utils.log] INFO: Versions: lxml 4.2.5.0, libxml2 2.9.8, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 18.7.0, Python 3.7.0 (default, Jun 28 2018, 08:04:48) [MSC v.1912 64 bit (AMD64)], pyOpenSSL 18.0.0 (OpenSSL 1.0.2p  14 Aug 2018), cryptography 2.3.1, Platform Windows-10-10.0.19041-SP0
2022-04-03 19:24:47 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor
2022-04-03 19:24:47 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'my_Scrapy',
 'NEWSPIDER_MODULE': 'my_Scrapy.spiders',
 'ROBOTSTXT_OBEY': True,
 'SPIDER_MODULES': ['my_Scrapy.spiders']}
2022-04-03 19:24:47 [scrapy.extensions.telnet] INFO: Telnet Password: e2250e171a87ebd6
2022-04-03 19:24:47 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.logstats.LogStats']
2022-04-03 19:24:47 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2022-04-03 19:24:47 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2022-04-03 19:24:47 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2022-04-03 19:24:47 [scrapy.core.engine] INFO: Spider opened
2022-04-03 19:24:47 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2022-04-03 19:24:47 [py.warnings] WARNING: D:\Anaconda\lib\site-packages\scrapy\spidermiddlewares\offsite.py:65: URLWarning: allowed_domains accepts only domains, not URLs. Ignoring URL entry http://quotes.toscrape.com/ in allowed_domains.
  warnings.warn(message, URLWarning)

2022-04-03 19:24:47 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2022-04-03 19:24:48 [scrapy.core.engine] DEBUG: Crawled (404) <GET http://quotes.toscrape.com/robots.txt> (referer: None)
2022-04-03 19:24:48 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/> (referer: None)
2022-04-03 19:24:48 [scrapy.core.engine] INFO: Closing spider (finished)
2022-04-03 19:24:48 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 448,
 'downloader/request_count': 2,
 'downloader/request_method_count/GET': 2,
 'downloader/response_bytes': 2582,
 'downloader/response_count': 2,
 'downloader/response_status_count/200': 1,
 'downloader/response_status_count/404': 1,
 'elapsed_time_seconds': 1.264597,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2022, 4, 3, 11, 24, 48, 658256),
 'httpcompression/response_bytes': 11053,
 'httpcompression/response_count': 1,
 'log_count/DEBUG': 2,
 'log_count/INFO': 10,
 'log_count/WARNING': 1,
 'response_received_count': 2,
 'robotstxt/request_count': 1,
 'robotstxt/response_count': 1,
 'robotstxt/response_status_count/404': 1,
 'scheduler/dequeued': 1,
 'scheduler/dequeued/memory': 1,
 'scheduler/enqueued': 1,
 'scheduler/enqueued/memory': 1,
 'start_time': datetime.datetime(2022, 4, 3, 11, 24, 47, 393659)}
2022-04-03 19:24:48 [scrapy.core.engine] INFO: Spider closed (finished)
“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.['change', 'deep-thoughts', 'thinking', 'world']     ------ Albert Einstein
“It is our choices, Harry, that show what we truly are, far more than our abilities.['abilities', 'choices']     ------ J.K. Rowling
“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.['inspirational', 'life', 'live', 'miracle', 'miracles']     ------ Albert Einstein
“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.['aliteracy', 'books', 'classic', 'humor']     ------ Jane Austen
“Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.” ['be-yourself', 'inspirational']     ------ Marilyn Monroe
“Try not to become a man of success. Rather become a man of value.['adulthood', 'success', 'value']     ------ Albert Einstein
“It is better to be hated for what you are than to be loved for what you are not.['life', 'love']     ------ André Gide
“I have not failed. I've just found 10,000 ways that won't work.['edison', 'failure', 'inspirational', 'paraphrased']     ------ Thomas A. Edison
“A woman is like a tea bag; you never know how strong it is until it's in hot water.” ['misattributed-eleanor-roosevelt']     ------ Eleanor Roosevelt
“A day without sunshine is like, you know, night.['humor', 'obvious', 'simile']     ------ Steve Martin

Process finished with exit code 0

2. 翻页爬取数据
  • 修改的主要是spider1.py文件中的部分语句。
import scrapy
from lxml import etree
from my_Scrapy.items import MyScrapyItem   # 从items文件中导入MyScrapyItem类


class Spider1Spider(scrapy.Spider):
    # spider(爬虫)名称,需要记住,通过名字来启动爬虫:
    name = 'spider1'
    # # 允许爬取的域名:可更改(限制爬虫,不要爬到其他网站去了)
    # allowed_domains = ['http://quotes.toscrape.com/']   # 不设置限制,就可以一直爬取下一页了
    # 初始request请求:
    start_urls = ['http://quotes.toscrape.com/']

    # 解析方法:
    def parse(self, response):
        # print(response.text)
        """
        解析返回响应,提取数据或进一步生成要处理的请求。
        :param response:
        :return:
        """
        # 方法一:通过css选择器来进行解析
        """ 
        quotes = response.css('.quote')   # 列表
        for quote in quotes:
            # text = quote.css('span.text::text')   # ::text表示获取里面的文本(注意它是一个对象)
            # 旧方法:
            # extract_first()  返回第一个数据。(字符串)
            # extract()  返回全部的数据。(字符串列表)
            text = quote.css('span.text::text').extract_first()   # ::text表示获取里面的文本  (extract_first()方法可以获取到对象里面的文本内容)
            # author = quote.css('small.author::text')  # 作者  (得到一个对象)
            author = quote.css('small.author::text').extract_first()  # 作者  (得到文本内容)
            # tags = quote.css('div.tags a.tag::text')   # 标签(CSS对象)
            tags = quote.css('div.tags a.tag::text').extract()   # 返回全部的数据
            # print(tags)
            # print(text, '      ——————', author, tags)

            # 新方法:
            # get()   返回一条数据
            # getall()  返回全部的数据
            text = quote.css('span.text::text').get()
            author = quote.css('small.author::text').get()
            tags = quote.css('div.tags a.tag::text').getall()
            print(text, tags, '      ——————', author)
        """

        # 方法二:通过xpath进行解析
        html = etree.HTML(response.text)
        quotes_divs = html.xpath('//div[@class="quote"]')
        for quote_div in quotes_divs:
            text = quote_div.xpath('./span[1]/text()')[0]
            author = quote_div.xpath('./span[2]/small/text()')[0]
            tags = quote_div.xpath('./div[@class="tags"]/a/text()')
            # print(text, tags, '    ------', author)

            # 将数据放入容器中,便于保存数据
            item = MyScrapyItem()
            item['text'] = text
            item['author'] = author
            item['tags'] = tags
            # print(item)

            # 通过迭代的方式将字典中的每一条数据交给Pipeline
            yield item

        # 定义翻页操作
        next = response.css('ul.pager li.next a::attr("href")').get()   # 获得“下一页”a标签按钮上面href属性值
        print(next)  # /page/2/
        url = self.start_urls[0]   # 获取当前爬取的url(方法一)
        # print(url)
        url = response.url    # 获取当前爬取的url(方法二)
        # print(url)
        # 拼接形成下一页的url
        url = response.urljoin(next)
        print(url)
        # 将请求交给调度器,重新构造下一个请求
        yield scrapy.Request(url, callback=self.parse)   # 对于新的请求,仍然执行parse()解析方法

运行结果部分截图:
在这里插入图片描述

7. 保存数据

1. 通过执行scrapy命令进行保存数据
1. 方式一:在终端执行命令
# scrapy crawl 爬虫文件名 -o 数据保存文件名
scrapy crawl spider1 -o demo.csv

在这里插入图片描述

2. 方式二:通过修改start.py启动文件的cmd命令行语句
# 使用Scrapy框架的该爬虫是一个项目,不能在爬虫文件中用右键运行,需要在终端里面输入scrapy crawl 爬虫文件名“的命令进行启动。
# 如果不想用终端输入命令的启动方式,可以创建该“start.py”文件
from scrapy import cmdline

# cmdline.execute('scrapy crawl spider1'.split())   # 调用终端命令
cmdline.execute('scrapy crawl spider1 -o demo.csv'.split())

# 红色的不是报错,而是scrapy框架自行打印的初始化信息。白色的内容就是print()语句输出的内容。

在这里插入图片描述

2. 通过自定义的方式保存(修改pipelines.py文件)
  1. 修改pipelines.py文件如下:
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html


# useful for handling different item types with a single interface
from itemadapter import ItemAdapter


class MyScrapyPipeline:
    def process_item(self, item, spider):
        with open('demo.txt', 'a', encoding="utf-8") as f:
            f.write(item['text'] + '           ——' + item['author'] + "\n")
        return item

  1. 将settings.py文件中pipelines.py文件对应的注释取消掉(否则就无法成功将数据保存在txt文件中)
# Scrapy settings for my_Scrapy project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     https://docs.scrapy.org/en/latest/topics/settings.html
#     https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#     https://docs.scrapy.org/en/latest/topics/spider-middleware.html

BOT_NAME = 'my_Scrapy'

SPIDER_MODULES = ['my_Scrapy.spiders']
NEWSPIDER_MODULE = 'my_Scrapy.spiders'


# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'my_Scrapy (+http://www.yourdomain.com)'

# Obey robots.txt rules
ROBOTSTXT_OBEY = True

# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32

# Configure a delay for requests for the same website (default: 0)
# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16

# Disable cookies (enabled by default)
#COOKIES_ENABLED = False

# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False

# Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
#   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
#   'Accept-Language': 'en',
#}

# Enable or disable spider middlewares
# See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
#    'my_Scrapy.middlewares.MyScrapySpiderMiddleware': 543,
#}

# Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
#    'my_Scrapy.middlewares.MyScrapyDownloaderMiddleware': 543,
#}

# Enable or disable extensions
# See https://docs.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
#    'scrapy.extensions.telnet.TelnetConsole': None,
#}

# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
   'my_Scrapy.pipelines.MyScrapyPipeline': 300,
}

# Enable and configure the AutoThrottle extension (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False

# Enable and configure HTTP caching (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

  1. 运行结果截图:
    在这里插入图片描述

8. 运行项目

1. 在终端内运行
# scrapy crawl 爬虫文件名
scrapy crawl spider1

最前面是爬虫运行的提示信息:
在这里插入图片描述
中间的就是网页源代码:
在这里插入图片描述
最后面是关闭爬虫的提示信息:
在这里插入图片描述

2. 通过PyCharm运行

需要在项目文件夹下创建一个启动项目的文件start.py:

# 使用Scrapy框架的该爬虫是一个项目,不能在爬虫文件中用右键运行,需要在终端里面输入scrapy crawl 爬虫文件名“的命令进行启动。
# 如果不想用终端输入命令的启动方式,可以创建该“start.py”文件
from scrapy import cmdline

cmdline.execute('scrapy crawl spider1'.split())   # 调用终端命令

# 红色的不是报错,而是爬虫的初始化信息。白色的内容就是print()语句输出的内容

运行结果:

D:\Anaconda\python.exe C:/Users/lv/Desktop/scrapy框架的学习/my_Scrapy/start.py
2022-04-03 16:34:35 [scrapy.utils.log] INFO: Scrapy 2.5.1 started (bot: my_Scrapy)
2022-04-03 16:34:35 [scrapy.utils.log] INFO: Versions: lxml 4.2.5.0, libxml2 2.9.8, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 18.7.0, Python 3.7.0 (default, Jun 28 2018, 08:04:48) [MSC v.1912 64 bit (AMD64)], pyOpenSSL 18.0.0 (OpenSSL 1.0.2p  14 Aug 2018), cryptography 2.3.1, Platform Windows-10-10.0.19041-SP0
2022-04-03 16:34:35 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor
2022-04-03 16:34:35 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'my_Scrapy',
 'NEWSPIDER_MODULE': 'my_Scrapy.spiders',
 'ROBOTSTXT_OBEY': True,
 'SPIDER_MODULES': ['my_Scrapy.spiders']}
2022-04-03 16:34:35 [scrapy.extensions.telnet] INFO: Telnet Password: b9d4a8fccbb5b978
2022-04-03 16:34:35 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.logstats.LogStats']
2022-04-03 16:34:35 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2022-04-03 16:34:35 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2022-04-03 16:34:35 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2022-04-03 16:34:35 [scrapy.core.engine] INFO: Spider opened
2022-04-03 16:34:35 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2022-04-03 16:34:35 [py.warnings] WARNING: D:\Anaconda\lib\site-packages\scrapy\spidermiddlewares\offsite.py:65: URLWarning: allowed_domains accepts only domains, not URLs. Ignoring URL entry http://quotes.toscrape.com/ in allowed_domains.
  warnings.warn(message, URLWarning)

2022-04-03 16:34:35 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2022-04-03 16:34:36 [scrapy.core.engine] DEBUG: Crawled (404) <GET http://quotes.toscrape.com/robots.txt> (referer: None)
2022-04-03 16:34:36 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/> (referer: None)
<!DOCTYPE html>
<html lang="en">
<head>
	<meta charset="UTF-8">
	<title>Quotes to Scrape</title>
    <link rel="stylesheet" href="/static/bootstrap.min.css">
    <link rel="stylesheet" href="/static/main.css">
</head>
<body>
    <div class="container">
        <div class="row header-box">
            <div class="col-md-8">
                <h1>
                    <a href="/" style="text-decoration: none">Quotes to Scrape</a>
                </h1>
            </div>
            <div class="col-md-4">
                <p>
                
                    <a href="/login">Login</a>
                
                </p>
            </div>
        </div>
    

<div class="row">
    <div class="col-md-8">

    <div class="quote" itemscope itemtype="http://schema.org/CreativeWork">
        <span class="text" itemprop="text">“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.</span>
        <span>by <small class="author" itemprop="author">Albert Einstein</small>
        <a href="/author/Albert-Einstein">(about)</a>
        </span>
        <div class="tags">
            Tags:
            <meta class="keywords" itemprop="keywords" content="change,deep-thoughts,thinking,world" /    > 
            
            <a class="tag" href="/tag/change/page/1/">change</a>
            
            <a class="tag" href="/tag/deep-thoughts/page/1/">deep-thoughts</a>
            
            <a class="tag" href="/tag/thinking/page/1/">thinking</a>
            
            <a class="tag" href="/tag/world/page/1/">world</a>
            
        </div>
    </div>

    <div class="quote" itemscope itemtype="http://schema.org/CreativeWork">
        <span class="text" itemprop="text">“It is our choices, Harry, that show what we truly are, far more than our abilities.</span>
        <span>by <small class="author" itemprop="author">J.K. Rowling</small>
        <a href="/author/J-K-Rowling">(about)</a>
        </span>
        <div class="tags">
            Tags:
            <meta class="keywords" itemprop="keywords" content="abilities,choices" /    > 
            
            <a class="tag" href="/tag/abilities/page/1/">abilities</a>
            
            <a class="tag" href="/tag/choices/page/1/">choices</a>
            
        </div>
    </div>

    <div class="quote" itemscope itemtype="http://schema.org/CreativeWork">
        <span class="text" itemprop="text">“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.</span>
        <span>by <small class="author" itemprop="author">Albert Einstein</small>
        <a href="/author/Albert-Einstein">(about)</a>
        </span>
        <div class="tags">
            Tags:
            <meta class="keywords" itemprop="keywords" content="inspirational,life,live,miracle,miracles" /    > 
            
            <a class="tag" href="/tag/inspirational/page/1/">inspirational</a>
            
            <a class="tag" href="/tag/life/page/1/">life</a>
            
            <a class="tag" href="/tag/live/page/1/">live</a>
            
            <a class="tag" href="/tag/miracle/page/1/">miracle</a>
            
            <a class="tag" href="/tag/miracles/page/1/">miracles</a>
            
        </div>
    </div>

    <div class="quote" itemscope itemtype="http://schema.org/CreativeWork">
        <span class="text" itemprop="text">“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.</span>
        <span>by <small class="author" itemprop="author">Jane Austen</small>
        <a href="/author/Jane-Austen">(about)</a>
        </span>
        <div class="tags">
            Tags:
            <meta class="keywords" itemprop="keywords" content="aliteracy,books,classic,humor" /    > 
            
            <a class="tag" href="/tag/aliteracy/page/1/">aliteracy</a>
            
            <a class="tag" href="/tag/books/page/1/">books</a>
            
            <a class="tag" href="/tag/classic/page/1/">classic</a>
            
            <a class="tag" href="/tag/humor/page/1/">humor</a>
            
        </div>
    </div>

    <div class="quote" itemscope itemtype="http://schema.org/CreativeWork">
        <span class="text" itemprop="text">“Imperfection is beauty, madness is genius and it&#39;s better to be absolutely ridiculous than absolutely boring.”</span>
        <span>by <small class="author" itemprop="author">Marilyn Monroe</small>
        <a href="/author/Marilyn-Monroe">(about)</a>
        </span>
        <div class="tags">
            Tags:
            <meta class="keywords" itemprop="keywords" content="be-yourself,inspirational" /    > 
            
            <a class="tag" href="/tag/be-yourself/page/1/">be-yourself</a>
            
            <a class="tag" href="/tag/inspirational/page/1/">inspirational</a>
            
        </div>
    </div>

    <div class="quote" itemscope itemtype="http://schema.org/CreativeWork">
        <span class="text" itemprop="text">“Try not to become a man of success. Rather become a man of value.</span>
        <span>by <small class="author" itemprop="author">Albert Einstein</small>
        <a href="/author/Albert-Einstein">(about)</a>
        </span>
        <div class="tags">
            Tags:
            <meta class="keywords" itemprop="keywords" content="adulthood,success,value" /    > 
            
            <a class="tag" href="/tag/adulthood/page/1/">adulthood</a>
            
            <a class="tag" href="/tag/success/page/1/">success</a>
            
            <a class="tag" href="/tag/value/page/1/">value</a>
            
        </div>
    </div>

    <div class="quote" itemscope itemtype="http://schema.org/CreativeWork">
        <span class="text" itemprop="text">“It is better to be hated for what you are than to be loved for what you are not.</span>
        <span>by <small class="author" itemprop="author">André Gide</small>
        <a href="/author/Andre-Gide">(about)</a>
        </span>
        <div class="tags">
            Tags:
            <meta class="keywords" itemprop="keywords" content="life,love" /    > 
            
            <a class="tag" href="/tag/life/page/1/">life</a>
            
            <a class="tag" href="/tag/love/page/1/">love</a>
            
        </div>
    </div>

    <div class="quote" itemscope itemtype="http://schema.org/CreativeWork">
        <span class="text" itemprop="text">“I have not failed. I&#39;ve just found 10,000 ways that won&#39;t work.”</span>
        <span>by <small class="author" itemprop="author">Thomas A. Edison</small>
        <a href="/author/Thomas-A-Edison">(about)</a>
        </span>
        <div class="tags">
            Tags:
            <meta class="keywords" itemprop="keywords" content="edison,failure,inspirational,paraphrased" /    > 
            
            <a class="tag" href="/tag/edison/page/1/">edison</a>
            
            <a class="tag" href="/tag/failure/page/1/">failure</a>
            
            <a class="tag" href="/tag/inspirational/page/1/">inspirational</a>
            
            <a class="tag" href="/tag/paraphrased/page/1/">paraphrased</a>
            
        </div>
    </div>

    <div class="quote" itemscope itemtype="http://schema.org/CreativeWork">
        <span class="text" itemprop="text">“A woman is like a tea bag; you never know how strong it is until it&#39;s in hot water.”</span>
        <span>by <small class="author" itemprop="author">Eleanor Roosevelt</small>
        <a href="/author/Eleanor-Roosevelt">(about)</a>
        </span>
        <div class="tags">
            Tags:
            <meta class="keywords" itemprop="keywords" content="misattributed-eleanor-roosevelt" /    > 
            
            <a class="tag" href="/tag/misattributed-eleanor-roosevelt/page/1/">misattributed-eleanor-roosevelt</a>
            
        </div>
    </div>

    <div class="quote" itemscope itemtype="http://schema.org/CreativeWork">
        <span class="text" itemprop="text">“A day without sunshine is like, you know, night.</span>
        <span>by <small class="author" itemprop="author">Steve Martin</small>
        <a href="/author/Steve-Martin">(about)</a>
        </span>
        <div class="tags">
            Tags:
            <meta class="keywords" itemprop="keywords" content="humor,obvious,simile" /    > 
            
            <a class="tag" href="/tag/humor/page/1/">humor</a>
            
            <a class="tag" href="/tag/obvious/page/1/">obvious</a>
            
            <a class="tag" href="/tag/simile/page/1/">simile</a>
            
        </div>
    </div>

    <nav>
        <ul class="pager">
            
            
            <li class="next">
                <a href="/page/2/">Next <span aria-hidden="true">&rarr;</span></a>
            </li>
            
        </ul>
    </nav>
    </div>
    <div class="col-md-4 tags-box">
        
            <h2>Top Ten tags</h2>
            
            <span class="tag-item">
            <a class="tag" style="font-size: 28px" href="/tag/love/">love</a>
            </span>
            
            <span class="tag-item">
            <a class="tag" style="font-size: 26px" href="/tag/inspirational/">inspirational</a>
            </span>
            
            <span class="tag-item">
            <a class="tag" style="font-size: 26px" href="/tag/life/">life</a>
            </span>
            
            <span class="tag-item">
            <a class="tag" style="font-size: 24px" href="/tag/humor/">humor</a>
            </span>
            
            <span class="tag-item">
            <a class="tag" style="font-size: 22px" href="/tag/books/">books</a>
            </span>
            
            <span class="tag-item">
            <a class="tag" style="font-size: 14px" href="/tag/reading/">reading</a>
            </span>
            
            <span class="tag-item">
            <a class="tag" style="font-size: 10px" href="/tag/friendship/">friendship</a>
            </span>
            
            <span class="tag-item">
            <a class="tag" style="font-size: 8px" href="/tag/friends/">friends</a>
            </span>
            
            <span class="tag-item">
            <a class="tag" style="font-size: 8px" href="/tag/truth/">truth</a>
            </span>
            
            <span class="tag-item">
            <a class="tag" style="font-size: 6px" href="/tag/simile/">simile</a>
            </span>
            
        
    </div>
</div>

    </div>
    <footer class="footer">
        <div class="container">
            <p class="text-muted">
                Quotes by: <a href="https://www.goodreads.com/quotes">GoodReads.com</a>
            </p>
            <p class="copyright">
                Made with <span class='sh-red'></span> by <a href="https://scrapinghub.com">Scrapinghub</a>
            </p>
        </div>
    </footer>
</body>
</html>
2022-04-03 16:34:36 [scrapy.core.engine] INFO: Closing spider (finished)
2022-04-03 16:34:36 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 448,
 'downloader/request_count': 2,
 'downloader/request_method_count/GET': 2,
 'downloader/response_bytes': 2578,
 'downloader/response_count': 2,
 'downloader/response_status_count/200': 1,
 'downloader/response_status_count/404': 1,
 'elapsed_time_seconds': 1.29309,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2022, 4, 3, 8, 34, 36, 608493),
 'httpcompression/response_bytes': 11053,
 'httpcompression/response_count': 1,
 'log_count/DEBUG': 2,
 'log_count/INFO': 10,
 'log_count/WARNING': 1,
 'response_received_count': 2,
 'robotstxt/request_count': 1,
 'robotstxt/response_count': 1,
 'robotstxt/response_status_count/404': 1,
 'scheduler/dequeued': 1,
 'scheduler/dequeued/memory': 1,
 'scheduler/enqueued': 1,
 'scheduler/enqueued/memory': 1,
 'start_time': datetime.datetime(2022, 4, 3, 8, 34, 35, 315403)}
2022-04-03 16:34:36 [scrapy.core.engine] INFO: Spider closed (finished)

Process finished with exit code 0

3. srcapy shell 的使用

1. 在终端内使用scrapy shell命令进行单次请求内容提取的测试

爬取网址:https://docs.scrapy.org/en/latest/_static/selectors-sample1.html
在终端内输入命令:

scrapy shell https://docs.scrapy.org/en/latest/_st
atic/selectors-sample1.html
Microsoft Windows [版本 10.0.19042.1586]
(c) Microsoft Corporation。保留所有权利。
(base) C:\Users\吕成鑫\Desktop\scrapy框架的学习\my_Scr
apy>scrapy shell https://docs.scrapy.org/en/latest/_st
atic/selectors-sample1.html
2022-04-04 20:04:11 [scrapy.utils.log] INFO: Scrapy 2.
5.1 started (bot: my_Scrapy)
2022-04-04 20:04:11 [scrapy.utils.log] INFO: Versions:
 lxml 4.2.5.0, libxml2 2.9.8, cssselect 1.1.0, parsel
1.6.0, w3lib 1.22.0, Twisted 18.7.0, Python 3.7.0 (def
ault, Jun 28 2018, 08:04:48) [MSC v.1912 64 bit (AMD64
)], pyOpenSSL 18.0.0 (OpenSSL 1.0.2p  14 Aug 2018), cr
yptography 2.3.1, Platform Windows-10-10.0.19041-SP0
2022-04-04 20:04:11 [scrapy.utils.log] DEBUG: Using re
actor: twisted.internet.selectreactor.SelectReactor
2022-04-04 20:04:11 [scrapy.crawler] INFO: Overridden
settings:
{'BOT_NAME': 'my_Scrapy',
 'DUPEFILTER_CLASS': 'scrapy.dupefilters.BaseDupeFilte
r',
 'LOGSTATS_INTERVAL': 0,
 'NEWSPIDER_MODULE': 'my_Scrapy.spiders',
 'ROBOTSTXT_OBEY': True,
 'SPIDER_MODULES': ['my_Scrapy.spiders']}
2022-04-04 20:04:11 [scrapy.extensions.telnet] INFO: T
elnet Password: 358ca5f9dee7f2d7
2022-04-04 20:04:11 [scrapy.middleware] INFO: Enabled
extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole']
2022-04-04 20:04:11 [scrapy.middleware] INFO: Enabled
downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMidd
leware',
 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddle
ware',
 'scrapy.downloadermiddlewares.downloadtimeout.Downloa
dTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultH
eadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMidd
leware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',

 'scrapy.downloadermiddlewares.redirect.MetaRefreshMid
dleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCom
pressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddle
ware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddlewa
re',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMidd
leware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']

2022-04-04 20:04:11 [scrapy.middleware] INFO: Enabled
spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddlewa
re',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',

 'scrapy.spidermiddlewares.referer.RefererMiddleware',

 'scrapy.spidermiddlewares.urllength.UrlLengthMiddlewa
re',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2022-04-04 20:04:11 [scrapy.middleware] INFO: Enabled
item pipelines:
['my_Scrapy.pipelines.MyScrapyPipeline']
2022-04-04 20:04:11 [scrapy.extensions.telnet] INFO: T
elnet console listening on 127.0.0.1:6023
2022-04-04 20:04:11 [scrapy.core.engine] INFO: Spider
opened
2022-04-04 20:04:12 [scrapy.core.engine] DEBUG: Crawle
d (200) <GET https://docs.scrapy.org/robots.txt> (refe
rer: None)
2022-04-04 20:04:14 [scrapy.core.engine] DEBUG: Crawle
d (200) <GET https://docs.scrapy.org/en/latest/_static
/selectors-sample1.html> (referer: None)
[s] Available Scrapy objects:
[s]   scrapy     scrapy module (contains scrapy.Reques
t, scrapy.Selector, etc)
[s]   crawler    <scrapy.crawler.Crawler object at 0x0
0000229978B6E80>
[s]   item       {}
[s]   request    <GET https://docs.scrapy.org/en/lates
t/_static/selectors-sample1.html>
[s]   response   <200 https://docs.scrapy.org/en/lates
t/_static/selectors-sample1.html>
[s]   settings   <scrapy.settings.Settings object at 0
x00000229978B6A20>
[s]   spider     <DefaultSpider 'default' at 0x22997d9
8898>
[s] Useful shortcuts:
[s]   fetch(url[, redirect=True]) Fetch URL and update
 local objects (by default, redirects are followed)
[s]   fetch(req)                  Fetch a scrapy.Reque
st and update local objects
[s]   shelp()           Shell help (print this help)
[s]   view(response)    View response in a browser
In [1]: response
Out[1]: <200 https://docs.scrapy.org/en/latest/_static
/selectors-sample1.html>

In [2]: response.text
Out[2]: "<html>\n <head>\n  <base href='http://example
.com/' />\n  <title>Example website</title>\n </head>\
n <body>\n  <div id='images'>\n   <a href='image1.html
'>Name: My image 1 <br /><img src='image1_thumb.jpg' /
></a>\n   <a href='image2.html'>Name: My image 2 <br /
><img src='image2_thumb.jpg' /></a>\n   <a href='image
3.html'>Name: My image 3 <br /><img src='image3_thumb.
jpg' /></a>\n   <a href='image4.html'>Name: My image 4
 <br /><img src='image4_thumb.jpg' /></a>\n   <a href=
'image5.html'>Name: My image 5 <br /><img src='image5_
thumb.jpg' /></a>\n  </div>\n </body>\n</html>\n\n"

In [3]: response.xpath('//a')
Out[3]: 
[<Selector xpath='//a' data='<a href="image1.html">Nam
e: My image ...'>,
 <Selector xpath='//a' data='<a href="image2.html">Nam
e: My image ...'>,
 <Selector xpath='//a' data='<a href="image3.html">Nam
e: My image ...'>,
 <Selector xpath='//a' data='<a href="image4.html">Nam
e: My image ...'>,
 <Selector xpath='//a' data='<a href="image5.html">Nam
e: My image ...'>]

In [4]: response.xpath('//a').xpath('./img')
Out[4]: 
[<Selector xpath='./img' data='<img src="image1_thumb.
jpg">'>,
 <Selector xpath='./img' data='<img src="image2_thumb.
jpg">'>,
 <Selector xpath='./img' data='<img src="image3_thumb.
jpg">'>,
 <Selector xpath='./img' data='<img src="image4_thumb.
jpg">'>,
 <Selector xpath='./img' data='<img src="image5_thumb.
jpg">'>]

In [5]: response.xpath('//a').xpath('./img')[0]
Out[5]: <Selector xpath='./img' data='<img src="image1
_thumb.jpg">'>

In [6]: response.xpath('//a').xpath('./img').getall()
   ...: 
Out[6]: 
['<img src="image1_thumb.jpg">',
 '<img src="image2_thumb.jpg">',
 '<img src="image3_thumb.jpg">',
 '<img src="image4_thumb.jpg">',
 '<img src="image5_thumb.jpg">']

In [7]: response.xpath('//a').xpath('./img').get()
Out[7]: '<img src="image1_thumb.jpg">'

In [8]: result = response.xpath('//a')

In [9]: result
Out[9]: 
[<Selector xpath='//a' data='<a href="image1.html">Nam
e: My image ...'>,
 <Selector xpath='//a' data='<a href="image2.html">Nam
e: My image ...'>,
 <Selector xpath='//a' data='<a href="image3.html">Nam
e: My image ...'>,
 <Selector xpath='//a' data='<a href="image4.html">Nam
e: My image ...'>,
 <Selector xpath='//a' data='<a href="image5.html">Nam
e: My image ...'>]

In [10]: result.xpath('./img').getall()
Out[10]: 
['<img src="image1_thumb.jpg">',
 '<img src="image2_thumb.jpg">',
 '<img src="image3_thumb.jpg">',
 '<img src="image4_thumb.jpg">',
 '<img src="image5_thumb.jpg">']

In [11]: response.xpath("//img")
Out[11]: 
[<Selector xpath='//img' data='<img src="image1_thumb.
jpg">'>,
 <Selector xpath='//img' data='<img src="image2_thumb.
jpg">'>,
 <Selector xpath='//img' data='<img src="image3_thumb.
jpg">'>,
 <Selector xpath='//img' data='<img src="image4_thumb.
jpg">'>,
 <Selector xpath='//img' data='<img src="image5_thumb.
jpg">'>]

In [12]: response.css('a')
Out[12]: 
[<Selector xpath='descendant-or-self::a' data='<a href
="image1.html">Name: My image ...'>,
 <Selector xpath='descendant-or-self::a' data='<a href
="image2.html">Name: My image ...'>,
 <Selector xpath='descendant-or-self::a' data='<a href
="image3.html">Name: My image ...'>,
 <Selector xpath='descendant-or-self::a' data='<a href
="image4.html">Name: My image ...'>,
 <Selector xpath='descendant-or-self::a' data='<a href
="image5.html">Name: My image ...'>]

In [13]: response.css('div#images')
Out[13]: [<Selector xpath="descendant-or-self::div[@id
 = 'images']" data='<div id="images">\n   <a href="ima
ge1....'>]

In [14]: response.css('div#images').get()
Out[14]: '<div id="images">\n   <a href="image
1.html">Name: My image 1 <br><img src="image1_
thumb.jpg"></a>\n   <a href="image2.html">Name
: My image 2 <br><img src="image2_thumb.jpg"><
/a>\n   <a href="image3.html">Name: My image 3
 <br><img src="image3_thumb.jpg"></a>\n   <a h
html">Name: My image 5 <br><img src="image5_thumb.jpg"></a>\n  </div>'

In [15]: response.xpath('//a/text()').re('Name:\s(.*)')
Out[15]: ['My image 1 ', 'My image 2 ', 'My image 3 ', 'My image 4 ', 'My image
 5 ']

In [16]: response.re('.*')      # 不能这样直接使用re,需要在解析的内容后面使用re正则表达式
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-16-a22dedc07090> in <module>()
----> 1 response.re('.*')

AttributeError: 'HtmlResponse' object has no attribute 're'

In [17]: 

4. 实现翻页功能

如何翻页?

  • 回忆:

    • requests模块时如何发送翻页的请求的?
      • 1.找到下一页的地址
      • 2.之后调用requests.get(url)
  • 思路:

    • 1.找到下一页的地址
    • 2.构造一个关于下一页url地址的request请求传递给调度器

1. 通过在最后进行拼接成url和回调实现翻页

import scrapy
from lxml import etree
from my_Scrapy.items import MyScrapyItem   # 从items文件中导入MyScrapyItem类


class Spider2Spider(scrapy.Spider):
    # spider(爬虫)名称,需要记住,通过名字来启动爬虫:
    name = 'spider2'
    # # 允许爬取的域名:可更改(限制爬虫,不要爬到其他网站去了)
    # allowed_domains = ['quotes.toscrape.com/']   # 不设置限制,就可以一直爬取下一页了
    # 初始request请求:
    base_url = 'http://quotes.toscrape.com/page/{}/'
    page = 1
    start_urls = [base_url.format(page)]

    # 解析方法:
    def parse(self, response):
        # print(response.text)
        """
        解析返回响应,提取数据或进一步生成要处理的请求。
        :param response:
        :return:
        """
        # 方法一:通过css选择器来进行解析
        """ 
        quotes = response.css('.quote')   # 列表
        for quote in quotes:
            # text = quote.css('span.text::text')   # ::text表示获取里面的文本(注意它是一个对象)
            # 旧方法:
            # extract_first()  返回第一个数据。(字符串)
            # extract()  返回全部的数据。(字符串列表)
            text = quote.css('span.text::text').extract_first()   # ::text表示获取里面的文本  (extract_first()方法可以获取到对象里面的文本内容)
            # author = quote.css('small.author::text')  # 作者  (得到一个对象)
            author = quote.css('small.author::text').extract_first()  # 作者  (得到文本内容)
            # tags = quote.css('div.tags a.tag::text')   # 标签(CSS对象)
            tags = quote.css('div.tags a.tag::text').extract()   # 返回全部的数据
            # print(tags)
            # print(text, '      ——————', author, tags)

            # 新方法:
            # get()   返回一条数据
            # getall()  返回全部的数据
            text = quote.css('span.text::text').get()
            author = quote.css('small.author::text').get()
            tags = quote.css('div.tags a.tag::text').getall()
            print(text, tags, '      ——————', author)
        """

        # 方法二:通过xpath进行解析
        html = etree.HTML(response.text)
        quotes_divs = html.xpath('//div[@class="quote"]')
        for quote_div in quotes_divs:
            text = quote_div.xpath('./span[1]/text()')[0]
            author = quote_div.xpath('./span[2]/small/text()')[0]
            tags = quote_div.xpath('./div[@class="tags"]/a/text()')
            # print(text, tags, '    ------', author)

            # 将数据放入容器中,便于保存数据
            item = MyScrapyItem()
            item['text'] = text
            item['author'] = author
            item['tags'] = tags
            # print(item)
            # 通过迭代的方式将字典中的每一条数据交给Pipeline
            yield item

        self.page += 1
        # 注意:需要控制翻页的结束
        if self.page < 11:
            # 构造下一个请求:(方法一)
            # yield scrapy.Request(self.base_url.format(self.page), callback=self.parse)

            # 构造下一个请求:(方法二)
            yield from response.follow_all(response.css('.pager .next a::attr("href")'), callback=self.parse)
        """
        # 原本定义的翻页操作:
        next = response.css('ul.pager li.next a::attr("href")').get()   # 获得“下一页”a标签按钮上面href属性值
        print(next)  # /page/2/
        url = self.start_urls[0]   # 获取当前爬取的url(方法一)
        # print(url)
        url = response.url    # 获取当前爬取的url(方法二)
        # print(url)
        # 拼接形成下一页的url
        url = response.urljoin(next)
        print(url)
        # 将请求交给调度器,重新构造下一个请求
        yield scrapy.Request(url, callback=self.parse)   # 对于新的请求,仍然执行parse()解析方法
        """

2. 通过重写strat_requests()方法实现翻页

import scrapy
from lxml import etree
from my_Scrapy.items import MyScrapyItem   # 从items文件中导入MyScrapyItem类


class Spider3Spider(scrapy.Spider):
    # spider(爬虫)名称,需要记住,通过名字来启动爬虫:
    name = 'spider3'
    # # 允许爬取的域名:可更改(限制爬虫,不要爬到其他网站去了)
    # allowed_domains = ['quotes.toscrape.com/']   # 不设置限制,就可以一直爬取下一页了
    # 初始request请求:
    base_url = 'http://quotes.toscrape.com/page/{}/'
    page = 1
    start_urls = [base_url.format(page)]

    # 通过封装方法的形式构造翻页功能:
    def start_requests(self):   # 在爬虫开始请求的时候会执行的操作
        for page in range(1, 11):
            url = self.base_url.format(page)
            yield scrapy.Request(url, callback=self.parse)

    # 解析方法:
    def parse(self, response):
        # print(response.text)
        """
        解析返回响应,提取数据或进一步生成要处理的请求。
        :param response:
        :return:
        """
        # 方法一:通过css选择器来进行解析
        """ 
        quotes = response.css('.quote')   # 列表
        for quote in quotes:
            # text = quote.css('span.text::text')   # ::text表示获取里面的文本(注意它是一个对象)
            # 旧方法:
            # extract_first()  返回第一个数据。(字符串)
            # extract()  返回全部的数据。(字符串列表)
            text = quote.css('span.text::text').extract_first()   # ::text表示获取里面的文本  (extract_first()方法可以获取到对象里面的文本内容)
            # author = quote.css('small.author::text')  # 作者  (得到一个对象)
            author = quote.css('small.author::text').extract_first()  # 作者  (得到文本内容)
            # tags = quote.css('div.tags a.tag::text')   # 标签(CSS对象)
            tags = quote.css('div.tags a.tag::text').extract()   # 返回全部的数据
            # print(tags)
            # print(text, '      ——————', author, tags)

            # 新方法:
            # get()   返回一条数据
            # getall()  返回全部的数据
            text = quote.css('span.text::text').get()
            author = quote.css('small.author::text').get()
            tags = quote.css('div.tags a.tag::text').getall()
            print(text, tags, '      ——————', author)
        """

        # 方法二:通过xpath进行解析
        html = etree.HTML(response.text)
        quotes_divs = html.xpath('//div[@class="quote"]')
        for quote_div in quotes_divs:
            text = quote_div.xpath('./span[1]/text()')[0]
            author = quote_div.xpath('./span[2]/small/text()')[0]
            tags = quote_div.xpath('./div[@class="tags"]/a/text()')
            # print(text, tags, '    ------', author)

            # 将数据放入容器中,便于保存数据
            item = MyScrapyItem()
            item['text'] = text
            item['author'] = author
            item['tags'] = tags
            # print(item)
            # 通过迭代的方式将字典中的每一条数据交给Pipeline
            yield item

        """
        self.page += 1
        # 注意:需要控制翻页的结束
        if self.page < 11:
            # 构造下一个请求:(方法一)
            # yield scrapy.Request(self.base_url.format(self.page), callback=self.parse)

            # 构造下一个请求:(方法二)
            # 该方法是2.0版本之后出现的    拼接请求,进行回调
            yield from response.follow_all(response.css('.pager .next a::attr("href")'), callback=self.parse)
        """
        """
        # 原本定义的翻页操作:
        next = response.css('ul.pager li.next a::attr("href")').get()   # 获得“下一页”a标签按钮上面href属性值
        print(next)  # /page/2/
        url = self.start_urls[0]   # 获取当前爬取的url(方法一)
        # print(url)
        url = response.url    # 获取当前爬取的url(方法二)
        # print(url)
        # 拼接形成下一页的url
        url = response.urljoin(next)
        print(url)
        # 将请求交给调度器,重新构造下一个请求
        yield scrapy.Request(url, callback=self.parse)   # 对于新的请求,仍然执行parse()解析方法
        """

3. 修改start.py文件保存数据

# 使用Scrapy框架的该爬虫是一个项目,不能在爬虫文件中用右键运行,需要在终端里面输入scrapy crawl 爬虫文件名“的命令进行启动。
# 如果不想用终端输入命令的启动方式,可以创建该“start.py”文件
from scrapy import cmdline

# cmdline.execute('scrapy crawl spider1'.split())   # 调用终端命令
# cmdline.execute('scrapy crawl spider1 -o demo.csv'.split())
# cmdline.execute('scrapy crawl spider2'.split())   # 调用终端命令
cmdline.execute('scrapy crawl spider3'.split())   # 调用终端命令

# 红色的不是报错,而是爬虫的初始化信息。白色的内容就是输出内容

5. Scrapy框架-案例2

1. 分析网站

  1. 目标网站:腾讯招聘网站
  2. 目标:
    1. 爬取招聘岗位信息
    2. 翻页
      虚假的url:https://talent.antgroup.com/off-campus
  3. 数据加载方式:动态和静态
    抓包获取的含有数据的data-url:
    第1页:
    https://careers.tencent.com/tencentcareer/api/post/Query?timestamp=1649127078399&countryId=&cityId=&bgIds=&productId=&categoryId=40001001,40001002,40001003,40001004,40001005,40001006&parentCategoryId=&attrId=&keyword=&pageIndex=1&pageSize=10&language=zh-cn&area=cn
    第2页:
    https://careers.tencent.com/tencentcareer/api/post/Query?timestamp=1649127078399&countryId=&cityId=&bgIds=&productId=&categoryId=40001001,40001002,40001003,40001004,40001005,40001006&parentCategoryId=&attrId=&keyword=&pageIndex=2&pageSize=10&language=zh-cn&area=cn
    详情页:
    url:https://careers.tencent.com/jobdesc.html?postId=1310124481703845888
    data-url:https://careers.tencent.com/tencentcareer/api/post/ByPostId?timestamp=1649156817199&postId=1310124481703845888&language=zh-cn
  4. 爬取思路:
    1. 第一页url
    2. 解析第一页上每个岗位对应postid
    3. 构造url

2. 实现步骤

  1. 创建项目
scrapy startproject tencent
  1. 创建爬虫程序
cd tencent
scrapy gensipider spider1 tencent.com

运行结果:

C:\Users\lv\Desktop\scrapy框架的学习>scrapy startproject tencent
New Scrapy project 'tencent', using template directory 'd:\anaconda\lib\site-packages\scrapy\templates\project', created in: C:\Users\lv\Desktop\scrapy框架的学习\tencent

You can start your first spider with:
    cd tencent
    scrapy genspider example example.com

C:\Users\lv\Desktop\scrapy框架的学习>cd tencent

C:\Users\lv\Desktop\scrapy框架的学习\tencent>scrapy genspider spider1 tencent.com
Created spider 'spider1' using template 'basic' in module:
  tencent.spiders.spider1

C:\Users\lv\Desktop\scrapy框架的学习\tencent>
  1. 用PyCharm打开tencent项目:
    在这里插入图片描述
  2. 在命令行使用如下命令生成一个spider1.py文件:
scrapy genspider spider1 tencent.com
  1. 编辑spider1.py文件如下:
import scrapy
import json
from tencent.items import TencentItem


class Spider1Spider(scrapy.Spider):
    name = 'spider1'
    allowed_domains = ['tencent.com']
    # 一页数据(10条)的url:需要更改页码值(pageIndex值)实现翻页获取
    one_url = "https://careers.tencent.com/tencentcareer/api/post/Query?timestamp=1649127078399&countryId=&cityId=&bgIds=&productId=&categoryId=40001001,40001002,40001003,40001004,40001005,40001006&parentCategoryId=&attrId=&keyword=&pageIndex={}&pageSize=10&language=zh-cn&area=cn"
    # 每一条数据页的url:需要更改postId值,实现获取另一条职位数据的url
    two_url = "https://careers.tencent.com/tencentcareer/api/post/ByPostId?timestamp=1649156817199&postId={}&language=zh-cn"

    start_urls = [one_url.format(1)]

    # 解析列表页
    def parse(self, response):
        # 解析数据(由于获取到的不是网页源代码,而是数据包,是字典或json类型的)
        data = json.loads(response.text)
        for job in data['Data']['Posts']:
            item = TencentItem()

            post_id = job['PostId']
            # print(post_id)
            item['job_name'] = job['RecruitPostName']

            # 构建详情页url
            detail_url = self.two_url.format(post_id)
            print(detail_url)

            # 构造请求:
            yield scrapy.Request(url=detail_url, callback=self.parse_detail, meta={"item": item})

        # 翻页
        for page in range(2, 5):
            url = self.one_url.format(page)
            yield scrapy.Request(url, callback=self.parse)   # 翻页后解析的是列表页,而非详情页


    # 解析详情页面的数据
    def parse_detail(self, response):
        item = response.meta.get('item')
        # print(item)
        data = json.loads(response.text)
        item['job_duty'] = data['Data']['Requirement']

        yield item



  1. 打开settings.py文件下的pipelindes的注释:
    在这里插入图片描述
  2. 运行编写的start.py文件:
from scrapy import cmdline

# cmdline.execute("scrapy crawl spider1".split())
cmdline.execute("scrapy crawl spider1 -o demo.csv".split())
# cmdline.execute("scrapy crawl spider2".split())

运行结果如下:(会生成一个demo.csv文件)
在这里插入图片描述

补充一:Spider类的使用

1. Spider的运行流程
  1. 定义爬取网站的逻辑
  2. 分析爬取下来的页面
2. Spider类的分析
  • name:设置爬虫名称。
  • allowed_domains:允许访问的域名,防止爬虫爬到其他网址去。
  • start_urls:请求的url列表。
  • custom_settings:一个字典,专属于本spider的配置,这个配置会覆盖项目的全局配置,这个配置必须
  • crawler:该属性由from_crawler()方法设置,代表spider对应的爬虫对象。可以通过该属性来获取项目的配置信息。
  • closed:当前spider关闭时,方法会被调用,释放一些资源。

补充二:Request对象

1. 介绍
  • Request对象是在构造新的请求时需要用到的scrapy的一个对象。
    例如:
yield scrapy.Request(url=detail_url, callback=self.parse_detail)
2. 参数说明
  • url:新请求的url地址。该url会被放入队列中。
  • callback:回调的解析数据的函数。
  • priority:请求的优先级。(自定义队列中哪个url需要先被请求。)默认是0,调度器进行request调度时使用它。数值越大,越优先被调度执行。
  • method:请求方式,默认是“GET”。
  • dont_filter:是否需要重复请求,默认为False。
  • errback:设置请求发生错误后的处理方法,默认为None。(很少用到)
    例如:
    def parse(self, response):
    	...
        yield scrapy.Request(url=detail_url, callback=self.parse_detail, errback=self.func)

    def func(self):
        print("请求出现错误后执行的方法")
  • body:request内容。
  • headers:请求头。
  • cookies
  • meta:通过response携带参数进行传递。相当于额外附加的信息。
    例如:
    def parse(self, response):
        # 解析数据(由于获取到的不是网页源代码,而是数据包,是字典或json类型的)
        data = json.loads(response.text)
        for job in data['Data']['Posts']:
            item = TencentItem()

            post_id = job['PostId']
            # print(post_id)
            item['job_name'] = job['RecruitPostName']

            # 构建详情页url
            detail_url = self.two_url.format(post_id)
            print(detail_url)

            # 构造请求:
            yield scrapy.Request(url=detail_url, callback=self.parse_detail, meta={"item": item})

    # 解析详情页面的数据
    def parse_detail(self, response):
        item = response.meta.get('item')
        print(item)
  • encoding:编码格式,默认“utf-8”。
  • cb_kwargs:设置回调方法需要额外携带的参数,可以通过字典传递。
    例如:
	def parse(self, response):
			...
            # 构造请求:
            yield scrapy.Request(url=detail_url, callback=self.parse_detail, cb_kwargs={"num": 1})

    # 解析详情页面的数据
    def parse_detail(self, response, num):
        print(num)

补充三:CSS选择器


"""
解析工具:
    1. 正则表达式                效率高      语法难记
    2. xpath                   效率中等    语法中等
    3. BS4(bs语法和css选择器)     效率低     语法最简单
"""
from bs4 import BeautifulSoup
# 推荐一个第三方库: parsel
import parsel   # 内置了正则、xpath、css三种选择器


html = """
<html><head><title>The Dormouse's story</title></head>

<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/titllie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a wel.</p>

<p class="story">...</p>
"""
# 一、通过BeautifulSoup模块使用css选择器:
# 解析
# lxml是第三方的解析器,比起默认的html.parser解析器速度快很多
soup = BeautifulSoup(html, features="lxml")   # BeautifulSoup会自动补全不完整的html(例如加上<body>、</html>等)
# print(soup)

# 1. 通过标签名称进行查找
a_tags = soup.select('a')
print(a_tags)

# 2. 通过类名称进行查找
sister_class = soup.select('.sister')
print(sister_class)

# 3. 通过id名进行查找
link1_id = soup.select("#link1")
print(link1_id)

# 4. 组合查找
a_link2 = soup.select("p #link2")
print(a_link2)
a_link2 = soup.select("p > #link2")  # > 代表直接的下一级
print(a_link2)
p_sister_class = soup.select("p > .sister")
print(p_sister_class)
# 同一个标签的id和class不能同时用
# p_sister_class_id = soup.select("p > .sister#link1")
# print(p_sister_class_id)

# 5. 通过属性查找
a_href = soup.select('a[href="http://example.com/elsie"]')
print(a_href)

# 6. 获取标签内的文本内容
text1 = soup.select('title')[0].get_text()
print(text1)

# 7. 获取标签属性的值(如获取href属性的值)
href = soup.select('a#link1')[0]['href']
print(href)
print("---"*20)


# 二、通过parsel模块使用CSS选择器:
selector = parsel.Selector(html)   # 创建选择器对象
# selector.re()
# selector.xpath()
# selector.css()

# 1. 通过标签名查找
object_list = selector.css("a")
print(object_list.getall())   # getall()方法获取全部满足的结果
# for item in object_list:
#     print(item.get())

# 2. 通过类名称进行查找
print(selector.css('.sister').get())  # get()方法获取第一个满足条件的结果
print(selector.css('.sister').getall())

# 3. 通过id名进行查找
print(selector.css('#link1').getall())

# 4. 组合查找
print(selector.css('p.story a#link2').getall())


# 5. 通过属性查找
print(selector.css('.story').get())

# 6. 获取标签内的文本内容
print(selector.css('p > #link1::text').get())

# 7. 获取标签属性的值(如获取href属性的值)
print(selector.css('p > #link1::attr(href)').get())

# 8. 伪类选择器
print(selector.css('a').getall()[1])
print(selector.css('a:nth-child(1)').getall())  # 选择第几个

  • 2
    点赞
  • 12
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值