Python爬虫——scrapy-4

免责声明

本文章仅用于学习交流,无任何商业用途

部分图片来自尚硅谷 

meta简介

        在Scrapy框架中,可以使用meta属性来传递额外的信息。meta属性可以在不同的组件之间传递数据,包括爬虫、中间件和管道等。

        在爬虫中,可以使用meta属性在请求之间传递数据。例如:

yield scrapy.Request(url, callback=self.parse_details, meta={'item': item})

        在上面的例子中,通过设置meta属性,将item对象传递给了下一个请求的回调函数parse_details

在中间件中,可以使用meta属性来获取和修改请求的元数据。例如:

def process_request(self, request, spider):
    item = request.meta['item']
    item['timestamp'] = datetime.now()
    request.meta['item'] = item

        在上面的例子中,process_request方法获取了请求的item对象,并添加了一个timestamp字段,然后将修改后的item对象保存回meta属性中。

在管道中,可以使用meta属性来获取和传递数据。例如:

def process_item(self, item, spider):
    timestamp = item['timestamp']
    # 使用timestamp做一些处理

        在上面的例子中,可以从item对象的meta属性中取出之前设置的timestamp值,并进行相应的处理。

        总之,Scrapy的meta属性可以在不同的组件之间传递数据,非常方便灵活。


爬取电影天堂的国内电影的全部名字和图片链接

import scrapy
from scrapy_movie_070.items import ScrapyMovie070Item

class MvSpider(scrapy.Spider):
    name = "mv"
    allowed_domains = ["www.dygod.net"]
    start_urls = ["https://www.dygod.net/html/gndy/china/index.html"]

    def parse(self, response):
        print("==============成功啦===============")
        # 我们要第一页的名字和第二页的图片
        a_list = response.xpath('//div[@class="co_content8"]//td[2]//a[2]')

        for a in a_list:
            # 获取第一页的name和链接
            name = a.xpath('./text()').extract_first()
            src = a.xpath('./@href').extract_first()

            url = 'https://www.dygod.net' + src
            print(name, url)
            yield scrapy.Request(url=url, callback=self.parse_second, meta={'name':name})

    def parse_second(self, response):
        print("==============呀啦嗦===============")
        # 如果拿不到数据,记得检查xpath语法是否正确
        img_src = response.xpath('//div[@id="Zoom"]//img[1]/@src').extract_first()
        img_url = 'https://www.dygod.net' + img_src
        # 接收到请求的那个Meta参数的值
        name = response.meta['name']

        movie = ScrapyMovie070Item(src=img_url, name=name)

        yield movie

        CrawlSpider是Scrapy框架中的一个特殊爬虫类,它提供了一种基于规则的快速爬取方式。        CrawlSpider使用了一组规则来定义爬取的行为,并自动根据这些规则对页面上的链接进行跟踪和爬取。

        使用CrawlSpider,可以更轻松地从一个网站中提取数据,而无需编写太多的代码。以下是使用CrawlSpider的基本步骤:

  1. 创建一个CrawlSpider的子类,并设置name属性(爬虫的唯一标识符)和allowed_domains属性(限制爬取的域名)。

  2. 定义一个rules属性,其中包含多个Rule对象,每个Rule对象定义了一个规则。

    • Rule对象的link_extractor属性定义了链接提取器,用于从页面中提取链接。

    • Rule对象的callback属性定义了回调函数,用于处理提取到的链接对应的页面。

  3. 编写回调函数,用于处理提取到的链接对应的页面。

  4. 在回调函数中使用XPath或CSS选择器等方法提取数据,并使用yield语句返回Item对象或新的Request对象,进行进一步的爬取或处理。

以下是一个简单的CrawlSpider示例:

from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor

class MySpider(CrawlSpider):
    name = 'myspider'
    allowed_domains = ['example.com']
    start_urls = ['http://www.example.com']

    rules = (
        Rule(LinkExtractor(allow=r'/page/\d+'), callback='parse_page', follow=True),
    )

    def parse_page(self, response):
        # 提取数据并返回Item对象
        yield {
            'title': response.css('h1::text').get(),
            'content': response.css('.content::text').getall(),
        }

        在上面的示例中,allowed_domains属性限制了只爬取example.com域名下的页面。start_urls属性定义了初始爬取的URL。

  rules属性定义了一个规则,其中使用了LinkExtractor来提取符合allow条件的链接,并将提取到的链接交给parse_page方法进行处理。follow=True表示继续跟踪该链接上的页面。

  parse_page方法是回调函数,用于处理提取到的链接对应的页面。在这个方法中,可以使用XPath或CSS选择器等方法提取页面中的数据,并使用yield语句返回Item对象。

        通过以上步骤,就可以创建一个基于规则的爬虫,并使用CrawlSpider类来自动进行页面跟踪和爬取。

        下图来自尚硅谷

C:\Users\14059>scrapy shell https://www.dushu.com/book/1188.html
2024-03-08 17:00:29 [scrapy.utils.log] INFO: Scrapy 2.9.0 started (bot: scrapybot)
2024-03-08 17:00:29 [scrapy.utils.log] INFO: Versions: lxml 5.1.0.0, libxml2 2.10.3, cssselect 1.2.0, parsel 1.8.1, w3lib 2.1.2, Twisted 22.10.0, Python 3.7.5 (tags/v3.7.5:5c02a39a0b, Oct 15 2019, 00:11:34) [MSC v.1916 64 bit (AMD64)], pyOpenSSL 24.0.0 (OpenSSL 3.2.1 30 Jan 2024), cryptography 42.0.5, Platform Windows-10-10.0.22621-SP0
2024-03-08 17:00:29 [scrapy.crawler] INFO: Overridden settings:
{'DUPEFILTER_CLASS': 'scrapy.dupefilters.BaseDupeFilter',
 'LOGSTATS_INTERVAL': 0}
2024-03-08 17:00:29 [py.warnings] WARNING: d:\python\python375\lib\site-packages\scrapy\utils\request.py:232: ScrapyDeprecationWarning: '2.6' is a deprecated value for the 'REQUEST_FINGERPRINTER_IMPLEMENTATION' setting.

It is also the default value. In other words, it is normal to get this warning if you have not defined a value for the 'REQUEST_FINGERPRINTER_IMPLEMENTATION' setting. This is so for backward compatibility reasons, but it will change in a future version of Scrapy.

See the documentation of the 'REQUEST_FINGERPRINTER_IMPLEMENTATION' setting for information on how to handle this deprecation.
  return cls(crawler)

2024-03-08 17:00:29 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor
2024-03-08 17:00:29 [scrapy.extensions.telnet] INFO: Telnet Password: 13c50912dfa84ac1
2024-03-08 17:00:29 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole']
2024-03-08 17:00:29 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2024-03-08 17:00:29 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2024-03-08 17:00:29 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2024-03-08 17:00:29 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2024-03-08 17:00:29 [scrapy.core.engine] INFO: Spider opened
2024-03-08 17:00:30 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.dushu.com/book/1188.html> (referer: None)
2024-03-08 17:00:30 [asyncio] DEBUG: Using selector: SelectSelector
[s] Available Scrapy objects:
[s]   scrapy     scrapy module (contains scrapy.Request, scrapy.Selector, etc)
[s]   crawler    <scrapy.crawler.Crawler object at 0x00000254496A38C8>
[s]   item       {}
[s]   request    <GET https://www.dushu.com/book/1188.html>
[s]   response   <200 https://www.dushu.com/book/1188.html>
[s]   settings   <scrapy.settings.Settings object at 0x00000254496A3748>
[s]   spider     <DefaultSpider 'default' at 0x25449bbdf88>
[s] Useful shortcuts:
[s]   fetch(url[, redirect=True]) Fetch URL and update local objects (by default, redirects are followed)
[s]   fetch(req)                  Fetch a scrapy.Request and update local objects
[s]   shelp()           Shell help (print this help)
[s]   view(response)    View response in a browser
2024-03-08 17:00:30 [asyncio] DEBUG: Using selector: SelectSelector
2024-03-08 17:00:30 [asyncio] DEBUG: Using selector: SelectSelector
In [1]: from scrapy.linkextractors import LinkExtractor

2024-03-08 17:01:58 [asyncio] DEBUG: Using selector: SelectSelector
In [2]: link = LinkExtractor

2024-03-08 17:02:49 [asyncio] DEBUG: Using selector: SelectSelector
In [3]: from scrapy.linkextractors import LinkExtractor

2024-03-08 17:03:24 [asyncio] DEBUG: Using selector: SelectSelector
In [4]: link = LinkExtractor(allow=r'/book/1188_\d+\.html')

2024-03-08 17:04:45 [asyncio] DEBUG: Using selector: SelectSelector
In [5]: link
Out[6]: <scrapy.linkextractors.lxmlhtml.LxmlLinkExtractor at 0x2544d2ae508>

2024-03-08 17:05:01 [asyncio] DEBUG: Using selector: SelectSelector
In [7]: link.extract_links(response)
Out[7]:
[Link(url='https://www.dushu.com/book/1188_2.html', text='2', fragment='', nofollow=False),
 Link(url='https://www.dushu.com/book/1188_3.html', text='3', fragment='', nofollow=False),
 Link(url='https://www.dushu.com/book/1188_4.html', text='4', fragment='', nofollow=False),
 Link(url='https://www.dushu.com/book/1188_5.html', text='5', fragment='', nofollow=False),
 Link(url='https://www.dushu.com/book/1188_6.html', text='6', fragment='', nofollow=False),
 Link(url='https://www.dushu.com/book/1188_7.html', text='7', fragment='', nofollow=False),
 Link(url='https://www.dushu.com/book/1188_8.html', text='8', fragment='', nofollow=False),
 Link(url='https://www.dushu.com/book/1188_9.html', text='9', fragment='', nofollow=False),
 Link(url='https://www.dushu.com/book/1188_10.html', text='10', fragment='', nofollow=False),
 Link(url='https://www.dushu.com/book/1188_11.html', text='11', fragment='', nofollow=False),
 Link(url='https://www.dushu.com/book/1188_12.html', text='12', fragment='', nofollow=False),
 Link(url='https://www.dushu.com/book/1188_13.html', text='13', fragment='', nofollow=False)]

2024-03-08 17:05:20 [asyncio] DEBUG: Using selector: SelectSelector
In [8]: link1 = LinkExtractor

2024-03-08 17:17:12 [asyncio] DEBUG: Using selector: SelectSelector
In [9]: link1 = LinkExtractor(restrict_xpaths=r'//div[@class="pages"]/a/@href')

2024-03-08 17:18:03 [asyncio] DEBUG: Using selector: SelectSelector
In [10]: link.extract_links(response)
Out[10]:
[Link(url='https://www.dushu.com/book/1188_2.html', text='2', fragment='', nofollow=False),
 Link(url='https://www.dushu.com/book/1188_3.html', text='3', fragment='', nofollow=False),
 Link(url='https://www.dushu.com/book/1188_4.html', text='4', fragment='', nofollow=False),
 Link(url='https://www.dushu.com/book/1188_5.html', text='5', fragment='', nofollow=False),
 Link(url='https://www.dushu.com/book/1188_6.html', text='6', fragment='', nofollow=False),
 Link(url='https://www.dushu.com/book/1188_7.html', text='7', fragment='', nofollow=False),
 Link(url='https://www.dushu.com/book/1188_8.html', text='8', fragment='', nofollow=False),
 Link(url='https://www.dushu.com/book/1188_9.html', text='9', fragment='', nofollow=False),
 Link(url='https://www.dushu.com/book/1188_10.html', text='10', fragment='', nofollow=False),
 Link(url='https://www.dushu.com/book/1188_11.html', text='11', fragment='', nofollow=False),
 Link(url='https://www.dushu.com/book/1188_12.html', text='12', fragment='', nofollow=False),
 Link(url='https://www.dushu.com/book/1188_13.html', text='13', fragment='', nofollow=False)]

整个和命令行斗智斗勇的过程如上了,[○・`Д´・ ○]


CrawlSpider案例

目标:读书网数据入库

(1)创建一个项目

scrapy startproject 项目名

(2)跳转到spdiers 的文件目录下

cd 到spiders为止

cd 项目名\项目名\spiders

(3)创建爬虫文件

scrapy  genspider  -t  crawl  爬虫文件的名字  爬取的域名

注意:一定要注意第一页的URL结构是否和其他页码的结构一样

如果不需要存储到数据库中,代码如下

read.py

import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from scrapy_readbook_090.items import ScrapyReadbook090Item

class ReadSpider(CrawlSpider):
    name = "read"
    allowed_domains = ["www.dushu.com"]
    start_urls = ["https://www.dushu.com/book/1188_1.html"]

    rules = (Rule(LinkExtractor(allow=r"/book/1188_\d+\.html"),
                  callback="parse_item",
                  follow=True),)

    def parse_item(self, response):
        img_list = response.xpath('//div[@class="bookslist"]//img')
        for img in img_list:
            name = img.xpath('./@alt').extract_first()
            img_src = img.xpath('./@data-original').extract_first()

            book = ScrapyReadbook090Item(name=name, src=img_src)
            yield book

pipelines.py

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html


# useful for handling different item types with a single interface
from itemadapter import ItemAdapter


class ScrapyReadbook090Pipeline:

    def open_spider(self, spider):
        self.fp = open('book.json', 'w', encoding='utf-8')

    def process_item(self, item, spider):
        self.fp.write(str(item))
        return item

    def close_spider(self, spider):
        self.fp.close()

items.py

# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html

import scrapy


class ScrapyReadbook090Item(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    name = scrapy.Field()
    src = scrapy.Field()

settings.py

# Scrapy settings for scrapy_readbook_090 project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     https://docs.scrapy.org/en/latest/topics/settings.html
#     https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#     https://docs.scrapy.org/en/latest/topics/spider-middleware.html

BOT_NAME = "scrapy_readbook_090"

SPIDER_MODULES = ["scrapy_readbook_090.spiders"]
NEWSPIDER_MODULE = "scrapy_readbook_090.spiders"


# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = "scrapy_readbook_090 (+http://www.yourdomain.com)"

# Obey robots.txt rules
ROBOTSTXT_OBEY = True

# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32

# Configure a delay for requests for the same website (default: 0)
# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16

# Disable cookies (enabled by default)
#COOKIES_ENABLED = False

# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False

# Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
#    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
#    "Accept-Language": "en",
#}

# Enable or disable spider middlewares
# See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
#    "scrapy_readbook_090.middlewares.ScrapyReadbook090SpiderMiddleware": 543,
#}

# Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
#    "scrapy_readbook_090.middlewares.ScrapyReadbook090DownloaderMiddleware": 543,
#}

# Enable or disable extensions
# See https://docs.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
#    "scrapy.extensions.telnet.TelnetConsole": None,
#}

# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
   "scrapy_readbook_090.pipelines.ScrapyReadbook090Pipeline": 300,
}

# Enable and configure the AutoThrottle extension (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False

# Enable and configure HTTP caching (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = "httpcache"
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = "scrapy.extensions.httpcache.FilesystemCacheStorage"

# Set settings whose default value is deprecated to a future-proof value
REQUEST_FINGERPRINTER_IMPLEMENTATION = "2.7"
TWISTED_REACTOR = "twisted.internet.asyncioreactor.AsyncioSelectorReactor"
FEED_EXPORT_ENCODING = "utf-8"

总结

        其实也没有多大的差距,都适应了其实代码也都挺简单的。主要还是一些细节以及路径的查找上有一点困难。

  • 25
    点赞
  • 20
    收藏
    觉得还不错? 一键收藏
  • 打赏
    打赏
  • 0
    评论
scrapy-redis是基于redis的分布式组件,它是scrapy框架的一个组件。它的主要作用是实现断点续爬和分布式爬虫的功能。 使用scrapy-redis可以实现分布式数据处理,爬取到的item数据可以被推送到redis中,这样你可以启动尽可能多的item处理程序。 安装和使用scrapy-redis非常简单,一般通过pip安装Scrapy-redis:pip install scrapy-redis。同时,scrapy-redis需要依赖Python 2.7, 3.4 or 3.5以上,Redis >= 2.8和Scrapy >= 1.1。在使用时,你只需要做一些简单的设置,几乎不需要改变原本的scrapy项目的代码。 scrapy-redis将数据存储在redis中,你可以在redis中查看存储的数据。<span class="em">1</span><span class="em">2</span><span class="em">3</span> #### 引用[.reference_title] - *1* *2* [scrapy_redis的基本使用和介绍](https://blog.csdn.net/WangTaoTao_/article/details/107748403)[target="_blank" data-report-click={"spm":"1018.2226.3001.9630","extra":{"utm_source":"vip_chatgpt_common_search_pc_result","utm_medium":"distribute.pc_search_result.none-task-cask-2~all~insert_cask~default-1-null.142^v93^chatsearchT3_2"}}] [.reference_item style="max-width: 50%"] - *3* [爬虫学习笔记(十二)—— scrapy-redis(一):基本使用、介绍](https://blog.csdn.net/qq_46485161/article/details/118863801)[target="_blank" data-report-click={"spm":"1018.2226.3001.9630","extra":{"utm_source":"vip_chatgpt_common_search_pc_result","utm_medium":"distribute.pc_search_result.none-task-cask-2~all~insert_cask~default-1-null.142^v93^chatsearchT3_2"}}] [.reference_item style="max-width: 50%"] [ .reference_list ]

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

WenJGo

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值