Scrapy框架-创建项目

最新推荐文章于 2024-08-06 17:20:59 发布

LyaJpunov

最新推荐文章于 2024-08-06 17:20:59 发布

阅读量1.4k

点赞数 1

分类专栏： # Spider 文章标签： python 开发语言后端

本文链接：https://blog.csdn.net/weixin_43903639/article/details/122748933

版权

Spider 专栏收录该内容

9 篇文章 0 订阅

订阅专栏

Scrapy是适用于Python的一个快速、高层次的屏幕抓取和web抓取框架。scrapy 是异步的，采取可读性更强的 xpath 代替正则。可以同时在不同的 url 上爬行，支持 shell 方式，方便独立调试。但是不支持分布式。

一、安装scrapy

pip install Scrapy

官网并不推荐这样安装，官网推荐先安装Anaconda或Miniconda并使用来自 conda-forge频道的软件包

conda install -c conda-forge scrapy

实测时我们发现我们需要再安装两个库

pip install itemloaders
pip install protego

二、开始一个新项目

2.1、创建项目

scrapy startproject tutorial

2.2、编写代码

这是我们的第一个 Spider 的代码。quotes_spider.py将其保存在项目目录下名为的文件 tutorial/spiders中：

import scrapy


class QuotesSpider(scrapy.Spider):
    name = "quotes"

    def start_requests(self):
        urls = [
            'https://www.xbiquge.la/32/32626/',
        ]
        for url in urls:
            yield scrapy.Request(url=url, callback=self.parse)

    def parse(self, response):
        page = response.url.split("/")[-2]
        filename = f'quotes-{page}.html'
        with open(filename, 'wb') as f:
            f.write(response.body)
        self.log(f'Saved file {filename}')

name: 标识蜘蛛。它在一个项目中必须是唯一的，即不能为不同的Spider设置相同的名称。
start_requests()：必须返回一个可迭代的 Requests（你可以返回一个请求列表或编写一个生成器函数），Spider 将从中开始爬取。后续请求将从这些初始请求中依次生成。
parse(): 将被调用以处理为每个请求下载的响应的方法。response 参数是一个实例。

我们也可以简化上面的代码

import scrapy


class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = [
        'https://www.xbiquge.la/32/32626/',
    ]
    
    def parse(self, response):
        page = response.url.split("/")[-2]
        filename = f'quotes-{page}.html'
        with open(filename, 'wb') as f:
            f.write(response.body)
        self.log(f'Saved file {filename}')

2.3、运行代码

要让我们的爬虫工作，请转到项目的顶级目录并运行：

scrapy crawl quotes

此时我们已经完成了我们的第一个爬虫，虽然他只是下载了我们的页面而已

三、选择器

我们从上一个例子可以看到，我们的response就是我们的下载完成的对象，response.url 使我们要爬取的URL。response.body是我们爬取的主体

3.1、CSS元素选择器

我们可以使用scrapy shell

scrapy shell "https://www.xbiquge.la/32/32626/"

>>> response.css("title")
[<Selector xpath='descendant-or-self::title' data='<title>叶辰孙怡'>]

>>> response.css("title::text").getall()
['笔趣阁']

>>> response.css('title::text').get()
['笔趣阁']

另一件事是调用的结果.getall()是一个列表：选择器可能返回多个结果，因此我们将它们全部提取出来。当你知道你只想要第一个结果时，就像在这种情况下，你可以使用.get()。

对于大多数抓取代码，你希望它能够对由于页面上找不到的东西而导致的错误具有弹性，这样即使某些部分无法被抓取，你至少可以得到一些数据。除了getall()and get()方法，你还可以使用正则表达式re()提取方法：

>>> response.css('title::text').re(r'Quotes.*')
['Quotes to Scrape']
>>> response.css('title::text').re(r'Q\w+')
['Quotes']
>>> response.css('title::text').re(r'(\w+) to (\w+)')
['Quotes', 'Scrape']

3.2、Xpath元素选择器

除了CSS之外，Scrapy 选择器还支持使用XPath表达式：

>>> response.xpath('//title')
[<Selector xpath='//title' data='<title>Quotes to Scrape</title>'>]
>>> response.xpath('//title/text()').get()
'Quotes to Scrape'

XPath 表达式非常强大，是 Scrapy Selector 的基础。事实上，CSS 选择器在底层转换为 XPath。如果您仔细阅读 shell 中选择器对象的文本表示，您会看到这一点。我们之前的文章已经详细的讲过了xpath，xpath可以说是整个爬虫的基础，要是读者对此还不熟悉。建议先继续学习xpath的内容。

四、一个循环爬取下一页的例子

<ul class="pager">
    <li class="next">
        <a href="/page/2/">Next <span aria-hidden="true">&rarr;</span></a>
    </li>
</ul>

这个页面是有下一页的

# 得到下一页的链接
response.css('li.next a::attr(href)').get()

response.css('li.next a').attrib['href']

response.xpath('//li[@class="next"]/a/@href').get()

现在让我们看看我们的蜘蛛修改为递归地跟随到下一页的链接，从中提取数据：

import scrapy


class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = [
        'http://quotes.toscrape.com/page/1/',
    ]

    def parse(self, response):
        for quote in response.css('div.quote'):
            yield {
                'text': quote.css('span.text::text').get(),
                'author': quote.css('small.author::text').get(),
                'tags': quote.css('div.tags a.tag::text').getall(),
            }

        next_page = response.css('li.next a::attr(href)').get()
        if next_page is not None:
            next_page = response.urljoin(next_page)
            yield scrapy.Request(next_page, callback=self.parse)

4.1、存储数据

我们再parse中yield的数据可以存储到文件中

scrapy crawl quotes -O quotes.json

这将生成一个quotes.json包含所有抓取项目的文件，并以JSON序列化。

在小型项目（如本教程中的项目）中，这应该足够了。但是，如果您想对抓取的项目执行更复杂的操作，需要使用到管道

4.2、创建请求的快捷方式

我们再这个例子中创建请求使用的是

next_page = response.urljoin(next_page)
yield scrapy.Request(next_page, callback=self.parse)

实际上我们使用response.follow：

import scrapy


class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = [
        'http://quotes.toscrape.com/page/1/',
    ]

    def parse(self, response):
        for quote in response.css('div.quote'):
            yield {
                'text': quote.css('span.text::text').get(),
                'author': quote.css('span small::text').get(),
                'tags': quote.css('div.tags a.tag::text').getall(),
            }

        next_page = response.css('li.next a::attr(href)').get()
        if next_page is not None:
            yield response.follow(next_page, callback=self.parse)

与 scrapy.Request 不同，response.follow直接支持相对 URL - 无需调用 urljoin。注意response.follow只返回一个 Request 实例；您仍然必须提交此请求。

五、使用参数

-a 您可以在运行它们时使用该选项为蜘蛛提供命令行参数：

scrapy crawl quotes -O quotes-humor.json -a tag=humor

在此示例中，为tag参数提供的值将通过self.tag. 您可以使用它使您的蜘蛛仅获取带有特定标签的引号，并根据参数构建 URL：

class QuotesSpider(scrapy.Spider):
    name = "quotes"

    def start_requests(self):
        tag = getattr(self, 'tag', None)
        self.log('%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%')
        self.log(tag)
        url = 'https://www.xbiquge.la/32/'+'/'
        yield scrapy.Request(url=str(url), callback=self.parse)

    def parse(self, response):
        self.log('%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%')

我们再调试信息中找到了这么两条

2022-01-29 20:57:05 [quotes] DEBUG: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
2022-01-29 20:57:05 [quotes] DEBUG: 123123123

说明成功了

LyaJpunov

关注

1
点赞
踩
2

收藏

觉得还不错? 一键收藏
打赏
0
评论
Scrapy框架-创建项目

Scrapy是适用于Python的一个快速、高层次的屏幕抓取和web抓取框架。scrapy 是异步的，采取可读性更强的 xpath 代替正则。可以同时在不同的 url 上爬行，支持 shell 方式，方便独立调试。但是不支持分布式。一、安装scrapypip install Scrapy官网并不推荐这样安装，官网推荐先安装Anaconda或Miniconda并使用来自 conda-forge频道的软件包conda install -c conda-forge scrapy实测时我们发现我们需要
复制链接

扫一扫